Traditional information system experts, database admins and data architects are not going to like this, but maybe its the discomfort that’s needed for change. Why change? because there are terabytes of development and humanitarian data gathering dust on virtual shelves, millions spent on gathering data, writing reports and building “open” datasets that no one uses, and yet another library of reports and resources that the occasional research graduate will painstakingly, click and download, click and download.. until the coffee runs out. When it’s the practitioner who should be the one learning from this data, but has already given up and going with the direction of the wind rather than data-driven decision making.
Traditional data architectures need to evolve beyond simple storage and complicated query structures to incorporate embedded intelligence, synthesis capabilities, and edge processing. This transformation is driven by the increasing opportunity of AI-powered knowledge retrieval systems, LLMs and RAG frameworks, requiring new approaches to data storage and retrieval.
Traditional data stores serve as static repositories, organizing information through complex schemas, indices of overly complicated query structures. However, these conventional structures—designed primarily for exact match queries and simple filtering— limit the capabilities of emerging knowledge and information retrieval systems (LLM-based systems, RAG frameworks, and agentic tools). Contemporary information retrieval demands data stores that embed intelligence directly into their structure and include rich semantic data, contextual understanding, simplified metadata and multi-modal retrieval capabilities.
Augmented storage structures
The data systems we need require storage mechanisms that extend beyond conventional object stores:
We're seeing the emergence of small language (transformer) models integrated directly within database systems, with experimental implementations in PostgreSQL (see PostgresML) demonstrating embedded intelligence capabilities. While this technology matures, implementing thin edge processing layers provides immediate data operations without requiring complex client-side implementations.
Edge processing architecture
We need edge processing architecture that shifts complex data operations to where they belong—at the data provider level—while maintaining clear boundaries between processing layers. This approach would enable tool builders to focus on application logic rather than implementing redundant data processing infrastructure.
Edge functions serve as a lightweight computational layer between data storage and API endpoints, performing operations that leverage deep understanding of the underlying data
1. Query processing
The edge layer should handle natural language inputs through query understanding and transformation. By leveraging domain-optimized embedding models fine-tuned specifically for the dataset, queries are automatically converted into appropriate vector representations. This ensures semantic matching with stored data embeddings. Query transformation/decomposition patterns can be tailored to the specific data structure and domain, utilizing the data provider's deep understanding of their content - hence why this should be with the data provider rather than the tool builder. This centralized approach eliminates the need for individual tools to implement their own embedding generation and query processing logic, ensuring consistency and efficiency.
2. Result optimization
When combining results from multiple retrieval methods (dense vectors, sparse vectors, keyword matching), fusion techniques can adjust scores based on retrieval method reliability and query context. This hybrid approach, incorporating both semantic and lexical search with carefully tuned k-values for results fusion, consistently demonstrates superior retrieval performance compared to single-method approaches.
Re-ranking has emerged as a effective component for improving retrieval relevance. The edge layer could implement dynamic re-ranking strategies using models specifically fine-tuned for the dataset's characteristics and domain.
3. Performance enhancement
Intelligent caching mechanisms at the edge can optimize frequent query patterns and reduce latency and AI processing costs while maintaining result freshness. The system could implement invalidation strategies based on content updates and query patterns. For dynamic content, incremental updates ensure cache coherence without full recomputation.
4. Content transformation
While more complex transformations may occur at the API layer, the edge layer could perform rapid, targeted synthesis of retrieved information. Contextualized synthesis can be performed at this level using the provided user query to provide contextualized summaries of larger unstructured content in the result sets. The edge layer could also quickly normalize output formats and adapt schemas for client consumption, though more complex transformations are typically reserved for higher layers in the architecture.
Edge processing shifts complex operations to the data provider level, where domain expertise can be leveraged effectively. Key benefits include:
Domain optimization
Efficiency gains
Resource management
Traditional API endpoints for development and humanitarian datasets often present unnecessary complexity and barriers to tool development. These conventional interfaces—built around rigid standards and complex querying patterns—create significant friction in building modern knowledge retrieval systems. This results in tool developers routinely downloading and reprocessing entire datasets, leading to inefficiency, redundancy, and increased environmental impact.
On that node quick kudos to HDX for building rapid, low latency endpoints, though still to complex in my opinion, they are fast which in itself is significant.
The API layer must evolve to prioritize practical utility and accessibility while maintaining data integrity. We should focus on:
Query simplification
Intelligent query processing
Standardized response patterns
Here are some practical examples of how this would change:
Query handling
GET /data/search?q="agriculture projects in east africa"
Instead of:
GET /data/search?sector_code=311®ion_code=EAF&activity_status=2
Geographic flexibility
GET /data/location?q="myanmar"
Accepts: "Myanmar", "Burma", "မြန်မာ", or standard codes
Sector searching
GET /data/sectors?q="clean water access"
Instead of:
GET /data/sectors?code=14030&vocabulary=DAC-5
The API layer leverages edge processing capabilities for:
For tool builders, the evolved API layer dramatically streamlines development by eliminating complex query construction and data reprocessing requirements. Teams can focus on building innovative solutions rather than managing data infrastructure, enabling rapid integration and significantly lower overhead costs. The elimination of data replication needs further reduces both technical debt, operational complexity and keep data fresh.
For data providers, this approach substantially decreases support requirements while driving higher utilization of their datasets. By handling complexity at the source, providers could see greater user adoption across diverse use cases and improved accessibility of their data. This centralized approach to data processing and access ultimately leads to more valuable and impactful datasets.
For the broader ecosystem, these improvements yield significant reductions in computational redundancy and environmental impact from eliminated reprocessing needs. The standardized yet flexible approach enhances data consistency while enabling diverse tool development. This results in more efficient resource utilization across the entire data ecosystem, fostering innovation while reducing waste, but most of all expands the actual use of the data, rather than it gathering dust in virtual development and humanitarian shelves.
The design should prioritize simplicity first, emphasizing intuitive query patterns and clear response formats with minimal required parameters. While maintaining flexibility, the system enforces standardized approaches through consistent response patterns and comprehensive documentation. A developer-centric focus ensures practical use cases and rapid integration capabilities, supported by clear examples. Performance optimization remains critical, implementing efficient query processing, smart caching, and optimized response sizes to ensure scalability and responsiveness.
Bottom line is that this evolution in API design fundamentally shifts complexity from tool builders to data providers, where it can be handled more efficiently and consistently. By implementing these patterns, data providers can significantly reduce barriers to entry for tool developers while improving overall system efficiency and reducing environmental impact from redundant processing. This is what will make all their hard work in making the “data” available.. To actually be used!
The future of development and humanitarian data systems demands more than just making information available – it requires making it intelligently accessible. By embedding intelligence directly into data architectures through vector storage, edge processing, and intuitive APIs, we can dramatically reduce the barriers for AI-powered tools while minimizing environmental impact from redundant processing. This transformation isn't just about technological advancement; it's about ensuring that the valuable data collected in development and humanitarian contexts can effectively serve its ultimate purpose: improving lives through better-informed decision making and more efficient development programming and aid delivery.
As we move forward, the focus must shift from simply storing and sharing data to building systems that actively facilitate its intelligent use. The next generation of humanitarian data infrastructure will be judged not by how much data it can hold, but by how effectively it can serve the AI-powered tools that increasingly drive development and humanitarian work.