Data Systems  |  November 23, 2024

Next-generation data systems to serve AI-powered tools for development and humanitarian use

Introduction

Traditional information system experts, database admins and data architects are not going to like this, but maybe its the discomfort that’s needed for change. Why change? because there are terabytes of development and humanitarian data gathering dust on virtual shelves, millions spent on gathering data, writing reports and building “open” datasets that no one uses, and yet another library of reports and resources that the occasional research graduate will painstakingly, click and download, click and download.. until the coffee runs out. When it’s the practitioner who should be the one learning from this data, but has already given up and going with the direction of the wind rather than data-driven decision making.

Traditional data architectures need to evolve beyond simple storage and complicated query structures to incorporate embedded intelligence, synthesis capabilities, and edge processing. This transformation is driven by the increasing opportunity of AI-powered knowledge retrieval systems, LLMs and RAG frameworks, requiring new approaches to data storage and retrieval.

Data layer evolution

Traditional data stores serve as static repositories, organizing information through complex schemas, indices of overly complicated query structures. However, these conventional structures—designed primarily for exact match queries and simple filtering— limit the capabilities of emerging knowledge and information retrieval systems (LLM-based systems, RAG frameworks, and agentic tools). Contemporary information retrieval demands data stores that embed intelligence directly into their structure and include rich semantic data, contextual understanding, simplified metadata and multi-modal retrieval capabilities.

Augmented storage structures

The data systems we need require storage mechanisms that extend beyond conventional object stores:

  • Dense vector embeddings for semantic searches and results ranking, clarifying the standard used in terms of embedding models. Recent research shows that smaller models for simple search work just fine (i.e. stop using OpenAI’s text-3 with so many dimensions!) see for example quantized or binary embeddings, or preferably a fine-tuned embedding model specialized on the data being served.
  • Using advancements in smaller sentence transformer models like model2vec.
  • Sparse vector implementations (BM25, TF-IDF) for keyword search and segmentation.
  • Hybrid retrieval architectures combining multiple search modalities (sparse and dense)
  • More useful metadata extraction of unstructured data for segmented search
  • Entity relationship graphs for complex queries, using Knowledge Graphs, when applicable building a knowledge graph and allow graph query language rather than SQL.

Edge processing integration

We're seeing the emergence of small language (transformer) models integrated directly within database systems, with experimental implementations in PostgreSQL (see PostgresML) demonstrating embedded intelligence capabilities. While this technology matures, implementing thin edge processing layers provides immediate data operations without requiring complex client-side implementations.

Edge processing architecture

We need edge processing architecture that shifts complex data operations to where they belong—at the data provider level—while maintaining clear boundaries between processing layers. This approach would enable tool builders to focus on application logic rather than implementing redundant data processing infrastructure.

1. Query processing

The edge layer should handle natural language inputs through query understanding and transformation. By leveraging domain-optimized embedding models fine-tuned specifically for the dataset, queries are automatically converted into appropriate vector representations. This ensures semantic matching with stored data embeddings. Query transformation/decomposition patterns can be tailored to the specific data structure and domain, utilizing the data provider's deep understanding of their content - hence why this should be with the data provider rather than the tool builder. This centralized approach eliminates the need for individual tools to implement their own embedding generation and query processing logic, ensuring consistency and efficiency.

2. Result optimization

When combining results from multiple retrieval methods (dense vectors, sparse vectors, keyword matching), fusion techniques can adjust scores based on retrieval method reliability and query context. This hybrid approach, incorporating both semantic and lexical search with carefully tuned k-values for results fusion, consistently demonstrates superior retrieval performance compared to single-method approaches.

Re-ranking has emerged as a effective component for improving retrieval relevance. The edge layer could implement dynamic re-ranking strategies using models specifically fine-tuned for the dataset's characteristics and domain.

3. Performance enhancement

Intelligent caching mechanisms at the edge can optimize frequent query patterns and reduce latency and AI processing costs while maintaining result freshness. The system could implement invalidation strategies based on content updates and query patterns. For dynamic content, incremental updates ensure cache coherence without full recomputation.

4. Content transformation

While more complex transformations may occur at the API layer, the edge layer could perform rapid, targeted synthesis of retrieved information. Contextualized synthesis can be performed at this level using the provided user query to provide contextualized summaries of larger unstructured content in the result sets. The edge layer could also quickly normalize output formats and adapt schemas for client consumption, though more complex transformations are typically reserved for higher layers in the architecture.

Implementation rationale for edge processing

Edge processing shifts complex operations to the data provider level, where domain expertise can be leveraged effectively. Key benefits include:

Domain optimization

  • Fine-tuned embedding models specific to dataset characteristics
  • Custom ranking algorithms optimized for data patterns
  • Consistent processing across all client applications

Efficiency gains

  • Eliminates redundant processing across tools
  • Centralizes complex operations at data layer
  • Enables tools to focus on unique value propositions
  • Huge gains in limiting environmental impact from all the redundant compute, including AI compute from all the tools building their own dataset copies

Resource management

  • Shared computation and intelligent caching
  • Minimized latency and bandwidth consumption
  • Optimal performance through proximity to data source

API layer evolution

Traditional API endpoints for development and humanitarian datasets often present unnecessary complexity and barriers to tool development. These conventional interfaces—built around rigid standards and complex querying patterns—create significant friction in building modern knowledge retrieval systems. This results in tool developers routinely downloading and reprocessing entire datasets, leading to inefficiency, redundancy, and increased environmental impact.

On that node quick kudos to HDX for building rapid, low latency endpoints, though still to complex in my opinion, they are fast which in itself is significant.

Modern API architecture

The API layer must evolve to prioritize practical utility and accessibility while maintaining data integrity. We should focus on:

Query simplification

  • Natural language query acceptance instead of strict parameter formats
  • Flexible geographic referencing beyond ISO codes
  • Intuitive sector and theme searching without requiring knowledge of specific taxonomies
  • Smart metadata interpretation using contextual understanding
  • Multi-format identifier acceptance (e.g., different country name variations)

Intelligent query processing

  • Automatic query decomposition and interpretation
  • Cross-reference resolution across different standards
  • Smart mapping of natural language to formal taxonomies
  • Contextual understanding of domain-specific terminology
  • Automated handling of multi-hop queries

Standardized response patterns

  • Consistent result formatting across different query types
  • Unified schema for cross-dataset queries
  • Flexible output formats supporting various use cases
  • Built-in pagination and filtering capabilities
  • Clear error handling and feedback mechanisms

Implementation patterns

Here are some practical examples of how this would change:

Query handling

GET /data/search?q="agriculture projects in east africa"

Instead of:

GET /data/search?sector_code=311&region_code=EAF&activity_status=2

Geographic flexibility

GET /data/location?q="myanmar"

Accepts: "Myanmar", "Burma", "မြန်မာ", or standard codes

Sector searching

GET /data/sectors?q="clean water access"

Instead of:

GET /data/sectors?code=14030&vocabulary=DAC-5

Integration with edge processing

The API layer leverages edge processing capabilities for:

  • Query interpretation and expansion
  • Standard code mapping and resolution
  • Response synthesis and formatting
  • Cross-reference resolution
  • Contextual result ranking

Benefits

For tool builders, the evolved API layer dramatically streamlines development by eliminating complex query construction and data reprocessing requirements. Teams can focus on building innovative solutions rather than managing data infrastructure, enabling rapid integration and significantly lower overhead costs. The elimination of data replication needs further reduces both technical debt, operational complexity and keep data fresh.

For data providers, this approach substantially decreases support requirements while driving higher utilization of their datasets. By handling complexity at the source, providers could see greater user adoption across diverse use cases and improved accessibility of their data. This centralized approach to data processing and access ultimately leads to more valuable and impactful datasets.

For the broader ecosystem, these improvements yield significant reductions in computational redundancy and environmental impact from eliminated reprocessing needs. The standardized yet flexible approach enhances data consistency while enabling diverse tool development. This results in more efficient resource utilization across the entire data ecosystem, fostering innovation while reducing waste, but most of all expands the actual use of the data, rather than it gathering dust in virtual development and humanitarian shelves.

Design principles

The design should prioritize simplicity first, emphasizing intuitive query patterns and clear response formats with minimal required parameters. While maintaining flexibility, the system enforces standardized approaches through consistent response patterns and comprehensive documentation. A developer-centric focus ensures practical use cases and rapid integration capabilities, supported by clear examples. Performance optimization remains critical, implementing efficient query processing, smart caching, and optimized response sizes to ensure scalability and responsiveness.

Bottom line is that this evolution in API design fundamentally shifts complexity from tool builders to data providers, where it can be handled more efficiently and consistently. By implementing these patterns, data providers can significantly reduce barriers to entry for tool developers while improving overall system efficiency and reducing environmental impact from redundant processing. This is what will make all their hard work in making the “data” available.. To actually be used!

Conclusion

The future of development and humanitarian data systems demands more than just making information available – it requires making it intelligently accessible. By embedding intelligence directly into data architectures through vector storage, edge processing, and intuitive APIs, we can dramatically reduce the barriers for AI-powered tools while minimizing environmental impact from redundant processing. This transformation isn't just about technological advancement; it's about ensuring that the valuable data collected in development and humanitarian contexts can effectively serve its ultimate purpose: improving lives through better-informed decision making and more efficient development programming and aid delivery.

As we move forward, the focus must shift from simply storing and sharing data to building systems that actively facilitate its intelligent use. The next generation of humanitarian data infrastructure will be judged not by how much data it can hold, but by how effectively it can serve the AI-powered tools that increasingly drive development and humanitarian work.