Knowledge Graphs  |  January 6, 2025

Creating multi-dimensional knowledge graphs from humanitarian reports: A case study of Ukraine crisis documentation

with Javier Fabra (formerly Department for Evaluation at NORAD)

Summary

This exploration and proof-of-concept presents a novel methodology for extracting rich, temporal-aware knowledge graphs from humanitarian sector reports, focusing specifically on a single sub-set of documentation: 200 most recent public reports covering the Ukraine crisis published by Relief Web. This approach can be applied to any document set, including internal documents an reports. While structured data repositories like the Humanitarian Data Exchange (HDX) provide valuable quantitative insights, substantial qualitative information remains embedded within narrative reports. Our approach combines multimodal analysis techniques to construct comprehensive multi-dimensional knowledge graphs capturing organizational relationships, geographic contexts, and temporal dynamics of activities in the humanitarian sector.

image.png

This proof of concept was built from scratch over a weekend, including the full data extraction, preparation, user interface design, and the writing of this article - demonstrating what could potentially achieved with additional time and focused effort.

️ Try it out here: https://demos.baobabtech.ai/knowledge-graphs ⚠️ still buggy, click on nodes or refresh page as needed.


Introduction

The humanitarian sector generates vast amounts of qualitative information through narrative reports, complementing structured data repositories like HDX. While HDX offers crucial categorized qualitative and mostly quantitative metrics, narrative reports contain rich contextual information about operational relationships, geographic scope, and temporal evolution of humanitarian responses. This research addresses the challenge of systematically extracting and structuring this qualitative information through automated knowledge graph construction.

image.png

Approach

Our goal was to create a streamlined proof-of-concept using a Large Multi-Modal model to analyze humanitarian reports. While this implementation proves that automated report processing is viable, a production system would perform better with custom-tuned visual and text models to maximize both accuracy and cost-effectiveness at scale.

Knowledge Graph Schema

The knowledge graph schema consists of three primary node or entity types:

  1. Organizations
  2. Locations
  3. Activities

These categories are deliberately broad to encompass informal entities. For instance, organizations include working groups and other collaborative bodies.

The nodes are connected though links or edges, these are:

  1. "OPERATES_IN" for organizations (source) operating or present in a location (target).
  2. "IMPLEMENTS" for organizations (source) implementing or related to an activity (target).
  3. "LOCATED_IN" for activities (source) located or related in a specific place (target).

Each relationship in the graph carries temporal attributes, enabling timeline-based analysis of humanitarian operations, activities in the region, presence or mention of organizations or groups in a location, and evolution of organizational networks.

Multimodal processing

We used a dual-mode processing pipeline that analyzes both visual and textual elements of humanitarian reports. For each page of each PDF report, we process:

  1. Visual snapshot: Captures maps, tables, and other visual elements that contain crucial information often not fully represented in extracted text
  2. Extracted text: Processes narrative content and structured text to capture the detailed written information

The multimodal input feeds into a large language model prompted for entity and relationship extraction, enabling comprehensive interpretation of both visual and textual information sources. This architecture allows for:

  • Parallel processing of visual and textual elements
  • Cross-validation between different information modalities
  • Integrated analysis of complex report components

Context Integration

Since the model processes each page in parallel, it lacks the broader context of the report, which can limit the quality of extracted information. To address this limitation, we inject contextual information through three key components:

  1. Overall report context: Including metadata about the report's origin, purpose, and scope
  2. Page-specific context: Textual content and structure of each individual page
  3. Visual representation: High-quality image capture of the PDF page

This three-part structure ensures the model maintains awareness of both the broader document context and page-specific details during analysis. While this implementation is intentionally simplified, it demonstrates the potential for more sophisticated approaches in production environments—such as injecting summaries of previous pages to address the challenge of report sections split across page breaks.

Extractor prompt

This is the main prompt used with Anthropic’s Sonnet-3.5-2024-10-22 which is a multi-modal model capability of vision and text input.

💡

In production and for scale, we would break down this process into multiple steps using smaller vision models and text models, generated specific plain english phrases which then could be transformed into the Graph model schema: node ↔ link ↔ node represented here as a JSON array of node and link objects (see sample outputs below)

You are an expert in extracting information from PDF reports. We want to capture who is doing what where. Your task is to analyse the information from the content provided (text and or image) and return a JSON object for use with a knowledge graph using this schema { nodes, links, metadata}. Make sure a link ALWAYS has the 2 nodes existing): { nodes: [ {id:Int, type: Enum("org","activity","location"), label: string, act_date_from: Date, act_date_to: Date, excerpt: string}, {id:2, ... }], links:[{source_id: Int, target_id: Int, type: Enum("OPERATES_IN", "IMPLEMENTS", "LOCATED_IN"), date_from:Date, date_to:Date, excerpt: string} ], metadata: {timeline: { from: Date, to: Date}} }. For the nodes: - ONLY use either `org`, `activity`, `location` as the type. - If type `org`: the label is the organization name (could be acronym or full name). - If type `activity`: the label is the action the organization did, could be a service, supplies, etc. any action or activity, write label in 8 words max. - If type `location`: the label is the location name, try and find the specific location within the country if possible. For any `location` nodes use Administrative divisions or city/town/village names where possible (from the content provided) such as "Cherkasy Oblast" or "Uman Raion" or "Kyiv", so admin level or city/town/village. Avoid locations that aren't geographical. For `org` nodes, use the organization's name or accronym. For the links, use ONLY the following relationship types: - "OPERATES_IN" for organizations (source) operating in a location (target). - "IMPLEMENTS" for organizations (source) implementing an activity (target). - "LOCATED_IN" for activities (source) located in a specific place (target). The date_from and date_to should be dates ONLY IF mentioned in the content related to that specific action/activity, leave blank if needed. The excerpt is the exact short section of text from which you extracted the information to create the node or link. ALWAYS ensure every link you create has 2 nodes and the ids correspond. Generate as many as you can find on the page. If many locations are mentioned, make sure to create a location node for each one and a link between the org and the location and the activity and location (if activity was done or focused on that location). For the metadata: - The timeline is the date range of the report or any specific to and from dates the put the information in context of time/date. - If there is not end date just use the "from" If the organization is doing an activity in a location make sure to have links for both: between the org and the activity, and between the activity and location. If you can't find anything to generate from the content provided that is fine, just return { error: "nothing found" }. ONLY return the JSON object, no comments, nothing else.

Sample outputs

For the nodes (entities)

{ "nodes": [ { "id": 1, "type": "org", "label": "UNFPA", "excerpt": "UNFPA launched a national network of 11 Survivor Relief Centres" }, { "id": 2, "type": "activity", "label": "Network of 11 Survivor Relief Centres", "act_date_from": "2023-01-01", "act_date_to": "2023-12-31", "excerpt": "UNFPA launched a national network of 11 Survivor Relief Centres, which provide a unique support services model for people affected by the war" }, { "id": 3, "type": "activity", "label": "29 women-and-girl-friendly spaces", "act_date_from": "2023-01-01", "act_date_to": "2023-12-31", "excerpt": "Across different Ukrainian cities, 29 women-and-girl-friendly spaces were opened" },

For the links

"links": [ { "source_id": 1, "target_id": 2, "type": "IMPLEMENTS", "excerpt": "UNFPA launched a national network of 11 Survivor Relief Centres" }, { "source_id": 1, "target_id": 3, "type": "IMPLEMENTS", "excerpt": "Across different Ukrainian cities, 29 women-and-girl-friendly spaces were opened" }, { "source_id": 1, "target_id": 4, "type": "IMPLEMENTS", "excerpt": "35,000 people received social and psychological support within the framework of 109 mobile psychosocial support (PSS) teams" }, { "source_id": 1, "target_id": 5, "type": "IMPLEMENTS", "excerpt": "SRH teams opened 27 mobile clinics and one mobile maternity unit" },

Entity Extraction and Relationship Mapping

Processing Pipeline

The extraction process employs a page-by-page analysis using a high-level language model. The multimodal approach ensures capture of:

  • Visual context from maps and diagrams
  • Tabular data and structured information
  • Narrative descriptions and qualitative assessments
  • Temporal markers and operational timelines

Temporal Data

Temporal data is stored at the node/link level and the page/report level (metadata)

In some cases the model identified dates that are specific to the node. In this case we see that while the report was in October 2024, the model identified a time for this specific project.

nodes: [ { "id": 7, "type": "location", "label": "Novy Bug", "excerpt": "In December 2023, the ETC implemented a pilot Project in the Mykolaiv region, the city of Novy Bug" }, { "id": 8, "type": "activity", "label": "Provide connectivity to Invincibility Points", "act_date_from": "2023-12-01", "excerpt": "providing data connectivity services to eight Invincibility Points (IPs). IPs are the Ukrainian authorities' project" } .... ], { "metadata": { "timeline": { "from": "2024-10-01", "to": "2024-10-31" }

Node de-duplication and grouping

Given the number of open free text, we need to de-duplicate the location and organization nodes and group the activities, we used basic text comparison of labels and the minishlab/M2V_base_output model2vec model for fast semantic similarity comparison.

Organization and location de-duplication

The processing of the raw generated nodes and links followed two stages:

  1. Entity Deduplication
    • Organization name normalization (exact lowercase match)
    • Location entity consolidation (similarity of >0.95 using the embedding model)
    • Activity node deduplication (similarity of >0.95 using the embedding model)
  2. Activity Clustering
    • Initial parameter: 100 clusters
    • Semantic similarity analysis
    • Hierarchical relationship mapping

Unsupervised Activity Classification

Rather than imposing predetermined categories, we employ K-means clustering to discover natural groupings within extracted activities. This data-driven approach provides:

  1. Emergent patterns in humanitarian operations
  2. Natural classification of intervention types
  3. Discovery of operational relationships

We first analyzed the model’s behaviour with the chosen 8 word limit activity labels generated/extract from the multi-modal model:

Analyzing 307 unique activities... Generating embeddings... Analysis completed in 0.28 seconds Threshold 0.9: Found 11 pairs with similarity >= 0.9 Top 5 most similar pairs in this band: Similarity 1.000: 'Child Protection Case Management Services' <-> 'Child protection case management services' Similarity 1.000: 'Mental health and psychosocial support services' <-> 'Mental Health and Psychosocial Support Services' Similarity 0.955: 'Partial deduplication reporting pilot' <-> 'Pilot partial deduplication case reporting' Similarity 0.951: 'Research study on adolescent GBV experiences' <-> 'Research on adolescent GBV experiences' Similarity 0.947: 'Provide child protection and GBV services' <-> 'Child protection and GBV services' Threshold 0.85: Found 15 pairs with similarity >= 0.85 Top 5 most similar pairs in this band: Similarity 0.897: 'Air and shelling attacks' <-> 'Air attacks and shelling on infrastructure' Similarity 0.890: 'Attacks on energy infrastructure' <-> 'Aerial attacks on energy infrastructure' Similarity 0.887: 'Infrastructure and utilities damage assessment' <-> 'Water infrastructure and utilities assessment' Similarity 0.887: 'Deploy mobile data connectivity for humanitarian response' <-> 'Mobile data connectivity support for humanitarian response' Similarity 0.887: 'Air attacks on energy infrastructure' <-> 'Aerial attacks on energy infrastructure' Threshold 0.8: Found 12 pairs with similarity >= 0.8 Top 5 most similar pairs in this band: Similarity 0.845: 'Provide education and learning support' <-> 'Education support' Similarity 0.838: 'Protection Cluster Coordination' <-> 'Protection Cluster Coordination Hub Operations' Similarity 0.834: 'Vehicle fuel availability monitoring' <-> 'Heating fuel availability monitoring' Similarity 0.825: 'Emergency Cash Support for Protection' <-> 'Cash for Protection assistance program' Similarity 0.824: 'Mine Action' <-> 'Mine Action Operations' Threshold 0.75: Found 24 pairs with similarity >= 0.75 Top 5 most similar pairs in this band: Similarity 0.800: 'Operate helpdesk services' <-> 'ICT Helpdesk Services' Similarity 0.794: 'GBV study data collection and analysis' <-> 'Data collection and analysis' Similarity 0.793: 'Gender-Based Violence protection' <-> 'Gender Based Violence Response' Similarity 0.790: 'Analysis of frontline buffer zones' <-> 'Analysis of frontline hromadas buffer zones' Similarity 0.788: 'Food distribution and cash assistance' <-> 'Provide humanitarian food and cash assistance'

Picking 0.8 threshold and clustering:

Generated 30 activity clusters with similarity >= 0.8 Top 5 clusters by size: Cluster 1 (4 activities): - Child Protection - Child Protection Case Management Services - Child Protection Services - Child protection case management services Cluster 2 (4 activities): - Aerial attacks on energy infrastructure - Air attacks on energy infrastructure - Attacks on energy infrastructure - Missile and drone attacks on energy infrastructure Cluster 3 (3 activities): - Protection Cluster Coordination - Protection Cluster Coordination Hub Operations - Protection Coordination Hub Operations Cluster 4 (3 activities): - Boost inclusive education for disabled children - Inclusive education initiative for 160,000 disabled children - Inclusive education support for disabled children Cluster 5 (2 activities): - Legal Assistance for Child Protection - Provide child protection legal assistance

Using K-means clustering for activities

Clusters: Cluster 1: Health assistance and medical support Variants (31 activities): - Health assistance and medical support (similarity: 0.843) - Health assistance services (similarity: 0.833) - Provide medical care services (similarity: 0.825) - Provide legal aid and social services (similarity: 0.785) - Mental health support services (similarity: 0.756) ... and 26 more Cluster 2: Food distribution and cash assistance Variants (28 activities): - Food distribution and cash assistance (similarity: 0.885) - Provide humanitarian food and cash assistance (similarity: 0.832) - Distribute cash assistance for winter needs (similarity: 0.791) - Winter assistance and cash support to households (similarity: 0.764) - Distribution of agricultural inputs and cash assistance (similarity: 0.762) ... and 23 more Cluster 3: Community infrastructure and needs assessment research Variants (28 activities): - Community infrastructure and needs assessment research (similarity: 0.832) - Infrastructure and utilities damage assessment (similarity: 0.806) - Water infrastructure and utilities assessment (similarity: 0.782) - Road infrastructure assessment and maintenance (similarity: 0.771) - Communal services infrastructure assessment (similarity: 0.752) ... and 23 more Cluster 4: Protection and assistance for displaced people Variants (21 activities): - Protection and assistance for displaced people (similarity: 0.796) - Protection and critical services for refugees (similarity: 0.748) - Humanitarian assistance and response operations (similarity: 0.735) - Provide humanitarian aid and evacuation support (similarity: 0.729) - Displacement and evacuation operations (similarity: 0.697) ... and 16 more Cluster 5: Child Protection Services Variants (19 activities): - Child Protection Services (similarity: 0.918) - Legal Assistance for Child Protection (similarity: 0.888) - Provide child protection legal assistance (similarity: 0.863) - Child Protection Case Management Services (similarity: 0.829) - Child protection case management services (similarity: 0.829) ... and 14 more Cluster 6: Provide education and learning support Variants (16 activities): - Provide education and learning support (similarity: 0.784) - Education and early learning programs (similarity: 0.763) - Education and skills development programs (similarity: 0.740) - Education support (similarity: 0.726) - Gender and disability in cash programs training (similarity: 0.690) ... and 11 more

Activity Category nodes

We created an Activity Category node for clearer visual representation using the cluster node labels, connecting each descriptive activity node to its category using BELONGS_TO links.

We then used Google's gemini-9b-it through Groq's fast inference to generate labels from the cluster variants.

The detailed Activity type nodes (in grey) only show when an Activity Category node (in red) is selected

image.png

image.png

Source tracing

Traceability back to the source is critical for this type of AI-extraction tool. We have implemented a complete trace chain back to the original nodes and links, enabling human verification, validation, and exploration.

When you select an activity, a dialog window opens showing all detailed activities and nodes traced back to the original documents, including the relevant excerpts.

image.png

You can click on the source and an integrated PDF viewer will open the document at the exact page from which the excerpt came from. (Red boxes and arrows added by author)

image.png

Unexpected Organic Activities Insights

Due to the very loose restrictions on what "activities" should be captured, we captured a significant number of events (rather than humanitarian activities), providing interesting insights and stimulating the opportunity to explore event versus activity distinctions. For example, details around power outages, which included nicely captured temporal components:

image.png

image.png

(Red circle and arrows added by author)

Limitations

Current methodological limitations include:

  1. Single-shot kitchen-sink approach to extraction likely led to missed nodes and links
  2. Activity clustering may oversimplify complex relationships and lose important nuances
  3. Organization and location deduplication remains incomplete
  4. Limited temporal analysis capabilities, particularly for historical querying
  5. Simplified node and edge structure creates ambiguous relationships that may not accurately represent the complexity of humanitarian operations
  6. Current approach sacrifices granular details in favor of data consolidation

Future directions

  1. Refine the extraction process to better capture actor-action relationships
  2. Implement temporal analysis tools for analyzing historical patterns
  3. Expand the dataset beyond the current 200 reports
  4. Develop more sophisticated querying mechanisms (using Cypher)
  5. Scale by introducing query-based backend rather than client-side filtering
  6. Connect with structured humanitarian data sources and taxonomies without loosing the detail (e.g. map to official org and sectors from HDX)
  7. Implement alternatives to force graphs such as trees or simple table lists for export

Conclusion

This case study shows both the potential and current constraints of using knowledge graphs to analyze humanitarian reports. While the approach can extract structured information from narrative reports, significant work remains to address technical limitations and improve accuracy. The methodology requires further development and testing before operational deployment.


🔢 Code and data

Extraction

The python code used to extract the data, including all PDFs and extract images and intermediary outputs, and analysis scripts are available here (zipped 600mb - mostly due to pdfs and images (1 per page), will share git repo soon.

User interface

The visual components (react) for the UI interface is available here (zipped 11kb): I used simple react-force-2d and shadcn for this prototyped ui. Suggest connecting to a neo4j instance and proper querying using cypher. This would allow to scale to millions of nodes.