with Javier Fabra (formerly Department for Evaluation at NORAD)
This exploration and proof-of-concept presents a novel methodology for extracting rich, temporal-aware knowledge graphs from humanitarian sector reports, focusing specifically on a single sub-set of documentation: 200 most recent public reports covering the Ukraine crisis published by Relief Web. This approach can be applied to any document set, including internal documents an reports. While structured data repositories like the Humanitarian Data Exchange (HDX) provide valuable quantitative insights, substantial qualitative information remains embedded within narrative reports. Our approach combines multimodal analysis techniques to construct comprehensive multi-dimensional knowledge graphs capturing organizational relationships, geographic contexts, and temporal dynamics of activities in the humanitarian sector.
This proof of concept was built from scratch over a weekend, including the full data extraction, preparation, user interface design, and the writing of this article - demonstrating what could potentially achieved with additional time and focused effort.
️ Try it out here: https://demos.baobabtech.ai/knowledge-graphs ⚠️ still buggy, click on nodes or refresh page as needed.
The humanitarian sector generates vast amounts of qualitative information through narrative reports, complementing structured data repositories like HDX. While HDX offers crucial categorized qualitative and mostly quantitative metrics, narrative reports contain rich contextual information about operational relationships, geographic scope, and temporal evolution of humanitarian responses. This research addresses the challenge of systematically extracting and structuring this qualitative information through automated knowledge graph construction.
Our goal was to create a streamlined proof-of-concept using a Large Multi-Modal model to analyze humanitarian reports. While this implementation proves that automated report processing is viable, a production system would perform better with custom-tuned visual and text models to maximize both accuracy and cost-effectiveness at scale.
The knowledge graph schema consists of three primary node or entity types:
These categories are deliberately broad to encompass informal entities. For instance, organizations include working groups and other collaborative bodies.
The nodes are connected though links or edges, these are:
Each relationship in the graph carries temporal attributes, enabling timeline-based analysis of humanitarian operations, activities in the region, presence or mention of organizations or groups in a location, and evolution of organizational networks.
We used a dual-mode processing pipeline that analyzes both visual and textual elements of humanitarian reports. For each page of each PDF report, we process:
The multimodal input feeds into a large language model prompted for entity and relationship extraction, enabling comprehensive interpretation of both visual and textual information sources. This architecture allows for:
Since the model processes each page in parallel, it lacks the broader context of the report, which can limit the quality of extracted information. To address this limitation, we inject contextual information through three key components:
This three-part structure ensures the model maintains awareness of both the broader document context and page-specific details during analysis. While this implementation is intentionally simplified, it demonstrates the potential for more sophisticated approaches in production environments—such as injecting summaries of previous pages to address the challenge of report sections split across page breaks.
This is the main prompt used with Anthropic’s Sonnet-3.5-2024-10-22
which is a multi-modal model capability of vision and text input.
In production and for scale, we would break down this process into multiple steps using smaller vision models and text models, generated specific plain english phrases which then could be transformed into the Graph model schema: node ↔ link ↔ node represented here as a JSON array of node and link objects (see sample outputs below)
You are an expert in extracting information from PDF reports. We want to capture who is doing what where. Your task is to analyse the information from the content provided (text and or image) and return a JSON object for use with a knowledge graph using this schema { nodes, links, metadata}. Make sure a link ALWAYS has the 2 nodes existing): { nodes: [ {id:Int, type: Enum("org","activity","location"), label: string, act_date_from: Date, act_date_to: Date, excerpt: string}, {id:2, ... }], links:[{source_id: Int, target_id: Int, type: Enum("OPERATES_IN", "IMPLEMENTS", "LOCATED_IN"), date_from:Date, date_to:Date, excerpt: string} ], metadata: {timeline: { from: Date, to: Date}} }. For the nodes: - ONLY use either `org`, `activity`, `location` as the type. - If type `org`: the label is the organization name (could be acronym or full name). - If type `activity`: the label is the action the organization did, could be a service, supplies, etc. any action or activity, write label in 8 words max. - If type `location`: the label is the location name, try and find the specific location within the country if possible. For any `location` nodes use Administrative divisions or city/town/village names where possible (from the content provided) such as "Cherkasy Oblast" or "Uman Raion" or "Kyiv", so admin level or city/town/village. Avoid locations that aren't geographical. For `org` nodes, use the organization's name or accronym. For the links, use ONLY the following relationship types: - "OPERATES_IN" for organizations (source) operating in a location (target). - "IMPLEMENTS" for organizations (source) implementing an activity (target). - "LOCATED_IN" for activities (source) located in a specific place (target). The date_from and date_to should be dates ONLY IF mentioned in the content related to that specific action/activity, leave blank if needed. The excerpt is the exact short section of text from which you extracted the information to create the node or link. ALWAYS ensure every link you create has 2 nodes and the ids correspond. Generate as many as you can find on the page. If many locations are mentioned, make sure to create a location node for each one and a link between the org and the location and the activity and location (if activity was done or focused on that location). For the metadata: - The timeline is the date range of the report or any specific to and from dates the put the information in context of time/date. - If there is not end date just use the "from" If the organization is doing an activity in a location make sure to have links for both: between the org and the activity, and between the activity and location. If you can't find anything to generate from the content provided that is fine, just return { error: "nothing found" }. ONLY return the JSON object, no comments, nothing else.
For the nodes (entities)
{ "nodes": [ { "id": 1, "type": "org", "label": "UNFPA", "excerpt": "UNFPA launched a national network of 11 Survivor Relief Centres" }, { "id": 2, "type": "activity", "label": "Network of 11 Survivor Relief Centres", "act_date_from": "2023-01-01", "act_date_to": "2023-12-31", "excerpt": "UNFPA launched a national network of 11 Survivor Relief Centres, which provide a unique support services model for people affected by the war" }, { "id": 3, "type": "activity", "label": "29 women-and-girl-friendly spaces", "act_date_from": "2023-01-01", "act_date_to": "2023-12-31", "excerpt": "Across different Ukrainian cities, 29 women-and-girl-friendly spaces were opened" },
For the links
"links": [ { "source_id": 1, "target_id": 2, "type": "IMPLEMENTS", "excerpt": "UNFPA launched a national network of 11 Survivor Relief Centres" }, { "source_id": 1, "target_id": 3, "type": "IMPLEMENTS", "excerpt": "Across different Ukrainian cities, 29 women-and-girl-friendly spaces were opened" }, { "source_id": 1, "target_id": 4, "type": "IMPLEMENTS", "excerpt": "35,000 people received social and psychological support within the framework of 109 mobile psychosocial support (PSS) teams" }, { "source_id": 1, "target_id": 5, "type": "IMPLEMENTS", "excerpt": "SRH teams opened 27 mobile clinics and one mobile maternity unit" },
The extraction process employs a page-by-page analysis using a high-level language model. The multimodal approach ensures capture of:
Temporal data is stored at the node/link level and the page/report level (metadata)
In some cases the model identified dates that are specific to the node. In this case we see that while the report was in October 2024, the model identified a time for this specific project.
nodes: [ { "id": 7, "type": "location", "label": "Novy Bug", "excerpt": "In December 2023, the ETC implemented a pilot Project in the Mykolaiv region, the city of Novy Bug" }, { "id": 8, "type": "activity", "label": "Provide connectivity to Invincibility Points", "act_date_from": "2023-12-01", "excerpt": "providing data connectivity services to eight Invincibility Points (IPs). IPs are the Ukrainian authorities' project" } .... ], { "metadata": { "timeline": { "from": "2024-10-01", "to": "2024-10-31" }
Given the number of open free text, we need to de-duplicate the location and organization nodes and group the activities, we used basic text comparison of labels and the minishlab/M2V_base_output
model2vec model for fast semantic similarity comparison.
The processing of the raw generated nodes and links followed two stages:
Rather than imposing predetermined categories, we employ K-means clustering to discover natural groupings within extracted activities. This data-driven approach provides:
We first analyzed the model’s behaviour with the chosen 8 word limit activity labels generated/extract from the multi-modal model:
Analyzing 307 unique activities... Generating embeddings... Analysis completed in 0.28 seconds Threshold 0.9: Found 11 pairs with similarity >= 0.9 Top 5 most similar pairs in this band: Similarity 1.000: 'Child Protection Case Management Services' <-> 'Child protection case management services' Similarity 1.000: 'Mental health and psychosocial support services' <-> 'Mental Health and Psychosocial Support Services' Similarity 0.955: 'Partial deduplication reporting pilot' <-> 'Pilot partial deduplication case reporting' Similarity 0.951: 'Research study on adolescent GBV experiences' <-> 'Research on adolescent GBV experiences' Similarity 0.947: 'Provide child protection and GBV services' <-> 'Child protection and GBV services' Threshold 0.85: Found 15 pairs with similarity >= 0.85 Top 5 most similar pairs in this band: Similarity 0.897: 'Air and shelling attacks' <-> 'Air attacks and shelling on infrastructure' Similarity 0.890: 'Attacks on energy infrastructure' <-> 'Aerial attacks on energy infrastructure' Similarity 0.887: 'Infrastructure and utilities damage assessment' <-> 'Water infrastructure and utilities assessment' Similarity 0.887: 'Deploy mobile data connectivity for humanitarian response' <-> 'Mobile data connectivity support for humanitarian response' Similarity 0.887: 'Air attacks on energy infrastructure' <-> 'Aerial attacks on energy infrastructure' Threshold 0.8: Found 12 pairs with similarity >= 0.8 Top 5 most similar pairs in this band: Similarity 0.845: 'Provide education and learning support' <-> 'Education support' Similarity 0.838: 'Protection Cluster Coordination' <-> 'Protection Cluster Coordination Hub Operations' Similarity 0.834: 'Vehicle fuel availability monitoring' <-> 'Heating fuel availability monitoring' Similarity 0.825: 'Emergency Cash Support for Protection' <-> 'Cash for Protection assistance program' Similarity 0.824: 'Mine Action' <-> 'Mine Action Operations' Threshold 0.75: Found 24 pairs with similarity >= 0.75 Top 5 most similar pairs in this band: Similarity 0.800: 'Operate helpdesk services' <-> 'ICT Helpdesk Services' Similarity 0.794: 'GBV study data collection and analysis' <-> 'Data collection and analysis' Similarity 0.793: 'Gender-Based Violence protection' <-> 'Gender Based Violence Response' Similarity 0.790: 'Analysis of frontline buffer zones' <-> 'Analysis of frontline hromadas buffer zones' Similarity 0.788: 'Food distribution and cash assistance' <-> 'Provide humanitarian food and cash assistance'
Picking 0.8 threshold and clustering:
Generated 30 activity clusters with similarity >= 0.8 Top 5 clusters by size: Cluster 1 (4 activities): - Child Protection - Child Protection Case Management Services - Child Protection Services - Child protection case management services Cluster 2 (4 activities): - Aerial attacks on energy infrastructure - Air attacks on energy infrastructure - Attacks on energy infrastructure - Missile and drone attacks on energy infrastructure Cluster 3 (3 activities): - Protection Cluster Coordination - Protection Cluster Coordination Hub Operations - Protection Coordination Hub Operations Cluster 4 (3 activities): - Boost inclusive education for disabled children - Inclusive education initiative for 160,000 disabled children - Inclusive education support for disabled children Cluster 5 (2 activities): - Legal Assistance for Child Protection - Provide child protection legal assistance
Clusters: Cluster 1: Health assistance and medical support Variants (31 activities): - Health assistance and medical support (similarity: 0.843) - Health assistance services (similarity: 0.833) - Provide medical care services (similarity: 0.825) - Provide legal aid and social services (similarity: 0.785) - Mental health support services (similarity: 0.756) ... and 26 more Cluster 2: Food distribution and cash assistance Variants (28 activities): - Food distribution and cash assistance (similarity: 0.885) - Provide humanitarian food and cash assistance (similarity: 0.832) - Distribute cash assistance for winter needs (similarity: 0.791) - Winter assistance and cash support to households (similarity: 0.764) - Distribution of agricultural inputs and cash assistance (similarity: 0.762) ... and 23 more Cluster 3: Community infrastructure and needs assessment research Variants (28 activities): - Community infrastructure and needs assessment research (similarity: 0.832) - Infrastructure and utilities damage assessment (similarity: 0.806) - Water infrastructure and utilities assessment (similarity: 0.782) - Road infrastructure assessment and maintenance (similarity: 0.771) - Communal services infrastructure assessment (similarity: 0.752) ... and 23 more Cluster 4: Protection and assistance for displaced people Variants (21 activities): - Protection and assistance for displaced people (similarity: 0.796) - Protection and critical services for refugees (similarity: 0.748) - Humanitarian assistance and response operations (similarity: 0.735) - Provide humanitarian aid and evacuation support (similarity: 0.729) - Displacement and evacuation operations (similarity: 0.697) ... and 16 more Cluster 5: Child Protection Services Variants (19 activities): - Child Protection Services (similarity: 0.918) - Legal Assistance for Child Protection (similarity: 0.888) - Provide child protection legal assistance (similarity: 0.863) - Child Protection Case Management Services (similarity: 0.829) - Child protection case management services (similarity: 0.829) ... and 14 more Cluster 6: Provide education and learning support Variants (16 activities): - Provide education and learning support (similarity: 0.784) - Education and early learning programs (similarity: 0.763) - Education and skills development programs (similarity: 0.740) - Education support (similarity: 0.726) - Gender and disability in cash programs training (similarity: 0.690) ... and 11 more
We created an Activity Category
node for clearer visual representation using the cluster node labels, connecting each descriptive activity node to its category using BELONGS_TO
links.
We then used Google's gemini-9b-it
through Groq's fast inference to generate labels from the cluster variants.
The detailed Activity
type nodes (in grey) only show when an Activity Category
node (in red) is selected
Traceability back to the source is critical for this type of AI-extraction tool. We have implemented a complete trace chain back to the original nodes and links, enabling human verification, validation, and exploration.
When you select an activity, a dialog window opens showing all detailed activities and nodes traced back to the original documents, including the relevant excerpts.
You can click on the source and an integrated PDF viewer will open the document at the exact page from which the excerpt came from. (Red boxes and arrows added by author)
Due to the very loose restrictions on what "activities" should be captured, we captured a significant number of events (rather than humanitarian activities), providing interesting insights and stimulating the opportunity to explore event
versus activity
distinctions. For example, details around power outages, which included nicely captured temporal components:
(Red circle and arrows added by author)
Current methodological limitations include:
This case study shows both the potential and current constraints of using knowledge graphs to analyze humanitarian reports. While the approach can extract structured information from narrative reports, significant work remains to address technical limitations and improve accuracy. The methodology requires further development and testing before operational deployment.
The python code used to extract the data, including all PDFs and extract images and intermediary outputs, and analysis scripts are available here (zipped 600mb - mostly due to pdfs and images (1 per page), will share git repo soon.
The visual components (react) for the UI interface is available here (zipped 11kb): I used simple react-force-2d and shadcn for this prototyped ui. Suggest connecting to a neo4j instance and proper querying using cypher. This would allow to scale to millions of nodes.