Evaluations, Evidence  |  March 22, 2025

Using AI for systematic program analysis

TL;DR - We introduce a new method for reviewing development and humanitarian programs using AI. The approach uses large language models (LLMs) in collaboration with human expertise to get results within minutes rather than months. The integration of AI and human judgment aims to enhance the scale, consistency, speed, and depth of program analysis while maintaining accountability and learning. The methodology involved structured rubrics, AI-powered document analysis, and human-in-the-loop verification. The framework is demonstrated through an example of assessing whether development programs were locally-led across six dimensions. The goal was to create a scalable, transparent, and systematic framework to assess program effectiveness and alignment with key principles. The process included setting up the framework, conducting intermediate and final analyses, and ensuring quality control. A proof of concept is being built and will be shared soon.

Introduction

The development and humanitarian sectors generate vast amounts of documentation about their programs, including project evaluations, logical frameworks, theories of change, and various monitoring reports. However, systematically analyzing this wealth of information to assess program effectiveness and alignment with key principles has traditionally been a labor-intensive process. This means that critical lessons about what works and what doesn’t are often lost or hidden.This article explores an innovative approach that combines the analytical capabilities of large language models (LLMs) with human expertise to create a scalable, transparent, and systematic program analysis framework.

The challenge of program analysis

Development and humanitarian programs are complex interventions with multiple stakeholders, objectives, and outcomes. Traditional program analysis often faces several challenges:

  1. Volume of Information: Programs generate extensive documentation across - and sometimes after - their lifecycle
  2. Consistency: Different analysts may interpret program elements differently
  3. Traceability: Linking conclusions to specific evidence can be challenging
  4. Scalability: In-depth analysis requires significant time and expertise
  5. Standardization: Comparing programs across different contexts needs structured frameworks

A novel analytical framework

Our approach combines human expertise in framework design with AI's capability to process and analyze large volumes of text. The framework consists of three main components:

  1. Structured rubrics
  2. AI-powered document analysis (multi-step)
  3. Human verification of intermediary outputs The human should have expertise in the area of analysis. In our example case, the human had decades of experience in evaluation of water, sanitation, and hygiene programs.

The Analysis Process

Our methodology follows a systematic approach. While we use the assessment of locally-led development programs as our primary example, it's important to note that this framework can be adapted for various types of program analysis. Below, we clearly indicate which elements are specific to our locally-led development case study and which are part of the general framework. Guidance for other applications is provided below.

Phase 1: Framework Setup and Data Preparation

1.1. Rubric Development

The foundation of the analysis is a carefully designed rubric system. While the specific rubrics or dimensions will vary based on what's being assessed, the process of developing these rubrics remains consistent:

  1. Rubric Definition

    • Identifying key areas for assessment
    • Developing clear definitions for each dimension
    • Establishing assessment boundaries and scope
  2. Question Framework

    • Creating objective questions for each dimension
    • Designing prompts that elicit specific, measurable responses
    • Establishing scoring criteria and thresholds
  3. Evidence Requirements

    • Defining acceptable types of evidence
    • Setting standards for evidence quality

Taking the example of assessing locally-led development programs, we developed six key rubrics.

RubricDefinitionObjective questions
StakeholdersAll of the actors engaged in the program1. Create a stakeholder map
2. Use map this to determine whether they are local or not. Examples: Local staff of foreign led funders, international organizations are not local; Local means their headquarters is in the community, city, country or region where the work is done.
Local Agenda SettingThe program's priorities and objectives are determined primarily by local actors during the design phase, prior to implementation. Local individuals, communities, or organizations play a leading role in identifying needs and setting the agenda. External partners may provide input, but final decisions on program focus align closely with local perspectives and desires.1. Who determined the priorities and objectives for the program? For example, the program objectives align with national or regional WASH policy or needs assessments; based on community surveys before the program was designed; or were determined based on the funder's priorities/strategies (e.g., U.S. Global Water Strategy).
2. When were the program priorities and objectives determined and by whom?
Local Solution DevelopmentSolutions and strategies are primarily developed by local stakeholders. The program demonstrates a strong reliance on local knowledge and expertise in designing interventions. External partners may offer technical support, but the core ideas and approaches originate from or are significantly shaped by local actors.1. What technical solutions and approaches were used in the program and how were the technical solutions and approaches selected and by whom?
Local Resource MobilizationThe program leverages local resources, both human and financial, to a significant degree. Local actors contribute meaningfully to the program's implementation, either through direct funding, in-kind contributions, or by providing essential human resources. The program builds on existing local capacities rather than relying primarily on external inputs such as foreign consultants1. What resources went into the program (financial, staffing, materials, other)?
2. How much of those resources in percentage did local actors contribute to the program?
Local Decision-Making PowerKey decisions throughout the program cycle are made predominantly by local stakeholders. This includes decisions on resource allocation, implementation strategies, and adaptive management. The governance structure of the program gives substantial weight to local voices in steering the initiative.1. What decisions needed to be made during the program cycle?
2. How were those decisions made and by whom?
3. What were the success metrics based on?
Capacity Building for SustainabilityThe program has a clear focus on strengthening local capacities for long-term self-reliance. It includes specific components or strategies aimed at enhancing local leadership, technical skills, and organizational capabilities. The program design anticipates and prepares for a gradual transition to full local ownership and management.1. Who are the actors responsible for maintaining the outcomes?
2. What type of capacity building activities were part of the program and how did they support the responsible actors?
💡

The text from the definitions are used as context for the LLM during extraction. And the questions are provided to LLM which is prompted to answer them using the content from the program documents (see below) to generate traceable intermediary outputs.

1.2 Data Collection and Organization

A core component of any AI-powered program analysis framework lies in how we organize and present documentation to the system. While sophisticated data integration methods exist, we recommend starting with the simplest approach: a well-curated folder of program documents.

For the purpose of our experiment we downloaded various documents from the USAID Development Experience Clearinghouse (DEC) and the Global Waters.org websites which as of January 2025 have both been shutdown.

Large language models (LLMs) can effectively differentiate between document types, making it possible to start with a basic collection of:

  • Program evaluations and inception reports
  • Logical frameworks and theories of change
  • Work plans and implementation updates
  • Budget documents and financial reports
  • Stakeholder meeting minutes and feedback This focused approach offers several advantages. First, it minimizes noise that typically comes from broader data sources like web searches or organizational databases. Second, it allows for direct control over document quality and relevance. Third, it provides a clear chain of evidence for any conclusions drawn by the analysis.

As your analysis needs grow, the framework can scale to incorporate more sophisticated data access methods:

  • Web-based document repositories
  • Organizational knowledge management systems
  • API connections to program databases
  • Automated document retrieval systems During this exercise, we found that starting with a broader set of documents introduced unnecessary complexity and noise into the analysis. We recommend beginning with a carefully curated set of high-quality program documents that directly support your assessment objectives. This allows you to refine the analysis process before expanding to more complex data sources.

LLMs excel at understanding document context and content - they can readily distinguish between an evaluation report and a blog post, or between a logical framework and a financial statement. This capability means we can focus more on document quality and relevance rather than rigid classification systems or complex metadata frameworks.

2. AI-Powered document analysis

The framework leverages large language models (at least Sonnet 3.7, Mistral Large, Lama 3 70b) to:

  • Classify then process program documentation
  • Identify relevant evidence for each rubric dimension
  • Answer structured questions about program elements
  • Generate intermediate analytical outputs that include segments For our example, the first dimension involves a custom processing of documentation to generate a well documented Stakeholder map. Here is the pseudo-code for it:

First, we define what we want to capture about each stakeholder:

STAKEHOLDER INFORMATION TO CAPTURE: - Name of Organization - Location of Headquarters - Is the organization local? (based on HQ location vs project location) - Role in Project (must be one of): * Donor * Implementing Partner * Technical Partner * Research Partner * Evaluation Partner * Government - Description of their involvement - When they were involved in the project - Source of this information (page number and exact text from documents)

Then, we give the AI clear instructions on how to analyze the documents:

INSTRUCTION PROMPT TO AI: "You are analyzing program documents to create a stakeholder map. Key rules: 1. Only use explicitly stated information from the provided documents 2. Do not make assumptions about organizations or their roles 3. If information is missing, mark it as unknown 4. For each stakeholder found, provide: - The exact text snippet that mentions them - The page number where they are mentioned - Clear evidence for their role classification 5. Mark an organization as 'local' only if there is clear evidence their headquarters is in the project implementation country"

The extraction process follows this logic for *each and every document *that is available.

This process ensures:

  • Only factual information is captured
  • Every piece of information is traceable to source documents
  • Clear distinction between known and unknown information
  • Consistent classification of stakeholder roles
  • Accurate assessment of "local" status based on headquarters location

3. Human-in-the-Loop Verification

The human element remains crucial for:

  • Defining and refining assessment frameworks
  • Validating AI-generated findings
  • Providing context-specific interpretation
  • Making final assessment decisions

Phase 2: Intermediate Outputs and Analysis

The intermediate analysis phase varies based on the type of assessment being conducted. This builds an audit and reference trail for each rubric and provides semantic data for further analysis.

Here's how it worked in our locally-led development example. For each of the dimensions in the rubric, we used AI to extract and record a list of excerpts from the documentation available. This includes answering the questions. For example:

Local Agenda Setting“The program team organized multiple design meetings prior to the start led by Country Org X” (p.12, Business Case.pdf)

The human reviewed and refined the rubric definitions to ensure we extract relevant information.

Phase 3: Final Analysis

We generated multiple outputs to form the analysis of the programme.

3.1 Scoring

We fed the full extracted segments into the scoring AI module for each rubric dimension. This produced a low, medium, or high score based on:

  • Evidence quality and quantity
  • Alignment with dimension criteria
  • Contextual factors

3.2 Narrative Development

The AI narrative module then generated detailed justifications for scores from the intermediate outputs analysing program strengths and weaknesses and providing evidence-based explanations of findings.

3.3 Contextualized Summary

We generated a comprehensive executive program summary from the locally-led perspective, incorporating key findings across dimensions, critical success factors, and areas for improvement.

Key Learnings so far

Prompt Engineering Impact

Our experimentation has unsurprisingly revealed the significant impact of prompt design on analysis quality:

  • Specific prompting strategies yield more consistent results
  • Balanced positive/negative inquiry produces more nuanced analysis, for example, using “You are a very critical programme reviewer…” vs “You are a programme reviewer” will create significant differences in extraction and synthesis tasks.
  • Structured question formats improve evidence gathering

Framework Adaptability

While our example focuses on locally-led development assessment, the framework is adaptable to various analysis needs:

  • Program effectiveness evaluation
  • Sustainability assessment
  • Stakeholder engagement analysis
  • Implementation quality review
  • Thematic analysis such as comparing programmes against each other

Future Applications and Implications

This approach opens new possibilities for program analysis:

Speed and Scale

  • Rapid processing of large program portfolios
  • Consistent analysis across multiple contexts
  • Reduced resource requirements for comprehensive assessment

Knowledge Management

  • Systematic capture of program insights
  • Enhanced institutional learning
  • Improved evidence base for decision-making

Quality Improvement

  • Early identification of program strengths and weaknesses
  • More frequent and comprehensive assessments
  • Better-informed program adaptations

Conclusion

The integration of AI capabilities with human expertise offers a promising path forward for program analysis in the development and humanitarian sectors. This approach maintains the crucial role of human judgment while leveraging technology to enhance the scale, consistency, and depth of analysis possible.

By maintaining clear documentation of the analysis process and ensuring traceability of findings, the framework supports both accountability and learning. As AI capabilities continue to evolve, this human-in-the-loop approach provides a foundation for increasingly sophisticated program assessment while ensuring that analysis remains grounded in sector expertise and contextual understanding.

The framework's success in analyzing locally-led development programs demonstrates its potential for broader application across various types of development and humanitarian interventions. Future refinements will likely focus on expanding the range of assessment criteria, improving prompt engineering techniques, and developing more sophisticated methods for synthesizing findings across multiple programs and contexts.

Acknowledgements

This work is being developed in collaboration with Susan Davis, an accomplished international development expert. Susan brings decades of expertise in program evaluation and is particularly focused on driving investments to locally-led development approaches. While working on her book about international development experiences, she is exploring innovative ways to leverage AI for synthesizing learnings from development program evaluations. Her current work spans philanthropic advising, activism for effective development, and strategic consulting for locally-led social impact organizations. Her deep understanding of program evaluation and commitment to advancing locally-led development, has been instrumental in shaping this analytical framework.

Contact

If you wish to apply this tool in your organization or work, reach out to us:

Olivier Mills

olivier@baobabtech.ai

Founder & CEO, Baobab Tech

Susan Davis

washsmd@gmail.com

Philanthropic advisor championing equitable social innovation