This research leverages cutting-edge Natural Language Processing (NLP) and Machine Learning to analyze development finance at an unprecedented level of detail. Instead of relying on self-declared project classifications (such as OECD’s Rio markers), our approach applies AI-driven topic modeling to extract meaningful patterns from millions of project descriptions.
🔍 Step 1: Data Collection & Preprocessing
We analyze 5 million+ project descriptions from the OECD CRS dataset, extracting relevant text fields and cleaning the data to ensure high-quality processing.
🧠 Step 2: AI-Powered Text Embedding
Using a BERT-based transformer model, we convert project descriptions into dense vector representations, capturing contextual meaning across multiple languages.
📊 Step 3: Unsupervised Clustering
The HDBSCAN clustering algorithm is applied to group projects into 406 distinct thematic clusters, enabling a more precise classification than predefined OECD sectors.
🏷 Step 4: Automated Labeling
Each cluster is labeled using Class-based TF-IDF and fine-tuned with Large Language Models (LLMs), ensuring meaningful and interpretable topic descriptions.
📈 Step 5: Interactive Visualization
The results are displayed in dynamic, interactive graphs, allowing users to explore:
- 📌 Trends Over Time: See how financing for climate, health, and other topics evolves.
- 🌍 Topic by donors: Map donor-recipient aid relationships and thematic distributions.
- 📑 Clustered output Analysis: Contrast AI-classified finance with OECD’s traditional classifications.
By combining AI-powered clustering, statistical validation, and interactive visualizations, this project provides an innovative, data-driven framework for analyzing global development finance.