I analyzed 20,000+ cleantech media articles and patents using Natural Language Processing to uncover innovation trends, technology gaps, and market opportunities in sustainable energy. The project combined exploratory analysis, custom AI models, and intelligent search systems to help researchers and policymakers navigate the rapidly evolving cleantech landscape.
What We Built
Stage 1: Understanding the Landscape
- Cleaned and analyzed 20,111 media articles and patent documents from 2022-2024
- Used topic modeling (LDA, NMF, BERTopic) to identify major cleantech themes
- Extracted company-technology relationships using Named Entity Recognition
- Compared media coverage against patent filings to spot innovation gaps
Stage 2: Custom AI Models
- Trained domain-specific word and sentence embeddings on cleantech texts
- Fine-tuned transformer models (RoBERTa, BERT) for cleantech language
- Compared general-purpose vs. specialized models to measure performance gains
- Created vector representations that understand cleantech concepts and relationships
Stage 3: Intelligent Question Answering
- Built a Retrieval-Augmented Generation (RAG) system for cleantech queries
- Generated 300+ question-answer pairs across different complexity levels
- Implemented semantic search using ChromaDB vector database
- Connected GPT-3.5-Turbo to retrieved documents for accurate, source-backed answers
- Evaluated system performance across question categories (factual, analytical, comparative)
Technical Architecture
- Data Processing: Pandas for data handling, spaCy for NLP preprocessing, NLTK for text refinement
- Analysis & Modeling: BERTopic and Top2Vec for topic discovery, Hugging Face Transformers for fine-tuning, Sentence-Transformers for embeddings
- RAG System: ChromaDB vector store, all-MiniLM-L6-v2 embeddings, OpenAI GPT-3.5-Turbo generation, LangChain for retrieval pipeline
- Infrastructure: Google Colab with GPU acceleration, batch processing for 20K+ documents, version control through Google Drive
Partner Institutions
Equintel GmbH, ETH Zurich, Google DeepMind