Using NLP to Reveal Thematic Overlaps and Gaps Across 200+ Course Descriptions
March 2026
This project grew out of my own proposal as part of my role in leadership support at the Institute for Secondary Education (IS1) at PHBern. In the Secondary I programme, course offerings are organised into modules, each containing several learning opportunities (Lerngelegenheiten, LGs). Every LG has a title, a description text, and is assigned to a specific module.
The key question: What topics do the learning opportunities in the field of educational sciences actually cover, and where are there thematic overlaps between modules, or potential gaps? With over 200 texts, answering this manually is hardly feasible. The results from my analysis were presented at the ESW retreat in March 2026.
The analysis identified seven thematic areas, including "School Life, Classroom Management and Collaboration", "Assessment, Counselling and Career Orientation", "Academic Writing and Research Methods" and "Education for Sustainable Development". The results were prepared from three perspectives:
Thematic clusters: Each LG represented as a point in a two-dimensional projection, colour-coded by thematic area. The visualisation makes the thematic structure of the entire offering visible at a glance.
Topic distribution per module: For each module, the representation shows which topics are present and in what proportion — revealing whether a module has a broad or focused profile.
Module similarity: A similarity matrix shows which modules are thematically closest to one another and where module-specific emphases lie.
Data preparation and text cleaning. The description texts of all ESW learning opportunities across 18 modules were exported and structured. Title and description were combined into a single text per LG. After cleaning, 163 substantive texts remained from an initial set of roughly 230.
Sentence embeddings. Each text was converted into a numerical vector using the multilingual model paraphrase-multilingual-mpnet-base-v2. Texts with similar content are positioned close together in vector space, even when they use different terminology.
Clustering with BERTopic. BERTopic was used as the analysis method, combining sentence embeddings, dimensionality reduction (UMAP) and clustering in a single workflow. The initially deployed density-based algorithm HDBSCAN failed on the data: the pedagogical texts use heavily overlapping specialist vocabulary, making clear cluster boundaries difficult to identify. A switch to K-Means with a fixed number of clusters produced more interpretable and stable results.
Cluster labelling. The automatically extracted keywords were translated into accessible titles and short descriptions using the language model Claude Opus 4.
To ensure the results were not perceived as a black box at the ESW retreat, I documented the technical approach step by step in a separate appendix. Participants could trace the data basis of the analysis, understand how the thematic areas were derived and see where the method has its limitations. This transparency was key to the results being taken seriously and discussed constructively.
Python (Google Colab), BERTopic, Sentence-Transformers, UMAP, scikit-learn, Plotly, Claude.