Master Thesis — MSc Data Science

For my Master Thesis, I am building a cross-platform data pipeline that observes what content gets shown on each platform's "discovery" feed, then models which creator choices (text, visuals, timing, format) are linked to better visibility and faster engagement. The output is a practical posting strategy for marketing teams.

What problem I solve

Organic reach on social media is unpredictable because feeds are algorithmic, not chronological. Companies post regularly, but cannot reliably predict whether content will appear to users, even followers. My thesis tackles this by measuring "visibility" directly from the platforms' ranked discovery surfaces and translating patterns into concrete posting guidance.

1) Collecting visibility data like an audit

Platforms: TikTok "For You", Instagram "Explore", LinkedIn "Top" feed.

Every hour for 30 days, I scrape:

Top posts: the top 20 items shown on-screen
Baseline posts: lower-ranked items captured at the same minute (ranks 51–100, or LinkedIn "Recent")

This snapshot + baseline design reduces selection bias and makes "what made it into the Top" measurable.

2) Tracking early performance

For each captured post, I revisit it after 24h and 72h to collect public engagement counters and compute engagement velocity (normalised by follower count). This lets me compare "exposure" (being shown high) with "uptake" (how audiences respond after exposure).

3) Turning raw pages into creator-controllable features

I engineer features creators can influence before posting, grouped into:

Text: caption length, hashtag count, emojis, CTA signals, language, embeddings
Visual: brightness, contrast, colourfulness, face presence, text-on-image length
Temporal: hour, weekday, post age
Audio (where relevant): audio present, audio metadata

4) Modelling and comparison across platforms

I fit interpretable models per platform to predict:

Top inclusion (Top vs baseline)
Within-Top rank (1–20)
Engagement velocity at 24h and 72h after the first scraping

Primary modelling: Generalised Additive Models (GAMs) for readable, non-linear effects.
Robustness checks: gradient-boosted models.

Expected outcome

A marketing-facing posting strategy built from measured evidence, not "tips". Concretely, this becomes:

A ranked list of visibility levers that generalise across platforms vs platform-specific ones
Posting guidelines (caption structure, timing windows, creative patterns) aligned with KPIs
A compact "drivers table" that summarises effect direction and strength by platform

Technical stack

Scraping: Selenium 4 (headless Chrome), DOM parsing + network JSON (dual-path extraction)
Infrastructure: Azure VM (Ubuntu), cron orchestration, residential proxies, cookie-based auth (no passwords stored)
Data: SQLite for incremental writes, Parquet for modelling tables
Modelling: R (GAMs), plus ML validation models

Master Thesis, MSc Data Science