oDCM - Opening Lecture

Hannes Datta

Welcome to oDCM!

We're about to start with the first lecture of this class.

If you haven't done so, please

Agenda

  • Part 1 (10.45 to about 11.45)
    • Getting to know each other
    • Motivation for the course
    • Course framework and learning goals
    • Agenda and practical arrangements
  • Break
  • Part 2: Python Bootcamp on your laptops (about 12.00 - 13.30/13.45)

This course in a nutshell

  • You will learn how to write code that automatically downloads and structures information from the internet for the purpose of (scientific) analysis.
  • We call these programs “web scrapers” (for any internet pages) and “APIs” (for official data access)
  • Web scraping are the foundation of Google Search (“web spiders”) and ChatGPT (e.g., for training); APIs are at the core of many business models (e.g., Twitter API - back in the days; OpenAI API)
  • I also almost got sued doing scraping (more about it later…)

Disclaimer

  • This is predominantly about web scraping and APIs - while I teach a bit of Python, becoming an expert requires years of practice
  • Mix of students at various levels (e.g., beginners, advanced Python users)
  • You can also extract web data using other software packages (e.g., R)
  • I will record sessions and post content immediately after class (but, attendance is strongly encouraged)
  • Consider me your coach, not your distant professor (yes, I respond to WhatsApp messages at +31134668938)
  • Slow me down if you need to

About myself

  • scraping nerd — learned it in 2008 using Visual Basic in Excel
  • started doing my own research with scraped and API-extracted data in 2012 (so, 10+ years experience)
  • left Germany around your age, now 15+ years in NL
  • Associate Professor at Tilburg University

Key areas of expertise

  • Substantive interests

    • streaming business models (e.g., music, movies)
    • marketing-mix modeling and optimization
    • open science
  • Methodological interests

    • online data collection via APIs and web scraping
    • causal effects with observational data

Teaching activities

Getting to know you

  • What's your background - previous education (e.g., program)?
  • Any experience in Python (or other programming languages)?
  • What are your passions & talents? (+ why I am asking you this…)

Motivation for course

  • started out as a PhD student without data
  • was interested in music, and found website with data (https://last.fm)
  • no best practices in scraping; learnt all by myself and made many mistakes
  • scraping was undervalued in academic job market, but, key role in shaping relevance and rigor of your work
  • now scraping and APIs are a large part of what defines my research

Selection of scraping projects I've undertaken

What is scraping, and what are APIs?

With web scraping, you can capture anything you can view in a web browser

With APIs, you obtain official data from a firm in a programmatic way

  • e.g., as a developer, interact with Instagram, Twitter/X, ChatGPT / OpenAI, AWS, …
  • as a researcher, construct data set from analytics firms

Introducting music-to-scrape.org

  • Mock-up streaming service
  • Developed last year, launched with Guyt et al. (2024)
  • “Save” and controlled environment to learn scraping and APIs