oDCM - Opening Lecture

Hannes Datta

Welcome to oDCM!

We're about to start with the first lecture of this class.

If you haven't done so, please

Agenda

  • Part 1 (10.45 to about 11.45)
    • Getting to know each other
    • Motivation for the course
    • Course framework and learning goals
    • Agenda and practical arrangements
  • Break
  • Part 2: Python Bootcamp on your laptops (about 12.00 - 13.30/13.45)

This course in a nutshell

  • You will learn how to write computer programs that automatically download and structure information from the internet for analysis.
  • We call these programs “web scrapers” (for any internet pages) and “APIs” (for official data access)
  • Web scraping and APIs are the foundation of Google Search (“web spiders”) and ChatGPT (for training)
  • I also almost got sued doing it (more about it later…)

Disclaimer

  • This is predominantly about web scraping and APIs - but you will learn Python on the way as well
  • You can also extract web data using other software packages (e.g., R)
  • Mix of students at various levels (e.g., beginners, advanced Python users)
  • I will record sessions and post content immediately after class (but, attendance is strongly encouraged)
  • Consider me your coach, not your distant professor
  • Slow me down if you need to

About myself

  • scraping nerd — learned it in 2008 using Visual Basic in Excel
  • started doing my own research with scraped and API-extracted data in 2012 (so, 10+ years experience)
  • left Germany around your age, now 15+ years in NL
  • Associate Professor at Tilburg University

Key areas of expertise

  • Substantive interests

    • streaming business models (e.g., music, movies)
    • marketing-mix modeling and optimization
    • open science
  • Methodological interests

    • online data collection via APIs and web scraping
    • causal effects with observational data

<!–

Teaching activities

Getting to know you

  • What's your background - previous education (e.g., program)?
  • Any experience in Python (or other programming languages)?
  • What are your passions & talents? (+ why I am asking you this…) –>

Motivation for course

  • started out as a PhD student without data
  • was interested in music, and found website with data (https://last.fm)
  • no best practices in scraping; learnt all by myself and made many mistakes
  • scraping was undervalued in academic job market, but, key role in shaping relevance and rigor of your work
  • now scraping and APIs are a large part of what defines my research

Selection of scraping projects I've undertaken

What is scraping, and what are APIs?

With web scraping, you can capture anything you can view in a web browser

With APIs, you obtain official data from a firm in a programmatic way

  • e.g., as a developer, interact with Instagram, Twitter/X, ChatGPT / OpenAI, AWS, …
  • as a researcher, construct data set from analytics firms

Introducting music-to-scrape.org

  • Mock-up streaming service
  • Developed last semester, launched with Guyt et al. (2024)
  • “Save” and controlled environment to learn scraping and APIs

Screenshot of Music to scrape

Quick web scraper in Python (I)

  • Let's first import some packages
import requests
  • And then call a particular URL (check it out in your browser!)
url = 'https://music-to-scrape.org/'
webrequest = requests.get(url)

Quick web scraper in Python (II)

  • Finally, let's retrieve the weekly top 15 songs (we use HTML tags and attribute-value pairs for this)
from bs4 import BeautifulSoup
soup = BeautifulSoup(webrequest.text)
weekly15 = soup.find('section', {'name':'weekly_15'})
for song in weekly15.find_all('h5'): print(song.text)
Tito Puente
DJ Quik
Babylon Disco
Billie Jo Spears
Stevie Ray Vaughan And Double Trouble
Charlie McCoy
Muse
Les Bonapartes
Little Joe & The Thrillers
Spectra Soul
Bare Jr.
Charlie Byrd Trio
Stevie Ray Vaughan And Double Trouble
Johnny Pearson
Chris Farlowe
  • Works with any website, even anything you see in a browser (e.g., apps)

Quick APIs in Python

  • APIs are official interfaces by firms for programmers to extract or submit data, or obtain access to an algorithm

  • They work like websites (i.e., you can call them with the same snippets as before), but usually you need to pay or at least sign up for the service

# let's get some data from the API of music-to-scrape
api_request = requests.get('https://api.music-to-scrape.org/charts/top-tracks')
  • let's structure the output in the JSON format
#api_request_json = api_request.json()
#for song in api_request_json.get('chart'): print(song.get('name'))

Opportunities with web data

  • for businesses
    • “stitch together” different services (e.g., augment functionality of ChatGPT)
    • do market research (e.g., pricing data; see Zyte)
    • initialize recommendation systems (e.g., music metrics)
  • for research
    • discover/document novel phenomena (e.g., new platforms/technologies)
    • improving methods (e.g., get data to try out new methods, such as review data & text analysis, images, videos)
    • improved inferences (i.e., get more accurate results)
    • collect metrics managers care about

Getting inspired...

  • What are cool websites/services you're using often?!

  • What are important issues right now that directly or indirectly affect your lives?

  • As a marketer, how could you use the API of OpenAI to automate/invent something new?

Let's talk about it right now…

Why to care (as a marketer...) (I)

odcm

Why to care (as a marketer...) (II)

odcm

Web data versus other marketing data (I)

Why do we need a course on this? Isn't this how research is always done?

Yes, but collecting web data is different from other datasets!

  • data source selection
    • finding the right data source may be difficult as many potential alternatives exist
    • differ in delivery formats (website versus API, compared to CSV/databases)
    • access to data that is not available commercially (or that a firm would not like to share)

Web data versus other marketing data (II)

  • extraction design
    • which information to select (which is available?!)
    • type of variables (sales is rare; review scores are abundant)
    • different stakeholders that could potentially be addressed
    • need to tackle legal and ethical issues

Web data versus other marketing data (III)

  • collecting data at scale!
    • unprecedent: it's totally automatic (but prone to errors)
    • need to put monitoring procedures in place
    • data is not documented, and there is no direct way to ask questions about the data (sampling?! generalizability?)

Each project is totally unique - that's why there is no universal “best way” to approach things…

Pragmatic approach to scraping (Guyt et al. 2024)

Framework by Guyt et al. 2024

Detailed guidance by (Boegershausen et al. 2022)

odcm

Learning Goals of this course

  • Explain how to use web data for creating marketing insight
  • Select web data sources and evaluate their value to inform a specific research context or business problem
  • Design the web data collection while balancing validity, technical feasibility, and exposure to legal/ethical risks
  • Collect data via web scraping and Application Programming Interfaces (APIs) by mixing, extending, and repurposing code snippets
  • Document and archive collected data and make it available for public (re)use

Positioning in the study program

odcm

Course structure and grading

  • Weekly modules, structured along the methodological framework

    • On-campus lectures and tutorials
    • Self-study
    • On-campus (sometimes online) coaching sessions
  • Project in which you put into practice your skills (40% of your grade)

  • On-campus computer exam (60% of your grade)

This year's team project

  • Specifics

    • scrape data via web sites or use APIs
    • work through the entire method framework of Boegershausen et al. 2022
    • receive feedback by students and me/Roshini (your coaches)
    • inspiration for the setup: past projects
    • should address a broad research context
  • Evaluation


Course website

Visit https://odcm.hannesdatta.com!

  • It's open education, so spread the word!
  • Course website is your #1 resource, Canvas only used for

    • posting important announcements,
    • sign up for teams, and
    • submitting data challenges/projects.
  • Do all students have access to Canvas?

Using AI

  • I encourage the use of AI to help you learn Python and web scraping/APIs efficiently
  • Note though that, on the exam, you've got to be 'good enough' to do it without AI
  • Resources

Common struggles & tips

  • Take time to become acquainted with the course (e.g., not using Canvas)
  • Can be tough at first, but you will gain experience rapidly!
  • It can be overwhelming to follow dPrep and oDCM simultaneously, but the skills you learn are important for your course work, Master Thesis & future career!
  • TWO exam moments per academic year - choose wisely!
  • Easing your start-up pain
    • Start preparing early on (the first weeks will be the most challenging!)
    • Have the same group members across dPrep and oDCM
    • Collaborate with each other and try to help one another, also across teams!
    • Be in touch via WhatsApp for Business (add +31 13 466 8938)

My commitment to you

  • Get you up to speed with both web scraping and APIs (and know when to pick which one!)
  • Teach you a bit of Python, but, becoming an expert requires years of practice
  • Discuss own work and ideas, but this requires interaction, talking to me, working hard
  • Use of open software (usable right away, no admin rights required)
  • Bring in your own ideas, don't be afraid to study topics off the main stream!
  • I value diversity! Be who you want to be. Say what you want to say.

Brain-dead by coding

  • Coding can be extremely frustrating if you're starting out
  • I tend to become semi-“brain-dead” after a day of coding
  • Take breaks! Stop coding. Go for a run. Start again.
  • You will learn from your mistakes
  • Use ChatGPT, cheat sheets and our support section

→ quick feedback loops in first few weeks

Steps of escalation & getting in touch

When you run into trouble, this is your way out!

  1. Use the course's Chatbot (https://odcm.tilburgai.nl)
  2. Try to find the info on the course website or course material
  3. Ask ChatGPT / Google / Stackoverflow
  4. Ask friend/classmate (form learning groups)
  5. Can it wait? Defer to lecturer or feedback sessions.
  6. If it can't wait: be in touch with me (–> next slide!)

Use of WhatsApp

  • Please use WhatsApp: +31 13 466 8938.

WhatsApp

  • Being in touch via email is typically slower

What's in for you?

  • Investment in research skills

    • e.g., collect own data & shape its creation to create relevant and rigorous research
  • Essential skills for entrepreneurs

    • e.g., have your own company on the basis of web data and APIs
    • be the linking pin (interface between marketing and analytics)
  • Showcast expertise in coding

Success factors

Please tell me what would make this course a success for you

Next steps

  • We will continue with our Python bootcamp after a break
  • Haven't followed the installation guide? Check it out on the course page now! (modules –> preparation)

Any questions so far?!