oDCM - Opening Lecture

Hannes Datta

Using virtual environment '/Users/hannesdatta/.virtualenvs/r-reticulate' ...
Using virtual environment '/Users/hannesdatta/.virtualenvs/r-reticulate' ...

Welcome to oDCM!

We're about to start with the first lecture of this class.

If you haven't done so, please

Agenda

  • Part 1 (12.45 to about 13.45)
    • Getting to know each other
    • Motivation for the course
    • Course framework and learning goals
    • Agenda and practical arrangements
  • Break
  • Part 2: Python Bootcamp on your laptops (about 14.00 - 15.30/15.45)

Disclaimer

  • This is not a class that merely teaches you Python (but, if you invest time, you'll learn Python on the way!)
  • You can also extract web data using other software packages (e.g., R)
  • Mix of students at various levels (e.g., beginners, advanced Python users)
  • Course mostly takes place on campus (based on student feedback); attendance is not mandatory but strongly encouraged
  • I will not record any online or offline sessions; slides will be posted always
  • Consider me your coach, not your distant professor
  • Slow me down if you need to

About myself

  • scraping nerd — learned it in 2008 using Visual Basic in Excel
  • started doing my own research with scraped and API-extracted data in 2012 (so, 10+ years experience)
  • left Germany around your age, now many years in NL
  • Associate Professor at Tilburg University

Key areas of expertise

  • Substantive interests

    • streaming business models (e.g., music, movies)
    • marketing-mix modeling and optimization
    • open science
  • Methodological interests

    • online data collection via APIs and web scraping
    • causal effects with observational data

Teaching activities

Getting to know you

  • What's your background - previous education (e.g., program)?
  • Any experience in Python (or other programming languages)?
  • What are your passions & talents? (+ why I am asking you this…)

Motivation for course

  • started out as a PhD student without data
  • was interested in music, and found website with data (https://last.fm)
  • no best practices in scraping; learnt all by myself and made many mistakes
  • scraping undervalued in academic job market, but, key role in shaping relevance and rigor of your work
  • now scraping and APIs are a large part of what defines me…

Selection of scraping projects I've undertaken

What is scraping, and what are APIs?

With web scraping, you can capture anything you can view in a web browser

With APIs, you obtain official data from a firm in a programmatic way

  • e.g., as a developer, interact with Instagram, Twitter/X, ChatGPT / OpenAI, AWS, …
  • as a researcher, construct data set from analytics firms

Introducting music-to-scrape.org

  • Mock-up streaming service
  • Developed last semester, to be launched with Guyt et al. (2024) (final testing phase!)
  • “Save” and controlled environment to learn scraping and APIs

Screenshot of Music to scrape

Quick web scraper in Python (I)

  • Let's first import some packages
import requests
  • And then call a particular URL (check it out in your browser!)
url = 'https://music-to-scrape.org/'
webrequest = requests.get(url)

Quick web scraper in Python (II)

  • Finally, let's retrieve the weekly top 15 songs (we use HTML tags and attribute-value pairs for this)
from bs4 import BeautifulSoup
soup = BeautifulSoup(webrequest.text)
weekly15 = soup.find('section', {'name':'weekly_15'})
for song in weekly15.find_all('h5'): print(song.text)
Tito Puente
DJ Quik
Babylon Disco
Stevie Ray Vaughan And Double Trouble
Billie Jo Spears
Charlie McCoy
Les Bonapartes
Muse
Bare Jr.
Spectra Soul
Little Joe & The Thrillers
Charlie Byrd Trio
Johnny Pearson
Stevie Ray Vaughan And Double Trouble
Chris Farlowe
  • Works with any website, even anything you see in a browser (e.g., apps)

Quick APIs in Python

  • APIs are official interfaces by firms for programmers to extract or submit data, or obtain access to an algorithm

  • They work like websites (i.e., you can call them with the same snippets as before), but usually you need to pay or at least sign up for the service

# let's get some data from the API of music-to-scrape
api_request = requests.get('https://api.music-to-scrape.org/charts/top-tracks')
  • let's structure the output in the JSON format
#api_request_json = api_request.json()
#for song in api_request_json.get('chart'): print(song.get('name'))

Opportunities with web data

  • for businesses
    • “stitch together” different services (e.g., augment functionality of ChatGPT)
    • do market research (e.g., pricing data; see Zyte)
    • initialize recommendation systems (e.g., music metrics)
  • for research
    • discover/document novel phenomena (e.g., new platforms/technologies)
    • improving methods (e.g., get data to try out new methods, such as review data & text analysis, images, videos)
    • improved inferences (i.e., get more accurate results)
    • collect metrics managers care about

Getting inspired...

  • What are cool websites/services you're using often?!

  • What are important societal issues right now that directly or indirectly affect your lives?

  • As a marketer, how could you use the API of OpenAI to automate/invent something new?

Let's talk about it right now…

Why to care (as a marketer...) (I)

odcm

Why to care (as a marketer...) (II)

odcm

Web data versus other marketing data (I)

Why do we need a course on this? Isn't this how research is always done?

Yes, but collecting web data is different from other datasets!

  • data source selection
    • finding the right data source may be difficult as many potential alternatives exist
    • differ in delivery formats (website versus API, compared to CSV/databases)
    • access to data that is not available commercially (or that a firm would not like to share)

Web data versus other marketing data (II)

  • extraction design
    • which information to select (which is available?!)
    • type of variables (sales is rare; review scores are abundant)
    • different stakeholders that could potentially be addressed
    • need to tackle legal and ethical issues

Web data versus other marketing data (III)

  • collecting data at scale!
    • unprecedent: it's totally automatic (but prone to errors)
    • need to put monitoring procedures in place
    • data is not documented, and there is no direct way to ask questions about the data (sampling?! generalizability?)

Each project is totally unique - that's why there is no universal “best way” to approach things…

Pragmatic approach to scraping (Guyt et al. 2024)

Framework by Guyt et al. 2024

Detailed guidance by (Boegershausen et al. 2022)

odcm

Learning Goals of this course

  • Explain how to use web data for creating marketing insight
  • Select web data sources and evaluate their value to inform a specific research context or business problem
  • Design the web data collection while balancing validity, technical feasibility, and exposure to legal/ethical risks
  • Collect data via web scraping and Application Programming Interfaces (APIs) by mixing, extending, and repurposing code snippets
  • Document and archive collected data and make it available for public (re)use

Positioning in the study program

odcm

Course structure and grading

  • Weekly modules, structured along the methodological framework

    • On-campus lectures and tutorials
    • Self-study
    • On-campus (sometimes online) coaching sessions
  • Project in which you put into practice your skills (50% of your grade)

  • On-campus computer exam (50% of your grade)

  • Bonus points available (max .5 on final course grade; contribute to course or Tilburg Science Hub using source code: e.g., solutions for assignments, writing a tutorial for a new package, debugging source code, maintaining course's issue board); needs GitHub for this

This year's team project: Retailing platforms (I)

  • We're currently experiencing many new retailers that are invented/launched.
  • You know way more than I do… so… let's hear some!

Goals:

In the context of retail-based platforms,

  • understand and present “data universe”
  • select one or multiple data source(s) and collect the data (scraping, APIs)
  • document the resulting data for public (re)use

This year's team project: Retailing platforms (II)

Find your sweet spot!

  • Read Guyt et al. 2024 and discover exiting data sources (or add your own!)
  • brainstorm about it…

This year's team project: Retailing platforms (III)

  • Specifics

    • scrape data via web sites or use APIs
    • work through the entire method framework
    • receive feedback by students and me (your coach)
    • inspiration for the setup: past projects
    • legacy project ideas
  • Evaluation


Course website

Visit https://odcm.hannesdatta.com!

  • It's open education, so spread the word!
  • Course website is your #1 resource, Canvas only used for

    • posting important announcements,
    • sign up for teams, and
    • submitting data challenges/projects.
  • Do all students have access to Canvas?

Common struggles & tips

  • Take time to become acquainted with the course (e.g., not using Canvas)
  • Can be tough at first, but you will gain experience rapidly!
  • It can be overwhelming to follow dPrep and oDCM simultaneously, but the skills you learn are important for your course work, Master Thesis & future career!
  • Usually not possible to take these courses at a later moment - do get in touch with program coordinator for questions about course enrollment
  • Easing your start-up pain
    • Start preparing early on (the first weeks will be the most challenging!)
    • Have the same group members across dPrep and oDCM
    • Collaborate with each other and try to help one another, also across teams!
    • Be in touch via WhatsApp for Business (add +31 13 466 8938)

My commitment to you

  • Get you up to speed with both web scraping and APIs (and know when to pick which one!)
  • Teach you a bit of Python, but, becoming an expert requires years of practice
  • Discuss own work and ideas, but this requires interaction, talking to me, working hard
  • Use of open software (usable right away, no admin rights required)
  • Bring in your own ideas, don't be afraid to study topics off the main stream!
  • I value diversity! Be who you want to be. Say what you want to say.

Brain-dead by coding

  • Coding can be extremely frustrating if you're starting out
  • I tend to become semi-“brain-dead” after a day of coding
  • Take breaks! Stop coding. Go for a run. Start again.
  • You will learn from your mistakes
  • Use ChatGPT, cheat sheets and our support section

→ quick feedback loops in first few weeks

Steps of escalation & getting in touch

When you run into trouble, this is your way out!

  1. Try to find the info on the course website
  2. Ask Chat GPT / Google / Stackoverflow
  3. Ask friend/classmate (form learning groups)
  4. Can it wait? Defer to lecturer or feedback sessions.
  5. If it can't wait: be in touch with me (–> next slide!)

Use of WhatsApp

  • Please use WhatsApp: +31 13 466 8938.

WhatsApp

  • Being in touch via email is typically slower

What's in for you?

  • Investment in research skills

    • e.g., collect own data & shape its creation to create relevant and rigorous research
  • Essential skills for entrepreneurs

    • e.g., have your own company on the basis of web data and APIs
    • be the linking pin (interface between marketing and analytics)
  • Showcast expertise in coding

Success factors

Please tell me what would make this course a success for you

Next steps

  • We will continue with our Python bootcamp after a break
  • Haven't followed the installation guide? Check it out on the course page now! (modules –> preparation)

Any questions so far?!