oDCM - Course Summary & Exam Preparation

Hannes Datta

Welcome to the final lecture in oDCM!



If you haven't done so, please explore the exam page & example questions at https://odcm.hannesdatta.com/docs/exam.

Agenda

  • Course summary
  • From here onwards
    • Recognizing limitations
    • Seizing research opportunities
  • Course evaluation
  • Exam preparation
  • Remaining questions

Positioning in the study program

odcm

Lessons learnt #1: Why to use web data?

  • Research is by far more impactful than non-webdata-based research
    • Explore novel phenomena, be timely!
    • Boost ecological value, get closer to what managers are interested in!
    • Facilitate methodological advancement (e.g., text, video)
    • Better measurement (e.g., control variables)
  • But: challenging!
    • Programming skills to master technical challenges
    • Conceptual thinking to navigate design choices & legal risks

Lessons learnt #2: Staying in balance

odcm

Let's generate a few examples where validity, legal and technical goals conflicted in your projects.

Lessons learnt #3: Data Source Selection

  • Important of “broadening horizon” (e.g., assume perspective of various stakeholders, exceed geographical boundaries, use aggregators rather than primary data providers)
  • Consider alternatives to scraping (i.e., avoid defaulting! e.g., APIs, but also ready-made data sets)
  • Scope the data context (i.e., understand how the data is generated, assess reliability of the data, explore user conversations about the data, etc.)

Lessons learnt #4: Design Challenges (I)

  • Look out for information - especially the “surprising ones!”
  • Look out for variables (e.g., to compare to other studies), but also to see time availability differences
  • Spend time mapping your navigation path
  • Real-time data collection required?! (e.g., wrong historical data, inaccurate timestamps?)
  • Algorithmic biases present? Can exert control?

Lessons learnt #5: Design Challenges (II)

  • How long does the data collection run?
  • How to sample/seed?
  • To which population do the seeds generalize?
  • At what frequency to extract the data (e.g., once? multiple times?)
  • How to process the data on-the-fly? (e.g., first store as JSON, then parse; pandas-based parsing)

Lessons learnt #6: Design Challenges (III)

  • Scientific purpose, and run by research institution?
  • Scale and scope? (all data vs. small sample? running time)
  • Location of data provider and users
  • “Go” decision from provider? Technical intrusiveness?
  • Data management & use, commercialization

Lessons learnt #7: Extraction

  • Prototyping is extremely important
    • requests + beautifulSoup vs. selenium (Chromedriver)
    • extraction methodology (e.g., tags, classes, attribute-values) + stability
    • array misalignment (obey the hierarchy of stuff/how it is structured!!!)
    • scheduling, hiding passwords
    • revise navigation paths
    • add comments to code (make it understandable for others, e.g., using ChatGPT)
  • Start documentation from a readme template
    • Generate plots, descriptive stats
    • Think as a “data supplier” rather than narrowly focusing on one (research) question

Lessons learnt #8: Web Scraping in Retailing

  • Based on Guyt et al. 2024
  • Research opportunities
  • Hands-on-framework: Extraction, Looping, Timing, Infrastructure
  • Challenges & Opportunities for Retail Scraping
    • Overcoming matching challenges and time alignment
    • LLMs for web scraping

Looking ahead: Recognizing Limitations

  • Web data entails modeling challenges - not covered in this course (e.g., self-selection, “messy” data)
  • Web data can't give you all (i.e., you don't see internal data such as clickstream data)
  • Legal and ethical issues not fully explored

Potential Applications

  • Collecting data for Master thesis

    • tell my colleagues you have the skills
    • start now, use later (data collection can take a long time!)
  • PhD and research master students can “invest” into data collections

    • data was crucial to what I study
    • maybe you are a future researcher/PhD student? Start today!

Academic Opportunities from "what we study"

  1. Scout out emerging phenomena
  2. Study phenomena that can't be captured otherwise (i.e., unobtrusively)
  3. Study diverse populations (e.g., moving being WEIRD, more socio-economic backgrounds + geographies)
  4. Generating realistic stimuli for experiments (e.g., brand logos)

Academic Opportunities from "how we study" it

  1. Unleashing real-time data collections (cf. historical)
  2. Conduct & support field experiments with a platform's user base
  3. Use APIs to access algorithms, rather than data (e.g., Google Cloud Vision, OpenAI)
  4. Build own research APIs
  5. Use aggregators & archive.org

Next steps: Projects & SPA

  • Please hand in so I can make the data package public!

    • Take out any passwords (store as environment variables instead)
    • Remove any unnecessary files
    • Want to keep your names on the documentation or anonymize them?
    • Don't make statements that are too bold!
    • Consider uploading datasets & documentation yourselves to Zenodo; counted towards “quality” of data package (10% of assignment grade)
  • Re-read the grading rubric on the course site!

  • You will receive invite to self- and peer assessment via email

Exam

  • Organization
    • When: 2 April (time tba)
    • Work max. 3 hours on exam
  • How?
  • What?
    • let's look at some questions now
    • prep well! Expect new websites/blogs (for scraping), new endpoints of APIs (that you know)
    • will work with whitelist: blog/website/APIs

Exam tips

  • Understand how selenium and Chrome work, next to the “regular” scraping toolkits (requests, BeautifulSoup
  • Practice scraping an unknown website (e.g., tutorials at tilburgsciencehub.com, working with music-to-scrape.org, tilburg.ai, any other blog) – the scraper will account for a larger number of points on the exam!
  • Cover “Fields of Gold” (2022), including the web appendix, sample size calculations and legal concerns (+ reason through challenges) + Guyt et al. 2024 (J Retailing)

Next steps: Official course evaluation

  • Course evaluation has been immensely important to this course

    • This edition: developed Journal of Retailing (2024) paper with hands-on guide to scraping, complementing Boegershausen et al. (2022); moved exam to campus
    • Last edition(s): built music-to-scrape.org, host coaching sessions mostly on campus, improved Python onboarding
  • Course evaluation has been critical to my career

    • Without my past evaluations, I wouldn't be teaching to you today
    • I will look at all comments
  • Scores are most important to show importance of this course

  • You will be invited via Evalytics (evaluate at the end of the week).

Informal feedback

  • how to ease onboarding?
  • how to make sure all software (including Chromedriver is installed)?
  • how to handle “switch” from bs4 to selenium?
  • is this course too hard? how can I make it 'easier'/'smoother'?
    • e.g., what did you find particularly hard?
    • what tips would you give to students before they start this course?
    • more online coaching sessions? (or not?)
  • Feel you've got enough support?

Stay in touch!