oDCM - Course Summary & Exam Preparation

Hannes Datta

Welcome to the final lecture in oDCM!



If you haven't done so, please explore the exam page & example questions at https://odcm.hannesdatta.com/docs/exam.

Agenda

  • Course summary
  • From here onwards
    • Recognizing limitations
    • Seizing research opportunities
  • Course evaluation
  • Exam preparation
  • Remaining questions

Positioning in the study program

odcm

Lessons learnt #1: Why to use web data?

  • Research is by far more impactful than non-webdata-based research
    • Explore novel phenomena, be timely!
    • Boost ecological value, get closer to what managers use and are interested in!
    • Facilitate methodological advancement (e.g., text, images, video)
    • Better measurement (e.g., control variables)
  • But: challenging!
    • Programming skills to master technical challenges
    • Conceptual thinking to navigate design choices & legal risks

Lessons learnt #2: Staying in balance

odcm

Let's generate a few examples where validity, legal and technical goals conflicted in your projects.

Lessons learnt #3: Data Source Selection

  • Important of “broadening horizon” (e.g., assume perspective of various stakeholders, exceed geographical boundaries, use aggregators rather than primary data providers)
  • Consider alternatives to scraping (i.e., avoid defaulting! e.g., APIs, but also ready-made data sets)
  • Scope the data context (i.e., understand how the data is generated, assess reliability of the data, explore user conversations about the data, etc.)

Lessons learnt #4: Design Challenges (I)

  • Look out for information - especially the “surprising ones!”
  • Look out for variables (e.g., to compare to other studies), but also to see time availability differences
  • Spend time mapping your navigation path
  • Real-time data collection required?! (e.g., wrong historical data, inaccurate timestamps?)
  • Algorithmic biases present? Can exert control?

Lessons learnt #5: Design Challenges (II)

  • How long does the data collection run?
  • How to sample/seed?
  • To which population do the seeds generalize?
  • At what frequency to extract the data (e.g., once? multiple times?)
  • How to process the data on-the-fly? (e.g., first store as JSON, then parse; pandas-based parsing)

Lessons learnt #6: Design Challenges (III)

  • Scientific purpose, and run by research institution?
  • Scale and scope? (all data vs. small sample? running time)
  • Location of data provider and users
  • “Go” decision from provider? Technical intrusiveness?
  • Data management & use, commercialization

Lessons learnt #7: Extraction

  • Prototyping is extremely important
    • requests + beautifulSoup vs. selenium (Chromedriver)
    • extraction methodology (e.g., tags, classes or other attributes, capturing .get_text() vs. the values of attributes) + stability
    • array misalignment (obey the hierarchy of stuff/how it is structured!!!)
    • scheduling, hiding passwords
    • revise navigation paths
    • add comments to code (make it understandable for others, e.g., using ChatGPT)
  • Start documentation from a readme template
    • Generate plots, descriptive stats
    • Think as a “data supplier” rather than narrowly focusing on one (research) question

Lessons learnt #8: Web Scraping in Retailing

  • Based on Guyt et al. 2024
  • Research opportunities
  • Hands-on-framework: Extraction, Looping, Timing, Infrastructure
  • Challenges & Opportunities for Retail Scraping
    • Overcoming matching challenges and time alignment
    • LLMs for web scraping

Looking ahead: Recognizing Limitations

  • Web data entails modeling challenges - not covered in this course (e.g., self-selection, endogeneity, “messy” data)
  • Web data can't give you all (i.e., it looks l ike internal clickstream data, but it is not)
  • Legal and ethical issues not fully explored

Potential Applications

  • Collecting data for Master thesis

    • tell my colleagues you have the skills
    • start now, use later (data collection can take a long time!)
  • PhD and research master students can “invest” into data collections

    • data was crucial to what I study
    • maybe you are a future researcher/PhD student? Start today!

Academic Opportunities from "what we study"

  1. Scout out emerging phenomena
  2. Study phenomena that can't be captured otherwise (i.e., unobtrusively)
  3. Study diverse populations (e.g., moving being WEIRD, more socio-economic backgrounds + geographies)
  4. Generating realistic stimuli for experiments (e.g., brand logos)

Academic Opportunities from "how we study" it

  1. Unleashing real-time data collections (cf. historical)
  2. Conduct & support field experiments with a platform's user base
  3. Use APIs to access algorithms, rather than data (e.g., Google Cloud Vision, OpenAI)
  4. Build own research APIs
  5. Use aggregators & archive.org

Next steps: Projects & SPA

  • Please hand in so we can make the data package public!

    • Take out any passwords (store as environment variables instead)
    • Remove any unnecessary files
    • Want to keep your names on the documentation or anonymize them? Choose.
    • Don't make statements that are too bold!
  • Re-read the grading rubric on the course site!

  • You will receive invite to self- and peer assessment via email

Exam

  • Organization
    • When: 16 October (time? check Osiris/registration)
    • Work max. 3 hours on exam
  • How?
    • access to TestVision
    • access to Jupyter Notebook
    • I make selected resources available on the instruction page; want to submit your own cheatsheet? Do so by Monday morning (not paper-based!).
  • What?
    • let's look at some questions now
    • prep well! Expect new websites (for scraping), new endpoints of APIs (of sites that you know)
    • Internet is available; but: no use of sites/tools (e.g., ChatGPT) other than indicated

Exam tips

  • Practice scraping an unknown website (e.g., tutorials at tilburgsciencehub.com, working with music-to-scrape.org, tilburg.ai; focus on “category overview pages” & array misalignment) – the scraper will account for a large share of your grade
  • You do not have to use selenium, but you need to understand how it is different from “regular” scraping toolkits (requests, BeautifulSoup)
  • Cover “Fields of Gold” (2022), including the web appendix, sample size calculations and legal concerns (+ reason through challenges) + Guyt et al. 2024 (J. of Retailing)

Next steps: Official course evaluation

  • Course evaluation has been immensely important to this course

    • This edition: onboarded Roshini for etra support, scheduled extra “computer setup” session before class, had some sessions in the computer lab
    • Last edition: developed Journal of Retailing (2024) paper with hands-on guide to scraping, moved exam to campus, built music-to-scrape.org
  • Course evaluation has been critical to my career

    • Without my past evaluations, I wouldn't be teaching to you today
    • I will look at all comments
  • Scores are most important to show importance of this course

  • You will be invited via Evalytics (evaluate at the end of the week).

Informal feedback

  • how to ease onboarding?
  • how to make sure all software (including Chromedriver is installed)?
  • Feel you've got enough support?

Stay in touch!