oDCM - Course Summary & Exam Preparation

Hannes Datta

Welcome to the final lecture in oDCM!

If you haven't done so, please explore the exam page & example questions at https://odcm.hannesdatta.com/docs/exam.

The course evaluation is live at https://app.evalytics.nl. Please voice your opinions!

Agenda

Course summary
From here onwards
- Recognizing limitations
- Seizing research opportunities
Course evaluation
Exam preparation
Remaining questions

Positioning in the study program

odcm

Lessons learnt #1: Why to use web data?

Research is by far more impactful than non-webdata-based research
- Explore novel phenomena, be timely!
- Boost ecological value, get closer to what managers use and are interested in!
- Facilitate methodological advancement (e.g., text, images, video)
- Better measurement (e.g., control variables)
But: challenging!
- Programming skills to master technical challenges
- Conceptual thinking to navigate design choices & legal risks

Lessons learnt #2: Staying in balance

odcm

Let's generate a few examples where validity, legal and technical goals conflicted in your projects.

Lessons learnt #3: Data Source Selection

Important of “broadening horizon” (e.g., assume perspective of various stakeholders, exceed geographical boundaries, use aggregators rather than primary data providers)
Consider alternatives to scraping (i.e., avoid defaulting! e.g., APIs, but also ready-made data sets)
Scope the data context (i.e., understand how the data is generated, assess reliability of the data, explore user conversations about the data, etc.)

Lessons learnt #4: Design Challenges (I)

Look out for information - especially the “surprising ones!”
Look out for variables (e.g., to compare to other studies), but also to see time availability differences
Spend time mapping your navigation path
Real-time data collection required?! (e.g., wrong historical data, inaccurate timestamps?)
Algorithmic biases present? Can exert control?

Lessons learnt #5: Design Challenges (II)

How long does the data collection run?
How to sample/seed?
To which population do the seeds generalize?
At what frequency to extract the data (e.g., once? multiple times?)
How to process the data on-the-fly? (e.g., first store as JSON, then parse; pandas-based parsing)

Lessons learnt #6: Design Challenges (III)

Scientific purpose, and run by research institution?
Scale and scope? (all data vs. small sample? running time)
Location of data provider and users
“Go” decision from provider? Technical intrusiveness?
Data management & use, commercialization

Lessons learnt #7: Extraction

Prototyping is extremely important
- requests + beautifulSoup vs. selenium (Chromedriver)
- extraction methodology (e.g., tags, classes or other attributes, capturing .get_text() vs. the values of attributes) + stability
- array misalignment (obey the hierarchy of stuff/how it is structured!!!)
- scheduling, hiding passwords
- revise navigation paths
- add comments to code (make it understandable for others, e.g., using ChatGPT)
Start documentation from a readme template
- Generate plots, descriptive stats
- Think as a “data supplier” rather than narrowly focusing on one (research) question

Lessons learnt #8: Web Scraping in Retailing

Based on Guyt et al. 2024
Research opportunities
Hands-on-framework: Extraction, Looping, Timing, Infrastructure
Challenges & Opportunities for Retail Scraping
- Overcoming matching challenges and time alignment
- LLMs for web scraping

Looking ahead: Recognizing Limitations

Web data entails modeling challenges - not covered in this course (e.g., self-selection, endogeneity, “messy” data)
Web data can't give you all (i.e., it looks l ike internal clickstream data, but it is not)
Legal and ethical issues not fully explored

Potential Applications

Collecting data for Master thesis
- tell my colleagues you have the skills
- start now, use later (data collection can take a long time!)
PhD and research master students can “invest” into data collections
- data was crucial to what I study
- maybe you are a future researcher/PhD student? Start today!

Academic Opportunities from "what we study"

Scout out emerging phenomena
Study phenomena that can't be captured otherwise (i.e., unobtrusively)
Study diverse populations (e.g., moving being WEIRD, more socio-economic backgrounds + geographies)
Generating realistic stimuli for experiments (e.g., brand logos)

Academic Opportunities from "how we study" it

Unleashing real-time data collections (cf. historical)
Conduct & support field experiments with a platform's user base
Use APIs to access algorithms, rather than data (e.g., Google Cloud Vision, OpenAI)
Build own research APIs
Use aggregators & archive.org

Next steps: Projects & SPA

Please hand in so we can make the data package public!
- Take out any passwords (store as environment variables instead)
- Remove any unnecessary files
- Want to keep your names on the documentation or anonymize them? Choose.
- Don't make statements that are too bold!
Re-read the grading rubric on the course site!
You will receive invite to self- and peer assessment via email

Exam

Organization
- When: 1 April (time? check Osiris/registration)
- Work max. 3 hours on exam
How?
- access to TestVision
- access to Jupyter Notebook
- I make selected resources available on the instruction page; want to submit your own cheatsheet? Do so seven days before the exam (see link on Canvas)

Exam technicalities

You have access to a “Demo exam” on TestVision.
Let's walk through it!

Exam content

let's look at some questions now
prep well! Expect new websites (for scraping), new endpoints of APIs (of sites that you know), selenium
Internet is available; but: no use of sites/tools (e.g., ChatGPT), or our course Chatbot.

Exam tips

Practice scraping an unknown website
- e.g., tutorials at tilburgsciencehub.com,
- working with music-to-scrape.org, tilburg.ai
- focus on “category overview pages” & array misalignment – the scraper will account for a large share of your grade
You will be asked to use selenium for some subquestions; BeautifulSoup will be the base
Cover “Fields of Gold” (2022), including the web appendix, sample size calculations and legal concerns (+ reason through challenges) + Guyt et al. 2024 (J. of Retailing)

Next steps: Official course evaluation

Course evaluation has been immensely important to this course
- This edition: updated tutorials, refined feedback on coaching support
- Last edition: TA support for coaching, new computer setup session, computer lab sessions, developed Journal of Retailing (2024) paper with hands-on guide to scraping, moved exam to campus, built music-to-scrape.org
Course evaluation has been critical to my career
- Without my past evaluations, I wouldn't be teaching to you today
- I will look at all comments
- Scores are most important to show importance of this course
You will be invited via Evalytics

Informal feedback

how to ease onboarding?
how to make sure all software (including Chromedriver is installed)?
Feel you've got enough support?

Stay in touch!

Stay in touch
- LinkedIn,YouTube, Twitter
- WhatsApp! +31 13 466 8938
Contribute to Tilburg's Open Science initiatives!
- https://tilburgsciencehub.com as content developer / community manager
- change this course at https://github.com/hannesdatta/course-odcm
And, finally… let's show to the world how awesome your new skills are!