oDCM - Web Data for Dummies (Tutorial)

Hannes Datta

Welcome to oDCM!

We're about to start with today's tutorial (“web data for dummies”).

If you haven't done so, open this slide deck at (course material –> week 2) & corresponding Jupyter Notebook
Coaching sessions start after today's tutorial (with Roshini) + finalize team assignment on Canvas
You can work in Google Colab today

Agenda

Today's lecture
- Work through the in-class tutorial (“webdata for dummies in class.ipynb”)
- We'll work on a selection of exercises on web scraping and APIs
- Address some data selection challenges highlighted in “Fields of Gold”
After class
- Check out today's workplan
- Complete in-class tutorial and exercises; solutions available online

When doing exercises...

Use your smartphone to indicate your status.

Face/screen up: DONE :)
Face down: still working.

Thanks.

This room

Meet Windows! :)

Able to find Anaconda prompt?
Able to start up Jupyter Notebook?
Try RStudio as well (!) :)

Framework

Focus in today's tutorial: collection design
Focus in team project: source selection

What are differences between web scraping and APIs?

official vs. unofficial data access
scaling (APIs scale more)
web scraping largely free, APIs for-pay
APIs are linchpin of internet economy

Source & more details: Web Appendix of “Fields of Gold”

Consider your options: scraping vs. APIs

Many teams “default” to scraping - seems easier (“you can see what you get”) and you may have heard about it before
Indeed, APIs can be tricky (importantly: getting your first connection sorted out!)
But: when that is set - things often go smoothly!
Suggestion: explore APIs broadly; maybe even some AI-based APIs you can get access to (e.g., OpenAI's API)

Web scraping

Who's ever done it?!
Which sites?
Which data?

We get started with web scraping

To do scraping, we need to understand what websites consist of:
- HTML (“what you see and where?”)
- CSS (“how it looks”)
- Javascript (“how it interacts”)
Examples:
- toybox at codepen: https://codepen.io/rcyou/pen/QEObEk/
- inspecting tilburguniversity.edu

Which information to extract from a website (challenge #2.1)?

We need to decide which information to extract from a site
- is the information publicly accessible or hidden after a login wall?
- can we reliably get the data, even after many iterations on the site?
- which information do we need to justify that what we measure is measured well (“construct operationalization”)?

Example: Suppose you need to get… data on Spotify's streaming charts - where would you get it from?

Which information to extract from a website (challenge #2.1)?

Best practices
- Explore different types of pages and providers
- Get “used” to the website, browse a bit (“how to navigate”), become a customer
- Identify roadblocks such as captchas
- Explore limits to iterating through a site (e.g., max. 1000 pages)

DO: Exploring music-to-scrape

Open music-to-scrape.org
Familiarize yourself with the structure of the site
Check which data you encounter
Suppose the data was stored in a database, how would you structure it?

Finding information in HTML code

But… we don't have access to the database. All we see is the website.
So, let's “narrow” down on the information of interest
For this, we can use various extraction techniques on the HTML source code of the site
- tags, e.g., <table>, <h1>, <div>
- attribute-value pairs, e.g., id is equal to “example-table” → <table id="example-table">
- “special” attribute-value pairs such as classes, e.g., <table class="striped-table">

Frequently, you need to combine several of these methods to extract information.

DO: Identifying information on music-to-scrape.org

Open https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F
How could you uniquely capture the following information?
- name of the artist
- number of song plays on the platform
- top 10 songs?

CSS selectors and XPATHS (I)

When following tutorials on the web, some coders advocate for the use of CSS selectors and XPATHS to extract information.
XPATH
- Think of it as the “path” to the specific data point
- Example for “artist name”: /html/body/div[3]/section[1]/div/div[2]/h2
- Likely to break very easily (say, something changes before the particular element)

CSS selectors and XPATHS (II)

CSS selector
- Example for extracting “artist name”: .artist_info_title
- Works here, but can be “highly” dependent: say, tag & class name + position (sometimes too detailed) & can also break
So, here, we are sticking mostly to tags, classes, and attribute-value pairs, defining our “way” to a particular element manually

Preview: Technical web scraping setup (I)

We will use three libraries:
- requests: downloads data from the web or transmits data (“headless”)
- BeautifulSoup: structures HTML data so we can query it
- later: selenium + chromedriver: simulates a browser (chrome), can scroll, click, view the site, but can also be headless; also structures HTML so we can query it
- json: structure and query JSON data

Preview: Technical web scraping setup (II)

Getting the data

Scraping
- Basic, “static” websites: requests + beautifulSoup (speed + ease!)
- Dynamic websites: selenium (advanced tutorial), OR: selenium (retrieve data) + beautifulSoup (structure/query)
APIs
- Only requests

Storing the data

use json (preferred) or CSV files (flat files with rows & columns)

Today: requests + BeautifulSoup

Preview: Technical web scraping setup (III)

A few notes on my code:

You can copy paste from slides to your own Jupyter Notebooks
You can see the code that I write live, at https://tiu.nu/livecoding
Variable names can be arbitrarily set (but choose names that make sense to you)
We're picking up pace; we build upon programming concepts introduced last week (variable assignment, variable types, functions, functions with return values, and loops)

Getting the website into Python

import requests # let's load the requests library

# make a get request to the website
url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
web_request = requests.get(url, headers = header)

# return the source code from the request object
web_request_source_code = web_request.text

BeautifulSoup 101

Why? Query the HTML code!
Think of it as a pipeline: download data –> pump to BeautifulSoup –> query data

import requests
from bs4 import BeautifulSoup

url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'

header = {'User-agent': 'Mozilla/5.0'} 

web_request = requests.get(url, headers = header)

soup = BeautifulSoup(web_request.text)

print(soup.find('h2').get_text())

Ya Boy

Change the snippet to show the location and number of plays!
Tip: use .find(class_ = 'classname')

BeautifulSoup 101 (solution)

print(soup.find(class_ = 'about_artist').get_text())


Location:
United States
Number of plays:
11

DO: BeautifulSoup exercises

So far, we've just 'lump-extracted' all of the text
Next, let us refine our collection by getting…
- the exact location (–> stored in location), exact number plays (–> stored in plays), and the total number of songs in the top 10.
Tips
- .find(class_='class-name') for classes
- you may need to use .find_all()
- len() for counting
- remember to write code like “an onion”

DO: BeautifulSoup exercises (solution)

location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()

plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

song_table = soup.find(class_ = 'top-songs')

number_of_songs = len(song_table.find_all('tr'))-1

Wrapping code in a function (I)

Functions are extremely useful to “reuse” code over and over again
- Functions have a name (say, def functionname)
- …and arguments (say, def FUNCTIONNAME(argument1, argument2))
We can now wrap the code above in a function
- def download_data(url)
- the name of the function is download_data, and it requires url as input

Wrapping code in a function (II)

import requests
from bs4 import BeautifulSoup

def download_data(url):
  header = {'User-agent': 'Mozilla/5.0'} 

  web_request = requests.get(url, headers = header)

  soup = BeautifulSoup(web_request.text)

  artist_name = soup.find('h2').get_text()

  print(f'Artist name: {artist_name}.')

# execute the function
download_data('https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F')

Artist name: Ya Boy.

DO: Wrap code into a function

Adapt the function to also extract the other attributes (i.e., location, number songs, etc.)
Write a loop to extract data using this function for the following artist IDs.

artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']

Tips:

Status messages with variables:

print(f'Done retrieving {url}')`

Solution

artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']

def download_data(url):
  header = {'User-agent': 'Mozilla/5.0'} 

  web_request = requests.get(url, headers = header)
  soup = BeautifulSoup(web_request.text)

  artist_name = soup.find('h2').get_text()
  location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
  plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

  print(f'Artist name: {artist_name} from {location} with {plays} song plays.')

for id in artist_ids:
  download_data(f'https://music-to-scrape.org/artist?artist-id={id}')

Artist name: Ya Boy from United States with 11 song plays.
Artist name: Prince With 94 East from Minneapolis MN with 8 song plays.
Artist name: Cabas from  with 91 song plays.

Storing results into JSON data

JSON is most flexible format, supports “hierarchical data”
Demonstrate how to create empty object (obj = {})
New-line separated JSON objects
- New line separation: each line in your file has one JSON object
- compare to “full” JSON objects: one entire JSON object per file

Storing results into JSON data

Change the code so it stores artist name, location and plays in the JSON object.


def download_data(url):
  web_request = BeautifulSoup(requests.get(url).text)
  artist_name = soup.find('h2').get_text()
  location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
  plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

  out_data = {'artist': 'artist name',
            'location': 'store location here',
            'plays': 'store plays here'}

  return(out_data)

Saving JSON data

The final step is to save JSON data in a file.

We can do this with the json.dumps function from the json library

import json

out_data = {'artist': 'artist name',
          'location': 'store location here',
          'plays': 'store plays here'}

# convert dict to "string" that we can save
to_json = json.dumps(out_data)

f=open('filename.json','a')
f.write(to_json+'\n')
f.close()

How to sample? (Challenge #2.2)

In the demos above, we just used a hard-coded list of URLs
In practice, getting to that list of books/products/songs/artists/users (generally speaking: “seeds”) is a web scraper in itself!
Examples
- scrape all artists featured on the homepage
- then loop through them an extract information for each artist (e.g., total number of)
- scrape user names from reddit.com; then, use the reddit API to get user metadata
- …

How to sample? (Challenge #2.2)

Many challenges in sampling
- sample size? generalizability? panel attrition?
- all of the firm's data? or just a little bit? are subjects vulnerable (say, kids)?
- can we actually “get” data on all subjects? can we match data?
Some solutions and best practices discussed in “Fields of Gold”
Let's look at the tables in the paper and try to find out…

Writing a complete web scraper

Have list of seeds to start up data collection (here, artist IDs)
loop through list of URLs
- store raw data for diagnostic purposes
- extract relevant data to JSON file with raw data
schedule
infrastructure

For details/explanation, see Guyt et al. 2024.

Wrapping up web scraping

Get content from a website
Since we don't want “everything”, we need to “find” relevant elements on the site using
- tags
- classes
- attribute-value pairs
We first build a prototype; then gradually improve it by
- modularizing code as much as possible (using functions and loops)
- ensuring code runs top-down

APIs

Standard way for exchanging data, functions or algorithms
Which APIs did you already encounter/explore?
Music to scrape is super easy to use - other APIs require advanced authentication procedures (→ import for project - run checks early)
Structure corresponds to web scraping
- have list of seeds (e.g., artist IDs; this was “URLs” for web scraping earlier)
- store data in new-line separated JSON files (if necessary: convert back to CSV)

Let's explore the documentation and a first endpoint

Pretty much all APIs have documentations, see here for ours
Let's explore a first endpoint in our browser: https://api.music-to-scrape.org/artists/featured
DO: Describe what you see; what does that mean? How does the data link to other sections of the website? Do you “get” the logic?

Retrieving data in Python

import requests
con = requests.get('https://api.music-to-scrape.org/artists/featured')

# convert to json
obj = con.json()

obj

{'artists': [{'artist': 'Fred Merpol', 'artist_id': 'ARKIQSL1241B9C90C8'}, {'artist': 'Off Broadway', 'artist_id': 'AR4IYQR1187B98F8F3'}, {'artist': 'A Challenge Of Honour', 'artist_id': 'ARL1QL91187B994B08'}, {'artist': 'Ya Boy', 'artist_id': 'ARICCN811C8A41750F'}, {'artist': 'Milo', 'artist_id': 'ARJ8ZIQ1187FB3FB5A'}]}

Working with JSON objects

Looking at them (obj)
Giving them arbitrary names (obj → anothername)
Accessing “nodes” (obj['artists'] or obj.get('artists'))
Accessing items in a list (obj['artists'][0]): multiple objects are in that list!

DO: Application 1 (loops)

Remember loops from last week's bootcamp? We can “iterate” through result objects.

for i in obj['artists']:
  print(i.get('artist'))

Fred Merpol
Off Broadway
A Challenge Of Honour
Ya Boy
Milo

DO: Can you also print out the artist IDs?

DO: Turning code into a function

Please write a function to execute the data collection of featured artists.
Call this function getdata()

Starting code:

def getdata():
  # YOUR CODE HERE

  # return some data

  # return()

Solution:

import requests

def getdata():
  con = requests.get('https://api.music-to-scrape.org/artists/featured')
  obj = con.json()
  return(obj)

Let's call the function!

{'artists': [{'artist': 'Grant Geissman', 'artist_id': 'ARIN12F1187FB3E92C'}, {'artist': 'Fred Merpol', 'artist_id': 'ARKIQSL1241B9C90C8'}, {'artist': 'Three-6 Mafia', 'artist_id': 'ARY55LO1187B9A3F17'}, {'artist': 'The Honeydogs', 'artist_id': 'ARSWORN1187B991A7B'}, {'artist': 'Terry Muska', 'artist_id': 'ARMORUX11F50C4EEBF'}]}

Timing the data collection

1) This makes your computer sleep for 5 seconds

import time
time.sleep(5) # sleeps 5 seconds

2) This makes your computer go on forever…

while True:
  # command here

3) This makes your computer go on forever…

counter = 0
while counter < 5:
  counter = counter + 1
  print("Hello")

DO: Please execute your data collection every one second seconds, at max. 5 times.

Solution

import time

i = 0
while i < 5:
  print(getdata())
  time.sleep(1)
  i = i + 1

At which frequency to extract the data? (Challenge #2.3)

Here, we extracted data every few seconds.
But, extraction frequency vastly differs by project
Considerations
- archival vs. live data?
- at which frequency does your phenomenon occur?
- what's the refresh rate of the data source?
- any excessive burden on server's caused by frequency of extraction?

At which frequency to extract the data? (Challenge #2.3)

Some solutions
- explore gains of live data collections
- adhere to best practices (say, 1 request per second)
- randomize extraction order
- use automatic schedulers for consistency

Extensions: data storage

Two options:

store internally in a list ([...]), then write to file at the end of your script
directly write to a file (f = open(), f.write(), f.close()) (“parsing on the fly”)

Processing data during the collection (challenge #2.4)

Processing can have various degrees
- just “save” the raw data
- extract only necessary information
- choose data format for saving (say, JSON vs. CSV)
Some selected challenges
- GDPR vs. value in retaining raw data
- Anonymization or pseudonymization required?

Processing data during the collection (challenge #2.4)

Solutions
- Retain raw when possible
- Parse minimal amount of data on the fly
- Remove sensitive info
- Ensure proper encoding

Optional I (if time left): Artist meta data

Get some artist meta data for an artist of your choice!

1.) Find an artist IDs on the site or in previous code

2.) Try to make a web request (in your browser) to the following URL:

Endpoint: https://api.music-to-scrape.org/artist/info
Requires parameter: artistid
Combined URL: https://api.music-to-scrape.org/artist/info?artistid={ENTER ARTIST ID HERE}

3.) Does it work? Then write Python code to retrieve the data.

4.) Finally, wrap your code in a function so you can later retrieve data for multiple artists.

Optional II (if time left): exercises with music-to-scrape

Please work on exercise 2.4 (see tutorial).

collect (longer) list of featured artists
for each of these featured artists, collect meta data
store data in new-line separated JSON file

Questions

Next steps in this class

Complete this tutorial at home - focus equally on scraping and APIs
Get engaged - see the “bigger” picture of web scraping and APIs
- discuss ideas & business opportunities (e.g., OpenAI's developer platform?)
- use our course chatbot at odcm.tilburgai.nl
- be in touch with Roshini (email) or myself (on WhatsApp) for any issues/bugs/etc.
Coaching session: Work on your team project (see workplan on the course website) + make team allocation definite (with Roshini)