oDCM - Web Data for Dummies (Tutorial)

Hannes Datta

Welcome to oDCM!

We're about to start with today's tutorial (“web data for dummies”).

  • If you haven't done so, open this slide deck at (course material –> week 2) & corresponding Jupyter Notebook
  • Coaching sessions start after today's tutorial (with Roshini) + finalize team assignment on Canvas
  • You can work in Google Colab today

Agenda

  • Today's lecture
    • Work through the in-class tutorial (“webdata for dummies in class.ipynb”)
    • We'll work on a selection of exercises on web scraping and APIs
    • Address some data selection challenges highlighted in “Fields of Gold”
  • After class
    • Check out today's workplan
    • Complete in-class tutorial and exercises; solutions available online

When doing exercises...

Use your smartphone to indicate your status.

  • Face/screen up: DONE :)
  • Face down: still working.

Thanks.

This room

Meet Windows! :)

  • Able to find Anaconda prompt?
  • Able to start up Jupyter Notebook?
  • Try RStudio as well (!) :)

Framework

  • Focus in today's tutorial: collection design
  • Focus in team project: source selection

What are differences between web scraping and APIs?

  • official vs. unofficial data access
  • scaling (APIs scale more)
  • web scraping largely free, APIs for-pay
  • APIs are linchpin of internet economy

Source & more details: Web Appendix of “Fields of Gold”

Consider your options: scraping vs. APIs

  • Many teams “default” to scraping - seems easier (“you can see what you get”) and you may have heard about it before
  • Indeed, APIs can be tricky (importantly: getting your first connection sorted out!)
  • But: when that is set - things often go smoothly!
  • Suggestion: explore APIs broadly; maybe even some AI-based APIs you can get access to (e.g., OpenAI's API)

Web scraping

  • Who's ever done it?!
  • Which sites?
  • Which data?

We get started with web scraping

Which information to extract from a website (challenge #2.1)?

  • We need to decide which information to extract from a site
    • is the information publicly accessible or hidden after a login wall?
    • can we reliably get the data, even after many iterations on the site?
    • which information do we need to justify that what we measure is measured well (“construct operationalization”)?

Example: Suppose you need to get… data on Spotify's streaming charts - where would you get it from?

Which information to extract from a website (challenge #2.1)?

  • Best practices
    • Explore different types of pages and providers
    • Get “used” to the website, browse a bit (“how to navigate”), become a customer
    • Identify roadblocks such as captchas
    • Explore limits to iterating through a site (e.g., max. 1000 pages)

DO: Exploring music-to-scrape

  • Open music-to-scrape.org
  • Familiarize yourself with the structure of the site
  • Check which data you encounter
  • Suppose the data was stored in a database, how would you structure it?

Finding information in HTML code

  • But… we don't have access to the database. All we see is the website.
  • So, let's “narrow” down on the information of interest
  • For this, we can use various extraction techniques on the HTML source code of the site
    • tags, e.g., <table>, <h1>, <div>
    • attribute-value pairs, e.g., id is equal to “example-table” → <table id="example-table">
    • “special” attribute-value pairs such as classes, e.g., <table class="striped-table">

Frequently, you need to combine several of these methods to extract information.

DO: Identifying information on music-to-scrape.org

CSS selectors and XPATHS (I)

  • When following tutorials on the web, some coders advocate for the use of CSS selectors and XPATHS to extract information.
  • XPATH
    • Think of it as the “path” to the specific data point
    • Example for “artist name”: /html/body/div[3]/section[1]/div/div[2]/h2
    • Likely to break very easily (say, something changes before the particular element)

CSS selectors and XPATHS (II)

  • CSS selector
    • Example for extracting “artist name”: .artist_info_title
    • Works here, but can be “highly” dependent: say, tag & class name + position (sometimes too detailed) & can also break
  • So, here, we are sticking mostly to tags, classes, and attribute-value pairs, defining our “way” to a particular element manually

Preview: Technical web scraping setup (I)

  • We will use three libraries:
    • requests: downloads data from the web or transmits data (“headless”)
    • BeautifulSoup: structures HTML data so we can query it
    • later: selenium + chromedriver: simulates a browser (chrome), can scroll, click, view the site, but can also be headless; also structures HTML so we can query it
    • json: structure and query JSON data

Preview: Technical web scraping setup (II)

Getting the data

  • Scraping
    • Basic, “static” websites: requests + beautifulSoup (speed + ease!)
    • Dynamic websites: selenium (advanced tutorial), OR: selenium (retrieve data) + beautifulSoup (structure/query)
  • APIs
    • Only requests

Storing the data

  • use json (preferred) or CSV files (flat files with rows & columns)


Today: requests + BeautifulSoup

Preview: Technical web scraping setup (III)

A few notes on my code:

  • You can copy paste from slides to your own Jupyter Notebooks
  • You can see the code that I write live, at https://tiu.nu/livecoding
  • Variable names can be arbitrarily set (but choose names that make sense to you)
  • We're picking up pace; we build upon programming concepts introduced last week (variable assignment, variable types, functions, functions with return values, and loops)

Getting the website into Python

import requests # let's load the requests library

# make a get request to the website
url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
web_request = requests.get(url, headers = header)

# return the source code from the request object
web_request_source_code = web_request.text

BeautifulSoup 101

  • Why? Query the HTML code!
  • Think of it as a pipeline: download data –> pump to BeautifulSoup –> query data
import requests
from bs4 import BeautifulSoup

url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'

header = {'User-agent': 'Mozilla/5.0'} 

web_request = requests.get(url, headers = header)

soup = BeautifulSoup(web_request.text)

print(soup.find('h2').get_text())
Ya Boy
  • Change the snippet to show the location and number of plays!
  • Tip: use .find(class_ = 'classname')

BeautifulSoup 101 (solution)

print(soup.find(class_ = 'about_artist').get_text())

Location:
United States
Number of plays:
11

DO: BeautifulSoup exercises

  • So far, we've just 'lump-extracted' all of the text
  • Next, let us refine our collection by getting…

    • the exact location (–> stored in location), exact number plays (–> stored in plays), and the total number of songs in the top 10.
  • Tips

    • .find(class_='class-name') for classes
    • you may need to use .find_all()
    • len() for counting
    • remember to write code like “an onion”

DO: BeautifulSoup exercises (solution)

location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()

plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

song_table = soup.find(class_ = 'top-songs')

number_of_songs = len(song_table.find_all('tr'))-1

Wrapping code in a function (I)

  • Functions are extremely useful to “reuse” code over and over again
    • Functions have a name (say, def functionname)
    • …and arguments (say, def FUNCTIONNAME(argument1, argument2))
  • We can now wrap the code above in a function
    • def download_data(url)
    • the name of the function is download_data, and it requires url as input

Wrapping code in a function (II)

import requests
from bs4 import BeautifulSoup

def download_data(url):
  header = {'User-agent': 'Mozilla/5.0'} 

  web_request = requests.get(url, headers = header)

  soup = BeautifulSoup(web_request.text)

  artist_name = soup.find('h2').get_text()

  print(f'Artist name: {artist_name}.')

# execute the function
download_data('https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F')
Artist name: Ya Boy.

DO: Wrap code into a function

  1. Adapt the function to also extract the other attributes (i.e., location, number songs, etc.)
  2. Write a loop to extract data using this function for the following artist IDs.
artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']

Tips:

  • Status messages with variables:
print(f'Done retrieving {url}')`

Solution

artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']

def download_data(url):
  header = {'User-agent': 'Mozilla/5.0'} 

  web_request = requests.get(url, headers = header)
  soup = BeautifulSoup(web_request.text)

  artist_name = soup.find('h2').get_text()
  location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
  plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

  print(f'Artist name: {artist_name} from {location} with {plays} song plays.')

for id in artist_ids:
  download_data(f'https://music-to-scrape.org/artist?artist-id={id}')
Artist name: Ya Boy from United States with 11 song plays.
Artist name: Prince With 94 East from Minneapolis MN with 8 song plays.
Artist name: Cabas from  with 91 song plays.

Storing results into JSON data

  • JSON is most flexible format, supports “hierarchical data”
  • Demonstrate how to create empty object (obj = {})
  • New-line separated JSON objects
    • New line separation: each line in your file has one JSON object
    • compare to “full” JSON objects: one entire JSON object per file

Storing results into JSON data

Change the code so it stores artist name, location and plays in the JSON object.


def download_data(url):
  web_request = BeautifulSoup(requests.get(url).text)
  artist_name = soup.find('h2').get_text()
  location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
  plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

  out_data = {'artist': 'artist name',
            'location': 'store location here',
            'plays': 'store plays here'}

  return(out_data)

Saving JSON data

The final step is to save JSON data in a file.

We can do this with the json.dumps function from the json library

import json

out_data = {'artist': 'artist name',
          'location': 'store location here',
          'plays': 'store plays here'}

# convert dict to "string" that we can save
to_json = json.dumps(out_data)

f=open('filename.json','a')
f.write(to_json+'\n')
f.close()

How to sample? (Challenge #2.2)

  • In the demos above, we just used a hard-coded list of URLs
  • In practice, getting to that list of books/products/songs/artists/users (generally speaking: “seeds”) is a web scraper in itself!
  • Examples
    • scrape all artists featured on the homepage
    • then loop through them an extract information for each artist (e.g., total number of)
    • scrape user names from reddit.com; then, use the reddit API to get user metadata

How to sample? (Challenge #2.2)

  • Many challenges in sampling
    • sample size? generalizability? panel attrition?
    • all of the firm's data? or just a little bit? are subjects vulnerable (say, kids)?
    • can we actually “get” data on all subjects? can we match data?
  • Some solutions and best practices discussed in “Fields of Gold”
  • Let's look at the tables in the paper and try to find out…

Writing a complete web scraper

  • Have list of seeds to start up data collection (here, artist IDs)
  • loop through list of URLs
    • store raw data for diagnostic purposes
    • extract relevant data to JSON file with raw data
  • schedule
  • infrastructure

For details/explanation, see Guyt et al. 2024.

Wrapping up web scraping

  • Get content from a website
  • Since we don't want “everything”, we need to “find” relevant elements on the site using
    • tags
    • classes
    • attribute-value pairs
  • We first build a prototype; then gradually improve it by
    • modularizing code as much as possible (using functions and loops)
    • ensuring code runs top-down

APIs

  • Standard way for exchanging data, functions or algorithms
  • Which APIs did you already encounter/explore?
  • Music to scrape is super easy to use - other APIs require advanced authentication procedures (→ import for project - run checks early)
  • Structure corresponds to web scraping
    • have list of seeds (e.g., artist IDs; this was “URLs” for web scraping earlier)
    • store data in new-line separated JSON files (if necessary: convert back to CSV)

Let's explore the documentation and a first endpoint

  • Pretty much all APIs have documentations, see here for ours

  • Let's explore a first endpoint in our browser: https://api.music-to-scrape.org/artists/featured

  • DO: Describe what you see; what does that mean? How does the data link to other sections of the website? Do you “get” the logic?

Retrieving data in Python

import requests
con = requests.get('https://api.music-to-scrape.org/artists/featured')

# convert to json
obj = con.json()

obj
{'artists': [{'artist': 'Fred Merpol', 'artist_id': 'ARKIQSL1241B9C90C8'}, {'artist': 'Off Broadway', 'artist_id': 'AR4IYQR1187B98F8F3'}, {'artist': 'A Challenge Of Honour', 'artist_id': 'ARL1QL91187B994B08'}, {'artist': 'Ya Boy', 'artist_id': 'ARICCN811C8A41750F'}, {'artist': 'Milo', 'artist_id': 'ARJ8ZIQ1187FB3FB5A'}]}

Working with JSON objects

  • Looking at them (obj)
  • Giving them arbitrary names (objanothername)
  • Accessing “nodes” (obj['artists'] or obj.get('artists'))
  • Accessing items in a list (obj['artists'][0]): multiple objects are in that list!

DO: Application 1 (loops)

Remember loops from last week's bootcamp? We can “iterate” through result objects.

for i in obj['artists']:
  print(i.get('artist'))
Fred Merpol
Off Broadway
A Challenge Of Honour
Ya Boy
Milo

DO: Can you also print out the artist IDs?

DO: Turning code into a function

  • Please write a function to execute the data collection of featured artists.
  • Call this function getdata()

Starting code:

def getdata():
  # YOUR CODE HERE

  # return some data

  # return()

Solution:

import requests

def getdata():
  con = requests.get('https://api.music-to-scrape.org/artists/featured')
  obj = con.json()
  return(obj)

Let's call the function!

{'artists': [{'artist': 'Grant Geissman', 'artist_id': 'ARIN12F1187FB3E92C'}, {'artist': 'Fred Merpol', 'artist_id': 'ARKIQSL1241B9C90C8'}, {'artist': 'Three-6 Mafia', 'artist_id': 'ARY55LO1187B9A3F17'}, {'artist': 'The Honeydogs', 'artist_id': 'ARSWORN1187B991A7B'}, {'artist': 'Terry Muska', 'artist_id': 'ARMORUX11F50C4EEBF'}]}

Timing the data collection

1) This makes your computer sleep for 5 seconds

import time
time.sleep(5) # sleeps 5 seconds

2) This makes your computer go on forever…

while True:
  # command here

3) This makes your computer go on forever…

counter = 0
while counter < 5:
  counter = counter + 1
  print("Hello")

DO: Please execute your data collection every one second seconds, at max. 5 times.

Solution

import time

i = 0
while i < 5:
  print(getdata())
  time.sleep(1)
  i = i + 1

At which frequency to extract the data? (Challenge #2.3)

  • Here, we extracted data every few seconds.
  • But, extraction frequency vastly differs by project
  • Considerations
    • archival vs. live data?
    • at which frequency does your phenomenon occur?
    • what's the refresh rate of the data source?
    • any excessive burden on server's caused by frequency of extraction?

At which frequency to extract the data? (Challenge #2.3)

  • Some solutions
    • explore gains of live data collections
    • adhere to best practices (say, 1 request per second)
    • randomize extraction order
    • use automatic schedulers for consistency

Extensions: data storage

Two options:

  1. store internally in a list ([...]), then write to file at the end of your script
  2. directly write to a file (f = open(), f.write(), f.close()) (“parsing on the fly”)

Processing data during the collection (challenge #2.4)

  • Processing can have various degrees
    • just “save” the raw data
    • extract only necessary information
    • choose data format for saving (say, JSON vs. CSV)
  • Some selected challenges
    • GDPR vs. value in retaining raw data
    • Anonymization or pseudonymization required?

Processing data during the collection (challenge #2.4)

  • Solutions
    • Retain raw when possible
    • Parse minimal amount of data on the fly
    • Remove sensitive info
    • Ensure proper encoding

Optional I (if time left): Artist meta data

Get some artist meta data for an artist of your choice!

1.) Find an artist IDs on the site or in previous code

2.) Try to make a web request (in your browser) to the following URL:

  • Endpoint: https://api.music-to-scrape.org/artist/info
  • Requires parameter: artistid
  • Combined URL: https://api.music-to-scrape.org/artist/info?artistid={ENTER ARTIST ID HERE}

3.) Does it work? Then write Python code to retrieve the data.

4.) Finally, wrap your code in a function so you can later retrieve data for multiple artists.

Optional II (if time left): exercises with music-to-scrape

Please work on exercise 2.4 (see tutorial).

  1. collect (longer) list of featured artists
  2. for each of these featured artists, collect meta data
  3. store data in new-line separated JSON file

Questions

Next steps in this class

  • Complete this tutorial at home - focus equally on scraping and APIs
  • Get engaged - see the “bigger” picture of web scraping and APIs
    • discuss ideas & business opportunities (e.g., OpenAI's developer platform?)
    • use our course chatbot at odcm.tilburgai.nl
    • be in touch with Roshini (email) or myself (on WhatsApp) for any issues/bugs/etc.
  • Coaching session: Work on your team project (see workplan on the course website) + make team allocation definite (with Roshini)