oDCM - Web Data for Dummies (Tutorial)

Hannes Datta

Welcome to oDCM!

We're about to start with today's tutorial (“web data for dummies”).

  • If you haven't done so, open this slide deck at (course material –> week 2) & corresponding Jupyter Notebook
  • Coaching sessions start after today's tutorial (with Roshini) + finalize team assignment on Canvas
  • You can work in Google Colab today

Agenda

  • Today's lecture
    • Work through the in-class tutorial (“webdata for dummies in class.ipynb”)
    • We'll work on a selection of exercises on web scraping and APIs
    • Address some data selection challenges highlighted in “Fields of Gold”
  • After class
    • Check out today's workplan
    • Complete in-class tutorial and exercises; solutions available online

When doing exercises...

Use your smartphone to indicate your status.

  • Face/screen up: DONE :)
  • Face down: still working.

Thanks.

Framework

  • Focus in today's tutorial: collection design
  • Focus in team project: source selection

What are differences between web scraping and APIs?

  • official vs. unofficial data access
  • scaling (APIs scale more)
  • web scraping largely free, APIs for-pay
  • APIs are linchpin of internet economy

Source & more details: Web Appendix of “Fields of Gold”

We get started with web scraping

Which information to extract from a website (challenge #2.1)?

  • We need to decide which information to extract from a site
    • is the information publicly accessible or hidden after a login wall?
    • can we reliably get the data, even after many iterations on the site?
    • which information do we need to justify that what we measure is measured well (“construct operationalization”)?

Example: Suppose you need to get… data on Spotify's streaming charts - where would you get it from?

Which information to extract from a website (challenge #2.1)?

  • Best practices
    • Explore different types of pages and providers
    • Get “used” to the website, browse a bit (“how to navigate”), become a customer
    • Identify roadblocks such as captchas
    • Explore limits to iterating through a site (e.g., max. 1000 pages)

DO: Exploring music-to-scrape

  • Open music-to-scrape.org
  • Familiarize yourself with the structure of the site
  • Check which data you encounter
  • Suppose the data was stored in a database, how would you structure it?

Finding information in HTML code

  • But… we don't have access to the database. All we see is the website.
  • So, let's “narrow” down on the information of interest
  • For this, we can use various extraction techniques on the HTML source code of the site
    • tags, e.g., <table>, <h1>, <div>
    • attributes, e.g., id <table id="example-table">
    • “special” attributes such as classes, e.g., <table class="striped-table">
    • attribute-value pairs, e.g., <table some_field_name = "123">

Frequently, you need to combine several of these methods to extract information.

DO: Identifying information on music-to-scrape.org

CSS selectors and XPATHS (I)

  • When following tutorials on the web, some coders advocate for the use of CSS selectors and XPATHS to extract information.
  • XPATH
    • Think of it as the “path” to the specific data point
    • Example for “artist name”: /html/body/div[3]/section[1]/div/div[2]/h2
    • Likely to break very easily (say, something changes before the particular element)

CSS selectors and XPATHS (II)

  • CSS selector
    • Example for extracting “artist name”: .artist_info_title
    • Works here, but can be “highly” dependent: say, tag & class name + position (sometimes too detailed) & can also break
  • So, here, we are sticking mostly to classes, attributes, and attribute-value pairs.

Preview: Technical web scraping setup (I)

  • We will use three libraries:
    • requests: downloads data from the web or transmits data (“headless”)
    • BeautifulSoup: structures HTML data so we can query it
    • selenium + chromedriver: simulates a browser (chrome), can scroll, click, view the site, but can also be headless; also structures HTML so we can query it
    • json: structure and query JSON data

Preview: Technical web scraping setup (II)

  • For basic, “static” websites, we use requests + beautifulSoup (speed + ease!)
  • For dynamic websites, we use selenium (advanced tutorial), OR: selenium (retrieve data) + beautifulSoup (structure/query)
  • For APIs, we exclusively use requests
  • For data storage, we use json (preferred) or CSV files (flat files with rows & columns)

- Today: requests + BeautifulSoup

Getting the website into Python

import requests # let's load the requests library

# make a get request to the website
url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
web_request = requests.get(url, headers = header)

# return the source code from the request object
web_request_source_code = web_request.text

BeautifulSoup 101

  • Why? Query the HTML code!
  • Think of it as a pipeline: download data –> pump to BeautifulSoup –> query data
import requests
from bs4 import BeautifulSoup

url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'

header = {'User-agent': 'Mozilla/5.0'} 

web_request = requests.get(url, headers = header)

soup = BeautifulSoup(web_request.text)

print(soup.find('h2').get_text())
Ya Boy
  • Change the snippet to show the location and number of plays!
  • Tip: use .find(class_ = 'classname')

BeautifulSoup 101 (solution)

print(soup.find(class_ = 'about_artist').get_text())

Location:
United States
Number of plays:
36

DO: BeautifulSoup exercises

  • So far, we've just 'lump-extracted' all of the text
  • Next, let us refine our collection by getting…

    • the exact location (–> stored in location), exact number plays (–> stored in plays), and the total number of songs in the top 10.
  • Tips

    • .find(class_='class-name') for classes
    • you may need to use .find_all()
    • len() for counting
    • remember to write code like “an onion”

DO: BeautifulSoup exercises (solution)

location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()

plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

song_table = soup.find(class_ = 'top-songs')

number_of_songs = len(song_table.find_all('tr'))-1

Wrapping code in a function (I)

  • Functions are extremely useful to “reuse” code over and over again
    • Functions have a name (say, def functionname)
    • …and arguments (say, def FUNCTIONNAME(argument1, argument2))
  • We can now wrap the code above in a function
    • def download_data(url)
    • the name of the function is download_data, and it requires url as input

Wrapping code in a function (II)

import requests
from bs4 import BeautifulSoup

def download_data(url):
  header = {'User-agent': 'Mozilla/5.0'} 

  web_request = requests.get(url, headers = header)

  soup = BeautifulSoup(web_request.text)

  artist_name = soup.find('h2').get_text()

  print(f'Artist name: {artist_name}.')

# execute the function
download_data('https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F')
Artist name: Ya Boy.

DO: Wrap code into a function

  1. Adapt the function to also extract the other attributes (i.e., location, number songs, etc.)
  2. Write a loop to extract data using this function for the following artist IDs.
artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']

Tips:

  • Status messages with variables: print(f'Done retrieving {url}')

Solution

artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']

def download_data(url):
  header = {'User-agent': 'Mozilla/5.0'} 

  web_request = requests.get(url, headers = header)
  soup = BeautifulSoup(web_request.text)

  artist_name = soup.find('h2').get_text()
  location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
  plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

  print(f'Artist name: {artist_name} from {location} with {plays} song plays.')

for id in artist_ids:
  download_data(f'https://music-to-scrape.org/artist?artist-id={id}')
Artist name: Ya Boy from United States with 36 song plays.
Artist name: Prince With 94 East from Minneapolis MN with 42 song plays.
Artist name: Cabas from  with 352 song plays.

Storing results into JSON data

  • JSON is most flexible format, supports “hierarchical data”
  • Demonstrate how to create object
  • “Full” JSON objects, vs. new-line separated JSON objects
    • Full: the entire FILE (e.g., data.json) is ONE giant JSON object
    • New line separation: each line in your file has one JSON object

Storing results into JSON data

Change the code so it stores artist name, location and plays in the JSON object.


def download_data(url):
  web_request = BeautifulSoup(requests.get(url).text)
  artist_name = soup.find('h2').get_text()
  location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
  plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

  out_data = {'artist': 'artist name',
            'location': 'store location here',
            'plays': 'store plays here'}

  return(out_data)

Saving JSON data

The final step is to save JSON data in a file.

We can do this with the JSON library.

import json

out_data = {'artist': 'artist name',
          'location': 'store location here',
          'plays': 'store plays here'}

to_json = json.dumps(out_data)

f=open('filename.json','a')
f.write(to_json+'\n')
f.close()

How to sample? (Challenge #2.2)

  • In the demos above, we just used a hard-coded list of URLs
  • In practice, getting to that list of books/products/songs/artists/users (generally speaking: “seeds”) is a web scraper in itself!
  • Examples
    • scrape all artists featured on the homepage
    • then loop through them an extract information for each artist (e.g., total number of)
    • scrape user names from reddit.com; then, use the reddit API to get user metadata

How to sample? (Challenge #2.2)

  • Many challenges in sampling
    • sample size? generalizability? panel attrition?
    • all of the firm's data? or just a little bit? are subjects vulnerable (say, kids)?
    • can we actually “get” data on all subjects? can we match data?
  • Some solutions and best practices discussed in “Fields of Gold”
  • Let's look at the tables in the paper and try to find out…

Writing a complete web scraper

  • Have list of seeds to start up data collection (here, artist IDs)
  • loop through list of URLs
    • store raw data for diagnostic purposes
    • extract relevant data to JSON file with raw data
  • schedule
  • infrastructure

Wrapping up web scraping

  • Get content from a website
  • Since we don't want “everything”, we need to “find” relevant elements on the site using
    • classes
    • attributes
    • (attribute-value pairs)
  • We first build a prototype; then gradually improve it by
    • modularizing code as much as possible (using functions and loops)
    • ensuring code runs top-down

APIs

  • Standard way for exchanging data, functions or algorithms
  • Which APIs did you already encounter/explore?
  • Music to scrape is super easy to use - other APIs require advanced authentication procedures
  • Structure corresponds to web scraping
    • have list of seeds (e.g., artist IDs; this was “URLs” for web scraping earlier)
    • store data in new-line separated JSON files (if necessary: convert back to CSV)

Let's explore the documentation and a first endpoint

–> we can use the browser to retrieve data from (simple) APIs

Extensions:

  • time data retrieval (e.g., every five seconds; discuss multiple strategies)
  • store featured artists in a list (or file)

DO: Let's retrieve some artist meta data

Get some artist meta data for an artist of your choice!

1.) Find an artist IDs on the site or in previous code 2.) Try to make a web request (in your browser) to the following URL:

  • Endpoint: https://api.music-to-scrape.org/artist/info
  • Requires parameter: artistid
  • Combined URL: https://api.music-to-scrape.org/artist/info?artistid={ENTER ARTIST ID HERE}

3.) Does it work? Then write Python code to retrieve the data.

4.) Finally, wrap your code in a function so you can later retrieve data for multiple artists.

At which frequency to extract the data? (Challenge #2.3)

  • Here, we extracted data every few seconds.
  • But, extraction frequency vastly differs by project
  • Considerations
    • archival vs. live data?
    • at which frequency does your phenomenon occur?
    • what's the refresh rate of the data source?
    • any excessive burden on server's caused by frequency of extraction?

At which frequency to extract the data? (Challenge #2.3)

  • Some solutions
    • explore gains of live data collections
    • adhere to best practices (say, 1 req per second)
    • randomize extraction order
    • use automatic schedulers for consistency

Processing data during the collection (challenge #2.4)

  • Processing can have various degrees
    • just “save” the raw data
    • extract only necessary information
    • choose data format for saving (say, JSON vs. CSV)
  • Some selected challenges
    • GDPR vs value in retaining raw data
    • Anonymization or pseudonymization required?

Processing data during the collection (challenge #2.4)

  • Solutions
    • Retain raw when possible
    • Parse minimal amount of data on the fly
    • Remove sensitive info
    • Ensure proper encoding

Optional (if time left): exercises with music-to-scrape

Please work on exercise 2.4 (see tutorial).

  1. collect (longer) list of featured artists
  2. for each of these featured artists, collect meta data
  3. store data in new-line separated JSON file

Next steps in this class

  • Work on your team project (see workplan on the course website) + make team allocation definite
  • Complete this tutorial at home
  • Get engaged - see the “bigger” picture of web scraping and APIs
    • discuss ideas & business opportunities (e.g., OpenAI's developer platform?)
    • use our course chatbot at odcm.tilburgai.nl
    • be in touch on WhatsApp for any issues/bugs/etc.