oDCM - Web Data for Dummies (Tutorial)

Hannes Datta

Welcome to oDCM!

We're about to start with today's tutorial (“web data for dummies”).

Agenda

  • In-class
    • Work through the in-class tutorial (“webdata for dummies in class.ipynb”)
    • We'll work on a selection of exercises on web scraping and APIs
    • Address some data selection challenges highlighted in “Fields of Gold”
  • After class
    • Work on team activity #1
    • Complete tutorial and exercises + be in touch for feedback during coaching hours

Framework

  • Focus in today's tutorial: collection design
  • Focus in team project: source selection

What are differences between web scraping and APIs?

  • official vs. unofficial data access
  • scaling (APIs scale more)
  • web scraping largely free, APIs for-pay
  • APIs are linchpin of internet economy
  • Learn about other differences and commonalities in the Web Appendix of “Fields of Gold”.

We get started with web scraping

Which information to extract from a website (challenge #2.1)?

  • We need to decide which information to extract from a site
    • is the information publicly accessible or hidden after a login wall?
    • can we reliably get the data, even after many iterations on the site?
    • which information do we need to justify that what we measure is measured well (“construct operationalization”)?

Example: Suppose you need to get… data on Spotify's streaming charts - where would you get it from?

Which information to extract from a website (challenge #2.1)?

  • Best practices
    • Explore different types of pages and providers
    • Get “used” to the website, browse a bit (“how to navigate”), become a customer
    • Identify roadblocks such as captchas
    • Explore limits to iterating through a site (e.g., max. 1000 pages)

DO: Exploring music-to-scrape

  • Open music-to-scrape.org (fallback option: books.toscrape.com)
  • Familiarize yourself with the structure of the site
  • Check which data you encounter
  • Suppose the data was stored in a database, how would you structure it?

Finding information in HTML code

  • But… we don't have access to the database. All we see is the website.
  • So, let's “narrow” down on the information of interest
  • For this, we can use various extraction techniques on the HTML source code of the site
    • tags, e.g., <table>, <h1>, <div>
    • attributes, e.g., id <table id="example-table">
    • “special” attributes such as classes, e.g., <table class="striped-table">
    • attribute-value pairs, e.g., <table some_field_name = "123">

Frequently, you need to combine several of these methods to extract information.

DO: Identifying information on music-to-scrape.org

DO: Identifying information on books.toscrape.com (fall-back option)

CSS selectors and XPATHS (I)

  • When following some extra tutorials, some coders advocate for the use of CSS selectors and XPATHS to extract information.
  • XPATH
    • Think of it as the “path” to the specific data point
    • Example for “artist name”: /html/body/div[3]/section[1]/div/div[2]/h2
    • Likely to break very easily (say, something changes before the particular element)

CSS selectors and XPATHS (II)

  • CSS selector
    • Example for extracting “artist name”: .artist_info_title
    • Works here, but can be “highly” dependent: say, tag & class name + position – can also break
  • So, here, we are sticking mostly to classes, attributes, and attribute-value pairs.

Getting the website into Python

import requests

# make a get request to the website
url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
web_request = requests.get(url, headers = header)

# return the source code from the request object
web_request_source_code = web_request.text

BeautifulSoup 101

  • Why? Query the HTML code!
  • Think of it as a pipeline: download data –> pump to BeautifulSoup –> query data
import requests
from bs4 import BeautifulSoup
url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'
header = {'User-agent': 'Mozilla/5.0'} 
web_request = requests.get(url, headers = header)

soup = BeautifulSoup(web_request.text)
print(soup.find('h2').get_text())
Ya Boy
  • Change the snippet to show the location and number of plays!
  • Tip: use .find(class_ = 'classname')

BeautifulSoup 101 (solution)

print(soup.find(class_ = 'about_artist').get_text())

Location:
United States
Number of plays:
9

DO: BeautifulSoup exercises

  • So far, we've just 'lump-extracted' all of the text
  • Next, let us refine our collection by getting…

    • the exact location (–> stored in location), exact number plays (–> stored in plays), and the total number of songs in the top 10.
  • Tips

    • .find(class_='class-name') for classes
    • you may need to use .find_all()
    • len() for counting
    • remember to write code like “an onion”

DO: BeautifulSoup exercises (solution)

location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()
song_table = soup.find(class_ = 'top-songs')
number_of_songs = len(song_table.find_all('tr'))-1

Wrapping code in a function (I)

  • Functions are extremely useful to “reuse” code over and over again
    • Functions have a name (say, def functionname)
    • …and arguments (say, def FUNCTIONNAME(argument1, argument2))
  • We can now wrap the code above in a function
    • def download_data(url)
    • the name of the function is download_data, and it requires url as input

Wrapping code in a function (II)

import requests
from bs4 import BeautifulSoup

def download_data(url):
  header = {'User-agent': 'Mozilla/5.0'} 
  web_request = requests.get(url, headers = header)
  soup = BeautifulSoup(web_request.text)

  artist_name = soup.find('h2').get_text()

  print(f'Artist name: {artist_name}.')


download_data('https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F')
Artist name: Ya Boy.

DO: Wrap code into a function

  1. Adapt the function to also extract the other attributes (i.e., location, number songs, etc.)
  2. Write a loop to extract data using this function for the following artist IDs.
artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']

Tips:

  • Status messages with variables: print(f'Done retrieving {url}')

Solution

artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']

def download_data(url):
  header = {'User-agent': 'Mozilla/5.0'} 
  web_request = requests.get(url, headers = header)
  soup = BeautifulSoup(web_request.text)

  artist_name = soup.find('h2').get_text()
  location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
  plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

  print(f'Artist name: {artist_name} from {location} with {plays} song plays.')

for id in artist_ids:
  download_data(f'https://music-to-scrape.org/artist?artist-id={id}')
Artist name: Ya Boy from United States with 9 song plays.
Artist name: Prince With 94 East from Minneapolis MN with 7 song plays.
Artist name: Cabas from  with 98 song plays.

Wrapping results into JSON data

  • JSON is most flexible format, supports “hierarchical data”
  • Demonstrate how to create object
  • “Full” JSON objects, vs. new-line separated JSON objects
    • Full: the entire FILE (e.g., data.json) is ONE giant JSON object
    • New line separation: each line in your file has one JSON object

Wrapping results into JSON data

Change the code so it stores artist name, location and plays in the JSON object.


def download_data(url):
  web_request = BeautifulSoup(requests.get(url).text)
  artist_name = soup.find('h2').get_text()
  location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
  plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()

  out_data = {'artist': 'artist name',
            'location': 'store location here',
            'plays': 'store plays here'}

  return(out_data)

Saving JSON data

The final step is to save JSON data in a file.

We can do this with the JSON library.

import json

out_data = {'artist': 'artist name',
          'location': 'store location here',
          'plays': 'store plays here'}

to_json = json.dumps(out_data)

f=open('filename.json','a')
f.write(to_json+'\n')
f.close()

How to sample? (Challenge #2.2)

  • In the demos above, we just used a hard-coded list of URLs
  • In practice, getting to that list of books/products/songs/artists/users (generally speaking: “seeds”) is a web scraper in itself!
  • Examples
    • scrape all artists featured on the homepage
    • then loop through them an extract information for each artist (e.g., total number of)
    • scrape user names from reddit.com; then, use the reddit API to get user metadata

How to sample? (Challenge #2.2)

  • Many challenges in sampling
    • sample size? generalizability? panel attrition?
    • all of the firm's data? or just a little bit? are subjects vulnerable (say, kids)?
    • can we actually “get” data on all subjects? can we match data?
  • Some solutions and best practices discussed in “Fields of Gold”
  • Let's look at the tables in the paper and try to find out…

Writing a complete web scraper

  • Have list of seeds to start up data collection (here, artist IDs)
  • loop through list of URLs
    • store raw data for diagnostic purposes
    • extract relevant data to JSON file with raw data
  • schedule
  • infrastructure

Wrapping of web scraping

  • Get content from a website
  • Since we don't want “everything”, we need to “find” relevant elements on the site using
    • classes
    • attributes
    • (attribute-value pairs)
  • We first build a prototype; then gradually improve it by
    • modularizing code as much as possible (using functions and loops)
    • ensuring code runs top-down

APIs

  • Standard way for exchanging data, functions or algorithms
  • Which APIs did you already encounter/explore?
  • Music to scrape is super easy to use - other APIs require advanced authentication procedures
  • Structure corresponds to web scraping
    • have list of seeds (e.g., artist IDs; this was “URLs” for web scraping earlier)
    • store data in new-line separated JSON files (if necessary: convert back to CSV)

Let's explore the documentation and a first endpoint

–> we can use the browser to retrieve data from (simple) APIs

Extensions:

  • time data retrieval (e.g., every five seconds; discuss multiple strategies)
  • store featured artists in a list (or file)

Let's retrieve some artist meta data

DO:

  • Get some data for an artist of your choice - you find the artist IDs on the site or in previous code
  • Wrap code in a function so you can retrieve data for multiple artists

At which frequency to extract the data? (Challenge #2.3)

  • Here, we extracted data every few seconds.
  • But, extraction frequency vastly differs by project
  • Considerations
    • archival vs. live data?
    • at which frequency does your phenomenon occur?
    • what's the refresh rate of the data source?
    • any excessive burden on server's caused by frequency of extraction?

At which frequency to extract the data? (Challenge #2.3)

  • Some solutions
    • explore gains of live data collections
    • adhere to best practices (say, 1 req per second)
    • randomize extraction order
    • use automatic schedulers for consistency

Exercises with music-to-scrape

Please work on exercise 2.4.

  1. collect (longer) list of featured artists
  2. for each of these featured artists, collect meta data
  3. store data in new-line separated JSON file

Processing data during the collection (challenge #2.4)

  • Processing can have various degrees
    • just “save” the raw data
    • extract only necessary information
    • choose data format for saving (say, JSON vs. CSV)
  • Some selected challenges
    • GDPR vs value in retaining raw data
    • Anonymization or pseudonymization required?

Processing data during the collection (challenge #2.4)

  • Solutions
    • Retain raw when possible
    • Parse minimal amount of data on the fly
    • Remove sensitive info
    • Ensure proper encoding

Next steps in this class

  • Work on team activity #1
    • coaching session following today's tutorial
    • make team allocation definite (see Canvas!)
  • If you haven't done so…
    • go through the self-study material of this week
  • Get engaged!
    • discuss ideas, post solutions, identify business opportunities, etc.
    • try fiddling with an AI API or use ChatGPT for starting code
    • be in touch on WhatsApp for any issues/bugs/etc.