oDCM - Web Scraping 101 (Tutorial)

Hannes Datta

Welcome to oDCM!

We're about to start with today's tutorial (“web scraping 101”).

Before we start (I): live coding

  • feel comfortable w/ code?
    • then write code with me and try it out!
  • struggle?
    • just see me code.
    • focus on understanding concepts & take notes
    • spend more time on tutorial after class
    • keep asking questions!
  • “bored”/too advanced?
    • flip through slides and work on advanced exercises
    • ask advanced questions when I walk around (1:1)

Before we start (II)

  • Recap from last week (.find(), .find_all())
  • Part 1: beautifulSoup (static websites; works in Colab)
  • Part 2: selenium (dynamic websites; only works on your laptops)


  • Today's coaching session: decide which website or API to use (challenges #1.1-#1.3) & start coding your collections

Recap

  • We focused on which information to extract (Challenge #2.1)
    • e.g., explore a website, navigate broadly to find information that's interesting, etc.
  • We focused on how to extract that information
    • BeautifulSoup: .find() and .find_all() functions
    • Use (a combination of) tags or attributes (e.g., classes; particularly attribute-value pairs)

Recap (Framework)

DO: Recap from last week's tutorial

Extend the code snippet below to extract the release year of the song below.

import requests
from bs4 import BeautifulSoup
url = 'https://music-to-scrape.org/song?song-id=SOJZTXJ12AB01845FB'
request = requests.get(url, headers =  {'User-agent': 'Mozilla/5.0'})
song_page = BeautifulSoup(request.text)
about = song_page.find(class_='about_artist')
title = about.find('p')

DO: Solution

  • Use the browser's inspect mode to find the relevant attributes
  • Consecutively develop the code step-by-step, “onion”-like
about.find_all('p')
[<p>China</p>, <p>Tito Puente</p>, <p>1972</p>, <p>54</p>]
plays=about.find_all('p')[3].get_text()

Recommendations:

  • Practice even more, e.g., with new information that you haven't extracted before from this and other sections of the website.
  • Always build code gradually, never from scratch

Part 1: `beautifulSoup` (static websites; works in Colab)

We use beautifulSoup for addressing the remaining challenges for the “design” phase of collecting data.

  • How to sample from a website? (Challenge #2.2)
    • Recall that we do not have access to a firm's database (so we can't sample from a population of, say, users)
    • With web scraping, we always need a “starting point” for a data collection
  • Examples for seeds/sample
    • list of recently active users (music-to-scrape.org)
    • homepage of Twitch (for current live streams)

Using your own project ideas, tell us how you could sample from the site? (–> see table 3 in “Fields of Gold”)

Sampling users (I)

Let's sample users from the main website of music-to-scrape.org

Can you propose a strategy to capture them? Any attributes/classes you see?

Sampling users (II)

DO: Please use the code snippet below to extract the 'href' attribute of all links on the website.

  • Links are identifiable by the “a” tag.
  • The relevant attribute is called “href”
  • Build your code consecutively!
url = 'https://music-to-scrape.org'

res = requests.get(url)
res.encoding = res.apparent_encoding

homepage = BeautifulSoup(res.text)

# continue here

Sampling users (III): Solution

for link in homepage.find_all("a"):
    if 'href' in link.attrs: 
        print(link.attrs["href"])
/privacy_terms
/privacy_terms
about
/
/
/tutorial_scraping
/tutorial_api
#
https://api.music-to-scrape.org/docs
https://odcm.hannesdatta.com
https://doi.org/10.1016/j.jretai.2024.02.002
https://web-scraping.org
https://github.com/tilburgsciencehub/music-to-scrape
https://tilburgsciencehub.com
/about
tutorial_scraping
song?song-id=SOCOMAW12AB017EE59
song?song-id=SOBMAUD12A6D4F9181
song?song-id=SOFCKRY12AAF3B2FAF
song?song-id=SOKHBNW12A8AE48A58
song?song-id=SOVLTPF12AC46866D7
song?song-id=SOHTWEG12AB018D11C
song?song-id=SOGRYWN12AB018C335
song?song-id=SOGMROZ12A679D8AE9
song?song-id=SOKFEUT12A6D4FC34C
song?song-id=SOMKWCF12A8C142BCD
song?song-id=SOJKNYV12A8C133E9C
song?song-id=SOAYONI12A6D4F85C8
song?song-id=SOCLLWU12A8AE47C66
song?song-id=SOOZDSM12A8C13D39D
song?song-id=SOEKLNK12A58A7839A
song?song-id=SODKSMV12A6D4F6922
song?song-id=SOJRGVK12CF5CFC527
song?song-id=SOJFLGV12A8C141AB3
song?song-id=SOCBWVV12A8C13605F
song?song-id=SOJZWRA12AB018D029
song?song-id=SOSAMRR12AB018203B
song?song-id=SOLGUGY12AB01897BE
song?song-id=SOEOJJA12AB018FCCF
song?song-id=SOOPVJI12AB0183957
song?song-id=SOWJRTX12AB0183C28
artist?artist-id=ARWBL9E1187FB4E695
artist?artist-id=ARY55LO1187B9A3F17
artist?artist-id=ARMBTFC1187FB56343
artist?artist-id=ARR2NH51187B98CE4C
artist?artist-id=ARA2ZTN1187B98E3ED
artist?artist-id=AREFUMW11F4C844D2B
artist?artist-id=ARMORUX11F50C4EEBF
artist?artist-id=ARN7OQ21187FB5A6B3
user?username=Galaxy04
user?username=StarPanda93
user?username=Stealth20
user?username=StarCoder49
user?username=Geek73
user?username=Panda38
/tutorial_scraping
/tutorial_api
https://api.music-to-scrape.org/docs
/about
/privacy_terms
https://www.linkedin.com/company/tilburgsciencehub
https://github.com/tilburgsciencehub/music-to-scrape
https://twitter.com/tilburgscience

Narrowing down (I)

  • But… are all of these links really relevant?
  • Recall: what “seeds” do we need to gather?

Narrowing down (II)

  • Let us explore the site structure a bit more
  • Particularly, can we identify other ways to navigate on the site?
  • Remember, your strategy can be a “multi-step” strategy (first to A, then within A to B)!
  • Let's open the inspect mode of our browser and come up with an updated strategy.

Narrowing down (III)

  • The relevant links all reside in the recent_users section, starting with <section>
relevant_section = homepage.find('section',attrs={'name':'recent_users'})

DO: Can you come up with a way to loop through all of the links WITHIN relevant_section and store them?

Narrowing down (IV): Solution

users = []
for link in relevant_section.find_all("a"):
  if ('href' in link.attrs):
      users.append(link.attrs['href'])
users
['user?username=Galaxy04', 'user?username=StarPanda93', 'user?username=Stealth20', 'user?username=StarCoder49', 'user?username=Geek73', 'user?username=Panda38']
  • Let's now take a look at these links more closely.

Narrowing down (V): More extensions

  • Notice that the links were relative links and we won't be able to use them for looping (e.g., try pasting them in your browser - they won't work!)
  • So, let's turn them into absolute links, simply by concatenating (= combining) strings.
urls = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        urls.append(f'https://music-to-scrape.org/{extracted_link}')
urls
['https://music-to-scrape.org/user?username=Galaxy04', 'https://music-to-scrape.org/user?username=StarPanda93', 'https://music-to-scrape.org/user?username=Stealth20', 'https://music-to-scrape.org/user?username=StarCoder49', 'https://music-to-scrape.org/user?username=Geek73', 'https://music-to-scrape.org/user?username=Panda38']

Narrowing down (VI): More extensions: Functions

DO: Write a function that…

Do you remember why we like functions so much?

Narrowing down (VII): Solution

import requests
from bs4 import BeautifulSoup

def get_users():
    url = 'https://music-to-scrape.org/'

    res = requests.get(url)
    res.encoding = res.apparent_encoding

    soup = BeautifulSoup(res.text)

    relevant_section = soup.find('section',attrs={'name':'recent_users'})

    links = []
    for link in relevant_section.find_all("a"):
        if 'href' in link.attrs: 
            extracted_link = link.attrs['href']
            links.append(f'https://music-to-scrape.org/{extracted_link}')
    return(links) # to return all links

# let's try it out
# users = get_users() 
  • Functions allow us to re-use code; it prevents errors, and helps us structure our code.

JSON data (I)

  • For now, the data is just in memory.
  • But, we can also save it!
  • For this, we store the data as JSON objects/dictionaries
  • Use json.dumps to convert JSON to text (and then save it)
  • Use json.loads to convert text to JSON (and then use it)

JSON data (II): Example

import json

users = get_users()
# build JSON dictionary
f = open('users.json','w')
for user in users:
  obj = {'url': user,
         'username': user.replace('https://music-to-scrape.org/user?username=','')}
  f.write(json.dumps(obj))
  f.write('\n')
85
1
91
1
87
1
91
1
81
1
83
1
f.close()

Preventing array misalignment (I)

  • When extracting web data, it's important to extract information in its “original” (nested) structure
  • For example, hover around the user names and observe the song names that pop up.
  • If you were to extract this information separately, you wouldn't be able to know which one relates to which user.

Preventing array misalignment (II): Demo

links = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        links.append(f'https://music-to-scrape.org/{extracted_link}')

# getting songs
songs = []
for song in relevant_section.find_all("span"):
    songs.append(song.get_text())

len(links)
6
len(songs)
3

Preventing array misalignment (III): DO

DO: How can you capture the song data, such that we can always link the particular song to particular users? (i.e., to prevent array misalignment)?

Tip: Observe how we can first iterate through users, THEN extract links (similar to how we have done it with relevant_section).

users = relevant_section.find_all(class_='mobile-user-margin')
data = []
for user in users: # iterate through each user
  obj = {'link':user.find('a').attrs['href']  }
  data.append(obj)
data
[{'link': 'user?username=Galaxy04'}, {'link': 'user?username=StarPanda93'}, {'link': 'user?username=Stealth20'}, {'link': 'user?username=StarCoder49'}, {'link': 'user?username=Geek73'}, {'link': 'user?username=Panda38'}]

Preventing array misalignment (IV): Solution

users = relevant_section.find_all(class_='mobile-user-margin')
data = []
for user in users:
  if user.find('span') is not None:
    song_name=user.find('span').get_text()
  else:
    song_name='NA'
  obj = {'link':user.find('a').attrs['href'],
         'song_name': song_name}

  data.append(obj)
data
[{'link': 'user?username=Galaxy04', 'song_name': 'NA'}, {'link': 'user?username=StarPanda93', 'song_name': 'NA'}, {'link': 'user?username=Stealth20', 'song_name': "The Fabulous Thunderbirds - Rainin' In My Heart"}, {'link': 'user?username=StarCoder49', 'song_name': "Jacques Dutronc - L'Homme De Paille"}, {'link': 'user?username=Geek73', 'song_name': 'NA'}, {'link': 'user?username=Panda38', 'song_name': 'Cancer Bats - Sabotage'}]

Let's take a step back...

  • What we've learnt so far…

    • Extract individual information from a website, e.g., the homepage of music-to-scrape (ch. #2.1)
    • Collect all links to user profiles from the homepage (ch. #2.2)
  • What's missing

    • VISIT each of the profile pages
    • Loop through them
  • We can then tie things together in a scraper

    • “collect” all user seeds (scraper 1)
    • then collect all consumption data from individual user pages (scraper 2)

Navigating on a website (I)

  • Two strategies
    • Understand how links are built (pre-building them; strategy 1)
    • Understand how to “click” to the next/previous page (consecutively building them; strategy 2)
urls = []
counter = 37
while counter >= 0:
  urls.append(f'https://music-to-scrape.org/user?username=StarCoder49&week={counter}')
  counter = counter - 1
urls
['https://music-to-scrape.org/user?username=StarCoder49&week=37', 'https://music-to-scrape.org/user?username=StarCoder49&week=36', 'https://music-to-scrape.org/user?username=StarCoder49&week=35', 'https://music-to-scrape.org/user?username=StarCoder49&week=34', 'https://music-to-scrape.org/user?username=StarCoder49&week=33', 'https://music-to-scrape.org/user?username=StarCoder49&week=32', 'https://music-to-scrape.org/user?username=StarCoder49&week=31', 'https://music-to-scrape.org/user?username=StarCoder49&week=30', 'https://music-to-scrape.org/user?username=StarCoder49&week=29', 'https://music-to-scrape.org/user?username=StarCoder49&week=28', 'https://music-to-scrape.org/user?username=StarCoder49&week=27', 'https://music-to-scrape.org/user?username=StarCoder49&week=26', 'https://music-to-scrape.org/user?username=StarCoder49&week=25', 'https://music-to-scrape.org/user?username=StarCoder49&week=24', 'https://music-to-scrape.org/user?username=StarCoder49&week=23', 'https://music-to-scrape.org/user?username=StarCoder49&week=22', 'https://music-to-scrape.org/user?username=StarCoder49&week=21', 'https://music-to-scrape.org/user?username=StarCoder49&week=20', 'https://music-to-scrape.org/user?username=StarCoder49&week=19', 'https://music-to-scrape.org/user?username=StarCoder49&week=18', 'https://music-to-scrape.org/user?username=StarCoder49&week=17', 'https://music-to-scrape.org/user?username=StarCoder49&week=16', 'https://music-to-scrape.org/user?username=StarCoder49&week=15', 'https://music-to-scrape.org/user?username=StarCoder49&week=14', 'https://music-to-scrape.org/user?username=StarCoder49&week=13', 'https://music-to-scrape.org/user?username=StarCoder49&week=12', 'https://music-to-scrape.org/user?username=StarCoder49&week=11', 'https://music-to-scrape.org/user?username=StarCoder49&week=10', 'https://music-to-scrape.org/user?username=StarCoder49&week=9', 'https://music-to-scrape.org/user?username=StarCoder49&week=8', 'https://music-to-scrape.org/user?username=StarCoder49&week=7', 'https://music-to-scrape.org/user?username=StarCoder49&week=6', 'https://music-to-scrape.org/user?username=StarCoder49&week=5', 'https://music-to-scrape.org/user?username=StarCoder49&week=4', 'https://music-to-scrape.org/user?username=StarCoder49&week=3', 'https://music-to-scrape.org/user?username=StarCoder49&week=2', 'https://music-to-scrape.org/user?username=StarCoder49&week=1', 'https://music-to-scrape.org/user?username=StarCoder49&week=0']

Navigating on a website (II): DO (strategy 2)

DO: Run the code below. How to extend the code to extract the LINK of the previous button?

url = 'https://music-to-scrape.org/user?username=StarCoder49&week=36'
header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
userpage = BeautifulSoup(res.text)
button=userpage.find(class_='page-link', attrs={'type':'previous_page'})

Navigating on a website (III): Solution

button.attrs['href']
'user?username=StarCoder49&week=35'

DISCUSS: When do we use strategy 1 (pre-built) vs. strategy 2 (do it on the fly)?

Tying things together

  • We now have a function get_users() to retrieve user names
  • We also know how to “find” the next link to visit (a user's previous page)
  • How can we now visit ALL pages, for ALL users?
  • Let's take a look at some pseudo code!
users = get_users()

consumption_data = []

for user in users:
  url = user['url']

  while url is not None:
    # scrape information from URL # challenge #2.1, #2.4
    # determine "previous page"
    time.sleep(1) # challenge #2.3

    # if previous page exists: rerun loop on next URL
    # if previous page does not exist: stop while loop, go to next 

Challenge #2.3: At which frequency to extract the data?

  • We can decide how often to capture information from a site
    • e.g., once, every 5 minutes, every day
  • Potential gains in extracting data multiple times
  • Extraction limits & technically feasible sample size!
  • More issues explained in table 3, challenge #2.3

Challenge #2.4: How to process data during collection

  • value of retaining raw data
# pseudo code!
product_data = []
for s in seeds:
  # store raw html code here
  store_to_file(s, 'website.html')
  # continue w/ parsing some information
  product = get_product_info(s)
  product_data.append(product)

Challenge #2.4: How to process data during collection

  • parsing data on the fly (vs. after the collection)
# pseudo code!

product_data = []
for s in seeds:
  product = get_product_info(s) 
  product_updated = processing(product)
  time.sleep(1) 
  product_data.append(product_updated) 
  # write data to file here (parsing on the fly!!!)

# write data to file here (after the collection - avoid!)

Part 2: Scraping more advanced websites

  • So far, we have done this

    • requests (to get) –> beautifulsoup (to extract information, “parse”)
  • DO: Run the snippet below and open amazon.html

import requests
header = 
f = open('amazon.html', 'w', encoding = 'utf-8')
f.write(requests.get('https://amazon.com', headers =  {'User-agent': 'Mozilla/5.0'}).text)
f.close()

Can you explain what happened?

Alternative ways to make connections

  • Many dynamic sites require what I call “simulated browsing”

  • Try this:

!pip install webdriver_manager --upgrade
!pip install selenium --upgrade

# Using selenium 4 - ensure you have Chrome installed!
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

url = "https://music-to-scrape.org/"
driver.get(url)
  • See your browser opening up?
  • Beware of rerunning code - a new instance will run each time!

Continuing with beautifulsoup

  • We can convert the site's source code (page_source) to BeautifulSoup, and proceed as always
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source)

cards = soup.find_all(class_='card-body')

counter = 0
for card in cards:
    counter = counter + 1
    print('Card ' + str(counter) + ': ' + card.get_text())

Clicking with selenium

url = "https://music-to-scrape.org/"
driver.get(url)
time.sleep(3) # wait for 3 seconds

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains

try:
    cookie_button = driver.find_element(By.ID, "accept-cookies")
    cookie_button.click()
except:
    print('No cookie button found (anymore)!')

Scrolling with selenium

scroll_pause_time = 2
for _ in range(3):  # Scroll down 3 times
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

Different ways to open a site

  • Static/easy site: use requests
    • requests –> BeautifulSoup –> .find()
  • Dynamic/difficult site: also use selenium
    • selenium –> wait –> BeautifulSoup –> .find()

Next steps

  • Focus on self-study material
  • Today's coaching session…!
    • sign-off on extraction design (challenges #1.1-1.3)
    • working on prototyping the data extraction
    • consider selenium and APIs (see next tutorials)

Thanks!

  • Any questions?
  • Be in touch for feedback via WhatsApp.