oDCM - Web Scraping 101 (Tutorial)

Hannes Datta

Welcome to oDCM!

We're about to start with today's tutorial (“web scraping 101”).

Before we start

  • Recap coaching session for team activity #1
    • explore broadly! (“universe discovery!”) –> check challenge #1.1
    • compare sources thoroughly –> e.g., comparison table + challenge #1.2
    • use overview tables for comparison, illustrations
  • Previous today's coaching session: decide which website or API to use (challenges #1.1-#1.3)
  • Any questions so far?

A note on my (live) tutorials (I)

  • Core material is the (Jupyter Notebook) tutorial of the week
  • It's written and illustrated with screenshots – it gives you all of the necessary information and you can do it at your own time after class.
  • IN CLASS, my ambition is to preview selected issues
  • Recordings of these sessions posted on Canvas

A note on my (live) tutorials (II)

  • feel comfortable w/ code?
    • then write code with me and try it out!
  • struggle?
    • just see me code.
    • focus on understanding concepts & take notes
    • spend more time on tutorial after class
    • keep asking questions!
  • “bored”/too advanced?
    • flip through slides and work on advanced exercises
    • ask advanced questions when I walk around (1:1)
    • today's “klaaropdracht” (“when you're done…” - see end of slide deck)

Framework

Today: zooming in more on collection design

Recap from last week's tutorial

  • We focused on which information to extract (Challenge #2.1)
    • e.g., explore a website, navigate broadly to find information that's interesting, etc.
  • We focused on how to extract that information
    • BeautifulSoup: .find() and .find_all() functions
    • Use (a combination of) tags, classes, attributes, attribute-value pairs

DO: Recap from last week's tutorial

Extend the code snippet below to extract the release year of the song below.

import requests
from bs4 import BeautifulSoup
url = 'https://music-to-scrape.org/song?song-id=SOJZTXJ12AB01845FB'
request = requests.get(url, headers =  {'User-agent': 'Mozilla/5.0'})
song_page = BeautifulSoup(request.text)
about = song_page.find(class_='about_artist')
title = about.find('p')

DO: Solution

  • Use the browser's inspect mode to find the relevant attributes
  • Consecutively develop the code step-by-step, “onion”-like
about.find_all('p')
[<p>China</p>, <p>Tito Puente</p>, <p>1972</p>, <p>54</p>]
plays=about.find_all('p')[3].get_text()

Recommendations:

  • Practice even more, e.g., with new information that you haven't extracted before from this and other sections of the website.
  • Always build code gradually, never from scratch

DO: After-class exercise from last week

  • Any issues you've run into?
  • Solutions posted on the site!

Today's focus

Remaining challenges for the “design” phase of collecting data.

  • How to sample from a website? (Challenge #2.2)
    • Recall that we do not have access to a firm's database (so we can't sample from a population of, say, users)
    • With web scraping, we always need a “starting point” for a data collection
  • Examples for seeds/sample
    • list of recently active users (music-to-scrape.org)
    • homepage of Twitch (for current live streams)

DO: Starting up a data collection for your own project

Using your own project ideas, tell us how you could sample from the site?



Tips:

  • Make use of the best practices for Challenge #2.2 in Table 3

Sampling users

For today's tutorial, we will be sampling users from the main website of music-to-scrape.org

Can you propose a strategy to capture them? Any attributes/classes you see?

DO: Extract all links

Please use the code snippet below to extract the 'href' attribute of all links on the website.

  • Links are identifiable by the “a” tag.
  • The relevant attribute is called “href”
  • Build your code consecutively!
url = 'https://music-to-scrape.org'

res = requests.get(url)
res.encoding = res.apparent_encoding

homepage = BeautifulSoup(res.text)

# continue here

Solution: Extract all links

for link in homepage.find_all("a"):
    if 'href' in link.attrs: 
        print(link.attrs["href"])
/privacy_terms
/privacy_terms
about
/
/
/tutorial_scraping
/tutorial_api
#
https://api.music-to-scrape.org/docs
/about
song?song-id=SOZYFSB12A8C1393CF
song?song-id=SOJDKQV12A6D4FAF0E
song?song-id=SONIWYZ12A58A7CB59
song?song-id=SOWHFID12A8C133F6B
song?song-id=SOGGUFB12A6D4F8977
song?song-id=SODDTOV12A6D4F9167
song?song-id=SOHDAZP12AB0185D7F
song?song-id=SOIVRCS12A6D4FDA2C
song?song-id=SOQZISQ12AC4689482
song?song-id=SOOIZES12AB018A80F
song?song-id=SOJZTXJ12AB01845FB
song?song-id=SOPZASO12A6D4F6A79
song?song-id=SOODIJF12A8C13FDBB
song?song-id=SOQXGVE12CF5F86D20
song?song-id=SOGXHEG12AB018653E
song?song-id=SOLNOMM12A8C132AD8
song?song-id=SOYYBGR12A8C140F1A
song?song-id=SOBVAPJ12AB018739D
song?song-id=SOWHXYS12AC9E177F0
song?song-id=SOQUEQP12A8C1397DE
song?song-id=SOBHHIZ12AB01841E0
song?song-id=SOAOXXE12AB0182517
song?song-id=SOJLUMZ12AB0186CA1
song?song-id=SOACUYZ12CF54662F1
song?song-id=SOGXWRE12AC468BE24
artist?artist-id=ARKGWMO11F50C4813F
artist?artist-id=ARYFAT91187B99FEF5
artist?artist-id=AR2TT8P1187B9B624D
artist?artist-id=ARICCN811C8A41750F
artist?artist-id=AR00A6H1187FB5402A
artist?artist-id=ARIN12F1187FB3E92C
artist?artist-id=ARN7OQ21187FB5A6B3
artist?artist-id=ARY55LO1187B9A3F17
user?username=Geek61
user?username=MoonRocket50
user?username=Pixel48
user?username=SonicShadow61
user?username=Coder26
user?username=Vector59
/tutorial_scraping
/tutorial_api
https://api.music-to-scrape.org/docs
/about
/privacy_terms
https://www.linkedin.com/company/tilburgsciencehub
https://github.com/tilburgsciencehub/music-to-scrape
https://twitter.com/tilburgscience

Solution: Extract all links

  • But… are all of these links really relevant?
  • Recall: what “seeds” do we need to gather?

Narrowing down our extraction

  • Let us explore the site structure a bit more
  • Particularly, can we identify other ways to navigate on the site?
  • Remember, your strategy can be a “multi-step” strategy (first to A, then within A to B)!
  • Let's open the inspect mode of our browser and come up with an updated strategy.

DO: Narrowing down our extraction

  • The relevant links all reside in the recent_users section, starting with <section>
relevant_section = homepage.find('section',attrs={'name':'recent_users'})

DO: Can you come up with a way to loop through all of the links WITHIN relevant_section and store them?

Solution: Narrowing down our extraction

users = []
for link in relevant_section.find_all("a"):
  if ('href' in link.attrs):
      users.append(link.attrs['href'])
users
['user?username=Geek61', 'user?username=MoonRocket50', 'user?username=Pixel48', 'user?username=SonicShadow61', 'user?username=Coder26', 'user?username=Vector59']
  • Let's now take a look at these links more closely.

Modifying the links

  • Notice that the links were relative links and we won't be able to use them for looping (e.g., try pasting them in your browser - they won't work!)
  • So, let's turn them into absolute links, simply by concatenating (= combining) strings.
urls = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        urls.append(f'https://music-to-scrape.org/{extracted_link}')
urls
['https://music-to-scrape.org/user?username=Geek61', 'https://music-to-scrape.org/user?username=MoonRocket50', 'https://music-to-scrape.org/user?username=Pixel48', 'https://music-to-scrape.org/user?username=SonicShadow61', 'https://music-to-scrape.org/user?username=Coder26', 'https://music-to-scrape.org/user?username=Vector59']

DO: Remember functions?

Write a function that…

Do you remember why we like functions so much?

Solution

import requests
from bs4 import BeautifulSoup

def get_users():
    url = 'https://music-to-scrape.org/'

    res = requests.get(url)
    res.encoding = res.apparent_encoding

    soup = BeautifulSoup(res.text)

    relevant_section = soup.find('section',attrs={'name':'recent_users'})

    links = []
    for link in relevant_section.find_all("a"):
        if 'href' in link.attrs: 
            extracted_link = link.attrs['href']
            links.append(f'https://music-to-scrape.org/{extracted_link}')
    return(links) # to return all links

# let's try it out
# users = get_users() 
  • Functions allow us to re-use code; it prevents errors, and helps us structure our code.

JSON data

  • For now, the data is just in memory.
  • But, we can also save it!
  • For this, we store the data as JSON objects/dictionaries
  • Use json.dumps to convert JSON to text (and then save it)
  • Use json.loads to convert text to JSON (and then use it)

Example: JSON data

import json

users = get_users()
# build JSON dictionary
f = open('users.json','w')
for user in users:
  obj = {'url': user,
         'username': user.replace('https://music-to-scrape.org/user?username=','')}
  f.write(json.dumps(obj))
  f.write('\n')
81
1
93
1
83
1
95
1
83
1
85
1
f.close()

Preventing array misalignment

  • When extracting web data, it's important to extract information in its “original” (nested) structure
  • For example, hover around the user names and observe the song names that pop up.
  • If you were to extract this information separately, you wouldn't be able to know which one relates to which user.

Demonstration

links = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        links.append(f'https://music-to-scrape.org/{extracted_link}')

# getting songs
songs = []
for song in relevant_section.find_all("span"):
    songs.append(song.get_text())

len(links)
6
len(songs)
4

DO: Preventing array misalignment

Which solutions do you see – by iterating differently through the source code – to prevent array misalignment?

Tip: Observe how we can first iterate through users, THEN extract links (similar to how we have done it with relevant_section).

users = relevant_section.find_all(class_='mobile-user-margin')
data = []
for user in users: # iterate through each user
  obj = {'link':user.find('a').attrs['href']  }
  data.append(obj)
data
[{'link': 'user?username=Geek61'}, {'link': 'user?username=MoonRocket50'}, {'link': 'user?username=Pixel48'}, {'link': 'user?username=SonicShadow61'}, {'link': 'user?username=Coder26'}, {'link': 'user?username=Vector59'}]

Solution

users = relevant_section.find_all(class_='mobile-user-margin')
data = []
for user in users:
  if user.find('span') is not None:
    song_name=user.find('span').get_text()
  else:
    song_name='NA'
  obj = {'link':user.find('a').attrs['href'],
         'song_name': song_name}

  data.append(obj)
data
[{'link': 'user?username=Geek61', 'song_name': 'Bad Company - Valerie (LP Version)'}, {'link': 'user?username=MoonRocket50', 'song_name': 'NA'}, {'link': 'user?username=Pixel48', 'song_name': 'Lara & Reyes - Amor De Lejos'}, {'link': 'user?username=SonicShadow61', 'song_name': 'NA'}, {'link': 'user?username=Coder26', 'song_name': 'Erik Berglund - All I Ask Of You'}, {'link': 'user?username=Vector59', 'song_name': 'Roger Williams - Cool Water'}]

Let's take a step back...

  • What we've learnt so far…

    • Extract individual information from a website, e.g., the homepage of music-to-scrape (ch. #2.1)
    • Collect all links to user profiles from the homepage (ch. #2.2)
  • What's missing

    • VISIT each of the profile pages
    • Loop through them
  • We can then tie things together in a scraper

    • “collect” all user seeds (scraper 1)
    • then collect all consumption data from individual user pages (scraper 2)

Navigating on a website

  • Two strategies
    • Understand how links are built (pre-building them; strategy 1)
    • Understand how to “click” to the next/previous page (consecutively building them; strategy 2)
urls = []
counter = 37
while counter >= 0:
  urls.append(f'https://music-to-scrape.org/user?username=StarCoder49&week={counter}')
  counter = counter - 1
urls
['https://music-to-scrape.org/user?username=StarCoder49&week=37', 'https://music-to-scrape.org/user?username=StarCoder49&week=36', 'https://music-to-scrape.org/user?username=StarCoder49&week=35', 'https://music-to-scrape.org/user?username=StarCoder49&week=34', 'https://music-to-scrape.org/user?username=StarCoder49&week=33', 'https://music-to-scrape.org/user?username=StarCoder49&week=32', 'https://music-to-scrape.org/user?username=StarCoder49&week=31', 'https://music-to-scrape.org/user?username=StarCoder49&week=30', 'https://music-to-scrape.org/user?username=StarCoder49&week=29', 'https://music-to-scrape.org/user?username=StarCoder49&week=28', 'https://music-to-scrape.org/user?username=StarCoder49&week=27', 'https://music-to-scrape.org/user?username=StarCoder49&week=26', 'https://music-to-scrape.org/user?username=StarCoder49&week=25', 'https://music-to-scrape.org/user?username=StarCoder49&week=24', 'https://music-to-scrape.org/user?username=StarCoder49&week=23', 'https://music-to-scrape.org/user?username=StarCoder49&week=22', 'https://music-to-scrape.org/user?username=StarCoder49&week=21', 'https://music-to-scrape.org/user?username=StarCoder49&week=20', 'https://music-to-scrape.org/user?username=StarCoder49&week=19', 'https://music-to-scrape.org/user?username=StarCoder49&week=18', 'https://music-to-scrape.org/user?username=StarCoder49&week=17', 'https://music-to-scrape.org/user?username=StarCoder49&week=16', 'https://music-to-scrape.org/user?username=StarCoder49&week=15', 'https://music-to-scrape.org/user?username=StarCoder49&week=14', 'https://music-to-scrape.org/user?username=StarCoder49&week=13', 'https://music-to-scrape.org/user?username=StarCoder49&week=12', 'https://music-to-scrape.org/user?username=StarCoder49&week=11', 'https://music-to-scrape.org/user?username=StarCoder49&week=10', 'https://music-to-scrape.org/user?username=StarCoder49&week=9', 'https://music-to-scrape.org/user?username=StarCoder49&week=8', 'https://music-to-scrape.org/user?username=StarCoder49&week=7', 'https://music-to-scrape.org/user?username=StarCoder49&week=6', 'https://music-to-scrape.org/user?username=StarCoder49&week=5', 'https://music-to-scrape.org/user?username=StarCoder49&week=4', 'https://music-to-scrape.org/user?username=StarCoder49&week=3', 'https://music-to-scrape.org/user?username=StarCoder49&week=2', 'https://music-to-scrape.org/user?username=StarCoder49&week=1', 'https://music-to-scrape.org/user?username=StarCoder49&week=0']

DO: Navigating on a Website (strategy 2)

Run the code below. How to extend the code to extract the LINK of the previous button?

url = 'https://music-to-scrape.org/user?username=StarCoder49&week=36'
header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
userpage = BeautifulSoup(res.text)
button=userpage.find(class_='page-link', attrs={'type':'previous_page'})

Solution

button.attrs['href']
'user?username=StarCoder49&week=35'

DISCUSS: When do we use strategy 1 (pre-built) vs. strategy 2 (do it on the fly)?

Let's tie things together

  • We now have a function get_users() to retrieve user names
  • We also know how to “find” the next link to visit (a user's previous page)
  • How can we now visit ALL pages, for ALL users?
  • Let's take a look at some pseudo code!
users = get_users()

consumption_data = []

for user in users:
  url = user['url']

  while url is not None:
    # scrape information from URL # challenge #2.1, #2.4
    # determine "previous page"
    time.sleep(1) # challenge #2.3

    # if previous page exists: rerun loop on next URL
    # if previous page does not exist: stop while loop, go to next 

Challenge #2.3: At which frequency to extract the data?

  • We can decide how often to capture information from a site
    • e.g., once, every 5 minutes, every day
  • Potential gains in extracting data multiple times
  • Extraction limits & technically feasible sample size!
  • More issues explained in table 3, challenge #2.3

Challenge #2.4: How to process data during collection

  • value of retaining raw data
# pseudo code!
product_data = []
for s in seeds:
  # store raw html code here
  store_to_file(s, 'website.html')
  # continue w/ parsing some information
  product = get_product_info(s)
  product_data.append(product)

Challenge #2.4: How to process data during collection

  • parsing data on the fly (vs. after the collection)
# pseudo code!

product_data = []
for s in seeds:
  product = get_product_info(s) 
  product_updated = processing(product)
  time.sleep(1) 
  product_data.append(product_updated) 
  # write data to file here (parsing on the fly!!!)

# write data to file here (after the collection - avoid!)

Extract consumption data

We still need to extract the consumption data from a user's profile page.

Let's start with the snippet below.

url = 'https://music-to-scrape.org/user?username=StarCoder49&week=10'
soup = BeautifulSoup(requests.get(url).text)

Can you propose:

  • how to find the table?
  • how to iterate through each row?
  • how to save the information as a JSON dictionary

Scraping more advanced websites

  • So far, we have done this

    • requests (to get) –> beautifulsoup (to extract information, “parse”)
  • DO: Run the snippet below and open amazon.html

import requests
header = 
f = open('amazon.html', 'w', encoding = 'utf-8')
f.write(requests.get('https://amazon.com', headers =  {'User-agent': 'Mozilla/5.0'}).text)
f.close()

Can you explain what happened?

Alternative ways to make connections

  • Many dynamic sites require what I call “simulated browsing”

  • Try this:

!pip install webdriver_manager
!pip install selenium

# Using selenium 4 - ensure you have Chrome installed!
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

url = "https://amazon.com/" # remember to solve captchas
driver.get(url)
  • See your browser opening up?
  • Beware of rerunning code - a new instance will run each time!

Continuing with beautifulsoup

  • We can convert the site's source code (page_source) to BeautifulSoup, and proceed as always
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source)

for el in soup.find_all(class_ = 'a-section'):
    title = el.find('h2')
    if title is not None: print(title.text)

Different ways to open a site

  • Static/easy site: use requests
    • requests –> BeautifulSoup –> .find()
  • Dynamic/difficult site: also use selenium
    • selenium –> wait –> BeautifulSoup –> .find()

Next steps

  • Focus on self-study material
  • Today's coaching session…!
    • sign-off on extraction design (challenges #1.1-1.3)
    • working on prototyping the data extraction
    • consider selenium and APIs (see next tutorials)

Thanks!

  • Any questions?
  • Be in touch for feedback via WhatsApp.

Today's "klaaropdracht" (when you're totally advanced already)

Use the code snippets provided on Selenium and BeautifulSoup to collect a list of at least 1,000 products from any of the product pages from Bol.com.

Coaching session


  • Please actively solicit my feedback!