oDCM - Web Scraping 101 (Tutorial)

Hannes Datta

Welcome to oDCM!

We're about to start with today's tutorial (“web scraping 101”).

If you haven't done so, open slides at https://odcm.hannesdatta.com/docs/modules/week3/
Today's (live) code is available at tiu.nu/livecoding (refresh may be needed)
You can use Google Colab at the beginning of this tutorial, BUT… you will have to use Jupyter Notebook on your laptops towards the end

Before we start

Recap coaching session for team activity #1
- explore broadly! (“universe discovery!”) –> check challenge #1.1
- compare sources thoroughly –> e.g., comparison table + challenge #1.2
- use overview tables for comparison, illustrations
Previous today's coaching session: decide which website or API to use (challenges #1.1-#1.3)
Any questions so far?

A note on my (live) tutorials (I)

Core material is the (Jupyter Notebook) tutorial of the week
It's written and illustrated with screenshots – it gives you all of the necessary information and you can do it at your own time after class.
IN CLASS, my ambition is to preview selected issues
Recordings of these sessions posted on Canvas

A note on my (live) tutorials (II)

feel comfortable w/ code?
- then write code with me and try it out!
struggle?
- just see me code.
- focus on understanding concepts & take notes
- spend more time on tutorial after class
- keep asking questions!
“bored”/too advanced?
- flip through slides and work on advanced exercises
- ask advanced questions when I walk around (1:1)
- today's “klaaropdracht” (“when you're done…” - see end of slide deck)

Framework

Today: zooming in more on collection design

Recap from last week's tutorial

We focused on which information to extract (Challenge #2.1)
- e.g., explore a website, navigate broadly to find information that's interesting, etc.
We focused on how to extract that information
- BeautifulSoup: .find() and .find_all() functions
- Use (a combination of) tags, classes, attributes, attribute-value pairs

DO: Recap from last week's tutorial

Extend the code snippet below to extract the release year of the song below.

import requests
from bs4 import BeautifulSoup
url = 'https://music-to-scrape.org/song?song-id=SOJZTXJ12AB01845FB'
request = requests.get(url, headers =  {'User-agent': 'Mozilla/5.0'})
song_page = BeautifulSoup(request.text)
about = song_page.find(class_='about_artist')
title = about.find('p')

DO: Solution

Use the browser's inspect mode to find the relevant attributes
Consecutively develop the code step-by-step, “onion”-like

about.find_all('p')

[<p>China</p>, <p>Tito Puente</p>, <p>1972</p>, <p>54</p>]

plays=about.find_all('p')[3].get_text()

Recommendations:

Practice even more, e.g., with new information that you haven't extracted before from this and other sections of the website.
Always build code gradually, never from scratch

DO: After-class exercise from last week

Any issues you've run into?
Solutions posted on the site!

Today's focus

Remaining challenges for the “design” phase of collecting data.

How to sample from a website? (Challenge #2.2)
- Recall that we do not have access to a firm's database (so we can't sample from a population of, say, users)
- With web scraping, we always need a “starting point” for a data collection
Examples for seeds/sample
- list of recently active users (music-to-scrape.org)
- homepage of Twitch (for current live streams)

DO: Starting up a data collection for your own project

Using your own project ideas, tell us how you could sample from the site?

Tips:

Make use of the best practices for Challenge #2.2 in Table 3

Sampling users

For today's tutorial, we will be sampling users from the main website of music-to-scrape.org

Can you propose a strategy to capture them? Any attributes/classes you see?

DO: Extract all links

Please use the code snippet below to extract the 'href' attribute of all links on the website.

Links are identifiable by the “a” tag.
The relevant attribute is called “href”
Build your code consecutively!

url = 'https://music-to-scrape.org'

res = requests.get(url)
res.encoding = res.apparent_encoding

homepage = BeautifulSoup(res.text)

# continue here

Solution: Extract all links

for link in homepage.find_all("a"):
    if 'href' in link.attrs: 
        print(link.attrs["href"])

/privacy_terms
/privacy_terms
about
/
/
/tutorial_scraping
/tutorial_api
#
https://api.music-to-scrape.org/docs
/about
song?song-id=SOZYFSB12A8C1393CF
song?song-id=SOJDKQV12A6D4FAF0E
song?song-id=SONIWYZ12A58A7CB59
song?song-id=SOWHFID12A8C133F6B
song?song-id=SOGGUFB12A6D4F8977
song?song-id=SODDTOV12A6D4F9167
song?song-id=SOHDAZP12AB0185D7F
song?song-id=SOIVRCS12A6D4FDA2C
song?song-id=SOQZISQ12AC4689482
song?song-id=SOOIZES12AB018A80F
song?song-id=SOJZTXJ12AB01845FB
song?song-id=SOPZASO12A6D4F6A79
song?song-id=SOODIJF12A8C13FDBB
song?song-id=SOQXGVE12CF5F86D20
song?song-id=SOGXHEG12AB018653E
song?song-id=SOLNOMM12A8C132AD8
song?song-id=SOYYBGR12A8C140F1A
song?song-id=SOBVAPJ12AB018739D
song?song-id=SOWHXYS12AC9E177F0
song?song-id=SOQUEQP12A8C1397DE
song?song-id=SOBHHIZ12AB01841E0
song?song-id=SOAOXXE12AB0182517
song?song-id=SOJLUMZ12AB0186CA1
song?song-id=SOACUYZ12CF54662F1
song?song-id=SOGXWRE12AC468BE24
artist?artist-id=ARKGWMO11F50C4813F
artist?artist-id=ARYFAT91187B99FEF5
artist?artist-id=AR2TT8P1187B9B624D
artist?artist-id=ARICCN811C8A41750F
artist?artist-id=AR00A6H1187FB5402A
artist?artist-id=ARIN12F1187FB3E92C
artist?artist-id=ARN7OQ21187FB5A6B3
artist?artist-id=ARY55LO1187B9A3F17
user?username=Geek61
user?username=MoonRocket50
user?username=Pixel48
user?username=SonicShadow61
user?username=Coder26
user?username=Vector59
/tutorial_scraping
/tutorial_api
https://api.music-to-scrape.org/docs
/about
/privacy_terms
https://www.linkedin.com/company/tilburgsciencehub
https://github.com/tilburgsciencehub/music-to-scrape
https://twitter.com/tilburgscience

Solution: Extract all links

But… are all of these links really relevant?
Recall: what “seeds” do we need to gather?

Narrowing down our extraction

Let us explore the site structure a bit more
Particularly, can we identify other ways to navigate on the site?
Remember, your strategy can be a “multi-step” strategy (first to A, then within A to B)!
Let's open the inspect mode of our browser and come up with an updated strategy.

DO: Narrowing down our extraction

The relevant links all reside in the recent_users section, starting with <section>

relevant_section = homepage.find('section',attrs={'name':'recent_users'})

DO: Can you come up with a way to loop through all of the links WITHIN relevant_section and store them?

Solution: Narrowing down our extraction

users = []
for link in relevant_section.find_all("a"):
  if ('href' in link.attrs):
      users.append(link.attrs['href'])
users

['user?username=Geek61', 'user?username=MoonRocket50', 'user?username=Pixel48', 'user?username=SonicShadow61', 'user?username=Coder26', 'user?username=Vector59']

Let's now take a look at these links more closely.

Modifying the links

Notice that the links were relative links and we won't be able to use them for looping (e.g., try pasting them in your browser - they won't work!)
So, let's turn them into absolute links, simply by concatenating (= combining) strings.

urls = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        urls.append(f'https://music-to-scrape.org/{extracted_link}')
urls

['https://music-to-scrape.org/user?username=Geek61', 'https://music-to-scrape.org/user?username=MoonRocket50', 'https://music-to-scrape.org/user?username=Pixel48', 'https://music-to-scrape.org/user?username=SonicShadow61', 'https://music-to-scrape.org/user?username=Coder26', 'https://music-to-scrape.org/user?username=Vector59']

DO: Remember functions?

Write a function that…

makes a web request to https://music-to-scrape.org
extract all links to the user pages and returns them back (return())

Do you remember why we like functions so much?

Solution

import requests
from bs4 import BeautifulSoup

def get_users():
    url = 'https://music-to-scrape.org/'

    res = requests.get(url)
    res.encoding = res.apparent_encoding

    soup = BeautifulSoup(res.text)

    relevant_section = soup.find('section',attrs={'name':'recent_users'})

    links = []
    for link in relevant_section.find_all("a"):
        if 'href' in link.attrs: 
            extracted_link = link.attrs['href']
            links.append(f'https://music-to-scrape.org/{extracted_link}')
    return(links) # to return all links

# let's try it out
# users = get_users()

Functions allow us to re-use code; it prevents errors, and helps us structure our code.

JSON data

For now, the data is just in memory.
But, we can also save it!
For this, we store the data as JSON objects/dictionaries
Use json.dumps to convert JSON to text (and then save it)
Use json.loads to convert text to JSON (and then use it)

Example: JSON data

import json

users = get_users()
# build JSON dictionary
f = open('users.json','w')
for user in users:
  obj = {'url': user,
         'username': user.replace('https://music-to-scrape.org/user?username=','')}
  f.write(json.dumps(obj))
  f.write('\n')

f.close()

Preventing array misalignment

When extracting web data, it's important to extract information in its “original” (nested) structure
For example, hover around the user names and observe the song names that pop up.
If you were to extract this information separately, you wouldn't be able to know which one relates to which user.

Demonstration

links = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        links.append(f'https://music-to-scrape.org/{extracted_link}')

# getting songs
songs = []
for song in relevant_section.find_all("span"):
    songs.append(song.get_text())

len(links)

len(songs)

DO: Preventing array misalignment

Which solutions do you see – by iterating differently through the source code – to prevent array misalignment?

Tip: Observe how we can first iterate through users, THEN extract links (similar to how we have done it with relevant_section).

users = relevant_section.find_all(class_='mobile-user-margin')
data = []
for user in users: # iterate through each user
  obj = {'link':user.find('a').attrs['href']  }
  data.append(obj)
data

[{'link': 'user?username=Geek61'}, {'link': 'user?username=MoonRocket50'}, {'link': 'user?username=Pixel48'}, {'link': 'user?username=SonicShadow61'}, {'link': 'user?username=Coder26'}, {'link': 'user?username=Vector59'}]

Solution

users = relevant_section.find_all(class_='mobile-user-margin')
data = []
for user in users:
  if user.find('span') is not None:
    song_name=user.find('span').get_text()
  else:
    song_name='NA'
  obj = {'link':user.find('a').attrs['href'],
         'song_name': song_name}

  data.append(obj)
data

[{'link': 'user?username=Geek61', 'song_name': 'Bad Company - Valerie (LP Version)'}, {'link': 'user?username=MoonRocket50', 'song_name': 'NA'}, {'link': 'user?username=Pixel48', 'song_name': 'Lara & Reyes - Amor De Lejos'}, {'link': 'user?username=SonicShadow61', 'song_name': 'NA'}, {'link': 'user?username=Coder26', 'song_name': 'Erik Berglund - All I Ask Of You'}, {'link': 'user?username=Vector59', 'song_name': 'Roger Williams - Cool Water'}]

Let's take a step back...

What we've learnt so far…
- Extract individual information from a website, e.g., the homepage of music-to-scrape (ch. #2.1)
- Collect all links to user profiles from the homepage (ch. #2.2)
What's missing
- VISIT each of the profile pages
- Loop through them
We can then tie things together in a scraper
- “collect” all user seeds (scraper 1)
- then collect all consumption data from individual user pages (scraper 2)

Navigating on a website

Two strategies
- Understand how links are built (pre-building them; strategy 1)
- Understand how to “click” to the next/previous page (consecutively building them; strategy 2)

urls = []
counter = 37
while counter >= 0:
  urls.append(f'https://music-to-scrape.org/user?username=StarCoder49&week={counter}')
  counter = counter - 1
urls

['https://music-to-scrape.org/user?username=StarCoder49&week=37', 'https://music-to-scrape.org/user?username=StarCoder49&week=36', 'https://music-to-scrape.org/user?username=StarCoder49&week=35', 'https://music-to-scrape.org/user?username=StarCoder49&week=34', 'https://music-to-scrape.org/user?username=StarCoder49&week=33', 'https://music-to-scrape.org/user?username=StarCoder49&week=32', 'https://music-to-scrape.org/user?username=StarCoder49&week=31', 'https://music-to-scrape.org/user?username=StarCoder49&week=30', 'https://music-to-scrape.org/user?username=StarCoder49&week=29', 'https://music-to-scrape.org/user?username=StarCoder49&week=28', 'https://music-to-scrape.org/user?username=StarCoder49&week=27', 'https://music-to-scrape.org/user?username=StarCoder49&week=26', 'https://music-to-scrape.org/user?username=StarCoder49&week=25', 'https://music-to-scrape.org/user?username=StarCoder49&week=24', 'https://music-to-scrape.org/user?username=StarCoder49&week=23', 'https://music-to-scrape.org/user?username=StarCoder49&week=22', 'https://music-to-scrape.org/user?username=StarCoder49&week=21', 'https://music-to-scrape.org/user?username=StarCoder49&week=20', 'https://music-to-scrape.org/user?username=StarCoder49&week=19', 'https://music-to-scrape.org/user?username=StarCoder49&week=18', 'https://music-to-scrape.org/user?username=StarCoder49&week=17', 'https://music-to-scrape.org/user?username=StarCoder49&week=16', 'https://music-to-scrape.org/user?username=StarCoder49&week=15', 'https://music-to-scrape.org/user?username=StarCoder49&week=14', 'https://music-to-scrape.org/user?username=StarCoder49&week=13', 'https://music-to-scrape.org/user?username=StarCoder49&week=12', 'https://music-to-scrape.org/user?username=StarCoder49&week=11', 'https://music-to-scrape.org/user?username=StarCoder49&week=10', 'https://music-to-scrape.org/user?username=StarCoder49&week=9', 'https://music-to-scrape.org/user?username=StarCoder49&week=8', 'https://music-to-scrape.org/user?username=StarCoder49&week=7', 'https://music-to-scrape.org/user?username=StarCoder49&week=6', 'https://music-to-scrape.org/user?username=StarCoder49&week=5', 'https://music-to-scrape.org/user?username=StarCoder49&week=4', 'https://music-to-scrape.org/user?username=StarCoder49&week=3', 'https://music-to-scrape.org/user?username=StarCoder49&week=2', 'https://music-to-scrape.org/user?username=StarCoder49&week=1', 'https://music-to-scrape.org/user?username=StarCoder49&week=0']

DO: Navigating on a Website (strategy 2)

Run the code below. How to extend the code to extract the LINK of the previous button?

url = 'https://music-to-scrape.org/user?username=StarCoder49&week=36'
header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
userpage = BeautifulSoup(res.text)
button=userpage.find(class_='page-link', attrs={'type':'previous_page'})

Solution

button.attrs['href']

'user?username=StarCoder49&week=35'

DISCUSS: When do we use strategy 1 (pre-built) vs. strategy 2 (do it on the fly)?

Let's tie things together

We now have a function get_users() to retrieve user names
We also know how to “find” the next link to visit (a user's previous page)
How can we now visit ALL pages, for ALL users?
Let's take a look at some pseudo code!

users = get_users()

consumption_data = []

for user in users:
  url = user['url']

  while url is not None:
    # scrape information from URL # challenge #2.1, #2.4
    # determine "previous page"
    time.sleep(1) # challenge #2.3

    # if previous page exists: rerun loop on next URL
    # if previous page does not exist: stop while loop, go to next

Challenge #2.3: At which frequency to extract the data?

We can decide how often to capture information from a site
- e.g., once, every 5 minutes, every day
Potential gains in extracting data multiple times
Extraction limits & technically feasible sample size!
More issues explained in table 3, challenge #2.3

Challenge #2.4: How to process data during collection

value of retaining raw data

# pseudo code!
product_data = []
for s in seeds:
  # store raw html code here
  store_to_file(s, 'website.html')
  # continue w/ parsing some information
  product = get_product_info(s)
  product_data.append(product)

Challenge #2.4: How to process data during collection

parsing data on the fly (vs. after the collection)

# pseudo code!

product_data = []
for s in seeds:
  product = get_product_info(s) 
  product_updated = processing(product)
  time.sleep(1) 
  product_data.append(product_updated) 
  # write data to file here (parsing on the fly!!!)

# write data to file here (after the collection - avoid!)

Extract consumption data

We still need to extract the consumption data from a user's profile page.

Let's start with the snippet below.

url = 'https://music-to-scrape.org/user?username=StarCoder49&week=10'
soup = BeautifulSoup(requests.get(url).text)

Can you propose:

how to find the table?
how to iterate through each row?
how to save the information as a JSON dictionary

Scraping more advanced websites

So far, we have done this
- requests (to get) –> beautifulsoup (to extract information, “parse”)
DO: Run the snippet below and open amazon.html

import requests
header = 
f = open('amazon.html', 'w', encoding = 'utf-8')
f.write(requests.get('https://amazon.com', headers =  {'User-agent': 'Mozilla/5.0'}).text)
f.close()

Can you explain what happened?

Alternative ways to make connections

Many dynamic sites require what I call “simulated browsing”
Try this:

!pip install webdriver_manager
!pip install selenium

# Using selenium 4 - ensure you have Chrome installed!
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

url = "https://amazon.com/" # remember to solve captchas
driver.get(url)

See your browser opening up?
Beware of rerunning code - a new instance will run each time!

Continuing with beautifulsoup

We can convert the site's source code (page_source) to BeautifulSoup, and proceed as always

from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source)

for el in soup.find_all(class_ = 'a-section'):
    title = el.find('h2')
    if title is not None: print(title.text)

Different ways to open a site

Static/easy site: use requests
- requests –> BeautifulSoup –> .find()
Dynamic/difficult site: also use selenium
- selenium –> wait –> BeautifulSoup –> .find()

Next steps

Focus on self-study material
Today's coaching session…!
- sign-off on extraction design (challenges #1.1-1.3)
- working on prototyping the data extraction
- consider selenium and APIs (see next tutorials)

Thanks!

Any questions?
Be in touch for feedback via WhatsApp.

Today's "klaaropdracht" (when you're totally advanced already)

Use the code snippets provided on Selenium and BeautifulSoup to collect a list of at least 1,000 products from any of the product pages from Bol.com.

Coaching session

Decide which website or API to use (challenges #1.1-#1.3)
Instructions, see: https://odcm.hannesdatta.com/docs/project/workplan/activity2/

Please actively solicit my feedback!