# Web Scraping 101

*By the end of this tutorial, you’ll scrape data from multiple web pages and export it to JSON and CSV for analysis. Set aside a few hours and take breaks to stay sharp.*

*New to web scraping? Start with the ["Webdata for Dummies" tutorial](https://odcm.hannesdatta.com/docs/modules/week2/webdata-for-dummies/).*

*Enjoy!*

--- 

## Learning Objectives

Our main goal is to compile a panel data set of music consumption data for (simulated) users of music-to-scrape.org, a platform developed for practicing web scraping skills.


* **Generating seeds**: Extract multiple elements with `.find_all()`, avoiding array misalignment  
* **Navigating websites**: Visit pages via URLs and use loops for bulk data collection  
* **Optimizing extraction**: Add timers, modularize code, and save data with metadata in CSV/JSON  
* **Scraping advanced sites**: Headless requests vs. browser emulation (`requests` vs. `selenium`)  
--- 

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>


## 1. Generating seeds ("sampling")


__Importance__  

So far, we’ve parsed data from individual *artist pages* (e.g., featured artists' names), but we haven't explored individual user behavior yet. User-level data is often central to web scraping, like sampling tweets from Twitter/X or tracking movie-watching patterns on trakt.tv.  

To build a *panel dataset* (multiple users observed over time), we must first decide __which users to track__. This requires generating a *sample of users* (or books, movies, series, games—depending on the platform).  

In web scraping, a "seed" is the starting point for data collection—without it, there’s no data. For example, before crawling all users at [music-to-scrape.org](https://music-to-scrape.org), we need a *list of users*. Obtaining every username is nearly impossible, so we can:  

1. Visit the [music-to-scrape.org](https://music-to-scrape.org) homepage to find recently active users.  
2. Go to each user's profile and scrape their data (as done in the *Webdata for Dummies* tutorial).  

The homepage provides navigation to user profiles by clicking usernames or avatars (see red boxes below).  

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-users.png" align="left" width="80%"/>  

### 1.1 Collecting Links as Seeds  

Open the [website](https://music-to-scrape.org) and inspect the HTML using Chrome or Firefox (right-click → *Inspect*). Hover over elements and select a user avatar.  

Notice that each user has a clickable `<a>` tag with an `href` pointing to their profile page.  

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-inspect-link.png" align="left" width="60%"/>  

But, how could we tell a computer to capture the links to the various user pages?

One simple way is to select *elements by their tags*. For example, to extract all links (`<a>` tags). 

<div class="alert alert-block alert-info"><b>How to extract multiple elements at once?</b>
    <br>
    
- By working through other tutorials, you may already be familiar with the <code>.find()</code> function of BeautifulSoup. The <code>.find()</code> function returns the <b>first element</b> that matches your particular "search query". <br>
- If you want to extract <b>all elements</b> that match a particular search pattern (say, a class name), you can use BeautifulSoup's <code>.find_all()</code> function.<br>
- Note that the "result" of the <code>.find_all()</code> option is a list of results __that you need to iterate through.__

</div>


__Exercise 1.1__  

Run the code below to extract all `<a>` tags and print their `href` values. Don’t worry about understanding the code yet—we’ll break it down step by step soon!  

Look closely at the extracted links. Not all are relevant for user profiles.  

**Task:** Make a list of links *not* pointing to user pages. Which links are these, and why do they appear?  

In [None]:
# Run this code now
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
user_agent = {'User-agent': 'Mozilla/5.0'}
url = 'https://music-to-scrape.org'

res = requests.get(url, headers = user_agent)
res.encoding = res.apparent_encoding

soup = BeautifulSoup(res.text)

# return the href attribute in the <a> tag nested within the first product class element
for link in soup.find_all("a"):
    if 'href' in link.attrs: 
        print(link.attrs["href"])

**Your answer**

...

__Solution__

The links we want to ignore are...

* The links to the about or privacy pages
* Any link pointing to the most popular songs or artists
* Any social media links, etc.

These links are present on the page, because they are used by users to navigate on the page. 

### 1.2 Collecting Targeted Information  

__Importance__  

When scraping data, extracting everything by a general tag often returns irrelevant information. To get more targeted results, we need to be more specific in how we select elements—in this case, links are just one example. __The goal is to extract only the data we care about and ignore the rest.__  

To illustrate this, let’s inspect the "recently active users" section again. __Open your browser’s inspect tool and hover over that section.__  

You’ll see that the user links are inside a `<section>` with the attribute `name="recent_users"`. This structure helps us focus on the relevant content, while unrelated elements like "about" or "privacy" links are excluded.  

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-section.png" align="left" width="60%"/>  

We can now focus our scraper to target only the `<a>` tags *inside* the `<section>` with the attribute `name="recent_users"`. This way, we filter out irrelevant links and get exactly what we need.  

__Let’s try it out!__  

We’ll still use `.find_all()` to capture matching elements on the page. However, instead of directly extracting `<a>` tags, we’ll first select the specific section containing the relevant links and then collect the `<a>` tags within it.  

Run the code below to see how it works. First, we grab elements from the `recent_users` section, then extract all `<a>` tags from that section.  

In [None]:
import requests
from bs4 import BeautifulSoup

# make request
url = 'https://music-to-scrape.org'

res = requests.get(url, headers = user_agent)
res.encoding = res.apparent_encoding

soup = BeautifulSoup(res.text)

relevant_section = soup.find('section',attrs={'name':'recent_users'})

users = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        users.append(link.attrs['href'])
users

As expected, we retrieve up to six user names. You can now also use the `users` object to look at the data for the first, second, third, ... user.

In [None]:
users[0] # returns the link to the user page of the 1st user

...to subsequently try to extract the link for the first book...

Note the user list still contains a lot of "other" things, unrelated to the user name. Remember, we extracted the __links__ to the profile pages, not just the user names.

If we want to remove anything but the usernames, we can modify our extraction function slightly, for example using Python's `split` function.


In [None]:
users = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        users.append(link.attrs['href'].split('=')[1])
users

Need explanation on this code? Just copy-paste it to ChatGPT and ask for an explanation, e.g., using this prompt:

> I struggle to understand this piece of Python code in the context of web scraping. 
> Can you please explain it, paying attention to the complicated last line (user.append())?

Pretty cool, right? So let's proceed with some exercises.

#### Exercise 1.2  

1. Modify the loop (`for link in relevant_section...`) to extract *absolute URLs* instead of relative ones. Combine the website's base URL (`https://music-to-scrape.org/`) with the extracted string (e.g., `user?username=GalaxyShadow34`). The final URL should look like: `https://music-to-scrape.org/user?username=GalaxyShadow34`.  

2. Wrap your code from step 1 in a function called `get_users()`. This function should return an array of user profile links. We’ll use it later to repeatedly collect user names (seeds) from this page.  

3. Run your `get_users()` function inside a `while` loop that executes every 2 seconds for 15 seconds. Write all collected URLs to a new-line-separated JSON file named `seeds.json`.  

In [None]:
# your answer goes here!

#### Solutions

In [None]:
# Question 1 
urls = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        urls.append(f'https://music-to-scrape.org/{extracted_link}')
urls

In [None]:
# Question 2
import requests
from bs4 import BeautifulSoup

def get_users():
    url = 'https://music-to-scrape.org/'
  
    res = requests.get(url)
    res.encoding = res.apparent_encoding
    
    soup = BeautifulSoup(res.text)
    
    relevant_section = soup.find('section',attrs={'name':'recent_users'})

    links = []
    for link in relevant_section.find_all("a"):
        if 'href' in link.attrs: 
            extracted_link = link.attrs['href']
            links.append(f'https://music-to-scrape.org/{extracted_link}')
    return(links) # to return all links

get_users()

In [None]:
# Question 3
import time
import json

# Define the duration in seconds (1 minute = 60 seconds)
duration = 15

# Calculate the end time
end_time = time.time() + duration

f = open('seeds.json','a')

# Run the loop until the current time reaches the end time
while time.time() < end_time:
    for user in get_users():
        f.write(json.dumps(user)+'\n')
    time.sleep(2)  # Sleep for a few seconds between each execution
f.close()


<div class="alert alert-block alert-info"><b>Working with JSON data in Python</b>
    <br>
    In Python, we often need to work with JSON data, which is a common format for exchanging information. 
    
- To make a string (such as one read from a file) queryable as JSON, we use the <code>json.loads()</code> function.
  The <code>json.loads()</code> function takes a JSON-formatted string and converts it into a Python data structure, such as a dictionary or a list, so you can easily access its contents.
- If you want to save a Python data structure as a JSON file, you can use the <code>json.dumps()</code> function.
        The <code>json.dumps()</code> function takes a Python object, like a dictionary or a list, and converts it into a JSON-formatted string that you can save to a text file for later use.

</div>

# 1.3 Preventing array misalignment

So far, we have only extracted *one* piece of information (the URL) from the list of recently active users. But, what if we want to also extract the names of recently consumed songs? For example, you can view this song by hovering over the user profile pictures on the landing page.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-hover.png" align="left" width=30%/>


Closely inspecting the source also shows you this information!


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-song-tag.png" align="left" width=60%/>


A simple solution may be to just use the `.find_all()` command from BeautifulSoup, extracting all tags called `span`.

__Example__:


In [None]:
# Run this code now
import requests
from bs4 import BeautifulSoup

url = 'https://music-to-scrape.org/'

res = requests.get(url, headers = user_agent)
res.encoding = res.apparent_encoding

soup = BeautifulSoup(res.text)

relevant_section = soup.find('section',attrs={'name':'recent_users'})

# getting links
links = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        links.append(f'https://music-to-scrape.org/{extracted_link}')

# getting songs
songs = []
for song in relevant_section.find_all("span"):
    songs.append(song.get_text())


# links for each user
print(links)

# recent songs for each user
print(songs)

While this approach seems easily implemented, it is __highly error-prone and needs to be avoided.__ 

So... what happened?

The length for these two objects - `links` and `songs` - differ! Didn't spot it? Then see for yourself!


In [None]:
print(len(links))
print(len(songs))

While the links are properly rendered for each user, we can only retrieve song information for a subset of songs. Ultimately, we won't be able to tell WHICH song is part of WHICH user. This is what we call a misalignment of the arrays that hold the necessary data.

<div class="alert alert-block alert-info"><b>What's an array misalignment?</b>
    <br>
    
<ul>
<li>
When extracting information from the web, we sometimes are prone to "ripping apart" the website's original structure by putting data points into individual arrays (e.g., lists such as one list for user names and another for their recently consumed songs). </li>
<li>In so doing, we violate the data's original structure: we should store information on users, and <b>each user</b> has a user name/link and song.</li>
    <li>The <b>correct way of organizing the data</b> is to create a list of users (e.g., in a dictionary) and then store each attribute (e.g., the song, etc.) <b>within</b> these objects. <b>Only if we store data this way</b> can we be sure to store everything correctly. </li>
<br>
<li>When we do not adhere to this practice, we run the risk of "array misalignment". For example, if only ONE data point were missing for a user, then the (independent) user names array (say, with 6 items) wouldn't be "1:1 aligned" with the song array (say, with only 2-5 items).</li>

</div>

__So, how to do it correctly?__

Similar to how we first "zoomed in" on the recently active user section earlier, we will *first* zoom in on each __user__, and then, *within each user*, extract the required information.

Subsequently, we will store the information in a list of dictionaries, where each element of the dictionary corresponds to a user. This data structure will allow us to also omit some of the song names. After all, whether or not a song is listed for users is now exactly tied to a particular usre. 

__See the example below.__ Pay attention to how we capture the "unavailability" of a song name with a `try` and `except` clause.

In [None]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup

# Define the URL you want to scrape
url = 'https://music-to-scrape.org/'

# Send an HTTP GET request to the URL and store the response
res = requests.get(url, headers=user_agent)

# Set the encoding of the response to the apparent encoding
res.encoding = res.apparent_encoding

# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(res.text)

# Find the HTML section with the attribute 'name' equal to 'recent_users'
relevant_section = soup.find('section', attrs={'name': 'recent_users'})

# Identify individual users within the relevant section
users = relevant_section.find_all(class_='mobile-user-margin')

# Initialize a list to store user data
user_data = []

# Loop through each user in the list of users
for user in users:
    # Check if the user has an 'href' attribute within an anchor tag
    if 'href' in user.find('a').attrs:
        # Extract the link from the 'href' attribute
        extracted_link = user.find('a').attrs['href']
    
    # Check if the user has a 'span' element
    if user.find('span') is not None:
        # Get the text content of the 'span' element, which represents song names
        song_name = user.find('span').get_text()
    else:
        # If there is no 'span' element, set the song_name to 'NA'
        song_name = 'NA'
    
    # Create a dictionary object with the extracted data
    obj = {'url': extracted_link, 'song_name': song_name}
    
    # Append the dictionary to the user_data list
    user_data.append(obj)

# user_data now contains a list of dictionaries, each representing user information with a URL and song name
user_data

<div class="alert alert-block alert-info"><b>Handling Errors with <code>try</code> and <code>except</code> in Python</b>
    <br>
    
- In Python, we have a useful way to deal with potential errors or exceptions in our code. We use a construct called a <code>try</code> and <code>except</code> clause.
  - The <code>try</code> block is where you place the code that might potentially cause an error. For example, if you're trying to find an element on a website, you can put this code inside the <code>try</code> block.
  - If the code inside the <code>try</code> block encounters an error, instead of crashing your program, Python will jump to the <code>except</code> block. This is incredibly useful for handling situations where, for instance, the element you're trying to find on a website isn't available.
  - Inside the <code>except</code> block, you can define what action to take when an error occurs. In our example, you could set the missing data point to "NA" so that you know it wasn't available.
- However, it's crucial to use the <code>try</code> and <code>except</code> construct sparingly. You don't want to skip the entire process for a user just because one data point isn't available. Instead, use it selectively to handle specific errors and ensure your program continues running smoothly.
</div>

## 2. Navigating and Extracting User Profile Data  

__Importance__  

So far, we’ve learned how to extract seeds (users) from a single page—the homepage.  

What’s next?  

[`music-to-scrape.org`](https://music-to-scrape.org) holds user consumption data across multiple pages (one per week). Our goal is to navigate each user’s profile and save the names of all songs, artists, and timestamps (date/time) by visiting these pages one by one.  

__Let's try it out__

Open [the website](https://music-to-scrape.org/user?username=StarCoder49&week=36), and click on the "previous" button at the top of the page. Do you understand how you will be able to "loop" through the site?

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-user-page.png" align="left" width=90%/>

### 2.1 Extract Consumption Data From User Profile Pages

We’ve identified the seeds (usernames) and target pages (weeks 37–0) but haven’t extracted any consumption data yet (e.g., songs a user listened to).  

Use what you’ve learned (e.g., from *Web Scraping for Dummies*) to iterate through the table and collect this data.  

__Try it out__  
  

It’s helpful to prototype before assembling a full working script.  

Let’s start by downloading the first page of a user and storing it in a variable called `soup`.

In [None]:
url = 'https://music-to-scrape.org/user?username=StarCoder49&week=6'
header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)


We can now try a few commands to access information on the site. Of course, the browser inspect tool is important to have opened on the side. You probably notice that the table is quite easy to capture - it has it's own tag, called `table`.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-table.png" align="left" width=90%/>

In [None]:
table = soup.find('table')
table

See? This one worked quite well! Inspecting the table a bit more, you can get at the individual rows using the `tr` tag. Again, use your browser's inspect tool to spot it!

In [None]:
table.find('tr')

This is just the first row. Using `.find_all()`, instead, will give you a list of all rows.

In [None]:
rows = table.find_all('tr')
rows

We can also check whether the number of rows is equal to what we would expect from looking at the website. Using the `len` function for this yields...

In [None]:
len(rows)

Looks about right? Yes! So, let's now try to extract, for one row, the name of the song and artist, corresponding to the first and second column of the table.

Let's first select one row for prototyping. We take row 2 (which is the first row after the table header).

In [None]:
one_row = rows[1]

In [None]:
one_row

In [None]:
one_row.find_all('td')[0].get_text() # for song name

In [None]:
one_row.find_all('td')[1].get_text() # for artist name, corresponding to the second "column"


We can now put everything together in one script.

In [None]:
url = 'https://music-to-scrape.org/user?username=StarCoder49&week=6'

header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

table = soup.find('table')

rows = table.find_all('tr')

for row in rows:
    #print(row)
    data = row.find_all('td')
    
    if len(data)>0:
        song_name=data[0].get_text()
        artist_name=data[1].get_text()
        
        print(f'Song "{song_name}" by "{artist_name}"')

__Exercise 2.1__

1. Rather than printing the data to the screen, store it in a list of dictionaries, containing the following data points:
    - song
    - artist
    - date
    - username
    - and time of data extraction.
2. Wrap your code in a function, that returns the JSON dictionary from 1).

__Solution__

In [None]:
# Q1:
import time

url = 'https://music-to-scrape.org/user?username=StarCoder49&week=36'

header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

table = soup.find('table')

rows = table.find_all('tr')

json_data=[]

for row in rows:
    data = row.find_all('td')

    if len(data)>0:
        song_name=data[0].get_text()
        artist_name=data[1].get_text()
        date=data[2].get_text()
        timestamp=data[3].get_text()
        json_data.append({'song_name': song_name,
                          'artist_name': artist_name,
                          'date': date,
                          'time': timestamp,
                          'timestamp_of_extraction': int(time.time()),
                          'username': url.split('=')[1]})
json_data

In [None]:
#Q2

def get_consumption_history(url):
    header = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers = header)
    res.encoding = res.apparent_encoding
    soup = BeautifulSoup(res.text)
    
    table = soup.find('table')
    
    rows = table.find_all('tr')
    
    json_data=[]
    for row in rows:
        data = row.find_all('td')
    
        if len(data)>0:
            song_name=data[0].get_text()
            artist_name=data[1].get_text()
            date=data[2].get_text()
            timestamp=data[3].get_text()
            json_data.append({'song_name': song_name,
                              'artist_name': artist_name,
                              'date': date,
                              'time': timestamp,
                              'timestamp_of_extraction': int(time.time()),
                              'username': url.split('=')[1]})
    return(json_data)

In [None]:
# try running the function
get_consumption_history('https://music-to-scrape.org/user?username=StarCoder49&week=6')


In [None]:
# Check whether it also works for different weeks
get_consumption_history('https://music-to-scrape.org/user?username=StarCoder49&week=4')

### 2.2. Loop through all weeks for each user


__Importance__

Alright - what have we achieve so far?

- In section 1, we've built a function to retrieve user names of currently active users. We call this the stage of our project in which we collect "seeds".
- In section 2.1, we've managed to extract a user's consumption history from a table displayed on the user's profile page.

What's missing, though, is __ALL of a user's consumption data__, i.e., from __ALL possible weeks__.

For this, we're making use of the "previous page" button.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mits-previous-button.png" align="left" width=30%/>

__Let's try it out__

Open the user's profile page at https://music-to-scrape.org/user?username=StarCoder49. __Click on the previous button__ a few times, and observe how the URL in your browser bar is changing. 

For example:

- `https://music-to-scrape.org/user?username=StarCoder49`
- `https://music-to-scrape.org/user?username=StarCoder49&week=37`
- `https://music-to-scrape.org/user?username=StarCoder49&week=36`
- `https://music-to-scrape.org/user?username=StarCoder49&week=35`
- ...

Can you guess the next one...?

A general solution is to look up whether there is a `previous` button on the page (see HTML code below). We can then either "grab" the URL and visit it, or - instead - "click" on the button.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-previous-page.png" align="left" width=60% style="border: 1px solid black" />

So, let's write a snippet that "captures" the link of the previous page button! We always proceed in small steps.

In [None]:
# Step 1: Load the website's source code and convert to BeautifulSoup object
url = 'https://music-to-scrape.org/user?username=StarCoder49'

header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

In [None]:
# Step 2: Trying to locate the previous button, using a combination of class names and attribute-value pairs.
soup.find(class_='page-link', attrs={'type':'previous_page'})

In [None]:
# Step 3: Trying to extract the `href` attribute
soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']

In [None]:
# Step 4: Storing "previous page" link
previous_page_link = soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']
previous_page_link # print it

At each iteration, we can observe how we're getting closer to the information we need.

Now, we only need to combine the base URL (`https://music-to-scrape.org/`) with the page number.

In [None]:
previous_page_link = soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']
f'https://music-to-scrape.org/{previous_page_link}'

__Exercise 2.2__

Please first load the snippet below, which has wrapped the "previous page" capturing in a function. Observe the use of `try` and `except`, which accounts for the last page NOT having a next page button.

In [None]:
def previous_page(soup):
    try:
        previous_page_link = soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']
        return(f'https://music-to-scrape.org/{previous_page_link}')
    except:
        return('no previous page')

Let's try out this function on the source code of the website.

In [None]:
soup = BeautifulSoup(requests.get('https://music-to-scrape.org/user?username=StarCoder49').text)
previous_page(soup)

See, it worked! Now, proceed with the exercises.


1. Make a web requests to 'https://music-to-scrape.org/user?username=StarCoder49&week=36', and pass on the (souped) object to the `previous_page()` function and observe the output. Then, use 'https://music-to-scrape.org/user?username=StarCoder49&week=0'. Is that what you expected? 

2. Write a while loop that continuously visits all pages for the user `StarCoder49`, by extracting previous page URLs from each page and continuing the data collection until there is no previous page to fetch. Start with week 10 to minimize server load.

In [None]:
# write your code here

__Solution__

In [None]:
# Question 1
soup = BeautifulSoup(requests.get('https://music-to-scrape.org/user?username=StarCoder49&week=6').text)
previous_page(soup)


In [None]:
soup = BeautifulSoup(requests.get('https://music-to-scrape.org/user?username=StarCoder49&week=0').text)
previous_page(soup)
# returns "no previous page"

In [None]:
# Question 2
urls = []

# define first URL to start from
url = 'https://music-to-scrape.org/user?username=StarCoder49&week=6'

while True:
    print(f'Opening {url} and checking for next page...')
    soup = BeautifulSoup(requests.get(url).text)
    previous_url = previous_page(soup)
    if 'no previous page' in previous_url: break
    url = previous_url
    

------------
So... seems like we're almost there!

The only thing that's missing is to actually also extract the song consumption data from each of the user profile pages.

We turn towards this issue next.

## 3. Improving Extraction Design  

### 3.1 Timers  

__Importance__  

Notice the use of `time.sleep` earlier? Sending too many requests at once can overload a server and get your IP blocked. Pausing between requests is essential to avoid this.  

__Try it out__  

In Python, use the `time` module to pause execution. For example, after `time.sleep(2)`, the print statement runs only after a 2-second delay:  

In [None]:
# run this cell again to see the timer in action yourself!
import time
pause = 2
time.sleep(pause)
print(f"I'll be printed to the console after {pause} seconds!")

__Exercise 3.1__

Modify the code above to sleep for 2 minutes. Go grab a coffee in-between. Did it take you longer than 2 minutes?

(if you want to abort the running code, just select the cell and push the "stop" button!)

In [None]:
# your answer goes here!

**Solution**  

In [None]:
time.sleep(2*60)
print("Done!")

### 3.2 Modularization  

**Importance**  

In scraping, many tasks must be repeated—like extracting all book links each time we open a new user page on *music-to-scrape.org*.  

To make this easier, we’ll modularize our code into functions. This improves readability, reusability, and allows us to call the same code whenever needed. Need a refresher? Please revisit the [Python Bootcamp](https://odcm.hannesdatta.com/docs/modules/week1/pythonbootcamp/).  

**Try it out**  

Let’s complete our scraper by combining everything we’ve learned.  

Re-run the `get_users` function from Exercise 1.2 (3), or the cell below (which is a copy from above). Then, continue with the exercises.  

In [None]:
import requests
from bs4 import BeautifulSoup

def get_users():
    url = 'https://music-to-scrape.org/'
  
    res = requests.get(url)
    res.encoding = res.apparent_encoding
    
    soup = BeautifulSoup(res.text)
    
    relevant_section = soup.find('section',attrs={'name':'recent_users'})

    links = []
    for link in relevant_section.find_all("a"):
        if 'href' in link.attrs: 
            extracted_link = link.attrs['href']
            links.append(f'https://music-to-scrape.org/{extracted_link}')
    return(links) # to return all links

get_users()

__Exercise 3.2__

Execute the function `get_users()` for a few minutes to collect a list of usernames. Store the user names in a JSON file (new-line separated), along with the timestamp of data retrieval `int(time.time())`.


In [None]:
# your answer here

__Solution__

In [None]:
import time
import json

duration = 15 # for testing, just 15 seconds

# Calculate the end time
end_time = time.time() + duration

f = open('seeds.json','w') # start a new file with seeds, so, use `w` (write new file) instead of `a` (append to existing file)

# Run the loop until the current time reaches the end time
while time.time() < end_time:
    print(f'Scraping user names...')
    for user in get_users():
        new_user = {'url': user,
                    'timestamp': int(time.time())}
        f.write(json.dumps(new_user)+'\n')
    time.sleep(2)  # Sleep for a few seconds between each execution
f.close()
print('Done.')

In [None]:
# verify whether you can open the data

import json
f = open('seeds.json','r',encoding = 'utf-8')
data = f.readlines()
for item in data:
    print(json.loads(item))
f.close()

__Exercise 3.3__

Now, let's write some code that loads `seeds.json`, and visit each user's __first profile page__ to extract consumption data. Remember to build in a little timer (e.g., waiting for 2 seconds or so). The prototype/starting code below stops automatically after 5 iterations to minimize server load. Try removing the prototyping condition using the comment character `#` when you think you're done!


In [None]:
# start from the code below

import time # we need the time package for implementing a bit of waiting time
import json

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break # deactivate this if you want to loop through the entire file
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    print(obj['url'])
    
    # eventually sleep for a second
    time.sleep(2)

print('Done!')

<div class="alert alert-block alert-info"><b>Tips</b>
    <br>
    <ul>
        <li>
            Use the function <code>get_consumption_history(url)</code> from exercise 2.3 above!
        </li>
 
</div>


__Solution__

In [None]:
# start from the code below
import time # we need the time package for implementing a bit of waiting time
import json

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break # deactivate this if you want to loop through the entire file
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    url = obj['url']

    print(f'Extracting information for {url}...')
    
    output_file = open('output_data.json','a')

    songs = get_consumption_history(url)

    for song in songs:
        output_file.write(json.dumps(song))
        output_file.write('\n')

    output_file.close()
    
    time.sleep(2)

print('Done!')

<div class="alert alert-block alert-info"><b>Tip: Understanding the Difference Between <code>'a'</code> and <code>'w'</code> When Writing Files in Python</b>
    <br>
    
- When working with files in Python, it's essential to know the difference between <code>'a'</code> and <code>'w'</code>  when opening them.
- <code>'a'</code> stands for "append" mode. When you open a file with <code>'a'</code> , Python will let you add data to the end of the existing file without erasing its contents. This is useful when you want to add new information to a file without losing what's already there. It's like adding new lines to the end of an ongoing document.
- <code>'w'</code>  stands for "write" mode. When you open a file with <code>'w'</code> , Python will create a new file or overwrite an existing one. This means that if the file already has data in it, using <code>'w'</code>  will erase all the existing content and start fresh. It's like creating a new document or wiping out the old one.
- Remember, when scraping data or working with files, it's generally safer to use <code>'a'</code>. This way, you won't accidentally delete valuable data. Using <code>'w'</code>  should be done with caution, and only when you intentionally want to start with a clean slate or create a new file altogether.
</div>

Finally, we can re-open the extracted data in Python to see whether what we retrieved seems complete.

Verify you've the `pandas` package installed by running the next cell.

In [None]:
!pip install pandas

Now, we can load the data.

In [None]:
# inspect data in pandas
import pandas as pd
pd.read_json('output_data.json', lines=True)

### 3.3 Summary

At the beginning of this tutorial, we set out the promise of writing multi-page scrapers from start to finish. Although the examples we have studied are relatively simple, the same principles (seed definition, data extraction plan, page-level data collection) apply to any other website you'd like to scrape. 

<div class="alert alert-block alert-info"><b>Limitations of BeautifulSoup and the Advantages of Selenium</b>
    <br>
    
- While BeautifulSoup is a powerful tool for parsing and navigating HTML documents, it has some limitations when it comes to interacting with websites:
  - BeautifulSoup is a static parser, meaning it can't interact with dynamic web content that loads or changes after the initial page load. This makes it less suitable for websites that heavily rely on, say, JavaScript to update their content. For example, this is relevant for Twitter or Instagram.
  - BeautifulSoup can't handle user interactions such as clicking buttons, filling out forms, or navigating through complex web applications.
- When you need to scrape data from very modern and interactive websites, consider using a tool like Selenium. Selenium is a web automation framework that allows you to control a web browser programmatically.
  - With Selenium, you can automate interactions with websites, simulate user actions, and retrieve data from pages that rely heavily on JavaScript.
  - It's an excellent choice for scraping data from dynamic websites, conducting web testing, and performing tasks that require a more interactive approach.
- Keep in mind that while BeautifulSoup is great for many scraping tasks, knowing when to use Selenium can open up new possibilities and make your web scraping efforts more effective.

</div>


# 4. A Primer on Scraping Advanced, Dynamic Websites with Selenium  

So far, you’ve used the `requests` library to retrieve web data. While this works for simpler sites, it often fails on modern, dynamic websites like Twitch, Twitter, or Instagram, where content is loaded dynamically through JavaScript.  

A powerful solution is to use the `selenium` library, which allows you to control a web browser programmatically. With `selenium`, you can simulate user actions such as clicking buttons, scrolling, or filling out forms—making it possible to access dynamic content that `requests` can't handle.  

In this section, we’ll focus on how to *open* and *navigate* websites using `selenium`. Once you’re on the site and the content has fully loaded, you can continue using `BeautifulSoup` to parse and extract the data you need. 

This combination gives you the best of both worlds: `selenium` for interaction and `BeautifulSoup` for efficient data extraction.  

__Let’s get started__  

We’ll begin with setting up `selenium` and writing a simple script to open a website and retrieve some data.


## 4.1 Making a connection to a website using Selenium

<div class="alert alert-block alert-warning"><b>Installing Selenium and Chromedriver</b> 

To install Selenium and Chromedriver locally, please follow the <a href="https://tilburgsciencehub.com/configure/python-for-scraping/?utm_campaign=referral-short">Tutorial on Tilburg Science Hub</a>.
    
You can also use the code snippet below to automate the installation. Running this snippet takes a little longer each time, but the benefit is that it almost always works!
</div>


In [None]:
!pip install webdriver_manager --upgrade
!pip install selenium --upgrade


In [None]:
# Using selenium 4 - ensure you have Chrome installed (and wait a bit for Chrome to show up!)
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

url = "https://music-to-scrape.org/"
driver.get(url)

If everything went smooth, your computer opened a new Chrome window, and opened `music-to-scrape.org`. 

<div class="alert alert-block alert-info"><b>Using Google Colab</b> 

If you're using Google Colab, you don't see your browser open up manually.
    
Whenever you switch pages, just manually open that page in your browser. Although this feels like a little less interactive, you will still be able to work through this tutorial!

</div>

From now onwards, you can use `driver.get('https://google.com')` to point to different websites (i.e., you don't need to install it over and over again, unless you open up a new instance of Jupyter Notebook).

## 4.2 Using BeautifulSoup with Selenium


We can now also try to extract information. Note that we're converting the source code of the site to a `BeautifulSoup` object (because you may have learnt how to use `BeautifulSoup` earlier).

In [None]:
# we also need the time package to wait a few seconds until the page is loaded
import time
url = "https://music-to-scrape.org/"
driver.get(url)
time.sleep(3)

Rather than using the "source code" obtained with the `requests` library, we can now convert the source code of the Selenium website to a BeautifulSoup object.

In [None]:
soup=BeautifulSoup(driver.page_source)

...and start experimenting with querying the site (such as retrieving the text of all cards).

In [None]:
cards = soup.find_all(class_='card-body')

# print 
counter = 0
for card in cards:
    counter = counter + 1
    print('Card ' + str(counter) + ': ' + card.get_text())


## 4.3 Clicking and Scrolling with Selenium

__Importance__

For more dynamic websites, we may have to click on certain elements (rather than extracting some URL).

<div class="alert alert-block alert-info"><b>Extracting elements using Selenium, not BeautifulSoup</b> 

Selenium is really great for navigating dynamic website. There are two ways in which you can use it for querying sites:
    
<ul>
    <li>put the "selenium" source code (<code>driver.page_source</code>) to BeautifulSoup, and then use BeautifulSoup commands, or </li>
    <li>directly use selenium (and it's own query language) to extract elements.</li>
</ul>
    
In the next few examples, we are using selenium's "internal" query language (which you identify easily because it is a subfunction of the `driver` object, and because it has a different name (`find_element`, instead of `find` or `find_all`).
    
Want to know more about selenium's built-in query language? Check out the "Advanced Web Scraping Tutorial", or dig up some extra material from the web. Knowing both BeautifulSoup and Selenium makes you most productive!
  
</div>

__Try it out__

If you haven't done so, rerun the installation code for `selenium` from above. Then, proceed by running the following cell and observe what happens in your browser:

1. **Click a button** to accept the cookie banner.
2. **Scroll down** the page to reveal hidden elements.  


In [None]:
url = "https://music-to-scrape.org/"
driver.get(url)
time.sleep(3) # wait for 3 seconds

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains

try:
    cookie_button = driver.find_element(By.ID, "accept-cookies")
    cookie_button.click()
except:
    print('No cookie button found (anymore)!')
    
    
# Scroll down the page
scroll_pause_time = 2
for _ in range(3):  # Scroll down 3 times
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)


Clicking and scrolling are essential steps when working with dynamic websites. With `Selenium`, you can use `driver.find_element()` to interact with various elements, such as buttons and scrollable sections. These interactions are often necessary to fully load content that would otherwise remain hidden. Unlike `requests`, `Selenium` gives you full control over the webpage, but it comes with trade-offs—it’s slower and can sometimes be buggy, especially on headless systems.  




__Exercise 4.1__

Please write code snippets to extract the following pieces of information. Do you choose `requests` or `selenium`?

1. The titles of all `<h2>` tags from `https://odcm.hannesdatta.com/docs/course/`
2. The titles of all available TV series from `https://www.bol.com/nl/nl/l/series/3133/30291/` (about 24)

```
soup.find_all('a', class_='product-title')
```


We also need the time package to wait a few seconds until the page is loaded.

```
import time
url = "https://twitch.tv/" # some example URL
driver.get(url)
time.sleep(3)
```

In [None]:
# write your solution here

In [None]:
# Solution to question 1:
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
request = requests.get('https://odcm.hannesdatta.com/docs/course/', headers = header)
request.encoding = request.apparent_encoding # set encoding to UTF-8
soup = BeautifulSoup(request.text)
for title in soup.find_all('h2'): print(title.get_text())

In [None]:
# Solution to question 2:
driver.get('https://www.bol.com/nl/nl/l/series/3133/30291/')
time.sleep(3)
soup = BeautifulSoup(driver.page_source)

In [None]:
urls = []
for url in soup.find_all('a', class_='product-title'):
    urls.append(url.attrs['href'])
urls


__Wrapping Up__  
Wow – you’ve just learned another way to open and interact with websites using `Selenium`. This powerful tool allows you to work with highly dynamic sites and helps you avoid blocks that sometimes happen with `requests`. However, keep in mind that `Selenium` is slower and less efficient for scaling up large data collections.  





If you want to go deeper, here are some excellent resources for additional study:  
- **Selenium Documentation**: [https://www.selenium.dev/documentation/](https://www.selenium.dev/documentation/)  
- **Selenium with Python**: [https://selenium-python.readthedocs.io/](https://selenium-python.readthedocs.io/)  
- **Stack Overflow** for troubleshooting common issues.  

<div class="alert alert-block alert-info"><b>Awesome stuff with Selenium</b> 

Selenium is your best shot at navigating a dynamic website. It can do amazing things, such as 
    
<ul>
    <li>"clicking" on buttons</li>
    <li>scrolling through a site</li>
    <li>hovering over items and capturing information from popups,</li>
    <li>starting to play a stream,</li>
    <li>typing text and submitting it in the chat, and</li>
    <li>so much more...!</li>
</ul>
    
Note though that we won't cover the advanced functionality of Selenium in this tutorial, but the optional "Web data advanced" tutorial holds the necessary information.
   
</div>



## After-class exercises


### Exercise 1

Can you extend the code written in 3.2 to extract data from ALL of a user's profile pages?

### Exercise 2

Please port your data collection into two Python scripts. One called `collect_seeds.py` that collects seeds for 5 minutes. You can use a task scheduler to launch this task every 15 minutes and keep it running for a few hours.

Building on exercise 1 above, write a second script, called `collect_user_data.py`, which you run once (after you've finalized collecting seeds). This script collects all of the required data for all users.

__Solution__

Let us first modify the `get_consumption_history()` function, ensuring it shows us whether there is a `previous page`.

In [None]:
# Question 1

def get_consumption_history(url):
    header = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers = header)
    res.encoding = res.apparent_encoding
    soup = BeautifulSoup(res.text)
    
    table = soup.find('table')
    
    rows = table.find_all('tr')
    
    json_data=[]
    for row in rows:
        data = row.find_all('td')
    
        if len(data)>0:
            song_name=data[0].get_text()
            artist_name=data[1].get_text()
            date=data[2].get_text()
            timestamp=data[3].get_text()
            json_data.append({'song_name': song_name,
                              'artist_name': artist_name,
                              'date': date,
                              'time': timestamp,
                              'timestamp_of_extraction': int(time.time()),
                              'username': url.split('=')[1]})

    url_of_previous_page = previous_page(soup)
        
    return({'songs': json_data, 'previous_page': url_of_previous_page})
    

In [None]:
# start from the code below
import time # we need the time package for implementing a bit of waiting time
import json

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break # deactivate this if you want to loop through the entire file
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    url = obj['url']

    while 'no previous page' not in url:
        print(f'Extracting information for {url}...')
    
        output_file = open('output_data.json','a')
    
        songs = get_consumption_history(url)
        
        for song in songs['songs']:
            output_file.write(json.dumps(song))
            output_file.write('\n')
        output_file.close()
        
        url = songs['previous_page']
        time.sleep(2)
    
print('Done!')