Hannes Datta
We're about to start with today's tutorial (“web scraping 101”).
.find()
, .find_all()
)beautifulSoup
(static websites; works in Colab)selenium
(dynamic websites; only works on your laptops).find()
and .find_all()
functionsExtend the code snippet below to extract the release year of the song below.
import requests
from bs4 import BeautifulSoup
url = 'https://music-to-scrape.org/song?song-id=SOJZTXJ12AB01845FB'
request = requests.get(url, headers = {'User-agent': 'Mozilla/5.0'})
song_page = BeautifulSoup(request.text)
about = song_page.find(class_='about_artist')
title = about.find('p')
about.find_all('p')
[<p>China</p>, <p>Tito Puente</p>, <p>1972</p>, <p>54</p>]
plays=about.find_all('p')[3].get_text()
Recommendations:
We use beautifulSoup
for addressing the remaining challenges for the “design” phase of collecting data.
Using your own project ideas, tell us how you could sample from the site? (–> see table 3 in “Fields of Gold”)
Let's sample users from the main website of music-to-scrape.org
Can you propose a strategy to capture them? Any attributes/classes you see?
DO: Please use the code snippet below to extract the 'href' attribute of all links on the website.
url = 'https://music-to-scrape.org'
res = requests.get(url)
res.encoding = res.apparent_encoding
homepage = BeautifulSoup(res.text)
# continue here
for link in homepage.find_all("a"):
if 'href' in link.attrs:
print(link.attrs["href"])
/privacy_terms
/privacy_terms
about
/
/
/tutorial_scraping
/tutorial_api
#
https://api.music-to-scrape.org/docs
https://odcm.hannesdatta.com
https://doi.org/10.1016/j.jretai.2024.02.002
https://web-scraping.org
https://github.com/tilburgsciencehub/music-to-scrape
https://tilburgsciencehub.com
/about
tutorial_scraping
song?song-id=SOCOMAW12AB017EE59
song?song-id=SOBMAUD12A6D4F9181
song?song-id=SOFCKRY12AAF3B2FAF
song?song-id=SOKHBNW12A8AE48A58
song?song-id=SOVLTPF12AC46866D7
song?song-id=SOHTWEG12AB018D11C
song?song-id=SOGRYWN12AB018C335
song?song-id=SOGMROZ12A679D8AE9
song?song-id=SOKFEUT12A6D4FC34C
song?song-id=SOMKWCF12A8C142BCD
song?song-id=SOJKNYV12A8C133E9C
song?song-id=SOAYONI12A6D4F85C8
song?song-id=SOCLLWU12A8AE47C66
song?song-id=SOOZDSM12A8C13D39D
song?song-id=SOEKLNK12A58A7839A
song?song-id=SODKSMV12A6D4F6922
song?song-id=SOJRGVK12CF5CFC527
song?song-id=SOJFLGV12A8C141AB3
song?song-id=SOCBWVV12A8C13605F
song?song-id=SOJZWRA12AB018D029
song?song-id=SOSAMRR12AB018203B
song?song-id=SOLGUGY12AB01897BE
song?song-id=SOEOJJA12AB018FCCF
song?song-id=SOOPVJI12AB0183957
song?song-id=SOWJRTX12AB0183C28
artist?artist-id=ARWBL9E1187FB4E695
artist?artist-id=ARY55LO1187B9A3F17
artist?artist-id=ARMBTFC1187FB56343
artist?artist-id=ARR2NH51187B98CE4C
artist?artist-id=ARA2ZTN1187B98E3ED
artist?artist-id=AREFUMW11F4C844D2B
artist?artist-id=ARMORUX11F50C4EEBF
artist?artist-id=ARN7OQ21187FB5A6B3
user?username=Galaxy04
user?username=StarPanda93
user?username=Stealth20
user?username=StarCoder49
user?username=Geek73
user?username=Panda38
/tutorial_scraping
/tutorial_api
https://api.music-to-scrape.org/docs
/about
/privacy_terms
https://www.linkedin.com/company/tilburgsciencehub
https://github.com/tilburgsciencehub/music-to-scrape
https://twitter.com/tilburgscience
recent_users
section, starting with <section
>relevant_section = homepage.find('section',attrs={'name':'recent_users'})
DO: Can you come up with a way to loop through all of the links WITHIN relevant_section
and store them?
users = []
for link in relevant_section.find_all("a"):
if ('href' in link.attrs):
users.append(link.attrs['href'])
users
['user?username=Galaxy04', 'user?username=StarPanda93', 'user?username=Stealth20', 'user?username=StarCoder49', 'user?username=Geek73', 'user?username=Panda38']
urls = []
for link in relevant_section.find_all("a"):
if 'href' in link.attrs:
extracted_link = link.attrs['href']
urls.append(f'https://music-to-scrape.org/{extracted_link}')
urls
['https://music-to-scrape.org/user?username=Galaxy04', 'https://music-to-scrape.org/user?username=StarPanda93', 'https://music-to-scrape.org/user?username=Stealth20', 'https://music-to-scrape.org/user?username=StarCoder49', 'https://music-to-scrape.org/user?username=Geek73', 'https://music-to-scrape.org/user?username=Panda38']
DO: Write a function that…
return()
)Do you remember why we like functions so much?
Narrowing down (VII): Solution
import requests
from bs4 import BeautifulSoup
def get_users():
url = 'https://music-to-scrape.org/'
res = requests.get(url)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)
relevant_section = soup.find('section',attrs={'name':'recent_users'})
links = []
for link in relevant_section.find_all("a"):
if 'href' in link.attrs:
extracted_link = link.attrs['href']
links.append(f'https://music-to-scrape.org/{extracted_link}')
return(links) # to return all links
# let's try it out
# users = get_users()
json.dumps
to convert JSON to text (and then save it)json.loads
to convert text to JSON (and then use it)import json
users = get_users()
# build JSON dictionary
f = open('users.json','w')
for user in users:
obj = {'url': user,
'username': user.replace('https://music-to-scrape.org/user?username=','')}
f.write(json.dumps(obj))
f.write('\n')
85
1
91
1
87
1
91
1
81
1
83
1
f.close()
links = []
for link in relevant_section.find_all("a"):
if 'href' in link.attrs:
extracted_link = link.attrs['href']
links.append(f'https://music-to-scrape.org/{extracted_link}')
# getting songs
songs = []
for song in relevant_section.find_all("span"):
songs.append(song.get_text())
len(links)
6
len(songs)
3
DO: How can you capture the song data, such that we can always link the particular song to particular users? (i.e., to prevent array misalignment)?
Tip: Observe how we can first iterate through users, THEN extract links (similar to how we have done it with relevant_section
).
users = relevant_section.find_all(class_='mobile-user-margin')
data = []
for user in users: # iterate through each user
obj = {'link':user.find('a').attrs['href'] }
data.append(obj)
data
[{'link': 'user?username=Galaxy04'}, {'link': 'user?username=StarPanda93'}, {'link': 'user?username=Stealth20'}, {'link': 'user?username=StarCoder49'}, {'link': 'user?username=Geek73'}, {'link': 'user?username=Panda38'}]
users = relevant_section.find_all(class_='mobile-user-margin')
data = []
for user in users:
if user.find('span') is not None:
song_name=user.find('span').get_text()
else:
song_name='NA'
obj = {'link':user.find('a').attrs['href'],
'song_name': song_name}
data.append(obj)
data
[{'link': 'user?username=Galaxy04', 'song_name': 'NA'}, {'link': 'user?username=StarPanda93', 'song_name': 'NA'}, {'link': 'user?username=Stealth20', 'song_name': "The Fabulous Thunderbirds - Rainin' In My Heart"}, {'link': 'user?username=StarCoder49', 'song_name': "Jacques Dutronc - L'Homme De Paille"}, {'link': 'user?username=Geek73', 'song_name': 'NA'}, {'link': 'user?username=Panda38', 'song_name': 'Cancer Bats - Sabotage'}]
What we've learnt so far…
What's missing
We can then tie things together in a scraper
urls = []
counter = 37
while counter >= 0:
urls.append(f'https://music-to-scrape.org/user?username=StarCoder49&week={counter}')
counter = counter - 1
urls
['https://music-to-scrape.org/user?username=StarCoder49&week=37', 'https://music-to-scrape.org/user?username=StarCoder49&week=36', 'https://music-to-scrape.org/user?username=StarCoder49&week=35', 'https://music-to-scrape.org/user?username=StarCoder49&week=34', 'https://music-to-scrape.org/user?username=StarCoder49&week=33', 'https://music-to-scrape.org/user?username=StarCoder49&week=32', 'https://music-to-scrape.org/user?username=StarCoder49&week=31', 'https://music-to-scrape.org/user?username=StarCoder49&week=30', 'https://music-to-scrape.org/user?username=StarCoder49&week=29', 'https://music-to-scrape.org/user?username=StarCoder49&week=28', 'https://music-to-scrape.org/user?username=StarCoder49&week=27', 'https://music-to-scrape.org/user?username=StarCoder49&week=26', 'https://music-to-scrape.org/user?username=StarCoder49&week=25', 'https://music-to-scrape.org/user?username=StarCoder49&week=24', 'https://music-to-scrape.org/user?username=StarCoder49&week=23', 'https://music-to-scrape.org/user?username=StarCoder49&week=22', 'https://music-to-scrape.org/user?username=StarCoder49&week=21', 'https://music-to-scrape.org/user?username=StarCoder49&week=20', 'https://music-to-scrape.org/user?username=StarCoder49&week=19', 'https://music-to-scrape.org/user?username=StarCoder49&week=18', 'https://music-to-scrape.org/user?username=StarCoder49&week=17', 'https://music-to-scrape.org/user?username=StarCoder49&week=16', 'https://music-to-scrape.org/user?username=StarCoder49&week=15', 'https://music-to-scrape.org/user?username=StarCoder49&week=14', 'https://music-to-scrape.org/user?username=StarCoder49&week=13', 'https://music-to-scrape.org/user?username=StarCoder49&week=12', 'https://music-to-scrape.org/user?username=StarCoder49&week=11', 'https://music-to-scrape.org/user?username=StarCoder49&week=10', 'https://music-to-scrape.org/user?username=StarCoder49&week=9', 'https://music-to-scrape.org/user?username=StarCoder49&week=8', 'https://music-to-scrape.org/user?username=StarCoder49&week=7', 'https://music-to-scrape.org/user?username=StarCoder49&week=6', 'https://music-to-scrape.org/user?username=StarCoder49&week=5', 'https://music-to-scrape.org/user?username=StarCoder49&week=4', 'https://music-to-scrape.org/user?username=StarCoder49&week=3', 'https://music-to-scrape.org/user?username=StarCoder49&week=2', 'https://music-to-scrape.org/user?username=StarCoder49&week=1', 'https://music-to-scrape.org/user?username=StarCoder49&week=0']
DO: Run the code below. How to extend the code to extract the LINK of the previous button?
url = 'https://music-to-scrape.org/user?username=StarCoder49&week=36'
header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
userpage = BeautifulSoup(res.text)
button=userpage.find(class_='page-link', attrs={'type':'previous_page'})
button.attrs['href']
'user?username=StarCoder49&week=35'
DISCUSS: When do we use strategy 1 (pre-built) vs. strategy 2 (do it on the fly)?
get_users()
to retrieve user namesusers = get_users()
consumption_data = []
for user in users:
url = user['url']
while url is not None:
# scrape information from URL # challenge #2.1, #2.4
# determine "previous page"
time.sleep(1) # challenge #2.3
# if previous page exists: rerun loop on next URL
# if previous page does not exist: stop while loop, go to next
# pseudo code!
product_data = []
for s in seeds:
# store raw html code here
store_to_file(s, 'website.html')
# continue w/ parsing some information
product = get_product_info(s)
product_data.append(product)
# pseudo code!
product_data = []
for s in seeds:
product = get_product_info(s)
product_updated = processing(product)
time.sleep(1)
product_data.append(product_updated)
# write data to file here (parsing on the fly!!!)
# write data to file here (after the collection - avoid!)
So far, we have done this
requests
(to get) –> beautifulsoup
(to extract information, “parse”)DO: Run the snippet below and open amazon.html
import requests
header =
f = open('amazon.html', 'w', encoding = 'utf-8')
f.write(requests.get('https://amazon.com', headers = {'User-agent': 'Mozilla/5.0'}).text)
f.close()
Can you explain what happened?
Many dynamic sites require what I call “simulated browsing”
Try this:
!pip install webdriver_manager --upgrade
!pip install selenium --upgrade
# Using selenium 4 - ensure you have Chrome installed!
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
url = "https://music-to-scrape.org/"
driver.get(url)
page_source
) to BeautifulSoup
, and proceed as alwaysfrom bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source)
cards = soup.find_all(class_='card-body')
counter = 0
for card in cards:
counter = counter + 1
print('Card ' + str(counter) + ': ' + card.get_text())
url = "https://music-to-scrape.org/"
driver.get(url)
time.sleep(3) # wait for 3 seconds
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
try:
cookie_button = driver.find_element(By.ID, "accept-cookies")
cookie_button.click()
except:
print('No cookie button found (anymore)!')
scroll_pause_time = 2
for _ in range(3): # Scroll down 3 times
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(scroll_pause_time)
requests
requests
–> BeautifulSoup
–> .find()
selenium
selenium
–> wait –> BeautifulSoup
–> .find()
selenium
and APIs (see next tutorials)