Hannes Datta
We're about to start with today's tutorial (“web scraping 101”).
Today: zooming in more on collection design
.find()
and .find_all()
functionsExtend the code snippet below to extract the release year of the song below.
import requests
from bs4 import BeautifulSoup
url = 'https://music-to-scrape.org/song?song-id=SOJZTXJ12AB01845FB'
request = requests.get(url, headers = {'User-agent': 'Mozilla/5.0'})
song_page = BeautifulSoup(request.text)
about = song_page.find(class_='about_artist')
title = about.find('p')
about.find_all('p')
[<p>China</p>, <p>Tito Puente</p>, <p>1972</p>, <p>193</p>]
plays=about.find_all('p')[3].get_text()
Recommendations:
Remaining challenges for the “design” phase of collecting data.
Using your own project ideas, tell us how you could sample from the site? (–> see table 3 in “Fields of Gold”)
For today's tutorial, we will be sampling users from the main website of music-to-scrape.org
Can you propose a strategy to capture them? Any attributes/classes you see?
Please use the code snippet below to extract the 'href' attribute of all links on the website.
url = 'https://music-to-scrape.org'
res = requests.get(url)
res.encoding = res.apparent_encoding
homepage = BeautifulSoup(res.text)
# continue here
for link in homepage.find_all("a"):
if 'href' in link.attrs:
print(link.attrs["href"])
/privacy_terms
/privacy_terms
about
/
/
/tutorial_scraping
/tutorial_api
#
https://api.music-to-scrape.org/docs
https://odcm.hannesdatta.com
https://doi.org/10.1016/j.jretai.2024.02.002
https://web-scraping.org
https://github.com/tilburgsciencehub/music-to-scrape
https://tilburgsciencehub.com
/about
tutorial_scraping
song?song-id=SONNUHH12A8C133376
song?song-id=SOLZZPQ12A8C130C70
song?song-id=SOYQOFI12A6D4F76E1
song?song-id=SOJSJTR12AB0188131
song?song-id=SOXCCHE12AAFF43D5D
song?song-id=SOGCFKE12AB01843C0
song?song-id=SOIQYOV12A8C143ED6
song?song-id=SOBYXQS12A8C13FAF6
song?song-id=SOSEJIQ12AB017E7EB
song?song-id=SOBUHQT12AB0186C07
song?song-id=SOJZTXJ12AB01845FB
song?song-id=SOPZASO12A6D4F6A79
song?song-id=SOODIJF12A8C13FDBB
song?song-id=SOGXHEG12AB018653E
song?song-id=SOQXGVE12CF5F86D20
song?song-id=SOLNOMM12A8C132AD8
song?song-id=SOBVAPJ12AB018739D
song?song-id=SOYYBGR12A8C140F1A
song?song-id=SOBHHIZ12AB01841E0
song?song-id=SOQUEQP12A8C1397DE
song?song-id=SOWHXYS12AC9E177F0
song?song-id=SOAOXXE12AB0182517
song?song-id=SOACUYZ12CF54662F1
song?song-id=SOJLUMZ12AB0186CA1
song?song-id=SOGXWRE12AC468BE24
artist?artist-id=ARLYPKH1241B9C7185
artist?artist-id=ARKIQSL1241B9C90C8
artist?artist-id=AR0U44O1187B99007C
artist?artist-id=ARQ76LG1187B9ACD84
artist?artist-id=ARN7OQ21187FB5A6B3
artist?artist-id=AR00A6H1187FB5402A
artist?artist-id=ARMBTFC1187FB56343
artist?artist-id=ARQQFIQ1187B99DB43
user?username=Shadow49
user?username=Star47
user?username=Coder43
user?username=SonicShadow61
user?username=CoderGeek94
user?username=StealthWizard02
/tutorial_scraping
/tutorial_api
https://api.music-to-scrape.org/docs
/about
/privacy_terms
https://www.linkedin.com/company/tilburgsciencehub
https://github.com/tilburgsciencehub/music-to-scrape
https://twitter.com/tilburgscience
recent_users
section, starting with <section
>relevant_section = homepage.find('section',attrs={'name':'recent_users'})
DO: Can you come up with a way to loop through all of the links WITHIN relevant_section
and store them?
users = []
for link in relevant_section.find_all("a"):
if ('href' in link.attrs):
users.append(link.attrs['href'])
users
['user?username=Shadow49', 'user?username=Star47', 'user?username=Coder43', 'user?username=SonicShadow61', 'user?username=CoderGeek94', 'user?username=StealthWizard02']
urls = []
for link in relevant_section.find_all("a"):
if 'href' in link.attrs:
extracted_link = link.attrs['href']
urls.append(f'https://music-to-scrape.org/{extracted_link}')
urls
['https://music-to-scrape.org/user?username=Shadow49', 'https://music-to-scrape.org/user?username=Star47', 'https://music-to-scrape.org/user?username=Coder43', 'https://music-to-scrape.org/user?username=SonicShadow61', 'https://music-to-scrape.org/user?username=CoderGeek94', 'https://music-to-scrape.org/user?username=StealthWizard02']
Write a function that…
return()
)Do you remember why we like functions so much?
import requests
from bs4 import BeautifulSoup
def get_users():
url = 'https://music-to-scrape.org/'
res = requests.get(url)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)
relevant_section = soup.find('section',attrs={'name':'recent_users'})
links = []
for link in relevant_section.find_all("a"):
if 'href' in link.attrs:
extracted_link = link.attrs['href']
links.append(f'https://music-to-scrape.org/{extracted_link}')
return(links) # to return all links
# let's try it out
# users = get_users()
json.dumps
to convert JSON to text (and then save it)json.loads
to convert text to JSON (and then use it)import json
users = get_users()
# build JSON dictionary
f = open('users.json','w')
for user in users:
obj = {'url': user,
'username': user.replace('https://music-to-scrape.org/user?username=','')}
f.write(json.dumps(obj))
f.write('\n')
85
1
81
1
83
1
95
1
91
1
99
1
f.close()
links = []
for link in relevant_section.find_all("a"):
if 'href' in link.attrs:
extracted_link = link.attrs['href']
links.append(f'https://music-to-scrape.org/{extracted_link}')
# getting songs
songs = []
for song in relevant_section.find_all("span"):
songs.append(song.get_text())
len(links)
6
len(songs)
4
How can you capture the song data, such that we can always link the particular song to particular users? (i.e., to prevent array misalignment)?
Tip: Observe how we can first iterate through users, THEN extract links (similar to how we have done it with relevant_section
).
users = relevant_section.find_all(class_='mobile-user-margin')
data = []
for user in users: # iterate through each user
obj = {'link':user.find('a').attrs['href'] }
data.append(obj)
data
[{'link': 'user?username=Shadow49'}, {'link': 'user?username=Star47'}, {'link': 'user?username=Coder43'}, {'link': 'user?username=SonicShadow61'}, {'link': 'user?username=CoderGeek94'}, {'link': 'user?username=StealthWizard02'}]
users = relevant_section.find_all(class_='mobile-user-margin')
data = []
for user in users:
if user.find('span') is not None:
song_name=user.find('span').get_text()
else:
song_name='NA'
obj = {'link':user.find('a').attrs['href'],
'song_name': song_name}
data.append(obj)
data
[{'link': 'user?username=Shadow49', 'song_name': 'John Stewart - Price Of The Fire'}, {'link': 'user?username=Star47', 'song_name': 'NA'}, {'link': 'user?username=Coder43', 'song_name': 'Francis Dunnery - Too Much Saturn'}, {'link': 'user?username=SonicShadow61', 'song_name': 'NA'}, {'link': 'user?username=CoderGeek94', 'song_name': 'Lonnie Johnson - Raining On The Cold_ Cold Ground'}, {'link': 'user?username=StealthWizard02', 'song_name': 'Pitch Black - Harmonia'}]
What we've learnt so far…
What's missing
We can then tie things together in a scraper
urls = []
counter = 37
while counter >= 0:
urls.append(f'https://music-to-scrape.org/user?username=StarCoder49&week={counter}')
counter = counter - 1
urls
['https://music-to-scrape.org/user?username=StarCoder49&week=37', 'https://music-to-scrape.org/user?username=StarCoder49&week=36', 'https://music-to-scrape.org/user?username=StarCoder49&week=35', 'https://music-to-scrape.org/user?username=StarCoder49&week=34', 'https://music-to-scrape.org/user?username=StarCoder49&week=33', 'https://music-to-scrape.org/user?username=StarCoder49&week=32', 'https://music-to-scrape.org/user?username=StarCoder49&week=31', 'https://music-to-scrape.org/user?username=StarCoder49&week=30', 'https://music-to-scrape.org/user?username=StarCoder49&week=29', 'https://music-to-scrape.org/user?username=StarCoder49&week=28', 'https://music-to-scrape.org/user?username=StarCoder49&week=27', 'https://music-to-scrape.org/user?username=StarCoder49&week=26', 'https://music-to-scrape.org/user?username=StarCoder49&week=25', 'https://music-to-scrape.org/user?username=StarCoder49&week=24', 'https://music-to-scrape.org/user?username=StarCoder49&week=23', 'https://music-to-scrape.org/user?username=StarCoder49&week=22', 'https://music-to-scrape.org/user?username=StarCoder49&week=21', 'https://music-to-scrape.org/user?username=StarCoder49&week=20', 'https://music-to-scrape.org/user?username=StarCoder49&week=19', 'https://music-to-scrape.org/user?username=StarCoder49&week=18', 'https://music-to-scrape.org/user?username=StarCoder49&week=17', 'https://music-to-scrape.org/user?username=StarCoder49&week=16', 'https://music-to-scrape.org/user?username=StarCoder49&week=15', 'https://music-to-scrape.org/user?username=StarCoder49&week=14', 'https://music-to-scrape.org/user?username=StarCoder49&week=13', 'https://music-to-scrape.org/user?username=StarCoder49&week=12', 'https://music-to-scrape.org/user?username=StarCoder49&week=11', 'https://music-to-scrape.org/user?username=StarCoder49&week=10', 'https://music-to-scrape.org/user?username=StarCoder49&week=9', 'https://music-to-scrape.org/user?username=StarCoder49&week=8', 'https://music-to-scrape.org/user?username=StarCoder49&week=7', 'https://music-to-scrape.org/user?username=StarCoder49&week=6', 'https://music-to-scrape.org/user?username=StarCoder49&week=5', 'https://music-to-scrape.org/user?username=StarCoder49&week=4', 'https://music-to-scrape.org/user?username=StarCoder49&week=3', 'https://music-to-scrape.org/user?username=StarCoder49&week=2', 'https://music-to-scrape.org/user?username=StarCoder49&week=1', 'https://music-to-scrape.org/user?username=StarCoder49&week=0']
Run the code below. How to extend the code to extract the LINK of the previous button?
url = 'https://music-to-scrape.org/user?username=StarCoder49&week=36'
header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
userpage = BeautifulSoup(res.text)
button=userpage.find(class_='page-link', attrs={'type':'previous_page'})
button.attrs['href']
'user?username=StarCoder49&week=35'
DISCUSS: When do we use strategy 1 (pre-built) vs. strategy 2 (do it on the fly)?
get_users()
to retrieve user namesusers = get_users()
consumption_data = []
for user in users:
url = user['url']
while url is not None:
# scrape information from URL # challenge #2.1, #2.4
# determine "previous page"
time.sleep(1) # challenge #2.3
# if previous page exists: rerun loop on next URL
# if previous page does not exist: stop while loop, go to next
# pseudo code!
product_data = []
for s in seeds:
# store raw html code here
store_to_file(s, 'website.html')
# continue w/ parsing some information
product = get_product_info(s)
product_data.append(product)
# pseudo code!
product_data = []
for s in seeds:
product = get_product_info(s)
product_updated = processing(product)
time.sleep(1)
product_data.append(product_updated)
# write data to file here (parsing on the fly!!!)
# write data to file here (after the collection - avoid!)
We still need to extract the consumption data from a user's profile page.
Let's start with the snippet below.
url = 'https://music-to-scrape.org/user?username=StarCoder49&week=10'
soup = BeautifulSoup(requests.get(url).text)
Can you propose:
So far, we have done this
requests
(to get) –> beautifulsoup
(to extract information, “parse”)DO: Run the snippet below and open amazon.html
import requests
header =
f = open('amazon.html', 'w', encoding = 'utf-8')
f.write(requests.get('https://amazon.com', headers = {'User-agent': 'Mozilla/5.0'}).text)
f.close()
Can you explain what happened?
Many dynamic sites require what I call “simulated browsing”
Try this:
!pip install webdriver_manager
!pip install selenium
# Using selenium 4 - ensure you have Chrome installed!
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
url = "https://amazon.com/" # remember to solve captchas
driver.get(url)
page_source
) to BeautifulSoup
, and proceed as alwaysfrom bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source)
for el in soup.find_all(class_ = 'a-section'):
title = el.find('h2')
if title is not None: print(title.text)
requests
requests
–> BeautifulSoup
–> .find()
selenium
selenium
–> wait –> BeautifulSoup
–> .find()
selenium
and APIs (see next tutorials)Use the code snippets provided on Selenium and BeautifulSoup to collect a list of at least 1,000 products from any of the product pages from Bol.com.