Hannes Datta
We're about to start with today's tutorial (“web data for dummies”).
Use your smartphone to indicate your status.
Thanks.
Meet Windows! :)
Source & more details: Web Appendix of “Fields of Gold”
Example: Suppose you need to get… data on Spotify's streaming charts - where would you get it from?
<table>
, <h1>
, <div>
<table id="example-table">
<table class="striped-table">
Frequently, you need to combine several of these methods to extract information.
/html/body/div[3]/section[1]/div/div[2]/h2
.artist_info_title
requests
: downloads data from the web or transmits data (“headless”)BeautifulSoup
: structures HTML data so we can query itselenium
+ chromedriver
: simulates a browser (chrome), can scroll, click, view the site, but can also be headless; also structures HTML so we can query itjson
: structure and query JSON dataGetting the data
requests
+ beautifulSoup
(speed + ease!)selenium
(retrieve data) + beautifulSoup
(structure/query)requests
Storing the data
json
(preferred) or CSV files (flat files with rows & columns)
Today: requests
+ BeautifulSoup
A few notes on my code:
import requests # let's load the requests library
# make a get request to the website
url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
web_request = requests.get(url, headers = header)
# return the source code from the request object
web_request_source_code = web_request.text
import requests
from bs4 import BeautifulSoup
url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'
header = {'User-agent': 'Mozilla/5.0'}
web_request = requests.get(url, headers = header)
soup = BeautifulSoup(web_request.text)
print(soup.find('h2').get_text())
Ya Boy
.find(class_ = 'classname')
print(soup.find(class_ = 'about_artist').get_text())
Location:
United States
Number of plays:
11
Next, let us refine our collection by getting…
location
), exact number plays (–> stored in plays
), and the total number of songs in the top 10.Tips
.find(class_='class-name')
for classes.find_all()
len()
for countinglocation = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()
song_table = soup.find(class_ = 'top-songs')
number_of_songs = len(song_table.find_all('tr'))-1
def functionname
)def FUNCTIONNAME(argument1, argument2)
)
def download_data(url)
download_data
, and it requires url
as inputimport requests
from bs4 import BeautifulSoup
def download_data(url):
header = {'User-agent': 'Mozilla/5.0'}
web_request = requests.get(url, headers = header)
soup = BeautifulSoup(web_request.text)
artist_name = soup.find('h2').get_text()
print(f'Artist name: {artist_name}.')
# execute the function
download_data('https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F')
Artist name: Ya Boy.
artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']
Tips:
print(f'Done retrieving {url}')`
artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']
def download_data(url):
header = {'User-agent': 'Mozilla/5.0'}
web_request = requests.get(url, headers = header)
soup = BeautifulSoup(web_request.text)
artist_name = soup.find('h2').get_text()
location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()
print(f'Artist name: {artist_name} from {location} with {plays} song plays.')
for id in artist_ids:
download_data(f'https://music-to-scrape.org/artist?artist-id={id}')
Artist name: Ya Boy from United States with 11 song plays.
Artist name: Prince With 94 East from Minneapolis MN with 8 song plays.
Artist name: Cabas from with 91 song plays.
obj = {}
)Change the code so it stores artist name, location and plays in the JSON object.
def download_data(url):
web_request = BeautifulSoup(requests.get(url).text)
artist_name = soup.find('h2').get_text()
location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()
out_data = {'artist': 'artist name',
'location': 'store location here',
'plays': 'store plays here'}
return(out_data)
The final step is to save JSON data in a file.
We can do this with the json.dumps
function from the json
library
import json
out_data = {'artist': 'artist name',
'location': 'store location here',
'plays': 'store plays here'}
# convert dict to "string" that we can save
to_json = json.dumps(out_data)
f=open('filename.json','a')
f.write(to_json+'\n')
f.close()
For details/explanation, see Guyt et al. 2024.
Pretty much all APIs have documentations, see here for ours
Let's explore a first endpoint in our browser: https://api.music-to-scrape.org/artists/featured
DO: Describe what you see; what does that mean? How does the data link to other sections of the website? Do you “get” the logic?
import requests
con = requests.get('https://api.music-to-scrape.org/artists/featured')
# convert to json
obj = con.json()
obj
{'artists': [{'artist': 'Fred Merpol', 'artist_id': 'ARKIQSL1241B9C90C8'}, {'artist': 'Off Broadway', 'artist_id': 'AR4IYQR1187B98F8F3'}, {'artist': 'A Challenge Of Honour', 'artist_id': 'ARL1QL91187B994B08'}, {'artist': 'Ya Boy', 'artist_id': 'ARICCN811C8A41750F'}, {'artist': 'Milo', 'artist_id': 'ARJ8ZIQ1187FB3FB5A'}]}
obj
)obj
→ anothername
)obj['artists']
or obj.get('artists')
)obj['artists'][0]
): multiple objects are in that list!Remember loops from last week's bootcamp? We can “iterate” through result objects.
for i in obj['artists']:
print(i.get('artist'))
Fred Merpol
Off Broadway
A Challenge Of Honour
Ya Boy
Milo
DO: Can you also print out the artist IDs?
getdata()
Starting code:
def getdata():
# YOUR CODE HERE
# return some data
# return()
import requests
def getdata():
con = requests.get('https://api.music-to-scrape.org/artists/featured')
obj = con.json()
return(obj)
Let's call the function!
{'artists': [{'artist': 'Grant Geissman', 'artist_id': 'ARIN12F1187FB3E92C'}, {'artist': 'Fred Merpol', 'artist_id': 'ARKIQSL1241B9C90C8'}, {'artist': 'Three-6 Mafia', 'artist_id': 'ARY55LO1187B9A3F17'}, {'artist': 'The Honeydogs', 'artist_id': 'ARSWORN1187B991A7B'}, {'artist': 'Terry Muska', 'artist_id': 'ARMORUX11F50C4EEBF'}]}
1) This makes your computer sleep for 5 seconds
import time
time.sleep(5) # sleeps 5 seconds
2) This makes your computer go on forever…
while True:
# command here
3) This makes your computer go on forever…
counter = 0
while counter < 5:
counter = counter + 1
print("Hello")
DO: Please execute your data collection every one second seconds, at max. 5 times.
import time
i = 0
while i < 5:
print(getdata())
time.sleep(1)
i = i + 1
Two options:
[...]
), then write to file at the end of your scriptf = open()
, f.write()
, f.close()
) (“parsing on the fly”)Get some artist meta data for an artist of your choice!
1.) Find an artist IDs on the site or in previous code
2.) Try to make a web request (in your browser) to the following URL:
https://api.music-to-scrape.org/artist/info
artistid
https://api.music-to-scrape.org/artist/info?artistid={ENTER ARTIST ID HERE}
3.) Does it work? Then write Python code to retrieve the data.
4.) Finally, wrap your code in a function so you can later retrieve data for multiple artists.
Please work on exercise 2.4 (see tutorial).