Hannes Datta
We're about to start with today's tutorial (“web data for dummies”).
Use your smartphone to indicate your status.
Thanks.
Source & more details: Web Appendix of “Fields of Gold”
Example: Suppose you need to get… data on Spotify's streaming charts - where would you get it from?
<table>
, <h1>
, <div>
<table id="example-table">
<table class="striped-table">
<table some_field_name = "123">
Frequently, you need to combine several of these methods to extract information.
/html/body/div[3]/section[1]/div/div[2]/h2
.artist_info_title
requests
: downloads data from the web or transmits data (“headless”)BeautifulSoup
: structures HTML data so we can query itselenium
+ chromedriver
: simulates a browser (chrome), can scroll, click, view the site, but can also be headless; also structures HTML so we can query itjson
: structure and query JSON datarequests
+ beautifulSoup
(speed + ease!)selenium
(retrieve data) + beautifulSoup
(structure/query)requests
json
(preferred) or CSV files (flat files with rows & columns)- Today: requests
+ BeautifulSoup
import requests # let's load the requests library
# make a get request to the website
url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
web_request = requests.get(url, headers = header)
# return the source code from the request object
web_request_source_code = web_request.text
import requests
from bs4 import BeautifulSoup
url = 'https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F'
header = {'User-agent': 'Mozilla/5.0'}
web_request = requests.get(url, headers = header)
soup = BeautifulSoup(web_request.text)
print(soup.find('h2').get_text())
Ya Boy
.find(class_ = 'classname')
print(soup.find(class_ = 'about_artist').get_text())
Location:
United States
Number of plays:
36
Next, let us refine our collection by getting…
location
), exact number plays (–> stored in plays
), and the total number of songs in the top 10.Tips
.find(class_='class-name')
for classes.find_all()
len()
for countinglocation = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()
song_table = soup.find(class_ = 'top-songs')
number_of_songs = len(song_table.find_all('tr'))-1
def functionname
)def FUNCTIONNAME(argument1, argument2)
)
def download_data(url)
download_data
, and it requires url
as inputimport requests
from bs4 import BeautifulSoup
def download_data(url):
header = {'User-agent': 'Mozilla/5.0'}
web_request = requests.get(url, headers = header)
soup = BeautifulSoup(web_request.text)
artist_name = soup.find('h2').get_text()
print(f'Artist name: {artist_name}.')
# execute the function
download_data('https://music-to-scrape.org/artist?artist-id=ARICCN811C8A41750F')
Artist name: Ya Boy.
artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']
Tips:
print(f'Done retrieving {url}')
artist_ids = ['ARICCN811C8A41750F', 'AR1GW0U1187B9B29FD', 'ARZ3U0K1187B999BF4']
def download_data(url):
header = {'User-agent': 'Mozilla/5.0'}
web_request = requests.get(url, headers = header)
soup = BeautifulSoup(web_request.text)
artist_name = soup.find('h2').get_text()
location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()
print(f'Artist name: {artist_name} from {location} with {plays} song plays.')
for id in artist_ids:
download_data(f'https://music-to-scrape.org/artist?artist-id={id}')
Artist name: Ya Boy from United States with 36 song plays.
Artist name: Prince With 94 East from Minneapolis MN with 42 song plays.
Artist name: Cabas from with 352 song plays.
data.json
) is ONE giant JSON objectChange the code so it stores artist name, location and plays in the JSON object.
def download_data(url):
web_request = BeautifulSoup(requests.get(url).text)
artist_name = soup.find('h2').get_text()
location = soup.find(class_ = 'about_artist').find_all('p')[0].get_text()
plays = soup.find(class_ = 'about_artist').find_all('p')[1].get_text()
out_data = {'artist': 'artist name',
'location': 'store location here',
'plays': 'store plays here'}
return(out_data)
The final step is to save JSON data in a file.
We can do this with the JSON library.
import json
out_data = {'artist': 'artist name',
'location': 'store location here',
'plays': 'store plays here'}
to_json = json.dumps(out_data)
f=open('filename.json','a')
f.write(to_json+'\n')
f.close()
Let's view the documentation
Let's explore a first endpoint: https://api.music-to-scrape.org/artists/featured
–> we can use the browser to retrieve data from (simple) APIs
Extensions:
Get some artist meta data for an artist of your choice!
1.) Find an artist IDs on the site or in previous code 2.) Try to make a web request (in your browser) to the following URL:
https://api.music-to-scrape.org/artist/info
artistid
https://api.music-to-scrape.org/artist/info?artistid={ENTER ARTIST ID HERE}
3.) Does it work? Then write Python code to retrieve the data.
4.) Finally, wrap your code in a function so you can later retrieve data for multiple artists.
Please work on exercise 2.4 (see tutorial).