oDCM - APIs 101 (Tutorial)

Hannes Datta

Welcome to oDCM!

We're about to start with today's tutorial (“API 101”).

Before we start (I)

  • Coaching sessions
    • we start paying attention to 'equal' attention/time; but: help us, this is also your responsibility
    • it is normal to face complicated'issues that cannot be solved on the spot (not even by myself). Address these issues to us in writing, please (email: Roshini; WhatsApp: Hannes).
  • Team projects
    • updated material on 'webscraping advanced' (with Selenium); it became easier and more targeted to follow
    • presentations during next week's class; both Roshini and I will be present

Before we start (II)

  • Extra Q&A session upon your request for oDCM & dPrep
    • Monday, 23 September, 10.45-12.30 (CubeZ 219).
  • Next week: tutorials take place in computer labs (GZ 106, see schedule)

  • Any questions at this stage?

What are APIs?

  • APIs are connectors within and between platforms and databases, using the web to talk to each other
    • backend to frontend: e.g., database <-> Reddit API <-> website
    • backend to backend: e.g., Twitter <-> Twitter API <-> some trading software
  • Initially made available for developers, later for researchers (e.g., OpenAI's API)
  • Available usually only after registration and approval

Technical differences between APIs and web scraping

  • Web scraping
    • not always 100% clear whether allowed
    • have to structure (unstructured) HTML code
  • API
    • good if you have permission (but, be realistic about getting access!)
    • you obtain structured JSON data

Discovering an API documentation

  • Probably the most difficult part w/ APIs is to get authenticated
    • recall: need platform's permission (but this API doesn't require authentication)
  • Find an endpoint (aka page when scraping)
  • Let's explore the documentation of… OpenAI!
    • which endpoints are available?
    • what could you do with them?

Do: Zooming in on Music-to-scrape's API

Browse the documentation of music-to-scrape's API.

1) Have a look at the various endpoints that are available. At what unit of analysis do they allow you to collect data?

2) Why would a platform offer an API in the first place?

Do: Making a connection with an API

Run the snippet below.

import requests
url = "https://api.music-to-scrape.org/users"
response = requests.get(url)
request = response.json()
print(request)
  1. Can you explain what happens when you run this cell?
  2. Try adding the limit parameter (?limit=20) and run again. What happens?

How can we make sense of JSON data?

  • JSON data is structured in a hierarchy
    • use JSONviewer.stack.hu to make sense of it
    • alternatively, write to .json file and view in VS Code (install JSON plugin first)
    • many other possibilities exist (e.g., in-browser) - check what you like best!

DO: Run this snippet to write JSON output to a file! Inspect it using JSONviewer or VS Code!

import json
f=open('users.json', 'w', encoding='utf-8')
f.write(json.dumps(request))
f.close()

Getting data from a JSON dictionary

  • Let's try to only extract a list of user names, rather than the entire dictionary.
  • For this, we need to know how to access nodes of a dictionary.
    • ['name_of_attribute'] or
    • .get('name_of_attribute')
  • Do: Let's try it out
request['limit']

# or: 

request.get('limit')

Do: Getting user names

  1. How would you have to modify the snippet below to access the list of user names?
  2. Can you write a loop that extracts user names and puts them to an array of only user names (not countries, etc.)?
request['limit'] # this one worked for extracting the value for limit...

Solution

#Q1: 
request['data']

#Q2:
user_names = [] # initialize empty array
for user in request['data']: # start loop
  user_names.append(user['username']) # append user name to the list of user_names (initialized above)
user_names # inspect the result

Do: Anonymizing data

Suppose we do not want to extract the ages and country names for the users…

Can you come up with a way to anonymize them (e.g., overwrite them with NA), while keeping the rest of the dictionary intact?

Solution:

new_dic = []

for user in request['data']:
  obj = user
  obj['username']='NA'
  obj['age']='NA'
  new_dic.append(obj)
new_dic

Difference between `.append()` and `.extend()`

  • .append() adds one item at a time to an existing list
users = [] # empty list
users.append('another user')
users.append('yet another user')
users
  • extend() adds multiple items to an existing list
users = []
new_users = ['another user','yet another user']
users.extend(new_users)
users

Understanding parameters in API calls

  • APIs are useful because you can modify its search results
    • say, you want to see the next batch of users (when listing users), or
    • search for a particular thing (when searching via the API)
  • Many ways to “submit” parameters to an API
    • directly in URL, e.g., myapi.com/search/cats_and_dogs [not the case for music to scrape]
    • w/ parameters in URL, e.g., myapi.com/search/?query=cats_and_dogs
    • in header of request (think of it like the address [=header] on an envelop [=data])



The API documentation will tell you what is required!

Example: submitting parameters in the header via the `params` argument

  • params requires to be a dictionary with the parameter names and corresponding values
import requests
url = "https://api.music-to-scrape.org/users"
response = requests.get(url, params = {'limit': 15})
request = response.json()
print(request)

Can you speculate about the benefits of submitting parameters in the header (params) rather than in the URL?

Do: Iterating through API endpoints

Remember iterating through pages on a website to “view” data? APIs know the same concept!

  1. Try to add the offset parameter to the snippet below. What happens when you set it to 1? What happens when you set it to 5?
  2. To which value would you have to set it to see the next batch of users?
# start with this code
import requests
url = "https://api.music-to-scrape.org/users?limit=10"
response = requests.get(url)
request = response.json()
print(request)

Solution:

# q1: setting the offset parameter
import requests
requests.get("https://api.music-to-scrape.org/users?limit=10").json()
requests.get("https://api.music-to-scrape.org/users?limit=10&offset=1").json()
requests.get("https://api.music-to-scrape.org/users?limit=10&offset=5").json()

# q2: 
requests.get("https://api.music-to-scrape.org/users?limit=10&offset=10").json()

Do: Let's wrap your code in a function

  1. Construct a function, called get_users(), with parameters limit and offset, returning the dictionary of users from the API endpoint /users.

Solution:

import requests
  def get_users(limit, offset):
    obj = requests.get(f"https://api.music-to-scrape.org/users?limit={limit}&offset={offset}").json()
    return(obj['data'])

get_users(10,1)

Looping through all result pages of an endpoint

  • Let's try to construct a loop that repeatedly fetches user names.
  • Let us first use a for loop.
for x in range(6):
  print(x)

Do: Modify the snippet below so that it calls get_users() 10 times, incrementing the offset by 10 at each iteration.

Solution:

offset=0
for x in range(10):
  print(get_users(limit=10, offset=offset))
  offset=offset+10

For loop vs. while loop

  • Difference between for for loops (you usually know beforehand when to stop), and while loops (the ending point can change, say when “there is no new data coming in”)
# for loop
for x in range(6):
  print(x)
cntr = 0
while cntr < 6:
  print(cntr)
  cntr = cntr+1

Tying everything together

We can now combine our learnings to build a function that extracts 100 user names and meta data to new-line separated JSON files.

  • Implement it step-by-step
  • Also load packages!
  • Save as .py file and test whether it runs from command prompt/terminal

Solution:

# will develop in class
import requests
import json

cntr = 0
f=open('output.json', 'w')
while cntr <= 50:
  f.write(json.dumps(get_users(limit=10, offset=cntr)))
  f.write('\n')
  cntr = cntr+10
f.close()

Do: Onwards to more endpoints

The tutorial proceeds by introducing a series of additional endpoints.

  • user/plays - get a user's total number of plays
  • charts/top-artists - see a list of top-performing artist for this week (and previous weeks)
  1. Try fetching data from user/plays, following the guidelines in the documentation. Do you succeed?
  2. Now do the same for charts/top-artists. Do you get some output?

Let's extract several of these weekly charts

# code in class

Reflection

  • wow - no “structuring” of HTML “soup” (THAT's the key difference compared to web scraping)
  • Spend time trying to get your first data from an API – authentication will be an issue!
  • Time left? Let's consider some legal issues
  • After this class: coaching session (check your workplan)
  • Next week
    • presentations about your projects
    • afterwards: coaching (get your prototype up and running!)