oDCM - APIs 101 (Tutorial)

Hannes Datta

Welcome to oDCM!

We're about to start with today's tutorial (“API 101”).

Before we start

  • Exam dates fixed
    • Tuesday, 2 April (main sit)
    • Thursday, 27 June (resit)
  • Tech development & extra material
    • Note there are “advanced” tutorials for extra code snippets (e.g., Selenium)
  • Team projects
    • enthusiastic about many of your ideas
    • feedback on Canvas
    • continue building your prototypes today!
  • Any issues at this stage?

Agenda

  • In-class: https://api.music-to-scrape.org/docs
  • Coaching session (challenges #2.1-#2.4)
    • which information to retrieve?
    • for which seeds?
    • at what frequency?
    • how to process the data during the collection
  • After class
    • complete tutorial on your own; work on team projects

What are APIs?

  • APIs are connectors within and between platforms and databases, using the web to talk to each other
    • backend to frontend: e.g., database <-> Reddit API <-> website
    • backend to backend: e.g., Twitter <-> Twitter API <-> some trading software
  • Initially made available for developers, later for researchers (e.g., OpenAI's API)
  • Available usually only after registration and approval

Technical differences between APIs and web scraping

  • Web scraping
    • not always 100% clear whether allowed
    • have to structure (unstructured) HTML code
  • API
    • good if you have permission (but, be realistic about getting access!)
    • structured JSON data

Discovering an API documentation

  • Probably the most difficult part w/ APIs is to get authenticated
    • recall: need platform's permission (but this API doesn't require authentication)
  • Find an endpoint (aka page when scraping)
  • Let's explore the documentation of… OpenAI!
    • which endpoints are available?
    • what could you do with them?

Do: Zooming in on Music-to-scrape's API

Browse the documentation of music-to-scrape's API.

1) Have a look at the various endpoints that are available. At what unit of analysis do they allow you to collect data?

2) Why would a platform offer an API in the first place?

Do: Making a connection with an API

Run the snippet below.

import requests
url = "https://api.music-to-scrape.org/users"
response = requests.get(url)
request = response.json()
print(request)
  1. Can you explain what happens when you run this cell?
  2. Try adding the limit parameter (?limit=20) and run again. What happens?

How can we make sense of JSON data?

  • JSON data is structured in a hierarchy
    • use JSONviewer.stack.hu to make sense of it
    • alternatively, write to .json file and view in VS Code
    • many other possibilities exist (e.g., in-browser) - check what you like best!

DO: Run this snippet to write JSON output to a file! Inspect it using JSONviewer or VS Code!

import json
f=open('users.json', 'w', encoding='utf-8')
f.write(json.dumps(request))
f.close()

Getting data from a JSON dictionary

  • Let's try to only extract a list of user names, rather than the entire dictionary.
  • For this, we need to know how to access nodes of a dictionary.
    • ['name_of_attribute']
    • .get('name_of_attribute')
  • Do: Let's try it out
request['limit']

# or: 

request.get('limit')

Do: Getting user names

  1. How would you have to modify the snippet below to access the list of user names?
  2. Can you write a loop that extracts user names and puts them to an array of only user names (not countries, etc.)?
request['limit'] # this one worked for extracting the value for limit...

Do: Anonymizing data

Suppose we do not want to extract the ages and country names for the users…

Can you come up with a way to anonymize them (e.g., overwrite them with NA), while keeping the rest of the dictionary intact?

Difference between `.append()` and `.extend()`

  • .append() adds one item at a time to an existing list
users = [] # empty list
users.append('another user')
users.append('yet another user')
users
  • extend() adds multiple items to an existing list
users = []
new_users = ['another user','yet another user']
users.extend(new_users)
users

Understanding parameters in API calls

  • APIs are useful because you can modify its search results
    • say, you want to see the next batch of users (when listing users), or
    • search for a particular thing (when searching via the API)
  • Many ways to “submit” parameters to an API
    • directly in URL, e.g., myapi.com/search/cats_and_dogs [not the case for music to scrape]
    • w/ parameters in URL, e.g., myapi.com/search/?query=cats_and_dogs
    • in header of request (think of it like the address [=header] on an envelop [=data])



The API documentation will tell you what is required!

Example: submitting parameters in the header via the `params` argument

  • params requires to be a dictionary with the parameter names and corresponding values
import requests
url = "https://api.music-to-scrape.org/users"
response = requests.get(url, params = {'limit': 15})
request = response.json()
print(request)

Can you speculate about the benefits of submitting parameters in the header (params) rather than in the URL?

Do: Iterating through API endpoints

  • Remember iterating through pages on a website to “view” data? APIs know the same concept!
  1. Try to add the offset parameter to the snippet below. What happens when you set it to 1? What happens when you set it to 5?
  2. To which value would you have to set it to see the next batch of users?
# start with this code
import requests
url = "https://api.music-to-scrape.org/users?limit=10"
response = requests.get(url)
request = response.json()
print(request)

Do: Let's wrap your code in a function

  1. Construct a function, called get_users(), with parameters limit and offset, returning the dictionary of users from the API endpoint /users.

Looping through all result pages of an endpoint

  • Let's try to construct a loop that repeatedly fetches user names.
  • Let us first use a for loop.
for x in range(6):
  print(x)

Do: Modify the snippet below so that it calls get_users() 10 times, incrementing the offset by 10 at each iteration.

For loop vs. while loop

  • Difference between for for loops (you usually know beforehand when to stop), and while loops (the ending point can change, say when “there is no new data coming in”)
# for loop
for x in range(6):
  print(x)
cntr = 0
while cntr < 6:
  print(cntr)
  cntr = cntr+1

Tying everything together

We can now combine our learnings to build a function that extracts all user names to new-line separated JSON files.

  • Implement it step-by-step
  • Also load packages!
  • Save as .py file and test whether it runs from command prompt/terminal

Solution:

# will develop in class

Do: Onwards to more endpoints

The tutorial proceeds by introducing a series of additional endpoints.

  • user/plays - get a user's total number of plays
  • charts/top-artists - see a list of top-performing artist for this week (and previous weeks)
  1. Try fetching data from user/plays, following the guidelines in the documentation. Do you succeed?
  2. Now do the same for charts/top-artists. Do you get some output?

Let's extract several of these weekly charts

# code in class

Reflection

  • wow - no “structuring” of HTML “soup” (THAT's the key difference compared to web scraping)
  • Spend time trying to get your first data from an API – authentication will be an issue!
  • Time left? Let's consider some legal issues
  • After this class: coaching session (focus on activity #3)
  • Next week
    • presentations about your projects
    • afterwards: coaching (get your prototype up and running!)

Coaching session

  • Coaching session (challenges #2.1-#2.4)
    • which information to retrieve?
    • for which seeds?
    • at what frequency?
    • how to process the data during the collection
  • Work on your prototypes
    • Selenium vs. requests?
    • Getting authenticated with the API?