oDCM - APIs 101 (Tutorial)

Hannes Datta

Welcome to oDCM!

We're about to start with today's tutorial (“API 101”).

If you haven't done so, open slides at https://odcm.hannesdatta.com/docs/modules/week4/ (I have just updated them)
Today's (live) code is available at https://tiu.nu/livecoding (refresh is needed when I update my code)
You can work in Google Colab throughout this class

Before we get started

Recall: Exam dates have been set
- dPrep: 3 April 2025 (resit 27 June)
- oDCM: 1 April 2025 (resit 26 June)
Mac users: Please familiarize yourselves with R/RStudio on Windows (i.e., try not to use your laptops today)
Presentations during next week's class

Today's session

Deep-dive on APIs

Part 1: API of music-to-scrape & working with JSON data
Part 2: Parameters in API calls
Part 3: Looping: for and while

Plus: as usual, focus on modularization (writing functions)!

Recap: What are APIs?

APIs are connectors within and between platforms and databases, using the web to talk to each other
- backend to frontend: e.g., database <-> Reddit API <-> website
- backend to backend: e.g., Twitter <-> Twitter API <-> some trading software
Initially made available for developers, later for researchers (e.g., OpenAI's API)
Available usually only after registration and approval

Recap: Technical differences between APIs and web scraping

Web scraping
- not always 100% clear whether allowed
- have to structure (unstructured) HTML code
API
- good if you have permission (but, be realistic about getting access!)
- you obtain structured JSON data
- business model!

Recap: Discovering an API documentation

Probably the most difficult part w/ APIs is to get authenticated
- recall: need platform's permission (but this API doesn't require authentication)
Find an endpoint (aka page when scraping)
Let's explore the documentation of… OpenAI!
- which endpoints are available?
- what could you do with them?

Part 1: API of music-to-scrape & working with JSON data

DO: Browse the documentation of music-to-scrape's API.

Then, run the snippet below.

import requests
url = "https://api.music-to-scrape.org/users"
response = requests.get(url)
request = response.json()
print(request)

Can you explain what happens when you run this cell?
Try adding the limit parameter (?limit=20) and run again. What happens?

Part 1: How can we make sense of JSON data?

JSON data is structured in a hierarchy

DO:

Run the snippet below (which saves data to a JSON file)
Open the snippet in VS Code
Install the JSON extension (I will show you how)
Reformat
- use VS Code (install the JSON plugin first!)
- alternatively, write to .json file and view in VS Code (install JSON plugin first)
- many other possibilities exist (e.g., in-browser) - check what you like best!

import json
f=open('users.json', 'w', encoding='utf-8')
f.write(json.dumps(request))
f.close()

Part 1: Getting data from a JSON dictionary

Let's try to only extract a list of user names, rather than the entire dictionary.
For this, we need to know how to access nodes of a dictionary.
- ['name_of_attribute'] or
- .get('name_of_attribute')
Do: Let's try it out

request['limit']

# or: 

request.get('limit')

Part 1: Getting user names

How would you have to modify the snippet below to access the list of user names?
Can you write a loop that extracts user names and puts them to an array of only user names (not countries, etc.)?

request['limit'] # this one worked for extracting the value for limit...

Solution

#Q1: 
request['data']

#Q2:
user_names = [] # initialize empty array
for user in request['data']: # start loop
  user_names.append(user['username']) # append user name to the list of user_names (initialized above)
user_names # inspect the result

Part 1: Anonymizing data

DO: Suppose we do not want to extract the ages and country names for the users…

Can you come up with a way to anonymize them (e.g., overwrite them with NA), while keeping the rest of the dictionary intact?

Solution:

new_dic = []

for user in request['data']:
  obj = user
  obj['username']='NA'
  obj['age']='NA'
  new_dic.append(obj)
new_dic

Part 1: Difference between `.append()` and `.extend()`

.append() adds one item at a time to an existing list

users = [] # empty list
users.append('another user')
users.append('yet another user')
users

extend() adds multiple items to an existing list

users = []
new_users = ['another user','yet another user']
users.extend(new_users)
users

Part 2: Understanding parameters in API calls

APIs are useful because you can modify its search results
- say, you want to see the next batch of users (when listing users), or
- search for a particular thing (when searching via the API)
Many ways to “submit” parameters to an API
- directly in URL, e.g., myapi.com/search/cats_and_dogs [not the case for music to scrape]
- w/ parameters in URL, e.g., myapi.com/search/?query=cats_and_dogs
- in header of request (think of it like the address [=header] on an envelop [=data])

The API documentation will tell you what is required!

Part 2: Submitting parameters in the header via the `params` argument

Example

params requires to be a dictionary with the parameter names and corresponding values

import requests
url = "https://api.music-to-scrape.org/users"
response = requests.get(url, params = {'limit': 15})
request = response.json()
print(request)

Can you speculate about the benefits of submitting parameters in the header (params) rather than in the URL?

Part 2: Iterating through API endpoints

Try to add the offset parameter to the snippet below. What happens when you set it to 1? What happens when you set it to 5?
To which value would you have to set it to see the next batch of users?

Tip: Remember iterating through pages on a website to “view” data? APIs know the same concept!

# start with this code
import requests
url = "https://api.music-to-scrape.org/users?limit=10"
response = requests.get(url)
request = response.json()
print(request)

Solution:

# q1: setting the offset parameter
import requests
requests.get("https://api.music-to-scrape.org/users?limit=10").json()
requests.get("https://api.music-to-scrape.org/users?limit=10&offset=1").json()
requests.get("https://api.music-to-scrape.org/users?limit=10&offset=5").json()

# q2: 
requests.get("https://api.music-to-scrape.org/users?limit=10&offset=10").json()

Part 2: Let's wrap your code in a function

Construct a function, called get_users(), with parameters limit and offset, returning the dictionary of users from the API endpoint /users.

Solution:

import requests
  def get_users(limit, offset):
    obj = requests.get(f"https://api.music-to-scrape.org/users?limit={limit}&offset={offset}").json()
    return(obj['data'])

get_users(10,1)

Part 3: Looping through all result pages of an endpoint

Let's try to construct a loop that repeatedly fetches user names.
Let us first use a for loop.

for x in range(6):
  print(x)

Do: Modify the snippet below so that it calls get_users() 10 times, incrementing the offset by 10 at each iteration.

Solution:

offset=0
for x in range(10):
  print(get_users(limit=10, offset=offset))
  offset=offset+10

Part 3: For loop vs. while loop

Difference between for for loops (you usually know beforehand when to stop), and while loops (the ending point can change, say when “there is no new data coming in”)

# for loop
for x in range(6):
  print(x)

cntr = 0
while cntr < 6:
  print(cntr)
  cntr = cntr+1

Part 3: Tying everything together

We can now combine our learnings to build a function that extracts 100 user names and meta data to new-line separated JSON files.

Implement it step-by-step
Also load packages!
Save as .py file and test whether it runs from command prompt/terminal

Solution:

# will develop in class
import requests
import json

cntr = 0
f=open('output.json', 'w')
while cntr <= 50:
  f.write(json.dumps(get_users(limit=10, offset=cntr)))
  f.write('\n')
  cntr = cntr+10
f.close()

Part 3: Onwards to more endpoints

The tutorial proceeds by introducing a series of additional endpoints.

user/plays - get a user's total number of plays
charts/top-artists - see a list of top-performing artist for this week (and previous weeks)

Try fetching data from user/plays, following the guidelines in the documentation. Do you succeed?
Now do the same for charts/top-artists. Do you get some output?

Let's extract several of these weekly charts

# code in class

Reflection & next steps

wow - no “structuring” of HTML “soup” (THAT's the key difference compared to web scraping)
Spend time trying to get your first data from an API – authentication will be an issue!
Time left? Let's consider some legal issues
After this class: coaching session (check your work plan)
Next week
- presentations about your projects
- afterwards: coaching session (get your prototype up and running!!!)