# APIs 101 (oDCM)

*The focus in this tutorial lies on parameters (i.e., telling the API what you really want), and pagination (i.e., looping through multiple "pages").*


--- 

## Learning Objectives

* Send HTTP requests to a web API, and retrieve JSON responses
* Use parameters to modify the results of an API call
* Iterate over multiple pages of JSON responses 
* Extract and store results of an API request in lists of dictionaries, and files

--- 

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>


## Background


This tutorial uses [music-to-scrape.org](https://music-to-scrape.org), a fictitious music streaming service you encountered earlier in this course. Music-to-scrape also as an [API](https://api.music-to-scrape.org/docs) that offers convenient access to its underlying data.

<div class="alert alert-block alert-info"><b>Tip</b> 
Like any other API, checking out its documentation first is best! You typically find API documentation when searching for "service name + developer API" or "service name + API documentation".
</div>



## 1. Getting to know the API's `/users` endpoint

### 1.1 Make an API request


We'll be using the API's `/users` endpoint to generate a list of user names available at the website. Familiarize yourself with the [endpoint's documentation](https://api.music-to-scrape.org/docs#operation/list_users_users_get).

__Let's try it out__

Run the cell below!


In [None]:
# Import the 'requests' library, which is used to make HTTP requests.
import requests

# Define the URL of the API endpoint you want to request data from.
url = "https://api.music-to-scrape.org/users"

# Send an HTTP GET request to the specified URL and store the response in 'response'.
response = requests.get(url)

# Parse the JSON content of the response into a Python dictionary and store it in 'request'.
request = response.json()

# Print the 'request' dictionary, which contains the retrieved data in JSON format.
print(request)


#### Exercises 1.1

1. Iterate through the request data, which contains user information. For each user, print their username and age to the screen.


In [None]:
# start your code here
# for user in request['data']: 
# ...

#### Solution

In [None]:
# Iterate through each 'user' dictionary in the 'data' list of the 'request' dictionary.
for user in request['data']:
    # Extract the 'username' field from the current 'user' dictionary.
    username = user['username']
    
    # Print the username using a formatted string.
    print(f'Username: {username}')

### 1.2 Use parameters to modify what you get as an API output

__Importance__

Up to this point, we've retrieved a list of 10 usernames, along with a bit of demographic information from these users (e.g., country of origin & age).

As a next step, you will learn how to *modify* the API request. Why is it useful? Well... with APIs, you kind of want to more specifically say *what data you want*. For example, using the Twitter API, you'd like to search for a particular set of hash tags. Similarly, for the API of Instagram, you may want to retrieve the number of followers a *particular* user has.

Whatever your goal is, you need to understand how to *customize* an API request. 


__Let's try it out__

Let's first open the API endpoint from earlier in your browser. Just click on the next link!

https://api.music-to-scrape.org/users

Now, modify the URL in your browser, adding `?limit=25` to it.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week4/apis101/images/api-users.png" width=50% align="left"  style="border: 1px solid black"/>

If everything went alright, you should now see a list of 25 users in your browser!



#### Exercises 1.2
1. Adapt the code cell above -- the one making requests -- to retrieve information from 25 users via the API directly here in Jupyter Notebook.
2. Write a function `get_users()` that takes the limit as an input parameter and returns the data as a list of dictionaries (tip: use your answer to question 1 as a starting point!).
3. Modify the function to discard users' ages and country of origin (remove the information).

__Tips:__

- Rather than writing `request['data']`, you can also use the function `get()` to retrieve the particular attribute: `request.get('data')`.
- How do we know the users are available in the `data` node? You can simply look at the output of the API call (i.e., by looking "what" is in the object `request`.


In [None]:
# your answer goes here!

#### Solutions

In [None]:
# Q1: Fetching User Data from the API

# Import the 'requests' library, which is used to make HTTP requests.
import requests

# Define the URL of the API endpoint with a 'limit' parameter set to 25 to retrieve 25 users.
url = "https://api.music-to-scrape.org/users?limit=25"

# Send an HTTP GET request to the specified URL and store the response in 'response'.
response = requests.get(url)

# Parse the JSON content of the response into a Python dictionary and store it in 'request'.
request = response.json()

# Extract the 'data' field from the 'request' dictionary, which contains user information.
users = request['data']

# The 'users' variable now contains a list of user data.
# You can further work with this data as needed.
users


In [None]:
# Q2: A Function to Fetch User Data from the API

# Import the 'requests' library, which is used to make HTTP requests.
import requests

# Define a function named 'get_users' that takes a 'limit' parameter.
def get_users(limit):
    # Construct the URL of the API endpoint with the specified 'limit'.
    url = f"https://api.music-to-scrape.org/users?limit={limit}"
    
    # Send an HTTP GET request to the constructed URL and store the response in 'response'.
    response = requests.get(url)
    
    # Parse the JSON content of the response into a Python dictionary and store it in 'request'.
    request = response.json()
    
    # Extract the 'data' field from the 'request' dictionary, which contains user information.
    users = request['data']

    # Return the 'users' list containing user data to the caller.
    return users

# Call the 'get_users' function with a 'limit' of 25 to retrieve 25 users.
get_users(25)


In [None]:
# Q3: Anonymizing User Data in a Function

# Import the 'requests' library, which is used to make HTTP requests.
import requests

# Define a function named 'get_users' that takes a 'limit' parameter.
def get_users(limit):
    # Construct the URL of the API endpoint with the specified 'limit'.
    url = f"https://api.music-to-scrape.org/users?limit={limit}"
    
    # Send an HTTP GET request to the constructed URL and store the response in 'response'.
    response = requests.get(url)
    
    # Parse the JSON content of the response into a Python dictionary and store it in 'request'.
    request = response.json()
    
    # Extract the 'data' field from the 'request' dictionary, which contains user information.
    users = request['data']

    # Create an empty list 'return_users' to store modified user data.
    return_users = []
    
    # Iterate through each 'user' in the 'users' list.
    for user in users:
        # Anonymize certain user data by setting 'age' and 'country' to 'anonymized'.
        user['age'] = 'anonymized'
        user['country'] = 'anonymized'
        
        # Append the modified 'user' to the 'return_users' list.
        return_users.append(user)
    
    # Return the 'return_users' list containing anonymized user data to the caller.
    return return_users

# Call the 'get_users' function with a 'limit' of 25 to retrieve 25 users and anonymize their data.
get_users(25)


### 1.3 Pagination/iteration

__Importance__

Transferring data is costly - both in a monetary sense and in *time*. So - APIs are typically very greedy in returning data. Ideally, they only produce a very targeted data point to see. On music-to-scrape.org, for example, that would be a few user names at maximum. It saves the website owner from paying for bandwidth and guarantees that the site responds fast to user input (such as navigating the site or searching for information).

However, we are frequently interested in obtaining *everything* when using APIs for research purposes (such as *all* available user names, to then build a panel data set of consumption behavior)...

We think you see where we're going with this... 

__Let's try it out__

So, let's grab more usernames. The API output, unfortunately, only shows at maximum 100 of these names. To retrieve the remaining usernames, you need to iterate through several of these API calls. The API divides the data into smaller subsets that can be accessed on various pages, rather than returning all output at once. 

Let's retrieve the first batch of usernames. Looking at the API documentation, we learn that there is an `offset` parameter to pass. Because it is the 2nd parameter we're adding, it is preceded by a `&` rather than a `?`. 

In [None]:
response = requests.get("https://api.music-to-scrape.org/users?limit=10&offset=0").json()
# first 10 user names
response['data']

In [None]:
# let's get data for the second batch of 10 user names

response = requests.get("https://api.music-to-scrape.org/users?limit=10&offset=10").json()
response['data']

In this particular application, the first API call displays 10 user names (see `limit` in the URL), and the 2nd page lists the next 10 usernames (after "skipping" (=`offset`) 10 users). 



<div class="alert alert-block alert-info"><b>Tips:</b> 
 <br>
<li>You can adjust the number of results on each page with the <code>limit</code> parameter.</li>
<li>In practice, almost every API on the web limits the results of an API call (<code>100</code> is also a common cap).</li>
    
</div>

--- 
#### Exercises 1.3



1. Adapt the function `get_users()` (see question 3 of exercise 1.2), such that it accepts two arguments: `limit` (as before), and `offset` (which you will have to add). Set the default value for `limit` to `25`. Run the function.
2. Write a loop that retrieves the information of the first 500 users of the platform and then stops.


In [None]:
# your answer goes here!

#### Solutions

In [None]:
# Q1

# Import the 'requests' library, which is used to make HTTP requests.
import requests

# Define a function named 'get_users' that takes two parameters: 'limit' and 'offset'.
def get_users(limit, offset):
    # Construct the API URL with the provided 'limit' and 'offset' values.
    url = f"https://api.music-to-scrape.org/users?limit={limit}&offset={offset}"
    
    # Send an HTTP GET request to the API and store the response in the 'response' variable.
    response = requests.get(url)
    
    # Parse the JSON content of the response into a Python dictionary and store it in 'request'.
    request = response.json()
    
    # Create an empty list 'return_users' to store modified user data.
    return_users = []
    
    # Iterate through each 'user' in the 'users' list obtained from the API response.
    for user in users:  # Note: 'users' variable is not defined; it should be 'request'.
        # Anonymize certain user data by setting 'age' and 'country' to 'anonymized'.
        user['age'] = 'anonymized'
        user['country'] = 'anonymized'
        
        # Append the modified 'user' to the 'return_users' list.
        return_users.append(user)
    
    # Return the list of modified user data.
    return return_users  # Removed unnecessary parentheses

# Call the 'get_users' function twice with different 'limit' and 'offset' values.
get_users(limit=25, offset=0)   # Retrieve the first 25 users.
get_users(limit=25, offset=25)  # Retrieve the next 25 users.


In [None]:
# Q2

# Initialize a counter 'cntr' to keep track of the number of retrieved users.
cntr = 0

# Create an empty list 'all_usernames' to store all the retrieved usernames.
all_usernames = []

# Continue the loop as long as 'cntr' is less than 500.
while cntr < 500:
    # Call the 'get_users' function to retrieve a batch of 25 users starting from 'cntr'.
    new_users = get_users(limit=25, offset=cntr)
    
    # Extend the 'all_usernames' list with the newly retrieved users.
    all_usernames.extend(new_users)
    
    # Increment the 'cntr' by 25 to move to the next batch of users.
    cntr = cntr + 25

# After the loop completes, check the length of 'all_usernames' to
len(all_usernames)

In [None]:
# peek at a few usernames
all_usernames[1:50]

# 1.4 Saving information in a JSON file

__Importance__

When we work with APIs, it's crucial to handle the data we get from them efficiently, especially when we're conducting research and need to save that data for later analysis. Unlike just viewing data on a website, research often requires us to collect and store large amounts of information for deeper examination.

JSON, which stands for JavaScript Object Notation, is our preferred format for this task. It's a lightweight way of representing data that's easy for both humans and computers to understand. JSON is like a universal language for structured data storage, and here's why it's so valuable when dealing with API data:

1. **Preserving Data Structure:** JSON can handle complex data structures, which means it's perfect for storing various types of information, whether it's usernames, user profiles, or any other kind of data you want to gather.

2. **User-Friendly:** JSON files are designed to be easy for humans to read and edit. You can open them in a text editor or use programming languages to work with them effortlessly.

3. **Wide Compatibility:** JSON plays well with most programming languages and data analysis tools, making it a versatile choice for storing data that you want to work with later.

In simpler terms, think of JSON as the perfect container for your collected data. It keeps things organized, accessible, and ready for in-depth study when you need it. 

__Let's try it out__

Run the next cell to see how to save API data as JSON using Python.

In [None]:
import json

# Open a file named 'usernames.json' in write mode ('w').
# This will create the file if it doesn't exist or overwrite it if it does.
f = open('usernames.json', 'w')

# Iterate through each username in the 'all_usernames' list.
for username in all_usernames:
    # Convert the 'username' dictionary to a JSON-formatted string and write it to the file.
    f.write(json.dumps(username))
    
    # Write a newline character ('\n') to separate each JSON object on a new line.
    f.write('\n')

# Close the file to ensure that changes are saved and resources are released.
f.close()


### Exercises 1.4

1. Before writing the user information to the `.json` file, add a new field `source_website` to each username and set it to `music-to-scrape.org`. Also add a variable called `timestamp`, which you set to your computer's current date and time using the `time.time()` function (in unix time). This will help you keep track of where the data originated from. Make sure the modified data is saved in the JSON file.


In [None]:
# your answer here
import time
import json

# ...

#### Solution

In [None]:
# Import the 'json' library for working with JSON data and the 'time' library to record timestamps.
import json
import time

# Open a file named 'usernames.json' in write mode ('w').
# This will create the file if it doesn't exist or overwrite it if it does.
f = open('usernames.json', 'w')

# Iterate through each 'username' dictionary in the 'all_usernames' list.
for username in all_usernames:
    # Add additional fields to each 'username' dictionary.
    
    # Add 'source_website' field and set it to 'music-to-scrape.org' to track the data source.
    username['source_website'] = 'music-to-scrape.org'
    
    # Add 'timestamp' field and set it to the current Unix timestamp for record-keeping.
    username['timestamp'] = int(time.time())

    # Convert the modified 'username' dictionary to a JSON-formatted string and write it to the file.
    f.write(json.dumps(username))
    
    # Write a newline character ('\n') to separate each JSON object on a new line in the file.
    f.write('\n')

# Close the file to ensure that changes are saved and resources are released.
f.close()


### 1.5 Wrap-up

To sum up, we have seen how *parameters* can be a powerful tool when working with APIs. They allow you to tailor your request to be more specific or loop through multiple pages. Finally, we have recapped how to save information to a JSON file.

Note that in an API documentation, you typically find more information about the available parameters and the values they can take on. The documentation also contains information on what *other* endpoints are available. For example, the music-to-scrape [API documentation](https://api.music-to-scrape.org/docs) includes a section on retrieving users' demographic information or their listening history. 

However, bear in mind that each API is unique. So, it's crucial to study each API's documentation carefully, including its terms of use.

--- 
## 2. Getting to know more endpoints

Getting information from the "first" endpoint is always the hardest. The good news is that we've already done that! With our knowledge about how to assemble the URL of the endpoint (i.e., `https://api.music-to-scrape.org/` and the endpoint name, e.g., `users`), we can now proceed by getting to know more endpoints.

In the next subchapters, we will briefly introduce you to these and prompting you to develop code to get information from it.

### 2.1 `user/plays`

This endpoint is simple: it merely shows, for a particular user, how many songs have been listened to on the platform. [Check out the documentation here!](https://api.music-to-scrape.org/docs#operation/get_total_plays_for_username_user_plays_get).



#### Exercises 2.1

1. Write a code snippet that extracts the number of song plays for the user `StarCoder49`. Try to be as brief as possible (i.e., use as little code as possible).
2. Iterate through the list of user names retrieved earlier (`all_users` - restrict yourself to the first 10) and add the number of plays to the dictionary. __Pause your loop at every iteration for .2 seconds to minimize server load using the `time.sleep(.2)` function.__

In [None]:
# your answer goes here!

### Solutions

In [None]:
# Q1: Fetching User Play Data

# Import the 'requests' library, which is used to make HTTP requests to the API.
import requests

# Send an HTTP GET request to the API endpoint to retrieve play data for a specific user.
# The URL includes the username 'StarCoder49' as a query parameter.
# The '.json()' method parses the JSON response into a Python dictionary.
play_data = requests.get('https://api.music-to-scrape.org/user/plays?username=StarCoder49').json()

# The 'play_data' variable now contains the play data for the user 'StarCoder49'.
# You can further work with this data as needed.
play_data

In [None]:
# Q2: Updating User Data with Play Counts

# Import the 'requests' library for making HTTP requests and the 'time' library for adding a delay.
import requests
import time

# Create an empty list 'all_users_updated' to store user data with play counts.
all_users_updated = []

# Iterate through a subset of 'all_usernames' (from the 1st to the 10th user).
for user in all_usernames[0:10]:
    # Extract the 'username' from the current 'user' dictionary.
    username = user['username']
    
    # Send an HTTP GET request to the API endpoint to retrieve play data for the 'username'.
    play_data = requests.get(f'https://api.music-to-scrape.org/user/plays?username={username}').json()
    
    # Add a 'plays' field to the 'user' dictionary and store the play data.
    user['plays'] = play_data
    
    # Append the updated 'user' dictionary to the 'all_users_updated' list.
    all_users_updated.append(user)

    # Add a time delay of 0.2 seconds (200 milliseconds) to avoid overloading the API.
    time.sleep(0.2)


In [None]:
all_users_updated

### 2.2 `charts/top-artists`

This endpoint provides the weekly charts - compiled at the artist level. See also [here for the documentation](https://api.music-to-scrape.org/docs#operation/chart_get_top_artists_charts_top_artists_get).

#### Exercises 2.2

1. Please try to extract information from this endpoint for the current week, producing the __top 50__ most listened to artists of the week.
2. Try to retrieve these charts for the most recent 4 weeks. 

#### Solutions

In [None]:
# Q1
requests.get('https://api.music-to-scrape.org/charts/top-artists?limit=50').json()

In [None]:
# Q2: Fetching Top Artist Charts Data Over Multiple Iterations

# Initialize 'unix' variable to 0 and 'counter' to 1.
unix = 0
counter = 1

# Create an empty list 'charts' to store top artist charts data.
charts = []

# Continue the loop as long as 'counter' is less than or equal to 5.
while counter <= 5:
    # Print the current iteration number.
    print(f'Iteration {counter}...')
    
    # Check if 'unix' is 0; if so, request top artist charts data without a timestamp.
    if unix == 0:
        data = requests.get(f'https://api.music-to-scrape.org/charts/top-artists?limit=50').json()
    else:
        # If 'unix' is not 0, include the 'unixtimestamp' parameter in the request.
        data = requests.get(f'https://api.music-to-scrape.org/charts/top-artists?limit=50&unixtimestamp={unix}').json()

    # Append the retrieved charts data to the 'charts' list.
    charts.append(data)

    # Update 'unix' to the 'unix_end' value from the retrieved data, incremented by 1.
    unix = data['unix_end'] + 1
    
    # Increment 'counter' to track the number of iterations.
    counter = counter + 1

    # Add a time delay of 0.5 seconds (500 milliseconds) between iterations.
    time.sleep(0.5)


In [None]:
# look at the data
charts

---

### 3 After-class exercise: Building an API data extraction

Up to now, we've written functions that in itself carry out separate tasks: getting usernames, retrieving a user's number of plays, or retrieving the weekly charts.

However, when we use APIs for research, we are not so much interested in the results of "single-shot" API requests, but we would like to obtain a *copy* of the entire data so that we can analyze it later.

So, the purpose of this section is to "stitch" together individual API requests. For now, we assume that we are interested in studying how the music consumption behavior across a representative set of users of music-to-scrape differs.

For this purpose, we
- obtain a list of 100 users who are registered at the site, 
- subsequently get their total number of plays on the platform, and
- store the information in a CSV (!) file (which can be used for analysis).

Before starting, check out the snippet below that converts a list of JSON dictionaries to a CSV file.


In [None]:
import pandas as pd

# Create a list of dictionaries with data about artists and their plays.
data = [
    {'Artist': 'Adele', 'Genre': 'Pop', 'Plays': 20000000},
    {'Artist': 'Ed Sheeran', 'Genre': 'Pop', 'Plays': 18000000},
    {'Artist': 'Taylor Swift', 'Genre': 'Country', 'Plays': 15000000},
    {'Artist': 'Drake', 'Genre': 'Hip-Hop', 'Plays': 22000000},
    {'Artist': 'BeyoncÃ©', 'Genre': 'R&B', 'Plays': 16000000}
]

# Convert the list of dictionaries into a pandas DataFrame.
df = pd.DataFrame(data)

# Specify the name of the CSV file where you want to save the data.
csv_file = 'artists_and_plays.csv'

# Use the to_csv method to save the DataFrame as a CSV file.
df.to_csv(csv_file, index=False)

print(f'Data about artists and plays has been saved to {csv_file}.')


#### Solution


In [None]:
# Q1: Fetching User Data with Pagination

# Import the 'requests' library for making HTTP requests.
import requests

# Create an empty list 'all_users' to store user data.
all_users = []

# Initialize 'offset' to 0 to start fetching data from the beginning.
offset = 0

# Continue fetching data in a loop until the total number of retrieved users reaches 100.
while len(all_users) < 100:
    # Print a message indicating the current offset being used for the request.
    print(f'Getting info with offset {offset}...')
    
    # Send an HTTP GET request to the API endpoint with the current 'offset' value.
    # The response is parsed as JSON data.
    new_users = requests.get(f'https://api.music-to-scrape.org/users?offset={offset}').json()
    
    # Extend the 'all_users' list with the user data from the current response.
    all_users.extend(new_users['data'])
    
    # Update 'offset' by adding the number of retrieved users in the current response.
    offset = offset + len(new_users['data'])
    
    # Add a time delay of 0.5 seconds (500 milliseconds) between requests to avoid overloading the API.
    time.sleep(0.5)


In [None]:
# Q2: Updating User Data with Play Counts

# Create an empty list 'all_users_updated' to store user data with play counts.
all_users_updated = []

# Iterate through each 'user' in the 'all_users' list.
for user in all_users:
    # Send an HTTP GET request to the API to retrieve play data for the current 'user'.
    # Note: It's essential to use 'user'['username'] instead of just 'username' to access the username field.
    plays = requests.get(f'https://api.music-to-scrape.org/user/plays?username={user["username"]}').json()
    
    # Add a 'plays' field to the 'user' dictionary and store the play counts from the response.
    user['plays'] = plays['plays']
    
    # Append the updated 'user' dictionary to the 'all_users_updated' list.
    all_users_updated.append(user)
    
    # Add a time delay of 0.2 seconds (200 milliseconds) to avoid overloading the API.
    time.sleep(0.2)


In [None]:
# Q3: Saving User Data to a CSV File

# Import the 'pandas' library for working with DataFrames.
import pandas as pd

# Convert the list of dictionaries ('all_users_updated') into a pandas DataFrame.
# Each dictionary represents user data, and the DataFrame will have columns based on dictionary keys.
df = pd.DataFrame(all_users_updated)

# Specify the name of the CSV file where you want to save the user data.
csv_file_name = 'user_data.csv'

# Use the 'to_csv' method to save the DataFrame as a CSV file.
# Setting 'index=False' excludes the DataFrame index from the CSV output.
df.to_csv(csv_file_name, index=False)

# Print a confirmation message indicating that the user data has been saved to the CSV file.
print(f'User data has been saved to "{csv_file_name}".')


### 4 Wrap-up

Good job - you've made it!

After working on this set of exercises, you should be able to further explore the API of music-to-scrape on your own. 

In particular, we can use our new skills to extract data, store them in JSON or CSV files, and then analyze them (but, recall, analyzing web scraped/API data is not part of this course!).

So...
- which user consumes most?
- which artist is the most popular one?
- which tracks have been trending recently...?

Think what data is required to obtain such data, and then try to extract such data.

Keep in mind that this tutorial has only scratched the surface of what's possible with APIs. In real-life scenarios, authentication is often required, and each API may have its own authentication methods. While we didn't cover authentication here, you do have a solid foundation for working with APIs at this stage.

So, go ahead and have fun exploring APIs, extracting data, and unleashing the power of data analysis!