# Python Bootcamp (in-class)

*1 February 2024*

## Learning Objectives

Students will be able to:
* Locally launch Jupyter Notebook and Visual Studio Code
* Know when it's helpful to use Google Colab in the cloud
* Understand basic programming concepts and their applications to collecting web data

<div class="alert alert-block alert-info"><b>Support Needed?</b>
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>

------

## 1. Welcome to Python!

Not installed yet? Follow the [installation instructions here!](https://tilburgsciencehub.com/get/python).


### 1.1 Why we use Python...
1. Multi-purpose (web server, web scraping, automating, machine learning)
2. High-level, relatively easy to learn
3. Great documentation
4. Open source / free
5. Widely used in business/data science
6. Platform-independent

### 1.2 Differences between Python, Anaconda, Jupyter Notebook, and Google Colab

- There are many Python versions floating around on the web...
    - Python vs. "Notebooks" (e.g., running code)
    - (Self-)installed Python distribution vs. Anaconda (e.g., getting it "right")
    - Local vs. cloud setups (e.g., installing packages vs. flexibility to run quickly)

__&rarr; For now, use Jupyter Notebook (via Anaconda) on your computers, or Google Colab in the cloud__.

-------------

## 2. Launching Python and getting to know the interface

### 2.1 Launching Jupyter Notebook locally

- Anaconda Navigator
- Command prompt/terminal (Mac users: once Anaconda is installed, your terminal *is* in Anaconda mode; Windows users: start Anaconda prompt explicitly)
- Why Jupyter Notebook runs in your browser (and *also* in your terminal)
- Finding and opening `.ipynb` files
- Closing Jupyter Notebook

<div class="alert alert-block alert-info"><b>Tip:</b>
The terms command prompt (Windows) and Terminal (Mac, Linux) are used interchangeably.
</div>


__Exercise 2.1__

Download this notebook from the course website as a `.ipynb` file, save it on your computer, launch Jupyter Notebook on your computer, and open the file.


### 2.2 Getting to know the Jupyter Notebook interface
- Code vs. markdown cells
- Running cells
- [Markdown formatting](https://www.markdownguide.org/cheat-sheet/)


__Exercise 2.2__

- Create a new Jupyter Notebook and create some content.
    - First, add a markdown cell, in which you format "Exercises" as a first-order title using markdown (using `#`), followed by your name and email address as regular text.
    - Second, add a code cell, in which you type `message = "Hello world"`.
    - Third, add another code cell, in which you type `print(message)`.
- Then, run all cells.
- Save the notebook as `my_exercise.ipynb`.

### 2.3 Launching Google Colab

- Discuss benefits and drawbacks of cloud-based Jupyter Notebooks (e.g., use of `selenium` and `Chrome`)
- [Launching Google Colab](https://colab.research.google.com)
- Connecting Google Colab to your Google Drive; sharing and collaborating on code

__Exercise 2.3__
- Import `my_exercise.ipynb` in Google Colab, run it, and create a sharing link. Invite the student next to you.

### 2.4 Launching a Python editor (e.g., Visual Studio Code)
- Source code (vs. code cells)
- Comments (vs. markdown cells)
- File extensions (`.ipynb` vs. `.py`)
- Alternative editors (e.g., pyCharm, Spyder, ...)
- Launching `.py` files "in production" from the terminal: `python your_sourcecode.py`

__Exercise 2.4__

- Create the following `my_code.py` file, with the following content:
    
    ```
    message = "Hello world"
    print(message)
    ```

- Run the file from the terminal by typing `python my_code.py`.

### 2.5 Installing packages and using help on the web

- Installing packages __via the terminal__: `pip install <packagename>`
- Importing packages into Python __via Python__: `import <packagename>`

__IMPORTANT:__ 
- `pip` commands run in the terminal - not IN Python.

__Exercise 2.5__

Pandas is a really popular package for working with data in Python. Can you quickly search the web for *how to install it*, and then actually install it on your computer and test whether it runs?


<div class="alert alert-block alert-info"><b>Tip:</b>
I frequently use <a href="https://chat.openai.com">Chat GPT</a> when I'm stuck with Python. Yet, it doesn't get everything right immediately, requiring extra care (and experience!) when using. I also sometimes cross-check with alternative websites where code was actually tested. For example, you could try <a href="https://stackoverflow.com/questions/">Stackoverflow</a> for debugging and finding reliable answers. Here in Tilburg, we're also developing <a href="https://tilburgsciencehub.com">Tilburg Science Hub</a>, which has many good code snippets to use.
</div>



### 2.6 Why Jupyter Notebook sucks (but still we use it...)

- Danger of point-and-click (and benefits of top-down execution)
- High overhead (vs. lean `.py` files)
- Support of advanced tutorials and packages (limited `selenium` support)

<div class="alert alert-block alert-info"><b>Tip:</b>
You can mimick top-down execution in Jupyter Notebook by restarting the kernel (Kernel --> Restart), and executing your cells (Cells --> Run all).
</div>

Ultimately, the benefits of Jupyter Notebook for education outweigh the drawbacks.

---------

## 3. Coding concepts for web data



### 3.1 The web data workflow

#### Planning, designing and executing the data collection (Boegershausen et al. 2022)

1. Select data sources
2. Design data collection
    - Import data from the web into Python ("how to import web data into Python?")
    - Select relevant data from raw HTML files or the output of APIs ("how to select and filter data in Python?")
    - Store data in tables or databases ("how to store data using Python?")
3. Execute data collection
    - Schedule the data extraction and monitor its health ("how to schedule Python scripts?")
    

<div class="alert alert-block alert-info"><b>Tip:</b>
Curious about the "real" web data workflow (which is much more comprehensive then what is here? Start getting familiar with it early on by reading <a href="https://journals.sagepub.com/doi/10.1177/00222429221100750">"Fields of Gold"</a> (you've got to know this paper inside out by the end of this course...).
</div>

#### Implementation specifics (Guyt et al. 2024)

Usually, web scrapers consist of four "shells", that are gradually developed to complete a webscraping project. Next to selecting data sources, researchers
1. *extract*: Build computer code to extract the data from the website
2. *loop*: Build a loop to extract data for multiple products, users, etc.
3. *schedule*: schedule the data collection, e.g., to run every day or week
4. *infrastructure*: run scrapers on local (i.e., your own) or remote (e.g., in the cloud) infrastructure.


### 3.2 Variable types: Strings and Numbers
- strings (`message = 'Hello world!'` and `message2 = "This is a tutorial!"`) vs. numbers (`age = 25`)
- joining/concatenating strings (`message + message2`)
- calculating with numbers (`age + 1`)
- joining strings and numbers (+ conversion) (`message + str(age)`)
- printing numbers and strings

__Exercise 3.1__

Please write some Python code that stores your name in a variable called `name`, and your age in a variable called `age`. Then, print the following to the screen: "My name is `<YOUR NAME>` and I am `<AGE>` years old.".

__Solution__

In [1]:
name = "Jeroen"
age = 25
message = f"My name is {name} and I am {age} years old."
# alternatively: "My name is " + name + " and I am " +str(age) + " years old."
# You can also use the str() function to convert numbers to printable text!

print(message)

My name is Jeroen and I am 25 years old.


### 3.3 Crawling data using the `reddit.com` API



Obviously, this class is about retrieving data from the Internet. As a toy example, let us use the social network Reddit. Reddit has an amazingly accessible API, that we can use to extract some data.

<div class="alert alert-block alert-info"><b>APIs vs. web scraping</b><br>
APIs are official interfaces by firms to offer data, code or functionality to its users. Think about the Twitter/X API that can be used by stockbrokers to train trading algorithms on Twitter/X data. Researchers can use APIs to officially tap into data sources of firms or organizations.
Web scraping, in turn, refers to extracting information from *any* website or app on the internet. Typically, web scraping is limited to publicly available information only that you could also see in your browser).
</div>

Do not worry about the exact details about the code below, we are going to cover those in later lectures. Right now, please pay attention to the bigger picture: Retrieving data from a particular URL/endpoint (here, `reddit.com/r/university/about.json`), and retrieving particular information from that data (here, `display_name` and `title`).

__Exercise 3.2__

- Open [Reddit.com](https:/reddit.com) and find the "University" subreddit - browse it.
- Copy the code below to Jupyter Notebook, and execute it.
- Then, change the subreddit to one of your choice in the source code (i.e., search for something you find interesting on [Reddit.com](https://reddit.com), and rerun the cell.


In [2]:
# load packages
import requests # for making http requests
import json # for dealing with JSON data

# define a few variables
subreddit = 'University'
url = 'https://www.reddit.com/r/' + subreddit + '/about.json'

# make web request with a header ("to identify ourselves"), convert to JSON data so we can query the object
content = requests.get(url, headers = {'User-agent': 'I am learning Python.'}).json()

# printing a status message
print('Getting data from...', url)

# printing some data
print('Subreddit name:', content['data']['display_name'])
print('Subreddit title:', content['data']['title'])

Getting data from... https://www.reddit.com/r/University/about.json
Subreddit name: University
Subreddit title: University: Academic and real-world news for students, faculty, and academics


### 3.4 Reusing code with functions

- Writing functions can drastically simplify code execution (avoiding copy-paste errors)
- Functions start with `def`, and can (but not need to) have inputs ("arguments") and outputs ("returns")
- Functions require a "hierarchy", as visualized with indents (same number of spaces or a tab)

    
<div class="alert alert-block alert-info"><b>Tip:</b>
Most novices don't get the indents right. Especially when copy-pasting code from the web, you end up with an inconsistent number of spaces and/or tabs, while Python requires you to always use the same number of spaces or tabs. Be aware of this bottleneck when coding! Check out <a href="https://www.youtube.com/watch?v=YL-CimZO_FE" target="_blank">this YouTube video for some easy explanation about indentation!</a>
</div>

__Exercise 3.3__

- Copy the following function into your Jupyter Notebook, and run the cell.

In [3]:
def get_reddit_data(subreddit):
    url = 'https://www.reddit.com/r/' + subreddit + '/about.json'
    content = requests.get(url, headers = {'User-agent': 'I am learning Python.'}).json()
    print('Getting data from...', url)
    print('Subreddit name:', content['data']['display_name'])
    print('Subreddit title:', content['data']['title'])

- Then, run the function, by writing `get_reddit_data("university")` in a new cell, and running that cell. Does it work?
- Finally, call this function for three subreddits of your choice (and write the code for it & run it!).
- What are the inputs and outputs of the `get_reddit_data` function?

__Solution__

In [6]:
get_reddit_data("university")
get_reddit_data("skateboarding")

# the inputs are: subreddit name; no outputs are RETURNED as data, but information is printed to the screen.

Getting data from... https://www.reddit.com/r/university/about.json
Subreddit name: University
Subreddit title: University: academic and real-world news for students, faculty, and academics
Getting data from... https://www.reddit.com/r/skateboarding/about.json
Subreddit name: skateboarding
Subreddit title: Skateboarding


### 3.5 Returning and saving data

At this stage, you have learned to write functions that print information to the screen. In many circumstances, though, it is better or more optimal to write functions that return particular data. For example, imagine that the information from our function above returns data (i.e. like the name of the subreddit) that we can later write to a data file and analyze.

To be able to return data rather than merely printing information to the screen, we use the `return` command at the end of the function.

For example, we can modify the function above to return a reddit's `display_name`.



In [12]:
def get_reddit_info(subreddit):
    url = 'https://www.reddit.com/r/' + subreddit + '/about.json'
    content = requests.get(url, headers = {'User-agent': 'I am learning Python.'}).json()
    return(content['data']['active_user_count'])

In [13]:
get_reddit_info("skateboarding")

105

Observe that, rather than having a lot of text printer to the screen, the only data returned is the display name.

We can store this information also in a new variable, e.g.,

In [14]:
info = get_reddit_info("skateboarding")
print(info)

105


<div class="alert alert-block alert-info"><b>Tip:</b>
Many of us will struggle with picking good variable names. Typically, we should always use lower caps, and names that have a meaning. Calling something info1, info2, and info3 is not very meaningful. Read these tips about <a href="https://geo-python.github.io/site/notebooks/L1/gcp-1-variable-naming.html" target="_blank">naming variables and other coding conventions</a>.
</div>

At this point, our function has just returned one data point. But what if we want to return multiple data points, such as a display name, the number of subscribers, etc.?

Call returning more complex data, we make use of the JSON file format. This file format consists of attribute-value pairs, e.g., `{'name': 'student', 'age': 25}`.

Run the code below to see how you can store information in this file format, and subsequently query it!


In [17]:
# store data in dictionary
my_data = {'name': 'jeroen', 'age': 25}
# retrieve data
name = my_data.get('name')
age = my_data.get('age')
# print message
message = f"I am {name} and I am {age} years old."
print(message)

I am jeroen and I am 25 years old.


__Exercise 3.4__

Please modify the `get_reddit_data` function to return a dictionary, holding the `display_name` and `title`, and store the dictionary in a variable called `output`.

__Solution__

In [18]:
def get_reddit_data(subreddit):
    url = 'https://www.reddit.com/r/' + subreddit + '/about.json'
    content = requests.get(url, headers = {'User-agent': 'I am learning Python.'}).json()
    my_data = {'display_name': content['data']['display_name'],
               'title': content['data']['title']}
    print('Getting data from...', url)

    return(my_data)


In [19]:
output = get_reddit_data('skateboarding')
output

Getting data from... https://www.reddit.com/r/skateboarding/about.json


{'display_name': 'skateboarding', 'title': 'Skateboarding'}

### 3.6 Arrays and looping
- In web data, you typically want to get data on MORE than one page/endpoint (e.g., subreddit).
- Copy-pasting -- even simple function calls -- is to be avoided (copy-paste mistakes, code complexity)
- To hold multiple records of information, we can use so-called __arrays__. Think of them as lists of multiple values, e.g., subreddit names `subreddits = ['university','marketing','sports']`
- We can now make use of a process called "looping", i.e., iterating through these lists. Combining arrays ("the starting point") with loops (to repeat some action on them) is very powerful! See the snippet below, which first defines a list of subreddit names, and then prints them to the screen.

In [20]:
subreddits = ['university','marketing','sports']

for item in subreddits:
    print(item)

university
marketing
sports


__Exercise 3.5__

- Now that you know what a loop is, can you modify the code-snippet above, to repeatedly execute the `get_reddit_data()` function on the subreddits?

- Test the function --- does it work? Where does the output go?


In [21]:
subreddits = ['university','marketing','sports']

for item in subreddits:
    get_reddit_data(item)

# the data is nowhere to be found - it's not saved anywhere.

Getting data from... https://www.reddit.com/r/university/about.json
Getting data from... https://www.reddit.com/r/marketing/about.json
Getting data from... https://www.reddit.com/r/sports/about.json


### 3.7 Saving text in "flat files"

As you have seen above, well the function is repeatedly executed, the data is nowhere to be found. So, at this stage, we are going to learn how we can actually write information to a file. For this, we will make use of a file writing process, consisting of the commands `open`, `write`, and `close`. Watch the code snippet below, which saves some output to a file. Try to locate this file on your desk and open it!


In [None]:
f = open('filename.txt', 'w', encoding = 'utf-8') # open new file for writing
f.write('Hello world!\n') # write content and NEW LINE character to file ('\n')
f.close() # close file

A few more explanations follow.

First, observe the `'w'` in the function above. This indicates we open a file for writing. Other ways to open files are just for reading, which would be represented with an `'r'`. Other ways of opening files are in the `'a'` mode: whereas `'w'` overwrites the file every time it is executed, `'a'` appends new information to a file.

Second, we must be aware of file encodings. Back in the days, computers only understood English characters. So no umlaut in German and no accents in French. Because we are extracting web data, we are highly likely to encounter non-English information. So it is safest to use the most universal file storing format, which is called `utf-8`. This is important not only for writing data, but also for reading that information later.

Third, observe that we have chosen a file extension to be `.txt`. This is just a convention to indicate to users what type of data format we have used. But, it really doesn't matter what you call it. You can leave it out, call it `.json`, or anything else.

Fourth, note that we are writing to a flat file. It can be opened in any editor. To start new lines, we use special characters. Above, we have used `"\n"` to start a new line. Other special characters are `"\t"` to insert the TAB character.

__Exercise 3.6__

Use the code (`f = open`...) from the cell above to create a new cell below, and write some code that appends `"I am working on a tutorial"` to the `filename.txt` file.

__Solution__

In [22]:
f = open('filename.txt', 'a', encoding = 'utf-8') # note we have used `a` to append here!
f.write('I am working on a tutorial.\n')
f.close()

# open the file now and verify it has worked!

### 3.8 Storing dictionaries in new-line separated JSON files

Now, we know how to write text to a file. But how can we save JSON objects in a text file?

We can learn how to convert a JSON object to simple text for this. We will then store separate JSON objects in a text file, separated by new lines. This very popular file format is called "new-line separated JSON files."

To be able to use it, we first need to learn how to convert JSON data into "flat" and textual information. For this, we use the `json` library.


In [23]:
my_datapoint = {'name': 'student', 'age': 25}

import json

json.dumps(my_datapoint)

'{"name": "student", "age": 25}'

In the code above, the `json.dumps` command takes a JSON object, and converts it into a character string (indicated by `'`).

<div class="alert alert-block alert-info"><b>Tip:</b>
    Have you noticed we have imported the json library to make use of the <code>json.dumps</code> function? Typically, all necessary packages are imported on top of your script (which we have not done here only because of didactical reasons. Just remember it when working on your actual projects!
</div>

__Exercise 3.7__

Please write some code that *saves* the variable content of `my_datapoint` to a flat `json` file, called `my_data.json`.

__Solution__


In [None]:
import json
my_datapoint = {'name': 'student', 'age': 25}
f=open('my_data.json','w',encoding='utf-8')
f.write(json.dumps(my_datapoint))
f.close()

### 3.9 Tying things together

Now it's your turn. Use the concepts from above to...

- Create an array, holding ten subreddit names of your choice
- Write a function that returns as a dictionary the following data points from the about page of a subreddit: `display_name`, `title`, `subscribers`, and the date of creation, `created` (e.g., this is the link to the viewable about page for the [subreddit "University"](https://www.reddit.com/r/University/about), and this is the link to the [JSON version of the same page](https://www.reddit.com/r/University/about/.json)).
- Write a loop to retrieve data for the ten subreddits, and store the data in a new-line separated JSON file called `my_first_web_data.json`.

<div class="alert alert-block alert-info"><b>Tips:</b>
    
<ul>
    <li>Did you know you can "look" at the API output directly in Firefox or Chrome? Just open the URL that is called for a particular subreddit in your browser. Try it with <a href='https://www.reddit.com/r/University/about.json'>this one first (click)!</a></li>
  <li>You can use <code>f.write</code> multiple times in your code. To write a new line to the file, use <code>f.write('\n')</code>.</li>
    <li>Please pay attention to where you open the file for the first time, and how (<code>'a'</code> vs. <code>'w'</code>)</li>

    
</ul>
    <br>
Watch the solution to this exercise on <a href='https://youtu.be/WOXUFOolPgI' target="_blank">YouTube</a>.
</div>



__That's it! Good job!__