# Web Scraping Advanced (oDCM)

*We finally move towards some real-world examples: Twitter & Instagram! You'll quickly realize that both websites face some extra challenges in terms of scraping. We'll show you how to overcome them, and build a scraper that you can use for your own project!*


## Learning Objectives

Students will be able to: 
- Make more advanced use of `selenium`, emulating user interaction on a site (e.g., scrolling and filling in forms)
- Access data that is hidden behind a login-screen
- Apply search parameters to obtain subsets of data
- Capture and store images from the web
- Save the retrieved data as tabular files (e.g., CSV)

--- 

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>


---
## 1. Selenium 

### 1.1 Let's recap: Why Selenium? 

In the Web Scraping 101 tutorial, we mainly used BeautifulSoup to turn HTML into a data structure that we could search and access using Python-like syntax. While it's easy to get started with this library, it has limitations when it comes to dynamic websites. That is, websites of which the content changes after each page refresh. Selenium can handle both static and dynamic websites and mimic user behavior (e.g., scrolling, clicking, logging in). It launches another web browser window in which all actions are visible which makes it feel more intuitive. For example, the video below launches a regular Google Chrome window and visits [`instagram.com`](https://www.instagram.com). This browser window behaves like normal, so you can click on buttons and fill out fields. Yet you can distinguish it from your normal web browser by the header that indicates that Chrome is being controlled by automated test software. Before you can try it out yourself, we need to install some additional software which we'll explain next. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscrapingadvanced/images/selenium_instagram.gif" align="left" width=70%/>

### 1.2 Installing Selenium

<div class="alert alert-block alert-warning"><b>Installing Selenium and Chromedriver</b> 

To install Selenium and Chromedriver locally, please follow the <a href="https://tilburgsciencehub.com/configure/python-for-scraping/?utm_campaign=referral-short">Tutorial on Tilburg Science Hub</a>.
    
You can also use the code snippet below to automate the installation. Running this snippet takes a little longer each time, but the benefit is that it almost always works!
</div>



In [None]:
# Installing and starting up Chrome using Webdriver Manager
!pip install webdriver_manager
!pip install selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# Opening an example site (e.g., Twitch)
driver = webdriver.Chrome(ChromeDriverManager().install())

url = "https://twitch.tv/"
driver.get(url)

If everything went smooth, your computer opened a new Chrome window, and opened `twitch.tv`. 

<div class="alert alert-block alert-info"><b>Using Google Colab</b> 

If you're using Google Colab, you don't see your browser open up manually.
    
Whenever you switch pages, just manually open that page in your browser. Although this feels like a little less interactive, you will still be able to work through this tutorial!

</div>


### 1.3 Access Dynamic Sites Programmatically

**Importance**  
Next, we're going to tell the browser to visit the Tilburg University Twitter account. We call the `driver` object we created above and use the `get` method, which we pass the URL of the website we'd like to extract. 

In [None]:
driver.get("https://twitter.com/TilburgU")

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscrapingadvanced/images/twitter_tilburgu.png" align="left" width=40%/>


<div class="alert alert-block alert-info"><b>Access Twitter via Web Scraping?</b> 
Recall that Twitter also offers an API for authorized, large-scale data access. Although we're using Twitter below as an example for a dynamic website, we don't want to imply it's the best data extraction method! Always critically reflect on whether the data is best obtained using scraping or APIs! More on this? See challenge #1.2 in "Fields of Gold"!
</div>


**Let's try it out!**  
As most information can only be obtained once you're signed in, manually login to your Twitter account through the driver page (create a new account if you don't have one yet - or if you don't want to run the risk of getting blocked on your personal account). 

From this point, we can use BeautifulSoup as we learned previously, though we create the `res` object from the `driver` object this time. 

In [None]:
# make sure to login to your Twitter account first!
from bs4 import BeautifulSoup
res = driver.page_source.encode('utf-8')
soup = BeautifulSoup(res, "html.parser")

Once you inspect the HTML code of the Twitter page you'll discover that the class names are more complex than the ones we looked at earlier. Take a look at the gigantic class name of the Twitter bio, for example...

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscrapingadvanced/images/twitter_bio.png" align="left" width=60%/>

... and more importantly: these class names are dynamic. If you open up the Inspector and look at the class name right now, the class name is likely going to be different than the one above (check for yourself!). Therefore, you may want to look out for elements that are not subjective to changes such as the `data-testid` attribute.

In [None]:
soup.find(attrs={"data-testid": "UserDescription"}).text

**Exercise 1.1**  
Using the same approach as above, extract the (i) number of followers, (ii) the location, and the (iii) join date of the [TilburgU](https://twitter.com/tilburgU) Twitter account. Tip: use Google Inspector to determine an appropriate navigation strategy. Take into consideration the dynamic nature of class names; so look for ways to navigate the source code without relying on these temporary classes.

In [None]:
# your answer goes here!

In [None]:
# solution
followers = soup.find(attrs = {"href": "/TilburgU/followers"}).find_all('span')[1].text
location = soup.find(attrs={"data-testid": "UserProfileHeader_Items"}).find_all('span')[1].text
join_date = soup.find(attrs = {"data-testid": "UserJoinDate"}).find_all('span')[0].text

print(f"Followers: {followers} \nLocation: {location} \nJoin date: {join_date}")

### 1.4 Scroll Sites Programmatically

**Importance**  
In a similar way, we can scrape the content of the most recent tweet as follows: 

In [None]:
# 1st tweet
soup.find_all(attrs={"data-testid": "tweet"})[0].find_all(attrs={"dir": "auto"})[4].text

And for older tweets we simply increment the counter by one: 

In [None]:
# 2nd tweet
soup.find_all(attrs={"data-testid": "tweet"})[1].find_all(attrs={"dir": "auto"})[4].text

Easy right? Not so fast.. From the 10th tweet onwards (in your case it may be a different figure; dependent on screen size, resolution, etc.), it returns an `IndexError: list index out of range`. This is because Twitter only pulls in new tweets once you scroll down the page. 

In [None]:
# 10th tweet
soup.find_all(attrs={"data-testid": "tweet"})[9].find_all(attrs={"dir": "auto"})[4].text

Therefore, we need to scroll down to the bottom of the page if we like to obtain more than a few tweets. Every time you run the cell below it loads another 5-10 tweets.  

In [None]:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

**Let's try it out!**  
Try running the cell above a couple of times. What happens to the most recent tweets? If you run the cells again that extract the 1st, 2nd, 9th, and 10th tweet, does the output change? 

Indeed, we need to recreate the `res` object after each iteration because the HTML code changes once you scroll down (older tweets are added and newer ones are hidden). The number of tweets in the view deviates depending on the type of media (e.g., images take up more space than text). Therefore, we first determine the number of views in the current view to make sure we capture all tweets. After we stored the last tweet in the view, we scroll down the page and start all over again.  

In [None]:
from time import sleep
tweets = []

for _ in range(5):
    res = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(res, "html.parser")
    
    # total number of tweets in current view
    num_tweets_view = len(soup.find_all(attrs={"data-testid": "tweet"}))
        
    # add tweets to list
    for counter in range(num_tweets_view):
        tweets.append(soup.find_all(attrs={"data-testid": "tweet"})[counter].find_all(attrs={"dir": "auto"})[4].text)
    
    # scroll down the page
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    
    # pause for 5 seconds
    sleep(1)

For pages with infinite loading, often used in social media (e.g., Twitter and Facebook), we can also use a while-loop to keep loading the pages untill it reaches the end. For that, we can use the following code snippet:

In [None]:
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Pause for one second
    sleep(1)

    # Recalculate scroll height and break the loop if the end is reached, otherwise continue scrolling
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

**Exercise 1.2**
1. What happens once you first scroll down the page and then run the cell with "tweets = [ ]"? Does `tweets` differ? Why? 
2. Estimate how many times you would need to scroll in order to capture all tweets (tip: you find the total number of tweets at the top). By the way, there's no need to collect all tweets!
3. Write a function `process_tweets()` that takes a list of `tweets` as input and returns a CSV file that contains the original tweet, a list of mentions (e.g., `@gemeentetilburg`), and a list of hashtags (e.g., `#makeitintilburg`). We call this "parsing"(i.e., the conversion of data from JSON into a CSV file). Tip: you may first want to split each tweet into a list of words and work from there. When is a word considered a hashtag? And a mention? How about punctuation? Test your function with the list of `tweets` above. 

In [None]:
# your answer goes here!

**Solution**  
1. The scraper will start from the current view. Since more recent tweets are hidden as you scroll down, the scraper would skip the first few tweets in that case. 
2. Scrolling down five times yielded 56 tweets in my case, so it would take about 3986/56 = 71 times on average. Again, the answer may differ for your machine.

In [None]:
# Question 3
import pandas as pd

def process_tweets(tweets):
    output = []
    
    for tweet in tweets: 
        mentions = []
        hashtags = []
        
        # a more elegant solution can be achieved using regular expressions (outside the scope of this course)
        
        # remove punctuation (to avoid #hashtag?, #hashtag!, etc.)
        for character in ["?", ".", ",", "!"]:
            tweet_clean = tweet.replace(character, "")
            tweet = tweet_clean
            
        # separate words chained by an enter with a space (to avoid #hashtag\nABCDEF)
        tweet = tweet.replace("\n", " ")
        
        for word in tweet.split(" "): 
            try: 
                if word[0] == "@" and word.count("@") == 1: 
                    mentions.append(word)
                if word[0] == "#" and word.count("#") == 1: 
                    hashtags.append(word) 
            except: 
                pass
            
        output.append({
            "tweet": tweet,
            "mentions": mentions,
            "hashtags": hashtags
        })
        
    df = pd.DataFrame(output)
    df.to_csv("tweets_mentions_hashtags.csv", index=False)

process_tweets(tweets)

**Exercise 1.3**  
Retrieve the list of all accounts Tilburg University is following on Twitter. See: https://twitter.com/TilburgU/following. __Make sure to log in to your Twitter account in the browser that you're using for scraping!__.

1. Store their full name, Twitter handle, and biography in a csv-file (`following.csv`). 
2. Load `following.csv` and determine the percentage of accounts that use the word `Professor` or `professor` in their bio.

**Solutions**

In [None]:
# Question 1
from time import sleep

def twitter_followings():
    users = []

    # we set the range to 3 here so the code works reasonably fast.
    # in practice, you ideally would like to set this to 10, because 
    # each view contains 10-15 accounts and we know that there are approximately 100 accounts in total
    
    for _ in range(3): 
        res = driver.page_source.encode('utf-8')
        soup = BeautifulSoup(res, "html.parser")
        
        # if you don't specify the primary column it will also scrape the accounts below "Who to follow" (right sidebar)
        data = soup.find(attrs={"data-testid": "primaryColumn"}).find_all(attrs={"data-testid": "UserCell"})

        for counter in range(len(data)):
            user = data[counter].find_all("span")

            full_name = user[1].text
            handle = user[2].text

            # not all users have a bio 
            try: 
                bio = user[5].text
            except: 
                bio = None

            user_data = {"full_name": full_name, 
                          "handle": handle,
                          "bio": bio
                         }
                
            if user_data not in users:  # to filter out potential duplicates
                users.append(user_data)
                
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        sleep(5)
         
    df = pd.DataFrame(users)
    df.to_csv("followings.csv", index=False)
              
# don't forget to manually login to your Twitter account to access the following list before proceeding
driver.get("https://twitter.com/TilburgU/following") 

# add a short pause here, or Twitter recognizes that it's a scraper and will block your request!
sleep(5)

twitter_followings()

In [None]:
# Question 2
followings = pd.read_csv("followings.csv")

professor_count = 0 

for row in followings["bio"]: 
    try:  # some users don't have a bio which causes an error
        if "professor" in row.lower():
            professor_count += 1
    except: 
        pass

print(f"The percentage of professors among the list of account Tilburg University follows is {round(professor_count / len(followings) * 100, 1)}%.")

Up to now, we have either picked an arbitrary value for the number of scrolls, or - at best - we have approximated the number of times based on the total number of records. An alternative strategy is based on the idea that the current position on the page remains the same if you're already at the bottom of the page and still try to scroll down. Simply put, if scrolling down changes the current position then we're not at the bottom of the page yet. 

With this idea in mind, we can implement this procedure with a `while` loop that remains true as long as we have not reached the end. Once the `current_height` equals the height before scrolling down (`last_height`), we `break` out of the loop and print the total number of scrolls (`scroll_counter`):

In [None]:
scroll_counter = 0 
last_height = 0
driver.get("https://twitter.com/TilburgU")

# running this cell may take a minute or two
while True:    
    current_height = driver.execute_script('return document.body.scrollHeight')
    print("current height: " + str(current_height))

    if current_height == last_height:
        break

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    scroll_counter += 1
    last_height = current_height
    print("last height: " + str(last_height))
    sleep(1)
    
print(f"The number scrolls required to scrape all tweets is: {scroll_counter}")

**Exercise 1.4**  
Extend your answer of Exercise 1.3 so that the number of scroll is dynamic. That is, it scrolls the minimum number of times required to capture all followings (not tweets!) and updates its value once new accounts are followed.

In [None]:
# your answer goes here!

In [None]:
# solution
from time import sleep

def twitter_followings_while():
    users = []
    current_height = 0
    last_height = 0
    
    while True: 
        current_height = driver.execute_script('return document.body.scrollHeight')
              
        if current_height == last_height: 
            break
            
        res = driver.page_source.encode('utf-8')
        soup = BeautifulSoup(res, "html.parser")
        data = soup.find(attrs={"data-testid": "primaryColumn"}).find_all(attrs={"data-testid": "UserCell"})

        for counter in range(len(data)):
            user = data[counter].find_all("span")

            full_name = user[1].text
            handle = user[2].text

            # not all users have a bio 
            try: 
                bio = user[5].text
            except: 
                bio = None

            # avoid duplicates
            user_data = {"full_name": full_name, 
                          "handle": handle,
                          "bio": bio
                         }
                
            if user_data not in users: 
                users.append(user_data)

        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')       
        last_height = current_height
        sleep(5)
        
    df = pd.DataFrame(users)
    df.to_csv("followings_dynamic.csv", index=False)
              
# don't forget to manually login to your Twitter account to access the following list before proceeding (otherwise it will not load the list of followers!)
driver.get("https://twitter.com/TilburgU/following") 

# add a short pause here, or Twitter recognizes that it's a scraper and will block your request!
sleep(1)

twitter_followings_while()


### 1.5 Search Tweets

**Importance**  
Although scraping data from a specific Twitter account is a good starting point, you may come across a scenario in which the tweets you're after come from a variety of sources. We can use Twitter's advanced search functionality to filter down on those tweets we're looking for.   

On Twitter, you find a search bar at the top which allows you to search for a topic. In the right sidebar a search filter panel appears that includes `Advanced search` tools. Filling out any of the search boxes will automatically update the user input in the search box above. For example, tweets that contain either `cats` or `dogs` (or both) can be obtained with the search query `(cats OR dogs)`. The specified search term may occur in the user name, handle, bio, tweet, or in any of the replies of a thread. All search queries are case insensitive, so `cats` and `Cats` is considered equal.

<img src="https://github.com/hannesdatta/course-odcm/blob/master/content/docs/tutorials/webscrapingadvanced/images/cats_dogs.gif?raw=true" align="left" width=60%/>

Did you notice the URL was updated in accordance with our search parameters (remember from [icanhazdadjoke](https://icanhazdadjoke.com/api))? In fact, the new URL became: `https://twitter.com/search?q=(cats%20OR%20dogs)&src=typed_query`: 
* `q=` stands for search query
* `(cats%20OR%20dogs)` corresponds with the contents of the query `(cats OR dogs)`
* `&src=typed_query` indicates that we filled out the search query manually

Note that `%20` is a space character. A full list of search commands and syntax can be found below: 

| Command | Syntax | Interpretation | URL suffix | 
| :------- | :------- | :------ | :----- | 
| All of these words | `cats dogs` | Contains both `cats` and `dogs` | `cats%20dogs` | 
| Exact phrase | `"cats"` | Contains the exact phrase `cats` | `%22cats%22` | 
| Any of these words | `(cats OR dogs)` | Contains either `cats` or `dogs` (or both) | `(cats%20OR%20dogs)` | 
| None of these words | `-cats` | Does not contain `cats` | `-cats` | 
| These hashtags | `(#cats)` | Contains the hashtag `#cats` | `(%23cats)` | 
| Language | `lang:nl` | Tweets in specified language (default: all) | `lang%3Anl` | 
| From these accounts | `(from:hannesdatta)` | Tweets from Hannes Datta | `(from%3Ahannesdatta)` | 
| Replies to these accounts | `(to:hannesdatta)` | Replies to tweets from Hannes Datta | `(to%3Ahannesdatta)` | 
| Mentioning these accounts | `(@hannesdatta)` | Tweets mentioning Hannes Datta | `(%40hannesdatta)` | 
| Minimum replies | `min_replies:10` | Tweets with at least 10 replies | `min_replies%3A10` | 
| Minimum likes | `min_faves:10` | Tweets with at least 10 likes | `min_faves%3A10` | 
| Minimum retweets | `min_retweets:10` | Tweets with at least 10 retweets | `min_retweets%3A10` | 
| From (date) | `since:01-01-2020` | Tweets after the 1st of January 2020 | `since%3A01-01-2020` | 
| To (date) | `until:01-01-2020` | Tweets before the 1st of January 2020 | `until%3A01-01-2020` | 

**Let's try it out!**  
Play around with the filters until you get the hang of it. How do you chain multiple search commands? (e.g., tweets about cats with at least 100 likes)

**Exercise 1.5**  
1. Suppose that you're responsible for Tilburg University's Public Relations & Digital Communication and want to keep an eye on what others are writing about the university. Compile a search query in the Twitter web interface to collect tweets that refer to either `Tilburg University` (English) or `Tilburg Universiteit` (Dutch). Take note of the search command and keep track of the URL.
2. Write a function `search_tweets()` that takes a Twitter search query as input and returns a list of all tweets (including a link to the original tweet so that your PR colleague can respond if necessary) as a csv-file. Test your function with the URL of the previous question. 
3. After further inspection, you come to the conclusion that your data include a dozen or so train (Dutch: "trein") disruption alerts related to "Tilburg Universiteit" (e.g., see example below). How can you easily exclude those tweets upfront? 

<img src="https://github.com/hannesdatta/course-odcm/blob/master/content/docs/tutorials/webscrapingadvanced/images/train_alerts.png?raw=true![image.png](attachment:image.png)" align="left" width=40%/>

*English: #NS Resolved: Tilburg University-Eindhoven: broken train. Boxtel-Eindhoven C.: train traffic has resumed.*

In [None]:
# Question 1 and 2
def search_tweets(query):
    driver.get("https://twitter.com/search?q=" + query) 
    tweets = []
    last_height = 0
    current_height = 0

    iterator = 0
    while True: 
        iterator += 1
        
        current_height = driver.execute_script('return document.body.scrollHeight')

        if current_height == last_height: 
            break

        res = driver.page_source.encode('utf-8')
        soup = BeautifulSoup(res, "html.parser")
        data = soup.find(attrs={"data-testid": "primaryColumn"}).find_all(attrs={"data-testid":"tweet"})

        for counter in range(len(data)):
            tweet = data[counter].find_all(attrs={"dir":"auto"})

            text = tweet[4].text
            try:
                link = tweet[3]['href']
            except:
                link = '/LINK-NOT-AVAILABLE'

            tweet_data = {
                            "tweet": text, 
                            "link": "https://twitter.com/" + link
                         }

            if tweet_data not in tweets: 
                tweets.append(tweet_data)

        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')       
        last_height = current_height
        sleep(5)
        if (iterator == 2): break
        
    df = pd.DataFrame(tweets)
    df.to_csv("search_tweets.csv", index=False)

collected_tweets = search_tweets('("Tilburg University" OR "Tilburg Universiteit")')

In [None]:
# Question 3

# add "-trein" to the URL - see table above (as simple as that!)
search_tweets('"Tilburg University" OR "Tilburg Universiteit" -"trein"')

By now, you have undoubtedly noticed that by default Twitter search shows the so-called "Top" tweets. However, you can click on any of the other tabs so that it returns data from a specific type, data sorted in a given order, or from Twitter users that meet your selection. Note that you can select up to 1 tab and 1 search filter. 

| Tab | Interpretation | URL suffix | 
| :----- | :------ | :------ |
| Latest | Sort the tweets chronologically (in descending order) | `&-f=live` |
| People | Return a list of Twitter accounts (as opposed to tweets) | `&-f=user` |
| Photos | Filter on tweets that include an image | `&-f=image` |
| Videos | Filter on tweets that include a video | `&-f=video` |

| Search filter| Interpretation | URL suffix | 
| :----- | :------ | :------ |
| Near you | Return tweets that are published by someone in your neighborhood | `&lf=on` |
| From people you follow | Filter on tweets from accounts you follow | `&pf=on` |

**Exercise 1.6**  
Brainstorm about search strategies you can deploy to narrow down your Twitter search and make the output of Exercise 5 more useful and actionable for the PR department. 

**Your answer**  

... (double click to edit this cell)

**Solution**  
You may want to filter down on the latest tweets near Tilburg to separate the signal the noise since. Some public statements may require immediate action after all.  

---
### 1.6 Wrap-Up
In this section, you have learned how to use a browser emulation (Selenium) to access data that is hidden behind a login-screen or only appears once scrolling down. This opens up a whole new world of opportunities for you to explore. Some ideas to explore on your own: analyze the hashtags that get the highest engagement, plot the Twitter follower growth over time, identify upcoming influencers, and conduct market research to derive insights to grow your audience. 


---
## 2. Instagram

### 2.1 Click Sites Programmatically

**Importance**  
Now that you got a feeling for how to work with Selenium, we're going to have a brief look at how to scrape data from Instagram. In many ways, the techniques and code follow the same logic as above. 


First, we navigate towards the [Instagram account](https://www.instagram.com/tilburguniversity/) of Tilburg University with Selenium, accept the cookies, login with our own credentials (we recommend creating a separate account for scraping purposes), and close any windows that may occur (e.g. "Save your login info", or "Turn on notifications").

In [None]:
# first create a new driver object if you accicentally closed it (see beginning of this notebook)
driver.get("https://www.instagram.com/tilburguniversity/")

The page structure of Instagram is slightly different from Twitter in the sense that it presents an overview of all images from which we gather the links to all separate posts (like the webshop example). Inspection of the HTML structure tells us that a link to each post can be obtained from the `<a>` tags with attribute `tabindex:"0"` within the `<article>` tags: 

In [None]:
# login and close pop-ups manually before proceeding!
driver.get("https://www.instagram.com/tilburguniversity/")
res = driver.page_source.encode('utf-8')
soup = BeautifulSoup(res, "html.parser")
link = soup.find("article").find_all(attrs={"tabindex": "0"})[0]["href"]
print(f"The link of the most recent post is: http://www.instagram.com/{link}")

**Let's try it out!**  
Get the links from all other posts! How many images/videos can you access (without scrolling down)? What does the `alt` attribute in the `img` tags tell you? 

---
One of the [links](https://www.instagram.com/p/B_KalD_DHcV/) that you'll get this way, showcases a picture of the Heuvelplein in Tilburg. Although comments are listed by default, replies are only visible after clicking on `View replies`. Note that replies are not simply hidden somewhere hidden in the code, in fact, a new `<div>` block of code is added to the source code (see below).

<img src="https://github.com/hannesdatta/course-odcm/blob/master/content/docs/tutorials/webscrapingadvanced/images/click_instagram.gif?raw=true" align="left" width=80%/>

Sure, you can click on each link manually but that would take ages for posts with many replies. Hence, we seek for the class of the `View replies` element and use Selenium to *click* on the link for us. 

In [None]:
driver.get("https://www.instagram.com/p/B_KalD_DHcV/")
view_replies = driver.find_elements_by_class_name("EizgU")[0]
sleep(3) # wait a few seconds before the page is fully loaded
view_replies.click()

**Let's try it out!**  
Run the `view_replies.click()` command one more time (without the other lines of code), what happens then? Which replies are triggered by this click? 

**Exercise 2.1**  
Write a program that extracts all replies for a given Instagram URL and writes it to a csv file. It should include both the username of the author and the reply text (not the original comments). You can break down your code in two separate functions: `click_buttons()` and `extract_replies()`. 

* `click_buttons()` should identify the location of the button class (e.g., whether the contents contains the text "View replies") and click on all those buttons. Ideally, your function should work without hardcoding a class name of the button as we did in the previous example. 


* `extract_replies()` scrapes all replies and usernames once the buttons are clicked upon. Keep in mind that a comment may have more than one reply. Tip: you can use `.find_parent()` to move up one level in the Document Object Model (DOM) hierarchy.


In [None]:
# your answer goes here!

In [None]:
# solution
def click_buttons(url):
    driver.get(url)
    
    # extract HTML of the Instagram page
    res = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(res, "html.parser")

    # there are a bunch of buttons on the page, but the one we're after has the text "View replies" on it
    temp = []
    for element in soup.find_all("button"): 
        if "View replies" in element.text: 
            temp.append(element)

    # all buttons have the same class so it doesn't really matter which one we pick        
    class_name = temp[0].find("span")["class"][0] 
    
    # click on buttons for the class we identified above
    buttons = driver.find_elements_by_class_name(class_name)
    for button in buttons: 
        button.click()
        sleep(1)
        
    return class_name


def extract_replies(url):
    replies = []
    class_name = click_buttons(url)  # here we call the function above which clicks on all buttons and returns the class_name (of the buttons) which we re-use below
    
    # extract HTML again (but this time with the replies included)
    res = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(res, "html.parser")
    data = soup.find_all(class_ = class_name)

    # collect replies
    for comment in data: 
        for reply in comment.find_parent("ul").find_all(class_ = "notranslate"):
            text = reply.find_parent("span").text
            user = reply.find_parent("div").find(attrs={"tabindex": 0}).text
            replies.append({"reply": text, 
                            "user": user})

    df = pd.DataFrame(replies)
    df.to_csv("replies.csv", index=False)
    
extract_replies("https://www.instagram.com/p/B_KalD_DHcV/")

---
### 2.2 Scrape Image Files

**Importance**  
In previous examples, we looked at scraping textual data from a web page. On Instagram, however, it would make sense to store the image files as well. To this end, we extract a link to the image source (`image_link`) and pass it to the `wget` library. You can name the image whatever you want (e.g., `my_image.jpg`). By default, the image is stored in your current working directory (i.e., where this notebook resides).

Please note that the `wget` module is not a standard Python package, so before running the cell below you need to install it first. Either type `pip install wget` in your terminal or use the Anaconda Navigator interface (see below - in your case there will likely be only the base (root) environment).

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscrapingadvanced/images/install_packages.gif" align="left" width=70%/>

In [None]:
!pip install wget
import wget
image_link = soup.find(attrs={"decoding": "auto"})["src"]
wget.download(image_link, "my_image.jpg")

**Let's try it out!**  
Visit another Instagram page (with `drive.get()`), extract the HTML of that page, recreate a `soup` object, and see whether you can store the image on your local machine. Can you also scrape videos in this way? How about Instagram [carousels](https://www.instagram.com/p/CIGLXWMoPkh/) (i.e., posts that contain multiple media)? 

**Exercise 2.2**  
Write a function `scrape_image()` that takes an Instagram URL and returns the images associated with the post. 

* Use the post identifier (e.g., for [this](https://www.instagram.com/p/CIGLXWMoPkh/) post use `CIGLXWMoPkh`) appended by `_image[NR].jpg` (e.g., `CIGLXWMoPkh_image1.jpg`, `CIGLXWMoPkh_image2.jpg`, etc.) as the file name for the images. 
* For this exercise, you can restrict yourself to the first two images of the image carousel (more on that later).
* Skip the posts displayed under "More posts from ..." (see below):

![](https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscrapingadvanced/images/more_posts.png)

In [None]:
# your answer goes here!

In [None]:
# solution
def scrape_image(url):
    driver.get(url)
   
    res = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(res, "html.parser")
    
    data = soup.find(attrs={"role": "presentation"}).find_all(attrs={"decoding": "auto"})
    
    for counter in range(len(data)):
        item_link = data[counter]["src"]
        file_name = url.strip('/')[-11:] + "_image" + str(counter + 1) + ".jpg"
        wget.download(item_link, file_name)
        
scrape_image("https://www.instagram.com/p/CIGLXWMoPkh/")

**Exercise 2.3**  
By default, Instagram only loads the first two images of a image carousel. Therefore, if you run `scrape_image()` above it will export two out of three images. If you click on the arrow pointing to the right, however, the 3rd image is loaded and becomes extractable in the code (see below). Extend the functionality of `scrape_image()` so that it scrapes all images. Your code should still work for a "normal" image (not a carousel). 

![](https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscrapingadvanced/images/load_images.gif)

In [None]:
# your answer goes here!

In [None]:
# solution
def scrape_image(url):
    res = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(res, "html.parser")
    
    data = soup.find(attrs={"role": "presentation"}).find_all(attrs={"decoding": "auto"})
    
    for counter in range(len(data)):
        item_link = data[counter]["src"]
        file_name = url.strip('/')[-11:] + "_image" + str(counter + 1) + ".jpg"
        wget.download(item_link, file_name)

def right_arrow():
    return driver.find_elements_by_class_name("coreSpriteRightChevron")  # this is the class for the right arrow button

def scrape_all_images(url):
    driver.get(url)
    arrow = right_arrow()

    # click on right arrow until you reached the last image (then there is no right arrow class anymore)
    while len(arrow) > 0: 
        arrow[0].click()
        arrow = right_arrow()
    
    # scrape images
    scrape_image(url)

scrape_all_images("https://www.instagram.com/p/CIGLXWMoPkh/") # image carousel
scrape_all_images("https://www.instagram.com/p/CJItoJODy9y/") # normal image

### 2.3 Wrap-Up
We have gone a long way since the Web data for dummies tutorial and learned a variety of techniques to get to the data we're after. In particular, social network services rely on scrolling and clicking interactions to achieve this. As said many times before, this is just the beginning of your scraping journey. For example, see whether you can exploit Instagram's explore function to filter down on posts with a specific [tag](https://www.instagram.com/explore/tags/tilburguniversity/)
or [location](https://www.instagram.com/explore/locations/213125596/tilburg-netherlands/). Good luck!
