Tips and examples

Tips

Using separate lists vs. lists of dictionaries
- Don’t break the structure of what belongs to what!
Looping: while loops are also an option!
Try & except: just use it for one part each, not for many things at the same time
Break up code into smaller modules (e.g., first seeds, then getting the data)
Cleanup code (e.g., comments, etc.)
Modularizing code (so that it works on multiple categories, pages, etc.)
Make “class names” flexible so that you don’t have to repeat yourself over and over again
For anonymization: use a hash function (salted!)
Read paper and align code / update code (e.g., meta data enrichment)

Store raw data as JSON - parse in a second step
Separate “seeding” from “collecting information” stage
Write the data as soon as you can to a file (e.g., JSON) - not only at the end of a long (1.5 days!) scraping session (minimize data loss)
For extended data collections - consider saving the raw html files first - then only parse!
How to find max. page numbers? You can do some calculations with information from the site (e.g., for AH.nl —> 1077/36 items on the page = 29.x pages)
Storing all of the JSON, then only preprocess
Use selenium for dynamic websites

Great websites to start your first scraping project
- Boxofficemojo is well structured and has stats on movie revenue, release date, and actors
- Sports Reference has stats on players in US sports (baseball, basketball, football, hockey)
- Wikipedia is also locally available, but keep your maximum retrieval frequency at 1 page per second.
More advanced use cases
- Ever tried extracting data from data widgets (e.g., like available at The New York Times)?
- Netflix Home Screen Capture
- Playlist Promotions and New Releases at Spotify

Code to monitor the health of online data collections via Push Messages
Scrapy and Morph.io are comprehensive, code-based frameworks that collect the data for you in the cloud
Try visualizing your results dynamically/interactively, for example with D3, Plotly, ShinyApps, or Tableau
Snscrape - an amazing Python package to scrape data from social networks like Twitter, Facebook, Instagram, Telegram, VKontakte and Weibo

Listen to a podcast with Kimberly Fessel who shares some best practices on scraping the web. She also has shared a fantastic tutorial on YouTube