Tips and examples
Tips
Coding
- Using separate lists vs. lists of dictionaries
- Don’t break the structure of what belongs to what!
- Looping: while loops are also an option!
- Try & except: just use it for one part each, not for many things at the same time
- Break up code into smaller modules (e.g., first seeds, then getting the data)
- Cleanup code (e.g., comments, etc.)
- Modularizing code (so that it works on multiple categories, pages, etc.)
- Make “class names” flexible so that you don’t have to repeat yourself over and over again
- For anonymization: use a
hash function (salted!)
- Read paper and align code / update code (e.g., meta data enrichment)
Data collection
- Store raw data as JSON - parse in a second step
- Separate “seeding” from “collecting information” stage
- Write the data as soon as you can to a file (e.g., JSON) - not only at the end of a long (1.5 days!) scraping session (minimize data loss)
- For extended data collections - consider saving the raw html files first - then only parse!
- How to find max. page numbers? You can do some calculations with information from the site (e.g., for AH.nl —> 1077/36 items on the page = 29.x pages)
- Storing all of the JSON, then only preprocess
- Use selenium for dynamic websites
Examples and extra material
Web scrapers
- Great websites to start your first scraping project
- Boxofficemojo is well structured and has stats on movie revenue, release date, and actors
- Sports Reference has stats on players in US sports (baseball, basketball, football, hockey)
- Wikipedia is also locally available, but keep your maximum retrieval frequency at 1 page per second.
- More advanced use cases
APIs
Podcasts and tutorials