Imagine you run a business selling shoes online and wanted to monitor how your competitors price their products. You could spend hours a day clicking through page after page or write a script for a web bot, an automated piece of software that keeps track a site’s updates. That’s where web scraping comes in.
Scraping websites lets you extract information from hundreds or thousands of webpages at once. You can search websites like Indeed for job opportunities or Twitter for tweets. In this gentle introduction to web scraping, we’ll go over the basic code to scrape websites such that anyone, regardless of background, can extract and analyze these kinds of results.
Getting Started
Using my GitHub repository on web scraping, you can install the software and run the scripts as instructed. Click on the src
directory on the repository page to see the README.md
file that explains each script and how to run them.
Examining the Site
You can use a sitemap file to located where websites upload content without crawling every single web page. Here’s a sample one. You can also find out how large a site is and how much information you can actually extract from it. You can search a site using Google’s Advanced Search to figure out how many pages you may need to scrape. This will come in handy when creating a web scraper that may need to pause for updates or act in a different manner after reaching a certain number of pages.
You can also run the identify.py
script in the src
directory to figure out more information bout how each site was built. This should give info about the frameworks, programming languages, and servers used in building each website as well as the registered owner for the domain. This also uses robotparser
to check for restrictions.
Many websites have a robots.txt
file with crawling restrictions. Make sure you check out this file for a website for more information about how to crawl a website or any rules that you should follow. The sample protocol can be found here.
Crawling a Site
There are three general approaches to crawling a site: Crawling a sitemap, Iterating through an ID for each webpage, and following webpage links. download.py
shows how to download a webpage with methods of sitemap crawling, results.py
shows you how to scrape those results while iterating through webpage IDs, and indeedScrape.py
uses the webpage links for crawling. download.py
also contains information on inserting delays, returning a list of links from HTML, and supporting proxies that can let you access websites through blocked requests.
Scraping the Data
In the file compare.py
, you can compare the efficiency of the three web scraping methods.
You can use regular expressions (known as regex or regexp) to perform neat tricks with text for getting information from websites. The script regex.py
shows how this is done.
You can also use the browser extension Firebug Lite to get information from a webpage. In Chrome, you can click View >> Developer >> View Source to get the source behind a webpage.
Beautiful Soup, one of the requried packages to run indeedScrape.py
, parses a webpage and provides a convenient interface to navigate the content, as shown in bs4test.py
. Lxml also does this in lxmltest.py
. A comparison of these three scraping methods are in the following table.
Scraping method | Performance | Ease of use | Ease of install |
Regex | Fast | Hard | Easy |
Beautiful Soup | Slow | Easy | Easy |
lxml | Fast | Easy | Hard |
The callback.py
script lets you scrape data and save it to an output .csv
file.
Caching Downloads
Caching crawled webpages lets you store them in a manageablae format while only having to download them once. In download.py
, there’s a python class Downloader
that shows how to cache URLs after downloading their webpages. cache.py
has a python class that maps a URL to a filename when caching.
Depending on which operating system you’re using, there’s a limit to how much you can cache.
Operating system | File system | Invalid filename characters | Max filename length |
Linux | Ext3/Ext4 | /, \0 | 255 bytes |
OS X | HFS Plus | :, \0 | 255 UT-16 code units |
Windows | NTFS | \, /, ?, :, *, >, <, | | 255 characters |
Though cache.py is easy to use, you can take the hash of the URL itself to use as the filename to ensure your files directly map to the URLs of the saved cache. Using MongoDB, you can build ontop of the current file system database system and avoid the file system limitations. This method is found in mongocache.py
using pymongo, a Python wrapper for MongoDB.
Test out the other scripts such as alexacb.py
for downloading information on the top sites by Alexa ranking. mongoqueue.py
has functionality for queueing the MongoDB inquiries that can be imported to other scripts.
You can work with dynamic webpages using the code from browserrender.py
. The majority of leading websites using JavScript for functionality, meaning you can’t view all their content in barebones HTML.