Blog

Web Scraping with Python Made Easy

Imagine you run a business selling shoes online and wanted to monitor how your competitors price their products. You could spend hours a day clicking through page after page or write a script for a web bot, an automated piece of software that keeps track a site’s updates. That’s where web scraping comes in.

Scraping websites lets you extract information from hundreds or thousands of webpages at once. You can search websites like Indeed for job opportunities or Twitter for tweets. In this gentle introduction to web scraping, we’ll go over the basic code to scrape websites such that anyone, regardless of background, can extract and analyze these kinds of results.

Getting Started

Using my GitHub repository on web scraping, you can install the software and run the scripts as instructed. Click on the src directory on the repository page to see the README.md file that explains each script and how to run them.

Examining the Site

You can use a sitemap file to located where websites upload content without crawling every single web page. Here’s a sample one. You can also find out how large a site is and how much information you can actually extract from it. You can search a site using Google’s Advanced Search to figure out how many pages you may need to scrape. This will come in handy when creating a web scraper that may need to pause for updates or act in a different manner after reaching a certain number of pages.

You can also run the identify.py script in the src directory to figure out more information bout how each site was built. This should give info about the frameworks, programming languages, and servers used in building each website as well as the registered owner for the domain. This also uses robotparser to check for restrictions.

Many websites have a robots.txt file with crawling restrictions. Make sure you check out this file for a website for more information about how to crawl a website or any rules that you should follow. The sample protocol can be found here.

Crawling a Site

There are three general approaches to crawling a site: Crawling a sitemap, Iterating through an ID for each webpage, and following webpage links. download.py shows how to download a webpage with methods of sitemap crawling, results.py shows you how to scrape those results while iterating through webpage IDs, and indeedScrape.py uses the webpage links for crawling. download.py also contains information on inserting delays, returning a list of links from HTML, and supporting proxies that can let you access websites through blocked requests.

Scraping the Data

In the file compare.py, you can compare the efficiency of the three web scraping methods.

You can use regular expressions (known as regex or regexp) to perform neat tricks with text for getting information from websites. The script regex.py shows how this is done.

You can also use the browser extension Firebug Lite to get information from a webpage. In Chrome, you can click View >> Developer >> View Source to get the source behind a webpage.

Beautiful Soup, one of the requried packages to run indeedScrape.py, parses a webpage and provides a convenient interface to navigate the content, as shown in bs4test.py. Lxml also does this in lxmltest.py. A comparison of these three scraping methods are in the following table.

Scraping methodPerformanceEase of useEase of install
RegexFastHardEasy
Beautiful SoupSlowEasyEasy
lxmlFastEasyHard

The callback.py script lets you scrape data and save it to an output .csv file.

Caching Downloads

Caching crawled webpages lets you store them in a manageablae format while only having to download them once. In download.py, there’s a python class Downloader that shows how to cache URLs after downloading their webpages. cache.py has a python class that maps a URL to a filename when caching.

Depending on which operating system you’re using, there’s a limit to how much you can cache.

Operating systemFile systemInvalid filename charactersMax filename length
LinuxExt3/Ext4/, \0255 bytes
OS X HFS Plus:, \0255 UT-16 code units
WindowsNTFS \, /, ?, :, *, >, <, |255 characters

Though cache.py is easy to use, you can take the hash of the URL itself to use as the filename to ensure your files directly map to the URLs of the saved cache. Using MongoDB, you can build ontop of the current file system database system and avoid the file system limitations. This method is found in mongocache.py using pymongo, a Python wrapper for MongoDB.

Test out the other scripts such as alexacb.py for downloading information on the top sites by Alexa ranking. mongoqueue.py has functionality for queueing the MongoDB inquiries that can be imported to other scripts.

You can work with dynamic webpages using the code from browserrender.py. The majority of leading websites using JavScript for functionality, meaning you can’t view all their content in barebones HTML.

%d bloggers like this: