Imagine you run a business selling shoes online and wanted to monitor how your competitors price their products. You could spend hours a day clicking through page after page or write a script for a web bot, an automated piece of software that keeps track a site’s updates. That’s where web scraping comes in.
Scraping websites lets you extract information from hundreds or thousands of webpages at once. You can search websites like Indeed for job opportunities or Twitter for tweets. In this gentle introduction to web scraping, we’ll go over the basic code to scrape websites such that anyone, regardless of background, can extract and analyze these kinds of results.
Getting Started
Using my GitHub repository on web scraping, you can install the software and run the scripts as instructed. Click on the src directory on the repository page to see the README.md file that explains each script and how to run them.
Examining the Site
You can use a sitemap file to located where websites upload content without crawling every single web page. Here’s a sample one. You can also find out how large a site is and how much information you can actually extract from it. You can search a site using Google’s Advanced Search to figure out how many pages you may need to scrape. This will come in handy when creating a web scraper that may need to pause for updates or act in a different manner after reaching a certain number of pages.
You can also run the identify.py script in the src directory to figure out more information bout how each site was built. This should give info about the frameworks, programming languages, and servers used in building each website as well as the registered owner for the domain. This also uses robotparser to check for restrictions.
Many websites have a robots.txt file with crawling restrictions. Make sure you check out this file for a website for more information about how to crawl a website or any rules that you should follow. The sample protocol can be found here.
Crawling a Site
There are three general approaches to crawling a site: Crawling a sitemap, Iterating through an ID for each webpage, and following webpage links. download.py shows how to download a webpage with methods of sitemap crawling, results.py shows you how to scrape those results while iterating through webpage IDs, and indeedScrape.py uses the webpage links for crawling. download.py also contains information on inserting delays, returning a list of links from HTML, and supporting proxies that can let you access websites through blocked requests.
Scraping the Data
In the file compare.py, you can compare the efficiency of the three web scraping methods.
You can use regular expressions (known as regex or regexp) to perform neat tricks with text for getting information from websites. The script regex.py shows how this is done.
You can also use the browser extension Firebug Lite to get information from a webpage. In Chrome, you can click View >> Developer >> View Source to get the source behind a webpage.
Beautiful Soup, one of the requried packages to run indeedScrape.py, parses a webpage and provides a convenient interface to navigate the content, as shown in bs4test.py. Lxml also does this in lxmltest.py. A comparison of these three scraping methods are in the following table.
Scraping method
Performance
Ease of use
Ease of install
Regex
Fast
Hard
Easy
Beautiful Soup
Slow
Easy
Easy
lxml
Fast
Easy
Hard
The callback.py script lets you scrape data and save it to an output .csv file.
Caching Downloads
Caching crawled webpages lets you store them in a manageablae format while only having to download them once. In download.py, there’s a python class Downloader that shows how to cache URLs after downloading their webpages. cache.py has a python class that maps a URL to a filename when caching.
Depending on which operating system you’re using, there’s a limit to how much you can cache.
Operating system
File system
Invalid filename characters
Max filename length
Linux
Ext3/Ext4
/, \0
255 bytes
OS X
HFS Plus
:, \0
255 UT-16 code units
Windows
NTFS
\, /, ?, :, *, >, <, |
255 characters
Though cache.py is easy to use, you can take the hash of the URL itself to use as the filename to ensure your files directly map to the URLs of the saved cache. Using MongoDB, you can build ontop of the current file system database system and avoid the file system limitations. This method is found in mongocache.py using pymongo, a Python wrapper for MongoDB.
Test out the other scripts such as alexacb.py for downloading information on the top sites by Alexa ranking. mongoqueue.py has functionality for queueing the MongoDB inquiries that can be imported to other scripts.
You can work with dynamic webpages using the code from browserrender.py. The majority of leading websites using JavScript for functionality, meaning you can’t view all their content in barebones HTML.
Imagine you run a business selling shoes online and wanted to monitor how your competitors price their products. You could spend hours a day clicking through page after page or write a script for a web bot, an automated piece of software that keeps track a site’s updates. That’s where web scraping comes in.
Scraping websites lets you extract information from hundreds or thousands of webpages at once. You can search websites like Indeed for job opportunities or Twitter for tweets. In this gentle introduction to web scraping, we’ll go over the basic code to scrape websites such that anyone, regardless of background, can extract and analyze these kinds of results.
Getting Started
Using my GitHub repository on web scraping, you can install the software and run the scripts as instructed. Click on the
src
directory on the repository page to see theREADME.md
file that explains each script and how to run them.Examining the Site
You can use a sitemap file to located where websites upload content without crawling every single web page. Here’s a sample one. You can also find out how large a site is and how much information you can actually extract from it. You can search a site using Google’s Advanced Search to figure out how many pages you may need to scrape. This will come in handy when creating a web scraper that may need to pause for updates or act in a different manner after reaching a certain number of pages.
You can also run the
identify.py
script in thesrc
directory to figure out more information bout how each site was built. This should give info about the frameworks, programming languages, and servers used in building each website as well as the registered owner for the domain. This also usesrobotparser
to check for restrictions.Many websites have a
robots.txt
file with crawling restrictions. Make sure you check out this file for a website for more information about how to crawl a website or any rules that you should follow. The sample protocol can be found here.Crawling a Site
There are three general approaches to crawling a site: Crawling a sitemap, Iterating through an ID for each webpage, and following webpage links.
download.py
shows how to download a webpage with methods of sitemap crawling,results.py
shows you how to scrape those results while iterating through webpage IDs, andindeedScrape.py
uses the webpage links for crawling.download.py
also contains information on inserting delays, returning a list of links from HTML, and supporting proxies that can let you access websites through blocked requests.Scraping the Data
In the file
compare.py
, you can compare the efficiency of the three web scraping methods.You can use regular expressions (known as regex or regexp) to perform neat tricks with text for getting information from websites. The script
regex.py
shows how this is done.You can also use the browser extension Firebug Lite to get information from a webpage. In Chrome, you can click View >> Developer >> View Source to get the source behind a webpage.
Beautiful Soup, one of the requried packages to run
indeedScrape.py
, parses a webpage and provides a convenient interface to navigate the content, as shown inbs4test.py
. Lxml also does this inlxmltest.py
. A comparison of these three scraping methods are in the following table.The
callback.py
script lets you scrape data and save it to an output.csv
file.Caching Downloads
Caching crawled webpages lets you store them in a manageablae format while only having to download them once. In
download.py
, there’s a python classDownloader
that shows how to cache URLs after downloading their webpages.cache.py
has a python class that maps a URL to a filename when caching.Depending on which operating system you’re using, there’s a limit to how much you can cache.
Though cache.py is easy to use, you can take the hash of the URL itself to use as the filename to ensure your files directly map to the URLs of the saved cache. Using MongoDB, you can build ontop of the current file system database system and avoid the file system limitations. This method is found in
mongocache.py
using pymongo, a Python wrapper for MongoDB.Test out the other scripts such as
alexacb.py
for downloading information on the top sites by Alexa ranking.mongoqueue.py
has functionality for queueing the MongoDB inquiries that can be imported to other scripts.You can work with dynamic webpages using the code from
browserrender.py
. The majority of leading websites using JavScript for functionality, meaning you can’t view all their content in barebones HTML.Share this:
Like this: