In this article we’re going to learn how to scrape a website using Scrapy. This will enable you to extract data from a webpage and save it in different formats or store it on disk.
First we need to install Scrapy on our computer and configure it so that it can work with the site we want to scrape. Then we’ll need to write some code that tells the Scrapy what to do when it comes to crawling a website and extracting data.
We’ll start by writing a spider, which is like a script that Scrapy will run. The spider needs to include a class that will tell Scrapy where to start, what types of requests it should make and how to parse found data. It also needs to include some functions that will process the data and then return it back to the caller in a format that they want.
Next we’ll create a method called parse which will tell the spider what data to look get help for on each page, what links to follow and how to parse that data. In this example we’ll be scraping the names and email addresses of faculty members that appear on each of the detail pages linked from the UCSB psychology faculty page.
To scrape this information we’ll use XPath queries to define what elements we’re looking for and Regular Expressions to define how the elements are located on the page. When we have extracted the information from each of the pages that Scrapy has visited, we’ll then output it into items. These are akin to Python dictionaries and can contain multiple fields to store the extracted data in various ways.