Web scraping using Python and Scrapy
Overview
Teaching: 90 min
Exercises: 30 minQuestions
How can scraping a web site be automated?
How can I setup a scraping project using the Scrapy framework for Python?
How do I tell Scrapy what elements to scrape from a webpage?
How do I tell Scrapy to follow URLs and scrape their contents?
What to do with the data extracted with Scrapy?
Objectives
Setting up a Scrapy project.
Understanding the various elements of a Scrapy projects.
Creating a spider to scrape a website and extract specific elements.
Creating a two-step spider to first extract URLs, visit them, and scrape their contents.
Storing the extracted data.
Recap
Here is what we have learned so far:
- We can use XPath queries to select what elements on a page to scrape.
- We can look at the HTML source code of a page to find how target elements are structured and how to select them.
- We can use the browser console and the
$x(...)
function to try out XPath queries on a live site. - We can use the Scraper browser extension to scrape data from a single web page. Its interface even tries to guess the XPath query to target the elements we are interested in.
This is quite a toolset already, and it’s probably sufficient for a number of use cases, but there are limitations in using the tools we have seen so far. Scraper requires manual intervention and only scrapes one page at a time. Even though it is possible to save a query for later, it still requires us to operate the extension.
Introducing Scrapy
Enter Scrapy! Scrapy is a framework for the Python programming language.
A framework is a reusable, “semi-complete” application that can be specialized to produce custom applications. (Source: Johnson & Foote, 1988)
In other words, the Scrapy framework provides a set of Python scripts that contain most of the code required to use Python for web scraping. We need only to add the last bit of code required to tell Python what pages to visit, what information to extract from those pages, and what to do with it. Scrapy also comes with a set of scripts to setup a new project and to control the scrapers that we will create.
It also means that Scrapy doesn’t work on its own. It requires a working Python installation (Python 2.7 and higher or 3.4 and higher - it should work in both Python 2 and 3), and a series of libraries to work. If you haven’t installed Python or Scrapy on your machine, you can refer to the setup instructions. If you install Scrapy as suggested there, it should take care to install all required libraries as well.
You can verify that you have the latest version of Scrapy installed by typing
scrapy version
in a shell. If all is good, you should get the following back (as of February 2017):
Scrapy 2.1.0
If you have a newer version, you should be fine as well.
To introduce the use of Scrapy, we will reuse the same example we used in the previous section. We will start by scraping a list of URLs from the list of faculty of the Psychological & Brain Sciences and then visit those URLs to scrape detailed information about those faculty members.
Setup a new Scrapy project
The first thing to do is to create a new Scrapy project.
Let’s navigate first to a folder on our drive where we want to create our project (refer to Software Carpentry’s lesson about the UNIX shell if you are unsure about how to do that). Then, type the following
scrapy startproject carpwebscraping
where carpwebscraping
is the name of our project.
Scrapy should respond will something similar to (the paths will reflect your own file structure)
New Scrapy project 'carpwebscraping', using template directory '/opt/conda/lib/python3.7/site-packages/scrapy/templates/project', created in:
/home/jovyan/carpwebscraping
You can start your first spider with:
cd carpwebscraping
scrapy genspider example example.com
If we list the files in the directory we ran the previous command
ls -F
we should see that a new directory was created:
carpwebscraping/
(alongside any other files and directories you had lying around previously). Moving into that new directory
cd carpwebscraping
we can see that it contains two items:
ls -F
carpwebscraping/ scrapy.cfg
Yes, confusingly, Scrapy creates a subdirectory called carpwebscraping
within the carpwebscraping
project
directory. Inside that second directory, we see a bunch of additional files:
ls -F carpwebscraping
__init__.py items.py settings.py
__pycache__ pipelines.py spiders/
To recap, here is the structure that scrapy startproject
created:
carpwebscraping/ # the root project directory
scrapy.cfg # deploy configuration file
carpwebscraping/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
...
We will introduce what those files are for in the next paragraphs. The most important item is the
spiders
directory: this is where we will write the scripts that will scrape the pages we
are interested in. Scrapy calls such scripts spiders.
Creating a spider
Spiders are the business end of the scraper. It’s the bit of code that combs through a website and harvests data. Their general structure is as follows:
- One or more start URLs, where the spider will start crawling
- A list of allowed domains to constrain the pages we allow our spider to crawl (this is a good way to avoid mistakenly writing an out-of-hand spider that mistakenly starts crawling the entire Internet…)
- A method called
parse
in which we will write what data the spider should be looking for on the pages it visits, what links to follow and how to parse found data.
To create a spider, Scrapy provides a handy command-line tool:
scrapy genspider <SCRAPER NAME> <START URL>
Don’t include
http://
when runningscrapy genspider
The current version of Scrapy (1.3.2 - February 2017) apparently only expects URLs without
http://
when runningscrapy genspider
. If you do include thehttp
prefix, you might see that the value instart_url
in the generated spider will have that prefix twice, because Scrapy appends it by default. This will cause your spider to fail. Either runscrapy genspider
withouthttp://
or check the resulting spider so that it looks like the code above.
We just need to replace <SCRAPER NAME>
with the name we want to give our spider and <START URL>
with
the URL we want to spider to crawl. In our case, we can type:
scrapy genspider psychfaculty 'www.psych.ucsb.edu/people?people_type=6'
This will create a file called psychfaculty.py
inside the spiders
directory of our project.
Using our favourite text editor, let’s open that file. It should look something like this:
import scrapy
class PsychfacultySpider(scrapy.Spider):
name = 'psychfaculty' # The name of this spider
# The allowed domain and the URLs where the spider should start crawling:
allowed_domains = ['www.psych.ucsb.edu/people?people_type=6']
start_urls = ['https://www.psych.ucsb.edu/people?people_type=6']
# And a 'parse' function, which is the main method of the spider. The content of the scraped
# URL is passed on as the 'response' object:
def parse(self, response):
pass
Note that here some comments have been added for extra clarity, they will not be there upon first creating a spider.
Object-oriented programming and Python classes
You might be unfamiliar with the
class PyschfacultySpider(scrapy.Spider)
syntax used above. This is an example of Object-oriented programming.All elements of a piece of Python code are objects: functions, variables, strings, integers, etc. Objects of a certain type have certain things in common. For example, it is possible to apply special functions to all strings in Python by using syntax such as
mystring.upper()
(this will make the contents ofmystring
all uppercase).We call these types of objects classes. A class defines the components of an object (called attributes), as well as specific functions, called methods, we might want to run on those objects. For example, we could define a class called
Pet
that would contain the attributesname
,colour
,age
etc. as well as the methodsrun()
orcuddle()
. Those are common to all pets.We can use the Object-oriented paradigm to describe a specific type of pet:
Dog
would inherit the attributes and methods ofPet
(dogs have names and can run and cuddle) but would extend thePet
class by adding dog-specific things like apedigree
attribute and abark()
method.The code in the example above is defining a class called
PsychfacultySpider
that inherits theSpider
class defined by Scrapy (hence thescrapy.Spider
syntax). We are extending the defaultSpider
class by defining thename
,allowed_domains
andstart_urls
attributes, as well as theparse()
method.
The
Spider
classA
Spider
class will define how a certain site (or a group of sites, defined instart_urls
) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items) in theparse()
method.In other words, Spiders are the place where we define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).
Once we have the spider open in a text editor, we can start by cleaning up a little the code that Scrapy has automatically generated.
Paying attention to
allowed_domains
Looking at the code that was generated by
genspider
, we see that by default the entire start URL has ended up in theallowed_domains
attribute.Is this desired? What do you think would happen if later in our code we wanted to scrape a page living at the address
https://www.psych.ucsb.edu/people?people_type=7
?Solution
allowed_domains
is a safeguard for our spider, it will restrict its ability to scrape pages outside of a certain realm. An URL is structured as a path to a resource, with the root directory at the beginning and a set of “subdirectories” after that. Inwww.mydomain.ca/house/dog.html
,http://www.mydomain.ca/
is the root,house/
is a first level directory anddog.html
is a file sitting inside thehouse/
directory.If we restrict a Scrapy spider with
allowed_domains = ["www.mydomain.ca/house"]
, it means that the spider will be able to scrape everything that’s inside thewww.mydomain.ca/house/
directory (including subdirectories), but not, say, pages that would be inwww.mydomain.ca/garage/
. However, if we setallowed_domains = ["www.mydomain.ca/"]
, the spider can scrape both the contents of thehouse/
andgarage/
directories.To answer the question, leaving
allowed_domains = ["https://www.psych.ucsb.edu/people?people_type=6"]
would restrict the spider to pages with URLs of the same pattern, andhttps://www.psych.ucsb.edu/people/faculty/nicole-alea-albada
is of a different pattern, so Scrapy would prevent the spider from scraping it which is not what we want.How should
allowed_domains
be set to prevent this from happening?Solution
We should let the spider scrape all pages inside the
www.psych.ucsb.edu
domain by editing it so that it reads:allowed_domains = ["www.psych.ucsb.edu"]
Here is what the spider looks like after cleaning the code a little:
(editing carpwebscraping/carpwebscraping/spiders/psychfaculty.py
)
import scrapy
class PsychfacultySpider(scrapy.Spider):
name = 'psychfaculty'
allowed_domains = ['www.psych.ucsb.edu']
start_urls = ['https://www.psych.ucsb.edu/people?people_type=6']
def parse(self, response):
pass
Don’t forget to save the file once changes have been applied.
Running the spider
Now that we have a first spider setup, we can try running it. Going back to the Terminal, we first make sure
we are located in the project’s top level directory (where the scrapy.cfg
file is) by using ls
, pwd
and
cd
as required, then we can run:
scrapy crawl psychfaculty
Note that we can now use the name we have chosen for our spider (psychfaculty
, as specified in the name
attribute)
to call it. This should produce the following result
2016-11-07 22:28:51 [scrapy] INFO: Scrapy 1.3.2 started (bot: mppaddresses)
(followed by a bunch of debugging output ending with:)
2020-06-15 22:13:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people?people_type=6/> (referer: None)
2020-06-15 22:13:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-15 22:13:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 475,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13901,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 1.005382,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 15, 22, 13, 33, 856350),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'memusage/max': 57393152,
'memusage/startup': 57393152,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 6, 15, 22, 13, 32, 850968)}
2020-06-15 22:13:33 [scrapy.core.engine] INFO: Spider closed (finished)
The line that starts with DEBUG: Crawled (200)
is good news, as it tells us that the spider was
able to crawl the website we were after. The number in parentheses is the HTTP status code that
Scrapy received in response of its request to access that page. 200 means that the request was successful
and that data (the actual HTML content of that page) was sent back in response.
However, we didn’t do anything with it, because the parse
method in our spider is currently empty.
Let’s change that by editing the spider as follows (note the contents of the parse
method):
(editing carpwebscraping/carpwebscraping/spiders/psychfaculty.py
)
add:
with open("test.html", 'wb') as file:
file.write(response.body)
import scrapy
class PsychfacultySpider(scrapy.Spider):
name = 'psychfaculty'
allowed_domains = ['www.psych.ucsb.edu']
start_urls = ['https://www.psych.ucsb.edu/people?people_type=6']
def parse(self, response):
with open("test.html", 'wb') as file:
file.write(response.body)
Now, if we go back to the command line and run our spider again
scrapy crawl psychfaculty
we should get similar debugging output as before, but there should also now be a file called
test.html
in our project’s root directory:
ls -F
carpwebscraping/ scrapy.cfg test.html
We can check that it contains the HTML from our target URL:
head -n 12 test.html
<!DOCTYPE html>
<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# schema: http://schema.org/ sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema# ">
<head>
<meta charset="utf-8" />
<meta name="Generator" content="Drupal 8 (https://www.drupal.org)" />
<meta name="MobileOptimized" content="width" />
<meta name="HandheldFriendly" content="true" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="shortcut icon" href="/core/misc/favicon.ico" type="image/vnd.microsoft.icon" />
<script>window.a2a_config=window.a2a_config||{};a2a_config.callbacks=[];a2a_config.overlays=[];a2a_config.templates={};</script>
<title>People | Psychological & Brain Sciences | UCSB</title>
Defining which elements to scrape using XPath
Now that we know how to access the content of the web page with the contact information of the psychology faculty, the next step is to extract the information we are interested in, in that case the URLs pointing to the detail pages for each politician.
Using the techniques we have learned earlier, we can start by looking at the source code for our target page by using either the “View Source” or “Inspect” functions of our browser. Here is an excerpt of that page:
(...)
<div id="block-psych-content" class="block-system block-psych-content">
<div class="views-element-container"><div class="peep--main-container js-view-dom-id-b141f8be3e80bfcacf768b6ca21b47b1730ea874f362d3a36f945b2c5bab2e4d">
<table class="cols-0 sticky-enabled">
<tbody>
<tr class="rev--people--row">
<td>
<h2>
<a href="/people/faculty/nicole-alea-albada" hreflang="en">Nicole Alea Albada</a></h2>
<p>Assistant Teaching Professor</p></td>
<td class="about--teaser"></td>
<td> </td>
</tr>
<tr class="rev--people--row">
<td>
<h2>
<a href="/people/faculty/greg-ashby" hreflang="en">Greg Ashby</a></h2>
<p>Distinguished Professor</p> </td>
<td class="about--teaser">Professor Ashby is interested in the basic cognitive and neural mechanisms that mediate human learning. His approach combines experimental psychology, cognitive neuroscience, and mathematical modeling </td>
<td>
<ul>
<li>
<a href="/research/cognition-perception-and-cognitive-neuroscience" hreflang="en">Cognition, Perception, and Cognitive Neuroscience</a></li>
</ul>
<a href="http://www.dyns.ucsb.edu/">Dynamical Neuroscience</a></td>
</tr>
(...)
There are different strategies to target the data we are interested in. One of them is to identify
that the URLs are inside the a
elements which are respectively in the h2
elements which are inside the td
elements which are within the tr
elements of the class rev--people--row
.
We can determine that the XPath syntax to access all such elements is //tr[@class='rev--people--row']
, which we can
try out in the browser console:
> $x("//tr[@class='rev--people--row']")
Selecting elements assigned to multiple classes
The above XPath works in this case because the target
tr
elements are only assigned onerev--people--row
class. It wouldn’t work if those elements had more than one class, for example<tr class="psychFacultyDisplay primary rev--people-row">
which has 3 classes applied to thetr
.A lengthier but more specific Xpath to select the elements we’re interested can be found by right clicking the element in the inspector of the webpage you want and then choosing “copy full xpath”. (You might need to trim a number in square brackets at the end of the xpath to select more than a single element.) This results in the following XPath:
/html/body/div[2]/div[2]/div/div[2]/div/section/div[2]/div/div[2]/div/div/table[2]/tbody/tr
However, as there is only one class assigned to the
tr
element we’re interested in we can proceed with the shorter XPath.
Once we were able to confirm that we are targeting the right elements, we can expand our XPath query
to only select the href
attribute of the URL:
> $x("//tr[@class='rev--people--row']/td/h2/a/@href")
This returns an array of objects:
<- (44) [href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href, href]
Debugging using the Scrapy shell
As we learned in the previous section, using the browser console and the $x()
syntax can be useful to make sure
we are selecting the desired elements using our XPath queries. But it is not the only way. Scrapy provides a similar
way to test out XPath queries, with the added benefit that we can then also debug how to further work on those
queries from within Scrapy.
This is achieved by calling the Scrapy shell from the command line:
scrapy shell 'https://www.psych.ucsb.edu/people?people_type=6'
which launches a Python console that allows us to type live Python and Scrapy code to
interact with the page which Scrapy just downloaded from the provided URL. We can see that we are inside an
interactive python console because the prompt will have changed to In [1]:
:
(similar Scrapy debug text as before)
2020-06-15 22:53:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people?people_type=6> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x10c7eae80>
[s] item {}
[s] request <GET https://www.psych.ucsb.edu/people?people_type=6>
[s] response <200 https://www.psych.ucsb.edu/people?people_type=6>
[s] settings <scrapy.settings.Settings object at 0x10c7ead30>
[s] spider <DefaultSpider 'default' at 0x10cb837f0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
2020-06-16 23:36:09 [asyncio] DEBUG: Using selector: EpollSelector
In [1]:
We can now try running the XPath query we just devised against the response
object, which in Scrapy
contains the downloaded web page:
In [1]: response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href")
This will return a bunch of Selector
objects (one for each URL found):
Out[1]:
[<Selector xpath="//tr[@class='rev--people--row']/td/h2/a/@href" data='/people/faculty/nicole-alea-albada'>,
<Selector xpath="//tr[@class='rev--people--row']/td/h2/a/@href" data='/people/faculty/greg-ashby'>,
...]
In [2]:
Challenge: What’s the difference between the outputs of these two xpath statments?
If we play with the XPaths, we find an even shorter XPath to get a similar result:
response.xpath("//a/@href")
Out[2]: (...) <Selector xpath='//a/@href' data='/people/faculty/nicole-alea-albada'>, <Selector xpath='//a/@href' data='/people/faculty/greg-ashby'>, <Selector xpath='//a/@href' data='/research/cognition-perception-and-co...'>, (...) In[3]:
response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href")
versus:
response.xpath("//a/@href")
Why are these XPaths not synonymous with one another?
Solution
response.xpath("//a/@href")
matches all the link urls from the page including ones we’re not particularly interested in. You can see that the third line is a relative link for the subject area focus of Greg Ashby rather than the next faculty member.
The objects returned from our response.xpath()
are pointers to the elements in the scraped page (href
attributes) that match our XPath query. To get to the actual content of those elements (the text of the URLs),
we can use the extract()
method. A variant of that method is extract_first()
which does the
same thing as extract()
but only returns the first element if there is more than one:
response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href").extract_first()
returns
Out[3]: '/people/faculty/nicole-alea-albada'
In [4]:
Dealing with relative URLs
Looking at this result and at the source code of the page, we realize that the URLs are all relative to that page. They are all missing part of the URL to become absolute URLs, which we will need if we want to ask our spider to visit those URLs to scrape more data. We could prefix all those URLs with
https://www.psych.ucsb.edu/people/faculty/
to make them absolute, but since this is a common occurence when scraping web pages, Scrapy provides a built-in function to deal with this issue.To try it out, still in the Scrapy shell, let’s first store the first returned URL into a variable:
testurl = response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href").extract_first()
Then, we can try passing it on to the
urljoin()
method:response.urljoin(testurl)
which returns
Out[5]:'https://www.psych.ucsb.edu/people/faculty/nicole-alea-albada'
We see that Scrapy was able to reconstruct the absolute URL by combining the URL of the current page context (the page in the
response
object) and the relative link we had stored intesturl
.
Extracting URLs using the spider
Armed with the correct query, we can now update our spider accordingly. The parse
methods returns the contents of the scraped page inside the response
object. The response
object supports a variety of methods to act on its contents:
Method | Description |
---|---|
xpath() |
Returns a list of selectors, each of which points to the nodes selected by the XPath query given as argument |
css() |
Works similarly to the xpath() method, but uses CSS expressions to select elements. |
Those methods will return objects of a different type, called selectors
. As their name implies,
these objects are “pointers” to the elements we are looking for inside the scraped page. In order
to get the “content” that the selectors
are pointing to, the following methods should be used:
Method | Description |
---|---|
extract() |
Returns the entire contents of the element(s) selected by the selector object, as a list of strings. |
extract_first() |
Returns the content of the first element selected by the selector object. |
re() |
Returns a list of unicode strings within the element(s) selected by the selector object by applying the regular expression given as argument. |
re_first() |
Returns the first match of the regular expression |
Know when to use
extract()
The important thing to remember is that
xpath()
andcss()
returnselector
objects, on which it is then possible to apply thexpath()
andcss()
methods a second time in order to further refine a query. Once you’ve reached the elements you’re interested in, you need to callextract()
orextract_first()
to get to their contents as string(s).Whereas
re()
returns a list of strings, and therefore it is no longer possible to applyxpath()
orcss()
to the results ofre()
. Since it returns a string, you don’t need to useextract()
there.
Since we have an XPath query we know will extract the URLs we are looking for, we can now use
the xpath()
method and update the spider accordingly:
(editing carpwebscraping/carpwebscraping/spiders/psychfaculty.py
)
import scrapy
class PsychfacultySpider(scrapy.Spider):
name = 'psychfaculty'
allowed_domains = ['www.psych.ucsb.edu']
start_urls = ['https://www.psych.ucsb.edu/people?people_type=6']
def parse(self, response):
for url in response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href").extract():
print(response.urljoin(url))
Looping through results
Why are we using
extract()
instead ofextract_first()
in the code above? Why is thefor
clause?Solution
We are not only interested in the first extracted URL but in all of them.
extract_first()
only returns the content of the first in a series of selected elements, whileextract()
will return all of them in the form of an array.The
for
syntax allows us to loop through each of the returned elements one by one.
We can now run our new spider:
scrapy crawl pyschfaculty
which produces a result similar to:
2020-06-16 20:52:29 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: carpwebscraping)
(..)
https://www.psych.ucsb.edu/people/faculty/nicole-alea-albada
https://www.psych.ucsb.edu/people/faculty/greg-ashby
https://www.psych.ucsb.edu/people/faculty/michael-beyeler
https://www.psych.ucsb.edu/people/faculty/jim-blascovich
https://www.psych.ucsb.edu/people/faculty/nancy-collins
https://www.psych.ucsb.edu/people/faculty/daniel-conroy-beam
https://www.psych.ucsb.edu/people/faculty/leda-cosmides
https://www.psych.ucsb.edu/people/faculty/miguel-eckstein
https://www.psych.ucsb.edu/people/faculty/aaron-ettenberg
https://www.psych.ucsb.edu/people/faculty/john-foley
(...)
2020-06-16 20:52:29 [scrapy.core.engine] INFO: Spider closed (finished)
We can now pat ourselves on the back, as we have successfully completed the first stage of our project by successfully extracting all URLs leading to the faculty profiles!
Limit the number of URLs to scrape through while debugging
We’ve seen by testing the code above that we are able to successfully gather all URLs from the list of faculty. But while we’re working through to the final code that will allow us the extract the data we want from those pages, it’s probably a good idea to only run it on a handful of pages at a time.
This will not only run faster and allow us to iterate more quickly between different revisions of our code, it will also not burden the server too much while we’re debugging. This is probably not such an issue for a couple of hundred of pages, but it’s good practice, as it can make a difference for larger scraping projects. If you are planning to scrape a massive website with thousands of pages, it’s better to start small. Other visitors to that site will thank you for respecting their legitimate desire to access it while you’re debugging your scraper…
An easy way to limit the number of URLs we want to send our spider to is to take advantage of the fact that the
extract()
method returns a list of matching elements. In Python, lists can be sliced using thelist[start:end]
syntax and we can leave out either thestart
orend
delimiters:list[start:end] # items from start through end-1 list[start:] # items from start through the rest of the array list[:end] # items from the beginning through end-1 list[:] # all items
We can therefore edit our spider thusly to only scrape the first five URLs:
import scrapy class PsychfacultySpider(scrapy.Spider): name = 'psychfaculty' allowed_domains = ['www.psych.ucsb.edu'] start_urls = ['https://www.psych.ucsb.edu/people?people_type=6'] def parse(self, response): for url in response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href").extract()[:5]: print(response.urljoin(url))
Note that this only works if there are at least five URLs that are being returned, which is the case here.
Part 2: More Scraping with Scrapy
Recursive scraping
Now that we were successful in harvesting the URLs to the detail pages, let’s begin by editing our spider to instruct it to visit those pages one by one.
For this, let’s begin by defining a new method biopage
that we want to run on the individual faculty biopages:
(editing carpwebscraping/carpwebscraping/spiders/psychfaculty.py
)
import scrapy
class PsychfacultySpider(scrapy.Spider):
name = "psychfaculty" # The name of this spider
# The allowed domain and the URLs where the spider should start crawling:
allowed_domains = ['www.psych.ucsb.edu', 'psych.ucsb.edu']
start_urls = ['https://www.psych.ucsb.edu/people?people_type=6']
def parse(self, response):
# The main method of the spider. It scrapes the URL(s) specified in the
# 'start_url' argument above. The content of the scraped URL is passed on
# as the 'response' object.
for url in response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href").extract()[:5]:
# This loops through all the URLs found inside an element of class 'rev--people--row'
# Constructs an absolute URL by combining the response’s URL with a possible relative URL:
full_url = response.urljoin(url)
print("Found URL: "+full_url)
# The following tells Scrapy to scrape the URL in the 'full_url' variable
# and calls the 'biopage() method below with the content of this
# URL:
yield scrapy.Request(full_url, callback=self.biopage)
def biopage(self, response):
# This method is called on by the 'parse' method above. It scrapes the URLs
# that have been extracted in the previous step.
print("Visited URL: "+response.url)
We’ve also added some comments to the code to make it easier to read and understand.
If we now run our spider again:
scrapy crawl psychfaculty
We should see the result of our print
statements interspersed with the regular Scrapy
debugging output, something like:
2020-06-16 21:24:30 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: test3)
(...)
2020-06-16 21:24:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/robots.txt> (referer: None)
2020-06-16 21:24:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people?people_type=6> (referer: None)
https://www.psych.ucsb.edu/people/faculty/nicole-alea-albada
https://www.psych.ucsb.edu/people/faculty/greg-ashby
https://www.psych.ucsb.edu/people/faculty/michael-beyeler
https://www.psych.ucsb.edu/people/faculty/jim-blascovich
https://www.psych.ucsb.edu/people/faculty/nancy-collins
2020-06-16 21:24:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people/faculty/nicole-alea-albada> (referer: https://www.psych.ucsb.edu/people?people_type=6)
2020-06-16 21:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people/faculty/jim-blascovich> (referer: https://www.psych.ucsb.edu/people?people_type=6)
Visited URL: https://www.psych.ucsb.edu/people/faculty/nicole-alea-albada
2020-06-16 21:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people/faculty/nancy-collins> (referer: https://www.psych.ucsb.edu/people?people_type=6)
2020-06-16 21:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people/faculty/greg-ashby> (referer: https://www.psych.ucsb.edu/people?people_type=6)
2020-06-16 21:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people/faculty/michael-beyeler> (referer: https://www.psych.ucsb.edu/people?people_type=6)
Visited URL: https://www.psych.ucsb.edu/people/faculty/jim-blascovich
Visited URL: https://www.psych.ucsb.edu/people/faculty/nancy-collins
Visited URL: https://www.psych.ucsb.edu/people/faculty/greg-ashby
Visited URL: https://www.psych.ucsb.edu/people/faculty/michael-beyeler
(...)
2020-06-16 21:24:31 [scrapy.core.engine] INFO: Spider closed (finished)
We’ve truncated the results above to make it easier to read, but on your console
you should see that all 5 URLs (remember, we are limiting the number of URLs to scrape
for now) have been first “found” (by the parse()
method) and then “visited”
(by the biopage()
method).
Asynchronous requests
If you look closely at the output of the code we’ve just run, you might be surprised to see that the “Found URL” and “Visited URL” statements didn’t necessarily get printed out one after the other, as we might expect.
The reason this is so is that Scrapy requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed before it runs another or do other things in the meantime. This is more efficient than running each request one after the other, and it also allows for Scrapy to keep working away even if some requests fails for whatever reason.
This is especially advantageous when scraping large websites. Depending on the resources of the computer on which Scrapy runs, it can scrape hundreds or thousands of pages simultaneously.
If you want to know more, the Scrapy documentation has a page detailing how the data flows between Scrapy’s components .
Scrape the detail pages
Now that we are able to visit each one of the detail pages, we should work on getting the data that we want out of them. In our example, we are primarily looking to extract the following details:
- Name
- Email address
Challenge: Scrape name and email address
Write XPath queries to scrape the faculty member’s name and email address displayed on each of the detail pages that are linked from the UCSB psychology faculty page.
Try out your queries on a handful of detail pages to make sure you are getting consistent results.
Tips:
- Look at the source code and try out XPath your queries until you find what you are looking for.
- You can use the browser console to try out your queries, but always confirm using the Scrapy shell mode (see above).
- The syntax for selecting an element like
<div class="mytarget">
isdiv[@class = 'mytarget']
.- The syntax to select the value of an attribute of the type
<element attribute="value">
iselement/@attribute
.Solution
This is one way of returning the faculty member name (using the Scrapy shell):
scrapy shell https://www.psych.ucsb.edu/people/faculty/nicole-alea-albada In [1]: response.xpath('//*[@id="block-psych-content"]/div/header/div/h1/text()').extract()
Out[1]: ['\n Nicole Alea Albada\n']
As you can see there is only one position title for Nicole in the returned array.
HINT: use
strip()
to remove the extra characters and whitespace from the strings (i.e.\n
). In order to utilizestrip()
we have to use it with
extract_first()instead of `extract()
. This is okay in our use case because there is only one email and title per faculty member.In [2]: response.xpath('//*[@id="block-psych-content"]/div/section[2]/h4/text()').extract_first().strip()
Out[2]: 'Nicole Alea Albada'
And this returns Nicole’s email address:
In [3]: response.xpath('//*[@id="block-psych-content"]/div/section[2]/div[1]/i/a/text()').extract_first().strip()
'nicole.albada@psych.ucsb.edu'
Scraping using Regular Expressions
In combination with XPath queries, it is also possible to use Regular Expressions to scrape the contents of a web page.
This is done by using the
re()
method. That method behaves a bit differently than thexpath()
method in that it has to be applied on aselector
object and returns an array of unicode strings (it is therefore not necessary to useextract()
on its results).Using the Scrapy shell, try writing a query that selects the room number found on the faculty member’s detail page regardless of where they are located, using Regular Expressions.
You might find the Regex 101 interactive Regular Expressions tester useful to get to the proper syntax.
Tips:
- We are looking for a string of the form: Room NNNN
re()
expects a regular expression string which should be prefixed byr
as inre(r'Name:\s*(.*)')
.- Remember that
re()
is run on aselector
object, so you can’t doresponse.re(r'...')
. Instead you may want to try doing something likeresponse.xpath('//body').re(r'...')
.Solution
This returns the room number (using the Scrapy shell):
scrapy shell https://www.psych.ucsb.edu/people/faculty/nicole-alea-albada In [1]: response.xpath('//body').re(r'Room (\d+)')
['3839']
Once we have found XPath queries to run on the detail pages and are happy with the result (we’ll take the name query from our previous challenge),
we can add them to the biopage()
method of our spider:
(editing carpwebscraping/carpwebscraping/spiders/psychfaculty.py
)
import scrapy
class PsychfacultySpider(scrapy.Spider):
name = 'psychfaculty' # The name of this spider
# The allowed domain and the URLs where the spider should start crawling:
allowed_domains = ['www.psych.ucsb.edu']
start_urls = ['https://www.psych.ucsb.edu/people?people_type=6']
def parse(self, response):
# The main method of the spider. It scrapes the URL(s) specified in the
# 'start_url' argument above. The content of the scraped URL is passed on
# as the 'response' object.
for url in response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href").extract()[:5]:
# This loops through all the URLs found inside an element of class 'rev--people--row'
# Constructs an absolute URL by combining the response’s URL with a possible relative URL:
full_url = response.urljoin(url)
print("Found URL: "+full_url)
# The following tells Scrapy to scrape the URL in the 'full_url' variable
# and calls the 'biopage() method below with the content of this
# URL:
yield scrapy.Request(full_url, callback=self.biopage)
def biogpage(self, response):
# This method is called on by the 'parse' method above. It scrapes the URLs
# that have been extracted in the previous step.
name = response.xpath('//*[@id="block-psych-content"]/div/header/div/h1/text()').extract()
email = response.xpath('//*[@id="block-psych-content"]/div/section[2]/div[1]/i/a/text()').extract()
print("Found details: " + name + ', ' + email)
Running our scraper again
scrapy crawl psychfaculty
produces something like
2020-06-17 13:46:59 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: carpwebscraping)
(...)
Found details: Nicole Alea Albada, nicole.albada@psych.ucsb.edu
2020-06-17 13:55:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people/faculty/nancy-collins> (referer: https://www.psych.ucsb.edu/people?people_type=6)
2020-06-17 13:55:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people/faculty/jim-blascovich> (referer: https://www.psych.ucsb.edu/people?people_type=6)
Found details: Greg Ashby, greg.ashby@psych.ucsb.edu
Found details: Michael Beyeler, michael.beyeler@psych.ucsb.edu
Found details: Nancy Collins, nancy.collins@psych.ucsb.edu
Found details: Jim Blascovich, jim.blascovich@psych.ucsb.edu
(...)
2020-06-17 13:55:11 [scrapy.core.engine] INFO: Spider closed (finished)
We appear to be getting somewhere! The last step is doing something useful with the scraped data instead of printing it out on the terminal. Enter the Scrapy Items.
Using Items to store scraped data
Scrapy conveniently includes a mechanism to collect scraped data and output it
in several different useful ways. It uses objects called Items
. Those are akin
to Python dictionaries in that each Item can contain one or more fields to
store individual data element. Another way to put it is, if you visualize the
data as a spreadsheet, each Item represents a row of data, and the fields within
each item are columns.
Before we can begin using Items, we need to define their structure. Using our editor,
let’s navigate and edit the following file that Scrapy has created for us when we
first created our project: carpwebscraping/carpwebscraping/items.py
Scrapy has pre-populated this file with an empty “CarpwebscrapingItem” class:
(editing carpwebscraping/carpwebscraping/items.py
)
import scrapy
class CarpwebscrapingItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
Let’s add a few fields to store the data we aim to extract from the detail pages for each politician:
import scrapy
class OntariomppsItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
email = scrapy.Field()
Then save this file. We can then edit our spider one more time:
(editing carpwebscraping/carpwebscraping/spiders/psychfaculty.py
)
import scrapy
from carpwebscraping.items import CarpwebscrapingItem # We need this so that Python knows about the item object
class PsychfacultySpider(scrapy.Spider):
name = 'psychfaculty' # The name of this spider
# The allowed domain and the URLs where the spider should start crawling:
allowed_domains = ['www.psych.ucsb.edu']
start_urls = ['https://www.psych.ucsb.edu/people?people_type=6']
def parse(self, response):
# The main method of the spider. It scrapes the URL(s) specified in the
# 'start_url' argument above. The content of the scraped URL is passed on
# as the 'response' object.
for url in response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href").extract()[:5]:
# This loops through all the URLs found inside an element of class 'rev--people--row'
# Constructs an absolute URL by combining the response’s URL with a possible relative URL:
full_url = response.urljoin(url)
print("Found URL: "+full_url)
# The following tells Scrapy to scrape the URL in the 'full_url' variable
# and calls the 'biopage() method below with the content of this
# URL:
yield scrapy.Request(full_url, callback=self.biopage)
def biogpage(self, response):
# This method is called on by the 'parse' method above. It scrapes the URLs
# that have been extracted in the previous step.
# Store desired scraped data into variables:
name = response.xpath('//*[@id="block-psych-content"]/div/header/div/h1/text()').extract_first().strip()
email = response.xpath('//*[@id="block-psych-content"]/div/section[2]/div[1]/i/a/text()').extract_first().strip()
print("Found details: " + name + ", " + title + ", " + email) #print for debugging
item = CarpwebscrapingItem() # Creating a new Item object
# Store scraped data into that item:
item['name'] = name
item['email'] = email
# Return that item to the main spider method:
yield item
We made two significant changes to the file above:
- We’ve included the line
from carpwebscraping.items import CarpwebscrapingItem
at the top. This is required so that our spider knows about theCarpwebscrapingItem
object we’ve just defined. - We’ve also replaced the
print
statements inbiopage()
with the creation of anCarpwebscrapingItem
object, in which fields we are now storing the scraped data. The item is then passed back to the main spider method using theyield
statement.
If we now run our spider again:
scrapy crawl psychfaculty
we see something like
2020-06-17 14:16:18 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: carpwebscraping)
(...)
2020-06-17 14:16:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people/faculty/nicole-alea-albada> (referer: https://www.psych.ucsb.edu/people?people_type=6)
2020-06-17 14:16:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.psych.ucsb.edu/people/faculty/heejung-kim> (referer: https://www.psych.ucsb.edu/people?people_type=6)
2020-06-17 14:16:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.psych.ucsb.edu/people/faculty/nicole-alea-albada>
{'email': 'nicole.albada@psych.ucsb.edu',
'name': 'Nicole Alea Albada'}
(...)
2020-06-17 14:16:19 [scrapy.core.engine] INFO: Spider closed (finished)
We see that Scrapy is dumping the contents of the items within the debugging output using a syntax that looks a lot like JSON.
But let’s now try running the spider with an extra -o
(‘o’ for ‘output’) argument that
specifies the name of an output file with a .csv
file extension:
scrapy crawl psychfaculty -o output.csv
This produces similar debugging output as the previous run, but now let’s look inside the
directory in which we just ran Scrapy and we’ll see that it has created a file called
output.csv
, and when we try looking inside that file, we see that it contains the
scraped data, conveniently arranged using the Comma-Separated Values (CSV) format, ready
to be imported into our favourite spreadsheet!
cat output.csv
Returns
email,name
nicole.albada@psych.ucsb.edu,Nicole Alea Albada
michael.goard@psych.ucsb.edu,Michael Goard
emily.jacobs@psych.ucsb.edu,Emily Jacobs
janusonis@ucsb.edu,Skirmantas Janusonis
jerry.jacobs@psych.ucsb.edu,Jerry Jacobs
scott.grafton@psych.ucsb.edu,Scott Grafton
david.hamilton@psych.ucsb.edu,David Hamilton
By changing the file extension to .json
or .xml
we can output the same data
in JSON or XML format.
Refer to the Scrapy documentation
for a full list of supported formats.
Now that everything looks to be in place, we can finally remove our limit to the number of scraped elements…
(editing carpwebscraping/carpwebscraping/spiders/psychfaculty.py
)
import scrapy
from carpwebscraping.items import CarpwebscrapingItem # We need this so that Python knows about the item object
class PsychfacultySpider(scrapy.Spider):
name = 'psychfaculty' # The name of this spider
# The allowed domain and the URLs where the spider should start crawling:
allowed_domains = ['www.psych.ucsb.edu']
start_urls = ['https://www.psych.ucsb.edu/people?people_type=6']
def parse(self, response):
# The main method of the spider. It scrapes the URL(s) specified in the
# 'start_url' argument above. The content of the scraped URL is passed on
# as the 'response' object.
for url in response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href").extract():
# This loops through all the URLs found inside an element of class 'rev--people--row'
# Constructs an absolute URL by combining the response’s URL with a possible relative URL:
full_url = response.urljoin(url)
print("Found URL: "+full_url)
# The following tells Scrapy to scrape the URL in the 'full_url' variable
# and calls the 'biopage() method below with the content of this
# URL:
yield scrapy.Request(full_url, callback=self.biopage)
def biogpage(self, response):
# This method is called on by the 'parse' method above. It scrapes the URLs
# that have been extracted in the previous step.
# Store desired scraped data into variables:
name = response.xpath('//*[@id="block-psych-content"]/div/header/div/h1/text()').extract_first().strip()
email = response.xpath('//*[@id="block-psych-content"]/div/section[2]/div[1]/i/a/text()').extract_first().strip()
print("Found details: " + name + ", " + title + ", " + email) #print for debugging
item = CarpwebscrapingItem() # Creating a new Item object
# Store scraped data into that item:
item['name'] = name
item['email'] = email
# Return that item to the main spider method:
yield item
(we’ve removed the [:5]
at the end of the for loop on line 16 of the above code)
and below is code without comments to copy and paste into your own document:
psychfaculty.py:
import scrapy
from carpwebscraping.items import CarpwebscrapingItem
class PsychfacultySpider(scrapy.Spider):
name = 'psychfaculty'
allowed_domains = ['www.psych.ucsb.edu']
start_urls = ['https://www.psych.ucsb.edu/people?people_type=6']
def parse(self, response):
for url in response.xpath("//tr[@class='rev--people--row']/td/h2/a/@href").extract():
yield scrapy.Request(response.urljoin(url), callback=self.biopage)
def biopage(self, response):
name = response.xpath('//*[@id="block-psych-content"]/div/header/div/h1/text()').extract_first().strip()
email = response.xpath('//*[@id="block-psych-content"]/div/section[2]/div[1]/i/a/text()').extract_first().strip()
item = CarpwebscrapingItem()
item['name'] = name
item['email'] = email
yield item
and items.py:
import scrapy
class CarpwebscrapingItem(scrapy.Item):
name = scrapy.Field()
email = scrapy.Field()
… and we run our spider one last time:
scrapy crawl psychfaculty -o psychfaculty.csv
Add other data elements to the spider
Try modifying the spider code to add more data extracted from the faculty biopages. Remember to edit the Item definition to allow for all extracted fields to be taken care of.
You are now ready to write your own spiders!
Reference
Key Points
Scrapy is a Python framework that can be use to scrape content from the web.
A Scrapy project is a set of configuration files and pieces of code that tell Scrapy what to do.
In Scrapy, a “Spider” is the code that tells it what to do on a specific website.
A Scrapy project can have more than one spider but needs at least one.
With Scrapy, we can use XPath, CSS selectors and Regular Expressions to define what elements to scrape from a page.
Extracted data can be stored in “Item” objects. Such objects must be defined before they can be used.
Scrapy will automatically stored extracted data in CSS, JSON or XML format based on the file extension given in the -o option.