Introduction to Webscraping

UC Santa Barbara, Library

May 12 and 13, 2022

4:00 pm - 7:00 pm

Instructors: Renata Curty, Ryan Horne, Jon Jablonski, Greg Janée

Helpers: Amanda Ho, Kristi Liu

Some adblockers block the registration window. If you do not see the registration box below, please check your adblocker settings.

General Information

Library Carpentry is made by people working in library- and information-related roles to help you:

Library Carpentry introduces you to the fundamentals of computing and provides you with a platform for further self-directed learning. For more information on what we teach and why, please see our paper "Library Carpentry: software skills training for library professionals".

Web scraping is the process of extracting data from websites. Some data that is available on the web is presented in a format that makes it easier to collect and use it, for example in the form of downloadable comma-separated values (CSV) datasets that can then be imported in a spreadsheet or loaded into a data analysis script. Often however, even though it is publicly available, data is not readily available for reuse. For example it can be contained in a PDF, or a table on a website, or spread across multiple web pages.
There are a variety of ways to scrape a website to extract information for reuse. In its simplest form, this can be achieved by copying and pasting snippets from a web page, b ut this can be unpractical if there is a large amount of data to be extracted, or if it spread over a large number of pages. Instead, specialized tools and techniques can be used to automate this process, by defining what sites to visit, what information to look for, and whether data extraction should stop once the end of a page has been reached, or whether to follow hyperlinks and repeat the process recursively. Automating web scraping also allows to define whether the process should be run at regular intervals and capture changes in the data.

Who: The course is for people working in library- and information-related roles.

Where: This workshop will support in-person and remote, online attendace. If you register as an in-person attendeee, the workshop will take place at Davidson Library, UCEN Rd, Santa Barbara, CA. If you register as a remote attendeee, the instructors will provide you with the information you will need to connect to this meeting.

When: May 12 and 13, 2022. Add to your Google Calendar.

Requirements: Participants must bring a laptop with a Mac, Linux, or Windows operating system (not a tablet, Chromebook, etc.) that they have administrative privileges on. They should have a few specific software packages installed (listed below).

Accessibility: We are committed to making this workshop accessible to everybody. For workshops at a physical location, the workshop organizers have checked that:

Materials will be provided in advance of the workshop and large-print handouts are available if needed by notifying the organizers in advance. If we can help making learning easier for you (e.g. sign-language interpreters, lactation facilities) please get in touch (using contact details below) and we will attempt to provide them.

Contact: Please email library-collaboratory@ucsb.edu for more information.

Roles: To learn more about the roles at the workshop (who will be doing what), refer to our Workshop FAQ.


Code of Conduct

Everyone who participates in Carpentries activities is required to conform to the Code of Conduct. This document also outlines how to report an incident if needed.


Collaborative Notes

We will use this collaborative document for chatting, taking notes, and sharing URLs and bits of code.


Surveys

Please be sure to complete these surveys before and after the workshop.

Pre-workshop Survey

Post-workshop Survey


Schedule

Day 1

Before Starting Pre-workshop survey
4:00 pm Workshop Introduction
04:15 pm What is web scraping?
04:25 pm Selecting content on a web page with XPath
05:10 pm Break
05:25 pm Manually scrape data using browser extensions
06:30 pm Introduction to JupyterLab
07:00 pm End Day 1

Day 2

04:00 pm Review
04:15 pm Web scraping using Python and Scrapy
05:15 pm Break
05:30 pm More Scraping with Scrapy
06:30 pm Ethics & Legality of Webscraping
07:00 pm End Workshop
After Post-workshop survey

Setup

To participate in a Library Carpentry workshop, you will need access to software as described on the Setup Page. In addition, you will need an up-to-date web browser.

We maintain a list of common issues that occur during installation as a reference for instructors that may be useful on the Configuration Problems and Solutions wiki page.

Install the videoconferencing client

If you haven't used Zoom before, go to the official website to download and install the Zoom client for your computer.

Set up your workspace

Like other Carpentries workshops, you will be learning by "coding along" with the Instructors. To do this, you will need to have both the window for the tool you will be learning about (a terminal, RStudio, your web browser, etc..) and the window for the Zoom video conference client open. In order to see both at once, we recommend using one of the following set up options:

This blog post includes detailed information on how to set up your screen to follow along during the workshop.