May 12 and 13, 2022
4:00 pm - 7:00 pm
Instructors: Renata Curty, Ryan Horne, Jon Jablonski, Greg Janée
Helpers: Amanda Ho, Kristi Liu
Some adblockers block the registration window. If you do not see the registration box below, please check your adblocker settings.
Library Carpentry is made by people working in library- and information-related roles to help you:
Library Carpentry introduces you to the fundamentals of computing and provides you with a platform for further self-directed learning. For more information on what we teach and why, please see our paper "Library Carpentry: software skills training for library professionals".
Web scraping is the process of extracting data from websites.
Some data that is available on the web is presented in a format that makes it easier to collect and use it,
for example in the form of downloadable comma-separated values (CSV) datasets that can then be imported in a spreadsheet or loaded into a data analysis script.
Often however, even though it is publicly available, data is not readily available for reuse. For example it can be contained in a PDF, or a table on a website,
or spread across multiple web pages.
There are a variety of ways to scrape a website to extract information for reuse.
In its simplest form, this can be achieved by copying and pasting snippets from a web page, b
ut this can be unpractical if there is a large amount of data to be extracted, or if it spread over a large number of pages.
Instead, specialized tools and techniques can be used to automate this process, by defining what sites to visit, what information to look for,
and whether data extraction should stop once the end of a page has been reached, or whether to follow hyperlinks and repeat the process recursively.
Automating web scraping also allows to define whether the process should be run at regular intervals and capture changes in the data.
Who: The course is for people working in library- and information-related roles.
Where: This workshop will support in-person and remote, online attendace. If you register as an in-person attendeee, the workshop will take place at Davidson Library, UCEN Rd, Santa Barbara, CA. If you register as a remote attendeee, the instructors will provide you with the information you will need to connect to this meeting.
When: May 12 and 13, 2022. Add to your Google Calendar.
Requirements: Participants must bring a laptop with a Mac, Linux, or Windows operating system (not a tablet, Chromebook, etc.) that they have administrative privileges on. They should have a few specific software packages installed (listed below).
Accessibility: We are committed to making this workshop accessible to everybody. For workshops at a physical location, the workshop organizers have checked that:
Materials will be provided in advance of the workshop and large-print handouts are available if needed by notifying the organizers in advance. If we can help making learning easier for you (e.g. sign-language interpreters, lactation facilities) please get in touch (using contact details below) and we will attempt to provide them.
Contact: Please email library-collaboratory@ucsb.edu for more information.
Roles: To learn more about the roles at the workshop (who will be doing what), refer to our Workshop FAQ.
Everyone who participates in Carpentries activities is required to conform to the Code of Conduct. This document also outlines how to report an incident if needed.
We will use this collaborative document for chatting, taking notes, and sharing URLs and bits of code.
Please be sure to complete these surveys before and after the workshop.
Before Starting | Pre-workshop survey |
4:00 pm | Workshop Introduction |
04:15 pm | What is web scraping? |
04:25 pm | Selecting content on a web page with XPath |
05:10 pm | Break |
05:25 pm | Manually scrape data using browser extensions |
06:30 pm | Introduction to JupyterLab |
07:00 pm | End Day 1 |
04:00 pm | Review |
04:15 pm | Web scraping using Python and Scrapy |
05:15 pm | Break |
05:30 pm | More Scraping with Scrapy |
06:30 pm | Ethics & Legality of Webscraping |
07:00 pm | End Workshop |
After | Post-workshop survey |
To participate in a Library Carpentry workshop, you will need access to software as described on the Setup Page. In addition, you will need an up-to-date web browser.
We maintain a list of common issues that occur during installation as a reference for instructors that may be useful on the Configuration Problems and Solutions wiki page.
If you haven't used Zoom before, go to the official website to download and install the Zoom client for your computer.
Like other Carpentries workshops, you will be learning by "coding along" with the Instructors. To do this, you will need to have both the window for the tool you will be learning about (a terminal, RStudio, your web browser, etc..) and the window for the Zoom video conference client open. In order to see both at once, we recommend using one of the following set up options: