Summary and Schedule
This is a new lesson built with The Carpentries Workbench.
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Hello-Scraping |
What is behind a website and how can I extract its information? What is there to consider before I do web scraping? |
Duration: 00h 50m | 2. Scraping a real website |
How can I get the data and information from a real website? How can I start automating my web scraping tasks? |
Duration: 01h 45m | 3. Dynamic websites |
What are the differences between static and dynamic websites? Why is it important to understand these differences when doing web scraping? How can I start my own web scraping project? |
Duration: 02h 20m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
In this workshop you will learn how to extract data from websites,
what you’d call web scraping, using Python. In Episode 1 we begin by
reviewing the structure of websites in HTML and how to retrieve
information from it using your browser and the
BeautifulSoup
package. In Episode 2 we’ll dive deep on how
to get the HTML behind any website using the requests
package and how to parse and find information with
BeautifulSoup
. At the end,you’ll learn about the
differences between static and dynamic webpages, and how to scrape the
latter with the Selenium
package.
This workshop is designed for participants who already have a basic understanding of Python programming. In particular, it’s best to know how to:
- Install and import packages and modules
- Use lists and dictionaries
- Use conditional statements (
if
,else
,elif
) - Use
for
loops - Calling functions, understanding parameters/arguments and return values
Software Setup
Steps:
- If you already have Anaconda, Jupyter Lab or Jupyter Notebooks installed in your computer, skip to step 2. Follow Miniforge’s download and installation instructions for your respective operating system. If you are using a Windows machine, make sure you mark the option to “Add Miniforge3 to my PATH environment variable”.
- If you are using Mac or Linux, open the ‘Terminal’. If you are using Windows, open the ‘Command Prompt’ or ‘Miniforge Prompt’.
- Activate the base conda environment by typing and running the code below to activate your environment.
conda activate
- Install the necessary packages by running:
pip install requests beautifulsoup4 selenium webdriver-manager pandas tqdm jupyterlab
- Start Jupyter Lab by running:
jupyter lab
- In a new Jupyter Notebook run the following code in a cell to check the necessary libraries can be loaded:
Additional resources
- Mitchell, R. (Ryan E. ). (2024). Web scraping with Python : data extraction from the modern web (3rd edition.). O’Reilly Media, Inc.
- Chapagain, A. (2023). Hands-On Web Scraping with Python : Extract Quality Data from the Web Using Effective Python Techniques (Second edition.). Packt Publishing.