Web Scraping with Python: All in One View

Last updated on 2024-11-04 | Edit this page

Overview

Questions

What is behind a website and how can I extract its information?
What is there to consider before I do web scraping?

Objectives

Identify the structure and basic components of an HTML document.
Use BeautifulSoup to locate elements, tags, attributes and text in an HTML document.
Understand the situations in which web scraping is not suitable for obtaining the desired data.

Introduction

This is part two of an Introduction to Web Scraping workshop we offered on February 2024. You can refer to those workshop materials to have a more gentle introduction to scraping using XPath and the Scraper Chrome extension.

We’ll refresh some of the concepts covered there to have a practical understanding of how content/data is structured in a website. For that purpose, we’ll see what Hypertext Markup Language (HTML) is and how it structures and formats the content using tags. From there, we’ll use the BeautifulSoup library to parse the HTML content so we can easily search and access elements of the website we are interested in. Starting from basic examples, we’ll move to scrape more complex, real-life websites.

HTML quick overview

All websites have a Hypertext Markup Language (HTML) document behind them. The following text is HTML for a very simple website, with only three sentences. If you look at it, can you imagine how that website looks?

HTML

<!DOCTYPE html>
<html>
<head>
<title>Sample web page</title>
</head>
<body>
<h1>h1 Header #1</h1>
<p>This is a paragraph tag</p>
<h2>h2 Sub-header</h2>
<p>A new paragraph, now in the <b>sub-header</b></p>
<h1>h1 Header #2</h1>
<p>
This other paragraph has two hyperlinks,
one to <a href="https://carpentries.org/">The Carpentries homepage</a>,
and another to the
<a href="https://carpentries.org/past_workshops/">past workshops</a> page.
</p>
</body>
</html>

Well, if you put that text in a file with a .html extension, the job of your web browser when opening the file will be to interpret that (markup) language and display a nicely formatted website.

Screenshot of a simple website with the previews HTML

An HTML document is composed of elements, which can be identified by tags written inside angle brackets (< and >). For example, the HTML root element, which delimits the beginning and end of an HTML document, is identified by the <html> tag.

Most elements have both a opening and a closing tag, determining the span of the element. In the previous simple website, we see a head element that goes from the opening tag <head> up to the closing tag </head>. Given than an element can be inside another element, an HTML document has a tree structure, where every element is a node that can contain child nodes, like the following image shows.

Finally, we can define or modify the behavior, appeareance, or functionality of an element by using attributes. Attributes are inside the opening tag, and consist of a name and a value, formatted as name="value". For example, in the previous simple website, we added a hyperlink with the <a>...</a> tags, but to set the destination URL we used the href attribute by writing in the opening tag a href="https://carpentries.org/past_workshops/".

Here is a non-exhaustive list of elements you’ll find in HTML and their purpose:

<hmtl>...</html> The root element, which contains the entirety of the document.
<head>...</head> Contains metadata, for example, the title that the web browser displays.
<body>...</body> The content that is going to be displayed.
<h1>...</h1>, <h2>...</h2>, <h3>...</h3> Defines headers of level 1, 2, 3, etc.
<p>...</p> A paragraph.
<a href="">...</a> Creates a hyperlink, and we provide the destination URL with the href attribute.
<img src="" alt=""> Embedds an image, giving a source to the image with the src attribute and specifying alternate text with alt.
<table>...</table>, <th>...</th>, <tr>...</tr>, <td>...</td> Defines a table, that as children will have a header (defined inside th), rows (defined inside tr), and a cell inside a row (as td).
<div>...</div> Is used to group sections of HTML content.
<script>...</script> Embeds or references JavaScript code.

In the previous list we’ve described some attributes specific for the hyperlink elements (<a>) and the image elements (<img>), but there are a few other global attributes that most HTML elements can have and are useful to identify specific elements when doing web scraping:

id="" Assigns a unique identifier to an element, which cannot be repeated in the entire HTML document
title="" Provides extra information, displayed as a tooltip when the user hovers over the element.
class="" Is used to apply a similar styling to multiple elements at once.

To summarize, an element is identified by tags, and we can assign properties to an element by using attributes. Knowing this about HTML will make our lifes easier when trying to get some specific data from a website.

Parsing HTML with BeautifulSoup

Now that we know how a website is structured, we can start extracting information from it. The BeautifulSoup package is our main tool for that task, as it will parse the HTML so we can search and access the elements of interest in a programmatic way.

To see how this package works, we’ll use the simple website example we showed before. As our first step, we will load the BeautifulSoup package, along with Pandas.

PYTHON

from bs4 import BeautifulSoup
import pandas as pd

Let’s get the HTML content inside a string variable called example_html

PYTHON

example_html = """
<!DOCTYPE html>
<html>
<head>
<title>Sample web page</title>
</head>
<body>
<h1>h1 Header #1</h1>
<p>This is a paragraph tag</p>
<h2>h2 Sub-header</h2>
<p>A new paragraph, now in the <b>sub-header</b></p>
<h1>h1 Header #2</h1>
<p>
This other paragraph has two hyperlinks,
one to <a href="https://carpentries.org/">The Carpentries homepage</a>,
and another to the
<a href="https://carpentries.org/past_workshops/">past workshops</a> page.
</p>
</body>
</html>
"""

We parse this HTML using the BeautifulSoup() function we imported, specifying that we want to use the html.parser. This object will represent the document as a nested data structure, similar to a tree as we mentioned before. If we use the .prettify() method on this object, we can see the nested structure, as inner elements will be indented to the right.

PYTHON

soup = BeautifulSoup(example_html, 'html.parser')
print(soup.prettify())

OUTPUT

<!DOCTYPE html>
<html>
 <head>
  <title>
   Sample web page
  </title>
 </head>
 <body>
  <h1>
   h1 Header #1
  </h1>
  <p>
   This is a paragraph tag
  </p>
  <h2>
   h2 Sub-header
  </h2>
  <p>
   A new paragraph, now in the
   <b>
    sub-header
   </b>
  </p>
  <h1>
   h1 Header #2
  </h1>
  <p>
   This other paragraph has two  hyperlinks, one to
   <a href="https://carpentries.org/">
    The Carpentries homepage
   </a>
   , and another to the
   <a href="https://carpentries.org/past_workshops/">
    past workshops
   </a>
   .
  </p>
 </body>
</html>

Now that our soup variable holds the parsed document, we can use the .find() and .find_all() methods. .find() will search the tag that we specify, and return the entire element, including the starting and closing tags. If there are multiple elements with the same tag, .find() will only return the first one. If you want to return all the elements that match your search, you’d want to use .find_all() instead, which will return them in a list. Additionally, to return the text contained in a given element and all its children, you’d use .get_text(). Below you can see how all these commands play out in our simple website example.

PYTHON

print("1.", soup.find('title'))
print("2.", soup.find('title').get_text())
print("3.", soup.find('h1').get_text())
print("4.", soup.find_all('h1'))
print("5.", soup.find_all('a'))
print("6.", soup.get_text())

OUTPUT

1. <title>Sample web page</title>
2. Sample web page
3. h1 Header #1
4. [<h1>h1 Header #1</h1>, <h1>h1 Header #2</h1>]
5. [<a href="https://carpentries.org/">The Carpentries homepage</a>, <a href="https://carpentries.org/past_workshops/">past workshops</a>]
6.



Sample web page


h1 Header #1
This is a paragraph tag
h2 Sub-header
A new paragraph, now in the sub-header
h1 Header #2
This other paragraph has two  hyperlinks, one to The Carpentries homepage, and another to the past workshops.

How would you extract all hyperlinks identified with <a> tags? In our example, we see that there are only two hyperlinks, and we could extract them in a list using the .find_all('a') method.

PYTHON

links = soup.find_all('a')
print("Number of hyperlinks found: ", len(links))
print(links)

OUTPUT

Number of hyperlinks found:  2
[<a href="https://carpentries.org/">The Carpentries homepage</a>, <a href="https://carpentries.org/past_workshops/">past workshops</a>]

To access the value of a given attribute in an element, for example the value of the href attribute in <a href="">, we would use square brackets and the name of the attribute (['href']), just like how in a Python dictionary we would access the value using the respective key. Let’s make a loop that prints only the URL for each hyperlink we have in our example.

PYTHON

for item in links:
    print(item['href'])

OUTPUT

https://carpentries.org/
https://carpentries.org/past_workshops/

Challenge

Create a Python dictionary that has the following three items, containing information about the first hyperlink in the HTML of our example.

PYTHON

first_link = {
   'element': the complete hyperlink element,
   'url': the destination url of the hyperlink,
   'text': the text that the website displays as the hyperlink
}

Show me the solution

One way of completing the exercise is as follows.

PYTHON

first_link = {
   'element': str(soup.find('a')),
   'url': soup.find('a')['href'],
   'text': soup.find('a').get_text()
}

An alternate but similar way is to store the tag found for not calling multiple times soup.find('a'), and also creating first an empty dictionary and append to it the keys and values we want, as this will be useful when we do this multiple times in a for loop.

PYTHON

find_a = soup.find('a')
first_link = {}
first_link['element'] = str(find_a)
first_link['url'] = find_a['href']
first_link['text'] = find_a.get_text()

To finish this introduction on HTML and BeautifulSoup, let’s create code for extracting in a structured way all the hyperlink elements, their destination URL and the text displayed for link. For that, let’s use the links variable that we created before as links = soup.find_all('a'). We’ll loop over each hyperlink element found, storing for each the three pieces of information we want in a dictionary, and finally appending that dictionary to a list called list_of_dicts. At the end we will have a list with two elements, that we can transform to a Pandas dataframe.

PYTHON

links = soup.find_all('a')
list_of_dicts = []
for item in links:
    dict_a = {}
    dict_a['element'] = str(item)
    dict_a['url'] = item['href']
    dict_a['text'] = item.get_text()
    list_of_dicts.append(dict_a)

links_df = pd.DataFrame(list_of_dicts)
print(links_df)

OUTPUT

                                             element  \
0  <a href="https://carpentries.org/">The Carpent...
1  <a href="https://carpentries.org/past_workshop...

                                       url                      text
0                 https://carpentries.org/  The Carpentries homepage
1  https://carpentries.org/past_workshops/            past workshops

You’ll find more useful information about the BeautifulSoup package and how to use all its methods in the Beautiful Soup Documentation website.

The rights, wrongs, and legal barriers to scraping

The internet is no longer what it used to be. Once an open and boundless source of information, the web has become an invaluable pool of data, widely used by companies to train machine learning and generative AI models. Now, social media platforms and other website owners have either recognized the potential for profit and licensed their data or have become overwhelmed by bots crawling their sites and straining their server resources.

For this reason, it’s now more common to see that a website’s Terms of Service (TOS) explicitly prohibits web scraping. If we want to avoid getting into trouble, we need to carefully check the TOS of your website of interest as well as its ‘robots.txt’ file. You should be able to find both using your preferred search engine, but you may directly go to the latter by appending ‘/robots.txt’ at the root URL of the website (e.g. for Facebook you’ll find it in ‘https://facebook.com/robots.txt’, not in any other URL like ‘https://facebook.com/user/robots.txt’).

Challenge

Visit Facebook’s Terms of Service and its robots.txt file. What do they say about web scraping or collecting data using automated means? Compare it to Reddit’s TOS and Reddit’s robots.txt.

Besides checking the website’s policies, you should also be aware of the legislation applicable to your location regarding copyright and data privacy laws. If you plan to start harvesting a large amount of data for research or commercial purposes, you should probably seek legal advice first. If you work in a university, chances are it has a copyright office that will help you sort out the legal aspects of your project. The university library is often the best place to start looking for help on copyright.

To conclude, here is a brief code of conduct you should consider when doing web scraping:

Ask nicely if you can access the data in another way. If your project requires data from a particular organisation, you can try asking them directly if they could provide you what you are looking for, or check if they have an API to access the data. With some luck, they will have the primary data that they used on their website in a structured format, saving you the trouble.
Don’t download copies of documents that are clearly not public. For example, academic journal publishers often have very strict rules about what you can and what you cannot do with their databases. Mass downloading article PDFs is probably prohibited and can put you (or at the very least your friendly university librarian) in trouble. If your project requires local copies of documents (e.g. for text mining projects), special agreements can be reached with the publisher. The library is a good place to start investigating something like that.
Check your local legislation. For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly avaialable web sites, can be illegal (e.g. in Australia).
Don’t share downloaded content illegally. Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don’t hold the right to share is illegal.
Share what you can. If the data you scraped is in the public domain or you got permission to share it, then put it out there for other people to reuse it (e.g. on datahub.io). If you wrote a web scraper to access it, share its code (e.g. on GitHub) so that others can benefit from it.
Publish your own data in a reusable way. Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.
Don’t break the Internet. Not all web sites are designed to withstand thousands of requests per second. If you are writing a recursive scraper (i.e. that follows hyperlinks), test it on a smaller dataset first to make sure it does what it is supposed to do. Adjust the settings of your scraper to allow for a delay between requests. More on this topic in the next episode.

Key Points

Every website has an HTML document behind it that gives a structure to its content.
An HTML is composed of elements, which usually have a opening <tag> and a closing </tag>.
Elements can have different properties, assigned by attributes in the form of <tag attribute_name="value">.
We can parse any HTML document with BeautifulSoup() and find elements using the .find() and .find_all() methods.
We can access the text of an element using the .get_text() method and the attribute values as we do with Python dictionaries (element["attribute_name"]).
We must be careful to not tresspass the Terms of Service (TOS) of the website we are scraping.

Content from Scraping a real website

Last updated on 2024-11-04 | Edit this page

Overview

Questions

How can I get the data and information from a real website?
How can I start automating my web scraping tasks?

Objectives

Use the requests package to get the HTML document behind a website.
Navigate the tree structure behind an HTML document to extract the information we need.
Know how to avoid being blocked by sending too much requests to a website.

“Requests” the website HTML

In the previous episode we used a simple HTML document, not an actual website. Now that we move to more real, complex escenario, we need to add another package to our toolbox, the requests package. For the purpose of this web scraping lesson, we will only use requests to get the HTML behind a website. However, there’s a lot of extra functionality that we are not covering but you can find in the Requests package documentation.

We’ll be scraping The Carpentries website, https://carpentries.org/, and specifically, the list of upcoming and past workshop you can find at the bottom. For that, first we’ll load the requests package and then use the code .get().text to store the HTML document of the website. Furthermore, to simplify our navigation through the HTML document, we will use the Regular Expressions re module to remove all new line characters (“”) and their surrounding whitespaces. You can think of removing new lines as a preprocessing or cleaning step, but in this lesson we won’t be explaining the intricacies of regular expressions. For that, you can refer to this introductory explanation on the Library Carpentry.

PYTHON

import requests
import re
url = 'https://carpentries.org/'
req = requests.get(url).text
cleaned_req = re.sub(r'\s*\n\s*', '', req).strip()
print(cleaned_req[0:1000])

OUTPUT

<!doctype html><html class="no-js" lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>The Carpentries</title><link rel="stylesheet" type="text/css" href="https://carpentries.org/assets/css/styles_feeling_responsive.css"><script src="https://carpentries.org/assets/js/modernizr.min.js"></script><!-- matomo --><script src="https://carpentries.org/assets/js/matomo-analytics.js"></script><link href="https://fonts.googleapis.com/css?family=Lato:400,400i,700,700i|Roboto:400,400i,700,700i&display=swap" rel="stylesheet"><!-- Search Engine Optimization --><meta name="description" content="The Carpentries is a fiscally sponsored project of Community Initiatives, a registered 501(c)3 non-profit organisation based in California, USA. We are a global community teaching foundational computational and data science skills to researchers in academia, industry and government."><link rel="canonical" href="https://carpentries.org/index.html"><

We truncated to print only the first 1000 characters of the document, as it is too long, but we can see it is HTML and has some elements we didn’t see in the example of the previous episode, like those identified with the <meta>, <link> and <script> tags.

There’s another way to see the HTML document behind a website, directly from your web browser. Using Google Chrome, you can right-click in any part of the website (on a Mac, press and hold the Control key in your keyboard while you click), and from the pop-up menu, click ‘View page source’, as the next image shows. If the ‘View page source’ option didn’t appear for you, try clicking in another part of the website. A new tab will open with the HTML document for the website you were in.

A screenshot of The Carpentries homepage in the Google Chrome web browser, showing how to View page source

In the HTML page source on your browser you can scroll down and look for the second-level header (<h2>) with the text “Upcoming Carpentries Workshops”. Or more easily, you can use the Find Bar (Ctrl + F on Windows and Command + F on Mac) to search for “Upcoming Carpentries Workshops”. Just right down of that header we have the table element we are interested in, which starts with the opening tag <table class="table table-striped" style="width: 100%;">. Inside that element we see nested different elements familiar for a table, the rows (<tr>) and the cells for each row (<td>), and additionally the image (<img>) and hyperlink (<a>) elements.

Finding the information we want

Now, going back to our coding, we left off on getting the HTML behind the website using requests, and stored it on the variable called req. From here we can proceed with BeautifulSoup as we learned in the previous episode, using the BeautifulSoup() function to parse our HTML, as the following code block shows. With the parsed document, we can use the .find() or find_all() methods to find the table element.

PYTHON

soup = BeautifulSoup(cleaned_req, 'html.parser')
tables_by_tag = soup.find_all('table')
print("Number of table elements found: ", len(tables_by_tag))
print("Printing only the first 1000 characters of the table element: \n", str(tables_by_tag[0])[0:1000])

OUTPUT

Number of table elements found:  1
Printing only the first 1000 characters of the table element:
 <table class="table table-striped" style="width: 100%;"><tr><td><img alt="swc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/swc.svg" title="swc workshop" width="24"/></td><td><img alt="mx" class="flags" src="https://carpentries.org/assets/img/flags/24/mx.png" title="MX"><img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online"><a href="https://galn3x.github.io/-2024-10-28-Metagenomics-online/">Nodo Nacional de BioinformÃ¡tica UNAM</a><br><b>Instructors:</b> CÃ©sar Aguilar, Diana Oaxaca, Nelly Selem-Mojica<br><b>Helpers:</b> Andreas Chavez, JosÃ© Manuel Villalobos Escobedo, Aaron Espinosa Jaime, AndrÃ©s Arredondo, Mirna VÃ¡zquez Rosas-Landa, David Alberto GarcÃa-Estrada</br></br></img></img></td><td>Oct 28 - Oct 31, 2024</td></tr><tr><td><img alt="dc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/dc.svg" title="dc workshop" width="24"/></td><td><img alt="de" class="flags" s

From our output we see that there was only one table element in the entire HTML, which corresponds to the table we are looking for. The output you see in the previous code block will be different from what you have in your computer, as the data in the upcoming workshops table is continously updated.

Besides searching elements using tags, sometimes it will be useful to search using attributes, like id or class. For example, we can see the table element has a class attribute with two values “table table-striped”, which identifies all possible elements with similar styling. Therefore, we could have the same result than before using the class_ argument on the .find_all() method as follows.

PYTHON

tables_by_class = soup.find_all(class_="table table-striped")

Now that we know there is only one table element, we can start working with it directly by storing the first and only item in the tables_by_tag result set into another variable, which we will call just workshops. We can see that we moved from working with a “ResultSet” object to a “Tag” object, which we can start working with to extract information from each row and cell.

PYTHON

print("Before: ", type(tables_by_tag))
workshops_table = tables_by_tag[0]
print("After:", type(workshops_table))
print("Element type:", workshops_table.name)

OUTPUT

Before:  <class 'bs4.element.ResultSet'>
After: <class 'bs4.element.Tag'>
Element type: table

Navigating the tree

If we use the prettify() method on the workshops_table variable, we see that this table element has a nested tree structure. On the first level is the <table> tag. Inside that, we have rows <tr>, and inside rows we have table data cells <td>. We can start to identify certain information we may be interested in, for example:

What type of workshop was it (‘swc’ for Software Carpentry, ‘dc’ for Data Carpentry, ‘lc’ for Library Carpentry, and ‘cp’ for workshops based on The Carpentries curriculum). We find this in the first <td> tag, or said in a different way, in the first cell of the row.
In what country was the workshop held. We can see the two-letter country code in the second cell of the row.
The URL to the workshop website, which will contain additional information. It is also contained in the second cell of the row, as the href attribute of the <a> tag.
The institution that is hosting the workshop. Also in the second <td>, in the text of the hyperlink <a> tag.
The name of instructors and helpers involved in the workshop.
The dates of the workshop, on the third and final cell.

PYTHON

print(workshops_table.prettify())

OUTPUT

<table class="table table-striped" style="width: 100%;">
 <tr>
  <td>
   <img alt="swc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/swc.svg" title="swc workshop" width="24"/>
  </td>
  <td>
   <img alt="mx" class="flags" src="https://carpentries.org/assets/img/flags/24/mx.png" title="MX">
    <img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online">
     <a href="https://galn3x.github.io/-2024-10-28-Metagenomics-online/">
      Nodo Nacional de BioinformÃ¡tica UNAM
     </a>
     <br>
      <b>
       Instructors:
      </b>
      CÃ©sar Aguilar, Diana Oaxaca, Nelly Selem-Mojica
      <br>
       <b>
        Helpers:
       </b>
       Andreas Chavez, JosÃ© Manuel Villalobos Escobedo, Aaron Espinosa Jaime, AndrÃ©s Arredondo, Mirna VÃ¡zquez Rosas-Landa, David Alberto GarcÃa-Estrada
      </br>
     </br>
    </img>
   </img>
  </td>
  <td>
   Oct 28 - Oct 31, 2024
  </td>
 </tr>
 <tr>
  <td>
...
  </td>
 </tr>
</table>

To navigate in this HTML document tree we can use the following properties of the “bs4.element.Tag” object: .contents (to access direct children nodes), .parent (to access the parent node), .next_sibling, and .previous_sibling (to access the siblings of a node) methods. For example, if we want to access the second row of the table, which is the second child of the table element we could use the following code.

PYTHON

# The second [1 in Python indexing] child of our table element
workshops_table.contents[1]

If you go back to the ‘View page source’ of the website, you’ll notice that the table element is nested inside a <div class="medium-12 columns"> element, which means this <div> is the parent of our <table>. If we needed to, we could access this parent by using workshops_table.parent.

Now imagine we had selected the second data cell of our fifth row using workshops_table.contents[4].contents[1], we could access the third data cell using .next_sibling() or the first data cell with .previous_sibling().

PYTHON

# Access the fifth row, and from there, the second data cell
row5_cell2 = workshops_table.contents[4].contents[1]
# Access the third cell of the fifth row
row5_cell3 = row5_cell2.next_sibling
# Access the first cell of the fifth row
row5_cell1 = row5_cell2.previous_sibling

Why do we bother to learn all this methods? Depending on you web scraping use case, they might result useful in complex websites. Let’s apply them to extract the information we want about the workshops, for example, to see how many upcoming workshops there are, which corresponds with the number of children the table element has

PYTHON

num_workshops = len(workshops_table.contents)
print("Number of upcoming workshops listed: ", num_workshops)

Let’s work to extract data from only the first row, and later we can use a loop to iterate over all the rows of the table.

PYTHON

# Empty dictionary to hold the data
dict_w = {}

# First row of data
first_row = workshops_table.contents[0]

# To get to the first cell
first_cell = first_row.contents[0]
second_cell = first_cell.next_sibling
third_cell = second_cell.next_sibling

# From the first cell, find the <image> tag and get the 'title' attribute, which contains the type of workshop
dict_w['type'] = first_cell.find('img')['title']

# In the second cell, get the country from the 'title' attribute of the <image> tag
dict_w['country'] = second_cell.find('img')['title']

# Now the link to the workshop website is in the 'href' attribute of the <a> tag
dict_w['link'] = second_cell.find('a')['href']

# The institution that hosts the workshop is the text inside that <a> tag
dict_w['link'] = second_cell.find('a').get_text()

# Get all the text from the second cell
dict_w['all_text'] = second_cell.get_text(strip=True)

# Get the dates from the third cell
dict_w['date'] = third_cell.get_text(strip=True)

print(dict_w)

OUTPUT

{'type': 'swc workshop',
 'country': 'MX',
 'link': 'https://galn3x.github.io/-2024-10-28-Metagenomics-online/',
 'host': 'Nodo Nacional de BioinformÃ¡tica UNAM',
 'all_text': 'Nodo Nacional de BioinformÃ¡tica UNAMInstructors:CÃ©sar Aguilar, Diana Oaxaca, Nelly Selem-MojicaHelpers:Andreas Chavez, JosÃ© Manuel Villalobos Escobedo, Aaron Espinosa Jaime, AndrÃ©s Arredondo, Mirna VÃ¡zquez Rosas-Landa, David Alberto GarcÃ\xada-Estrada',
 'date': 'Oct 28 - Oct 31, 2024'}

This was just for one row, but we can iterate over all the rows in the table adding a for loop and appending each dictionary to a list. That list will be transformed to a Pandas dataframe so we can see the results nicely.

PYTHON

list_of_workshops = []
for row in range(num_workshops):
	n_row = workshops_table.contents[row]
	first_cell = n_row.contents[0]
	second_cell = first_cell.next_sibling
	third_cell = second_cell.next_sibling
	dict_w = {}
	dict_w['type'] = first_cell.find('img')['title']
	dict_w['country'] = second_cell.find('img')['title']
	dict_w['link'] = second_cell.find('a')['href']
	dict_w['host'] = second_cell.find('a').get_text()
	dict_w['all_text'] = second_cell.get_text(strip=True)
	dict_w['date'] = third_cell.get_text(strip=True)
	list_of_workshops.append(dict_w)

result_df = pd.DataFrame(list_of_workshops)

Great! We’ve finished our first scraping task on a real website. Please be aware that there are multiple ways of achieving the same result. For example, instead of using the .contents() method to access the different rows of the table, we could have used .find_all('tr') to scan the table and loop through the row elements. Similarly, instead of moving to the siblings of the first data cell, we could have used .find_all('td'). Code using that other approach would look like this. Remember, the results are the same!

PYTHON

list_of_workshops = []
for row in workshops_table.find_all('tr'):
	cells = row.find_all('td')
	first_cell = cells[0]
	second_cell = cells[1]
	third_cell = cells[2]
	dict_w = {}
	dict_w['type'] = first_cell.find('img')['title']
	dict_w['country'] = second_cell.find('img')['title']
	dict_w['link'] = second_cell.find('a')['href']
	dict_w['host'] = second_cell.find('a').get_text()
	dict_w['all_text'] = second_cell.get_text(strip=True)
	dict_w['date'] = third_cell.get_text(strip=True)
	list_of_workshops.append(dict_w)

upcomingworkshops_df = pd.DataFrame(list_of_workshops)

A key takeaway from this exercise is that, when we want to scrape data in a structured way, we have to spend some time getting to know how the website is structured and how we can identify and extract only the elements we are interested in.

Challenge

Extract the same information as in the previous exercise, but this time from the Past Workshops Page at https://carpentries.org/past_workshops/. Which 5 countries have held the most workshops, and how many has each held?

Show me the solution

We can reuse directly the code we wrote before, changing only the URL we got the HTML from.

PYTHON

url = 'https://carpentries.org/past_workshops/'
req = requests.get(url).text
cleaned_req = re.sub(r'\s*\n\s*', '', req).strip()

soup = BeautifulSoup(cleaned_req, 'html.parser')
workshops_table = soup.find('table')

list_of_workshops = []
for row in workshops_table.find_all('tr'):
	cells = row.find_all('td')
	first_cell = cells[0]
	second_cell = cells[1]
	third_cell = cells[2]
	dict_w = {}
	dict_w['type'] = first_cell.find('img')['title']
	dict_w['country'] = second_cell.find('img')['title']
	dict_w['link'] = second_cell.find('a')['href']
	dict_w['host'] = second_cell.find('a').get_text()
	dict_w['all_text'] = second_cell.get_text(strip=True)
	dict_w['date'] = third_cell.get_text(strip=True)
	list_of_workshops.append(dict_w)

pastworkshops_df = pd.DataFrame(list_of_workshops)

print('Total number of workshops in the table: ', len(pastworkshops_df))

print('Top 5 of countries by number of workshops held: \n',
      pastworkshops_df['location'].value_counts().head())

OUTPUT

Total number of workshops in the table:  3830
Top 5 of countries by number of workshops held:
 country
US    1837
GB     468
AU     334
CA     225
DE     172
Name: count, dtype: int64

Challenge

For a more challenging exercise, try to add to our output dataframe if the workshop was held online or not.

You’ll notice from the website that the online workshops have a world icon next between the country flag and the name of the institution that hosts the workshop.

Show me the solution

To start, we can see in the HTML document that the world icon is in the second data cell of a row. Additionally, for those workshops that are online, there is an additional image element with these attributes <img title="Online" alt="globe image" class="flags"/>. So we could search if the second data cell has an element with an attribute of title="Online". If it doesn’t, the .find() method would return nothing, what in Python is called a “NoneType” data type. So if .find() returns None, we should fill the respective cell in our dataframe with a “No”, meaning that the workshop not held online, and in the opposite case fill it with a “Yes”. Here is a possible code solution, which you would add to the previous code where we extracted the other data and created the dataframe.

PYTHON

if second_cell.find(title="Online") == None:
  online_value = "No"
else:
  online_value = "Yes"
dict_w['online'] = online_value

Automating data collection

Until now we’ve only scraped one website at a time. But there may be situations where the information you need will be split in different pages, or where you have to follow a trace of hyperlinks. With the tools we’ve learned until now, this new task is straightforward. We would have to add a loop that goes to those other pages, gets the HTML document using the requests package, and parses the HTML with BeautifulSoup to extract the required information.

The additional and important step to consider in this task is to add a wait time between each request to the website, so we don’t overload the web server that is providing us the information we need. If we send too many requests in a short period of time, we can prevent other “normal” users from accessing the site during that time, or even cause the server to run out of resources and crash. If the provider of the website detects an excessive use, it could block our computer from accessing that website, or even take legal action in extreme cases.

To make sure we don’t crash the server, we can add a wait time between each step of our loop with the built-in Python module time and its sleep() function. With this function, Python will wait for the specified number of seconds before continuing to execute the next line of code. For example, when you run the following code, Python will wait 10 seconds between each print execution.

PYTHON

from time import sleep
print('First')
sleep(10)
print('Second')

Let’s incorporate this important principle for extracting additional information from each of our workshop websites in the upcoming list. We already have our upcomingworkshops_df dataframe, and in there, a link column with the URL to the website for each individual workshop. For example, let’s make a request for the HTML of the first workshop in the dataframe, and take a look.

PYTHON

first_url = upcomingworkshops_df.loc[0, 'link']
print("URL we are visiting: ", first_url)

req = requests.get(first_url).text
cleaned_req = re.sub(r'\s*\n\s*', '', req).strip()

soup = BeautifulSoup(cleaned_req, 'html.parser')
print(soup.prettify())

If we explore the HTML this way, or using the ‘View page source’ in the browser, we notice something interesting in the <head> element. As this information is inside <head> instead of the <body> element, it won’t be displayed in our browser when we visit the page, but the meta elements will provide metadata for search engines to better understand, display, and index the page. Each of this <meta> tags contain useful information for our table of workshops, for example, a well formatted start and end date, the exact location of the workshop with latitude and longitude (for those not online), the language in which it will be taught, and a more structured way of listing instructors and helpers. Each of these data points can be identified by the the “name” attribute in the <meta> tags, and the information we want to extract is the value in the “content” attribute.

The following code automates the process of getting this data from each website, for the first five workshops in our upcomingworkshops_df dataframe. We will only do it for five workshops to not send too many requests overwhelming the server, but we could also do it for all the workshops.

PYTHON

# List of URLs in our dataframe
urls = list(upcomingworkshops_df.loc[:5, 'link'])

# Start an empty list to store the different dictionaries with our data
list_of_workshops = []

# Start a loop over each URL
for item in tqdm(urls):
    # Get the HTML and parse it
    req = requests.get(item).text
    cleaned_req = re.sub(r'\s*\n\s*', '', req).strip()
    soup = BeautifulSoup(cleaned_req, 'html.parser')

    # Start an empty dictionary and fill it with the URL, which
    # is our identifier with our other dataframe
    dict_w = {}
    dict_w['link'] = item

    # Use the find function to search for the <meta> tag that 
    # has each specific 'name' attribute and get the value in the
    # 'content' attribute
    dict_w['startdate'] = soup.find('meta', attrs = {'name': 'startdate'})['content']
    dict_w['enddate'] = soup.find('meta', attrs = {'name': 'enddate'})['content']
    dict_w['language'] = soup.find('meta', attrs = {'name': 'language'})['content']
    dict_w['latlng'] = soup.find('meta', attrs = {'name': 'latlng'})['content']
    dict_w['instructor'] = soup.find('meta', attrs = {'name': 'instructor'})['content']
    dict_w['helper'] = soup.find('meta', attrs = {'name': 'helper'})['content']

    # Append to our list
    list_of_workshops.append(dict_w)

    # Be respectful, wait at least 3 seconds before a new request
    sleep(3)

extradata_upcoming_df = pd.DataFrame(list_of_workshops)

Challenge

It is possible that you received an error when executing the previous block code, and the most probable reason is that the URL your tried to visit didn’t exist. This is known as 404 code error, that indicates the requested page doesn’t exist, or more precisely, it cannot be found on the server. What would be your approach to work around this possible error?

Show me the solution

A Pythonic crude way of working around any error for a given URL would be to use a try and except block, for which you would ignore any URL that throws an error and continue with the next one.

A more stylish way to deal when a web page doesn’t exist is to get the actual response code when requests tries to reach the page. If you receive a 200 code, it means the request was successful. In any other case, you’d want to store the code and skip the scraping of that page. The code you’d use to get the response code is:

PYTHON

req = requests.get(url)
print(req.status_code)

Key Points

We can get the HTML behind any website using the “requests” package and the function requests.get('website_url').text.
An HTML document is a nested tree of elements. Therefore, from a given element, we can access its child, parent, or sibling, using .contents, .parent, .next_sibling, and previous_sibling.
It’s polite to not send too many requests to a website in a short period of time. For that, we can use the sleep() function of the built-in Python module time.

Content from Dynamic websites

Last updated on 2024-11-04 | Edit this page

Overview

Questions

What are the differences between static and dynamic websites?
Why is it important to understand these differences when doing web scraping?
How can I start my own web scraping project?

Objectives

Use the Selenium package to scrape dynamic websites.
Identify the elements of interest using the browser’s “Inspect” tool.
Understand the usual pipeline of a web scraping project.

You see it but the HTML doesn’t have it!

Visit the following webpage created by Hartley Brody for the purpose of learning and practicing web scraping: https://www.scrapethissite.com/pages/ajax-javascript/ (read first the terms of use). Select “2015” to see that year’s Oscar winning films. Now look at the HTML behind it as we’ve learned, using the “View page source” tool on your browser or in Python using the requests and BeautifulSoup packages. Can you find in any place of the HTML the best picture winner “Spotlight”? Can you find any other of the movies or the data from the table? If you can’t, how could you scrape this page?

When you explore a page like this, you’ll notice that the movie data, including the title “Spotlight,” isn’t actually in the initial HTML source. This is because the website uses JavaScript to load the information dynamically. JavaScript is a programming language that runs in your browser, allowing websites to fetch, process, and display content on the fly, often based on user actions like clicking a button. When you select “2015”, the browser executes JavaScript (called by one of the <script> elements you see in the HTML) to retrieve the relevant movie information from the web server and build a new HTML document with actual information in the table. This makes the page feel more interactive, but it also means the initial HTML you see doesn’t contain the movie data itself.

Let’s explore a new way to view HTML elements in your browser to better understand the differences in an HTML document before and after JavaScript is executed. On the Oscar winners page you just visited, right-click (or Control key + Click on Mac) and select “Inspect” from the pop-up menu, as shown in the image below. This opens DevTools on the side of your browser, offering a range of features for inspecting, debugging, and analyzing web pages in real-time. For this workshop, however, we’ll focus on just the “Elements” tab. In the “Elements” tab, you’ll see an HTML document that actually includes the table element, unlike what you saw in “View Page Source”. This difference is because “View Page Source” displays the original HTML, before any JavaScript is run, while “Inspect” reveals the HTML as it appears after JavaScript has executed.

A screenshot of Google Chrome web browser, showing how to use Inspect from the Chrome DevTools

As the requests package retrieves the source HTML, we need a different approach to scrape these types of websites. For this, we will use the Selenium package. But don’t forget about the “Inspect” tool we just learned, it will be handy when scraping.

Using Selenium to scrape dynamic websites

Selenium is an open source project for web browser automation. It will be useful for our scraping tasks as it will act as a real user interacting with a webpage in a browser. This way, Selenium will render pages in a browser environment, allowing JavaScript to load dynamic content, and therefore giving us access to the website HTML after JavaScript has executed. Additionally, this package simulates real user interactions like filling in text boxes, clicking, scrolling, or selecting drop-down menus, which will be useful when we scrape dynamic websites.

To use it, we’ll load “webdriver” and “By” from the selenium package. webdriver open or simulate a web browser, interacting with it based on the instructions we give. By will allow us to specify how we will select a given element in the HTML, by tag (using “By.TAG_NAME”) or by attributes like class (“By.CLASS_NAME”), id (“By.ID”), or name (“By.NAME”). We will also load the other packages we used in the previous episode.

PYTHON

from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By

Selenium can simulate multiple multiple browsers like Chrome, Firefox, Safari, etc. For now, we’ll use Chrome. When you run the following line of code, you’ll notice that a Google Chrome window opens up. Don’t close it, as this is how Selenium interacts with the browser. Later we’ll see how to do headless browser interactions, headless meaning that browser interactions will happen in the background, without opening a new browser window or user interface.

PYTHON

driver = webdriver.Chrome()

Now, to tell the browser to visit our Oscar winners page, use the .get() method on the driver object we just created.

PYTHON

driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")

How can we direct Selenium to click the text “2015” for the table of that year tho show up? First, we need to find that element, in a similar way to how we find elements with BeautifulSoup. Just like we used .find() in BeautifulSoup to find the first element that matched the specified criteria, in Selenium we have .find_element(). Likewise, as we used .find_all() in BeautifulSoup to return a list of all coincidences for our search criteria, we can use .find_elements() in Selenium. But the syntax of how we specify the search paramenters will be a little different.

If we wanted to search a table element that has the <table> tag, we would run driver.find_element(by=By.TAG_NAME, value="table"). If we wanted to search an element with a specific value in the “class” attribute, for example an element like <tr class="film"> we would run driver.find_element(by=By.CLASS_NAME, value="film"). To know what element we need to click to open “2015” table of Oscar winners we can use the “Inspect” tool (remember you can do this in Google Chrome by pointing your mouse over the “2015” value, make a right-click, and select “Inspect” from the pop-up menu). In the DevTools window, you’ll see element <a href="#" class="year-link" id="2015">2015</a>. As the ID attribute is unique for only one element in the HTML, we can directly select the element by this attribute using the code you’ll find after the following image.

A screenshot of Google Chrome web browser, showing how to search a specific element by using Inspect from the Chrome DevTools

PYTHON

button_2015 = driver.find_element(by=By.ID, value="2015")

We’ve located the hyperlink element we want to click to get the table for that year, and on that element we will use the .click() method to interact with it. As the table takes a couple of seconds to load, we will also use the sleep() function from the “time” module to wait will the JavaScript runs and the table loads. Then, we use driver.page_source for Selenium to get the HTML document from the website, and we store it in a variable called html_2015. Finally, we close the web browser that Selenium was using with driver.quit().

PYTHON

button_2015.click()
sleep(3)
html_2015 = driver.page_source
driver.quit()

Importantly, the HTML document we stored in html_2015 is the HTML after the dynamic content loaded, so it will contain the table values for 2015 that weren’t there originally and that we wouldn’t be able to see if we had used the requests package instead.

We could continue using Selenium and its .find_element() and .find_elements() methods to to extract our data of interest. But instead, we will use BeautifulSoup to parse the HTML and find elements, as we already have some practice using it. If we search for the first element with class attribute equal to “film-title” and return the text inside it, we see that this HTML has the “Spotlight” movie.

PYTHON

soup = BeautifulSoup(html_2015, 'html.parser')
print(soup.find(class_='film').prettify())

PYTHON

<tr class="film">
 <td class="film-title">
  Spotlight
 </td>
 <td class="film-nominations">
  6
 </td>
 <td class="film-awards">
  2
 </td>
 <td class="film-best-picture">
  <i class="glyphicon glyphicon-flag">
  </i>
 </td>
</tr>

The following code repeats the process of clicking and loading the 2015 data, but now using “headless” mode (i.e. without opening a browser window). Then, it extracts data from the table one column at a time, taking advantage that each column has a unique class attribute that identifies it. Instead of using for loops to extract data from each element that .find_all() finds, we use list comprehensions. You can learn more about them reading Python’s documentation for list comprehensions, or with this Programiz short tutorial.

PYTHON

# Create the Selenium webdriver and make it headless
options = ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)

# Load the website. Find and click 2015. Get post JavaScript execution HTML. Close webdriver
driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")
button_2015 = driver.find_element(by=By.ID, value="2015")
button_2015.click()
sleep(3)
html_2015 = driver.page_source
driver.quit()

# Parse HTML using BeautifulSoup and extract each column as a list of values ising list comprehensions
soup = BeautifulSoup(html_2015, 'html.parser')
titles_lc = [elem.get_text() for elem in soup.find_all(class_="film-title")]
nominations_lc = [elem.get_text() for elem in soup.find_all(class_="film-nominations")]
awards_lc = [elem.get_text() for elem in soup.find_all(class_="film-awards")]

# For the best picture column, we can't use .get_text() as there is no text
# Rather, we want to see if there is an <i> tag
best_picture_lc = ["Yes" if elem.find("i") == None else "No" for elem in soup.find_all(class_="film-best-picture")]

# Create a dataframe based on the previous lists
movies_2015 = pd.DataFrame(
    {'titles': titles_lc, 'nominations': nominations_lc, 'awards': awards_lc, 'best_picture': best_picture_lc}
)

Challenge

Based on what we’ve learned in this episode, write code for getting the data of all the years from 2010 to 2015 of Hartley Brody’s website with information of Oscar Winning Films. Hint: You’ll use the same code, but add loop through each year.

Show me the solution

Besides adding a loop for each year, the following solution is refactoring the code by creating two functions: one that finds and clicks a year returning the html after the data shows up, and another that gets the html and parses it to extract the data and create a dataframe.

So you can see the process of how Selenium opens the browser and clicks the years, we are not adding the “headless” option.

PYTHON

# Function to search year hyperlink and click it
def findyear_click_gethtml(year):
    button = driver.find_element(by=By.ID, value=year)
    button.click()
    sleep(3)
    html = driver.page_source
    return html

# Function to parse html, extract table data, and assign year column
def parsehtml_extractdata(html, year):
    soup = BeautifulSoup(html, 'html.parser')
    titles_lc = [elem.get_text() for elem in soup.find_all(class_="film-title")]
    nominations_lc = [elem.get_text() for elem in soup.find_all(class_="film-nominations")]
    awards_lc = [elem.get_text() for elem in soup.find_all(class_="film-awards")]
    best_picture_lc = ["No" if elem.find("i") == None else "Yes" for elem in soup.find_all(class_="film-best-picture")]
    movies_df = pd.DataFrame(
        {'titles': titles_lc, 'nominations': nominations_lc, 'awards': awards_lc, 'best_picture': best_picture_lc, 'year': year}
    )
    return movies_df

# Open Selenium webdriver and go to the page
driver = webdriver.Chrome()
driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")

# Create empty dataframe where we will append/concatenate the dataframes we get for each year
result_df = pd.DataFrame()

for year in ["2010", "2011", "2012", "2013", "2014", "2015"]:
    html_year = findyear_click_gethtml(year)
    df_year = parsehtml_extractdata(html_year, year)
    result_df = pd.concat([result_df, df_year])

# Close the browser that Selenium opened
driver.quit()

Challenge

If you are tired of scraping table data like we’ve been doing for the last two episodes, here is another dynamic website exercise where you can practice what you’ve learned. Go to this product page created by scrapingcourse.com and extract all product names and prices, and also the hyperlink that each product card has to a detailed view page.

When you complete that, and if you are up to an additional challenge, scrape from the detailed view page of each product its SKU, Category and Description.

Show me the solution

To identify what elements containt the data you need, you should use the “Inspect” tool in your browser. The following image is a screenshot of the website. In there we can see that each product card is a <div> element with multiple attributes that we can use to narrow down our search to the specific elements we want. For example, we would use 'data-testid'='product-item'. After we find all divs that satisfy that condition, we can extract from each the hyperlink, the name, and the price. The hyperlink is the ‘href’ attribute of the <a> tag. The name and price are inside <span> tags, and we could use multiple attributes to get each of them. In the following code, we will use 'class'='product-name' to get the name and 'data-content'='product-price' to get the price.

A screenshot of Google Chrome web browser, highlighting the `<div>` element that contains the data we want about the product

PYTHON

# Open Selenium webdriver in headless mode and go to the desired page
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
driver.get("https://www.scrapingcourse.com/javascript-rendering")

# As we don't have to click anything, just wait for the JavaScript to load, we can get the HTML right away
sleep(3)
html = driver.page_source

# Parste the HTML
soup = BeautifulSoup(html, 'html.parser')
# Find all <div> elements that have a 'data-testid' attribute with the value of 'product-item'
divs = soup.find_all("div", attrs = {'data-testid': 'product-item'})

# Loop through the <div> elements we found, and for each get the href,
# the name (inside a <span> element with attribute class="product-name")
# and the price (inside a <span> element with attribute data-content="product-price"
list_of_dicts = []
for div in divs:
    # Create a dictionary to store the data we want for each product
    item_dict = {
        'link': div.find('a')['href'],
        'name': div.find('span', attrs = {'class': 'product-name'}).get_text(),
        'price': div.find('span', attrs = {'data-content': 'product-price'}).get_text()
    }
    list_of_dicts.append(item_dict)

all_products = pd.DataFrame(list_of_dicts)

We could arrive to the same result if we replace the for loop with list comprehensions. So here is another possible solution with that approach.

PYTHON

links = [elem['href'] for elem in soup.find_all('a', attrs = {'class': 'product-link'})]
names = [elem.get_text() for elem in soup.find_all('span', attrs = {'class': 'product-name'})]
prices = [elem.get_text() for elem in soup.find_all('span', attrs = {'data-content': 'product-price'})]
all_products_v2 = pd.DataFrame(
    {'link': links, 'name': names, 'price': prices}
)

The scraping pipeline

By now, you’ve learned about the core tools for web scraping: requests, BeautifulSoup, and Selenium. These three tools form a versatile pipeline for almost any web scraping task. When starting a new scraping project, there are several important steps to follow that will help ensure you capture the data you need.

The first step is understanding the website structure. Every website is different and structures data in its own particular way. Spend some time exploring the site and identifying the HTML elements that contain the information you want. Next, determine if the content is static or dynamic. Static content can be directly accessed from the HTML source code using requests and BeautifulSoup, while dynamic content often requires Selenium to load JavaScript on the page before BeautifulSoup can parse it.

Once you know how the website presents its data, start building your pipeline. If the content is static, make a requests call to get the HTML document, and use BeautifulSoup to locate and extract the necessary elements. If the content is dynamic, use Selenium to load the page fully, perform any interactions (like clicking or scrolling), and then pass the rendered HTML to BeautifulSoup for parsing and extracing the necessary elements. Finally, format and store the data in a structured way that’s useful for your specific project and that makes it easy to analyse later.

This scraping pipeline helps break down complex scraping tasks into manageable steps and allows you to adapt the tools based on the website’s unique features. With practice, you’ll be able to efficiently combine these tools to extract valuable data from almost any website.

Key Points

Dynamic websites load content using JavaScript, which isn’t present in the initial or source HTML. It’s important to distinguish between static and dynamic content when planning your scraping approach.
The Selenium package and its webdriver module simulate a real user interacting with a browser, allowing it to execute JavaScript and clicking, scrolling or filling in text boxes.
Here are the commandand we learned when we use Selenium:
- webdriver.Chrome() # Start the Google Chrome browser simulator
- .get("website_url") # Go to a given website
- .find_element(by, value) and .find_elements(by, value) # Get a given element
- .click() # Click the element selected
- .page_source # Get the HTML after JavaScript has executed, which can later be parsed with BeautifulSoup
- .quit() # Close the browser simulator
The browser’s “Inspect” tool allows users to view the HTML document after dynamic content has loaded, revealing elements added by JavaScript. This tool helps identify the specific elements you are interested in scraping.
A typical scraping pipeline involves understanding the website’s structure, determining content type (static or dynamic), using the appropriate tools (requests and BeautifulSoup for static, Selenium and BeautifulSoup for dynamic), and structuring the scraped data for analysis.