Hello-Scraping


  • Every website has an HTML document behind it that gives a structure to its content.
  • An HTML is composed of elements, which usually have a opening <tag> and a closing </tag>.
  • Elements can have different properties, assigned by attributes in the form of <tag attribute_name="value">.
  • We can parse any HTML document with BeautifulSoup() and find elements using the .find() and .find_all() methods.
  • We can access the text of an element using the .get_text() method and the attribute values as we do with Python dictionaries (element["attribute_name"]).
  • We must be careful to not tresspass the Terms of Service (TOS) of the website we are scraping.

Scraping a real website


  • We can get the HTML behind any website using the “requests” package and the function requests.get('website_url').text.
  • An HTML document is a nested tree of elements. Therefore, from a given element, we can access its child, parent, or sibling, using .contents, .parent, .next_sibling, and previous_sibling.
  • It’s polite to not send too many requests to a website in a short period of time. For that, we can use the sleep() function of the built-in Python module time.

Dynamic websites


  • Dynamic websites load content using JavaScript, which isn’t present in the initial or source HTML. It’s important to distinguish between static and dynamic content when planning your scraping approach.
  • The Selenium package and its webdriver module simulate a real user interacting with a browser, allowing it to execute JavaScript and clicking, scrolling or filling in text boxes.
  • Here are the commandand we learned when we use Selenium:
    • webdriver.Chrome() # Start the Google Chrome browser simulator
    • .get("website_url") # Go to a given website
    • .find_element(by, value) and .find_elements(by, value) # Get a given element
    • .click() # Click the element selected
    • .page_source # Get the HTML after JavaScript has executed, which can later be parsed with BeautifulSoup
    • .quit() # Close the browser simulator
  • The browser’s “Inspect” tool allows users to view the HTML document after dynamic content has loaded, revealing elements added by JavaScript. This tool helps identify the specific elements you are interested in scraping.
  • A typical scraping pipeline involves understanding the website’s structure, determining content type (static or dynamic), using the appropriate tools (requests and BeautifulSoup for static, Selenium and BeautifulSoup for dynamic), and structuring the scraped data for analysis.