Web Scraping using Python and headless Chrome — Tutorial & Explanation

Yair Nevet
5 min readAug 22, 2023

Web scraping is a technique used to extract data from websites. This method involves making HTTP requests to websites, fetching the returned content (typically HTML), and then parsing this content to extract desired data. Web scraping is often used when the target website does not offer an API or if the data on the website is presented in a way that’s not easily downloadable.

Photo by Florian Olivo on Unsplash

In the upcoming article, I will outline the process of scraping products information from an online store’s webpage using a Python script and couple of libraries. Additionally, I’ll guide you on saving the scraped data into both CSV and Parquet formats for subsequent review and analysis.

Particularly, you’ll see how I utilize a Chrome WebDriver to engage with a headless Chrome browser. The goal? To retrieve content from a specific web page that showcases a list of MacBook Laptops. Once obtained, I sift through the product details with the aid of the BeautifulSoup library, iterating over every listed product. Finally, I store these details in both a CSV and a Parquet file. Dive in to see this in action!

The diagram below illustrates the web scraping process described above:

In this tutorial, you’ll see how I utilize a Chrome WebDriver to engage with a headless Chrome browser. The goal? To retrieve content from a specific web page that showcases a list of MacBook Laptops. Once obtained, I sift through the product details with the aid of the BeautifulSoup library, iterating over every listed product. Finally, I store these details in both a CSV and a Parquet file. Dive in to see this in action!

First Steps: Setting up Python, Virtual Environment, and Installing the Required Packages

Follow these steps to get your Python environment set up with all the necessary packages:

1. Install Python3:
— Visit the official Python website: https://www.python.org/downloads/
— Download the latest Python3 version.
— Follow the installation steps and ensure you check the box that says Add Python to PATH during installation.

2. Setting up a Project Directory:
— First, ensure you have Visual Studio Code (VSCode) installed. If not, download it from here: https://code.visualstudio.com/
— Open a terminal or command prompt.
— Create a new directory for your project:

mkdir py-web-scraper

— Navigate into the directory:

cd py-web-scraper

— Open this directory with VSCode:

code .

3. Setting up a Python Virtual Environment:
— Inside VSCode’s terminal, create a new virtual environment:

python3 -m venv .venv

— Activate the virtual environment:
Windows:

.\.venv\Scripts\Activate

Mac/Linux:

source .venv/bin/activate

— Ensure Python3 is the interpreter for your virtual environment. Within VSCode, you can select the interpreter by pressing Ctrl+Shift+P, typing Python: Select Interpreter, and then choosing the one within your virtual environment.

4. Installing the Required Packages:
— With your virtual environment activated, install the required packages using pip:

pip install webdriver-manager beautifulsoup4 pandas html5lib

5. Setting up Chrome WebDriver:
— Determine your Chrome version by navigating to the three-dot menu in the upper right corner of your Chrome browser, selecting Help, and then About Google Chrome.
— Visit the ChromeDriver download page: https://sites.google.com/a/chromium.org/chromedriver/downloads and download the matching version.
— Once downloaded, you have two options:
a. Copy the chromedriver binary into your py-web-scraper directory.
b. Or, add the location of the chromedriver binary to your system’s PATH variable.

6. Creating Your Main File:
— In VSCode, within your project directory, create a new file and name it main.py

7. Paste the following python code into the main.py file:

8. Execute your script by running the following command:

python main.py

9. Inspect your root directory, and you’ll discover newly generated files from your web scraping efforts: products.csv file and a data.parquet directory. The latter holds a parquet file partitioned into year and month folders. If you open the CSV file using Excel or the Numbers app, it should resemble the table view outlined below:

The data representation provided above illustrates our capability to transform raw data from a webpage into a structured and queryable dataset for in-depth analysis and evaluation. This underscores the significant advantage of web scraping: the ability to convert unstructured data into organized and easily accessible formats.

Let’s dissect the code sequence in the `main.py` file to ensure you grasp each segment:

1. Imports:
— The script starts by importing necessary libraries and modules:
selenium: For automating web browser interactions.
BeautifulSoup from bs4: To parse and navigate the HTML content.
pandas: For data manipulation and analysis.
html: To handle HTML entities.
datetime: To work with dates and times.

2. Webdriver Configuration:
— ChromeOptions are set for the browser driver, ensuring it runs in the background ( — headless) without the graphical user interface ( — disable-gpu). The — no-sandbox option is often used to run Chrome in environments without a sandbox for security reasons.
— A new Chrome browser driver instance is then created with these options.

3. Webpage Access:
— The driver navigates to the given URL (a Page listing MacBooks on the Sharaf DG website)

4. HTML Extraction:
— The entire source of the webpage is retrieved and stored in the variable content.
— This content is then parsed using BeautifulSoup, producing an object (soup) that provides methods to search and navigate the HTML structure.

5. Data Scraping:
— The script looks for all anchor (<a>) tags with a specified class which represents individual products.
— For each of these products, it extracts the product’s name, price, currency, and rating by locating the respective HTML elements and their classes.
— This extracted data is then appended to respective lists (products, prices, currencies, and ratings).

6. Data Wrangling with Pandas:
— A new DataFrame (df) is created using the Pandas library. This DataFrame organizes the scraped data into columns like Product Name, Price, Currency, and Rating.

7. Date Handling:
— The current date is retrieved, and both the year and month are extracted.
— These values are then added as new columns in the DataFrame, indicating the year and month when the data was scraped.

8. Data Export:
— The DataFrame is both saved as a Parquet file (you’ll be able to find it under a directory named data.parquet), and as a CSV file named products.csvunder the root folder.
— For the parquet files, the data is partitioned by year and month using the pyarrow engine, which is beneficial for optimizing reads on large datasets.

In summary, this script automates the process of visiting an online store’s webpage, extracting product details, organizing them into a structured format, and then saving this data in a Parquet file, all while adding the current year and month for reference.

You can find the entire tutorial code and use it through this repo link: https://github.com/ynevet/py-web-scraper

If you have any thoughts or questions, please leave a comment below. I’d be happy to engage in a discussion. Hope you found this content enjoyable!

--

--