Building a Concurrent Web Scraper with Python and Selenium

Building a Concurrent Web Scraper with Python and Selenium

ยท

6 min read

Web Scraping and crawling is the process of automatically extracting data from websites. Websites are built with HTML, as a result, data on these websites can be unstructured but one can find a structure with the help of HTML tags, IDs, and classes. In this article, we will find out how to speed up this process by using Concurrent methods such as multithreading. Concurrency means overlapping two or more tasks and executing them in parallel, where those tasks progress simultaneously and result in less time for execution, making our Web Scraper fast.

lol-cat.gif

In this article, we will create a bot that will scrape all quotes, authors, and tags from a given sample website, after which we will Concurrent that using multithreading and multiprocessing.

Modules required ๐Ÿ”Œ :

  • Python: Setting up python, link
  • Selenium: Setting up selenium
    Installing selenium using pip
    pip install selenium
    
  • Webdriver-manager: To get and manage headless browser drivers
    Installing Webdriver-manager using pip
    pip install webdriver-manager
    

    Basic Setup ๐Ÿ—‚ :

Let us create a directory named web-scraper, with a subfolder named scraper. In the scraper folder, we will place our helper functions

For creating a directory, we can use the following command.

mkdir -p web-scraper/scraper

After creating folders, we need to create a few files using the commands below. To write our code, scrapeQuotes.py contains our main scraper code.

cd web-scraper && touch scrapeQuotes.py

whereas scraper.py contains our helper functions.

cd scraper && touch scraper.py __init__.py

After this, our folder structure will look like this, open this in any text editor of your choice.

Screenshotfrom20220821194111.png

Editing scraper.py in the scraper Folder ๐Ÿ“ :

Now, Let us write some helper functions for our bot

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


def get_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    return webdriver.Chrome(
        service=ChromeService(ChromeDriverManager().install()),
        options=options,
    )


def persist_data(data, filename):
    # this function writes data in the file
    try:
        file = open(filename, "a")
        file.writelines([f"{line}\n" for line in data])
    except Exception as e:
        return e
    finally:
        file.close()
    return True

Editing scrapeQuotes.py ๐Ÿ“ :

For scraping, we need an identifier in an order to scrape quotes for this website. We can see each quote has a unique class called quote

Screenshotfrom202208212239261.png

For getting all elements we will use By.CLASS_NAME identifier

 def scrape_quotes(self):
        quotes_list = []
        quotes = WebDriverWait(self.driver, 20).until(
            EC.visibility_of_all_elements_located((By.CLASS_NAME, "quote"))
        )
        for quote in quotes:
            quotes_list.append(self.clean_output(quote))
        self.close_driver()
        return quotes_list

Complete script with a few more functions:

  • load_page: load_page is a function that takes the URL and loads the webpage
  • scrape_quotes: scrape_quotes fetch all quotes by using the class name
  • close_driver: this function closes the selenium session after we get our quotes
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

from scraper.scraper import get_driver, persist_data


class ScrapeQuotes:
    def __init__(self, url):
        self.url = url
        self.driver = get_driver()

    def load_page(self):
        self.driver.get(self.url)

    def scrape_quotes(self):
        quotes_list = []
        quotes = WebDriverWait(self.driver, 20).until(
            EC.visibility_of_all_elements_located((By.CLASS_NAME, "quote"))
        )
        for quote in quotes:
            quotes_list.append(self.clean_output(quote))
        self.close_driver()
        return quotes_list

    def clean_output(self, quote):
        raw_quote = quote.text.strip().split("\n")
        raw_quote[0] = raw_quote[0].replace("โ€œ", '"')
        raw_quote[0] = raw_quote[0].replace("โ€", '"')
        raw_quote[1] = raw_quote[1].replace("by", "")
        raw_quote[1] = raw_quote[1].replace("(about)", "")
        raw_quote[2] = raw_quote[2].replace("Tags: ", "")
        return ",".join(raw_quote)

    def close_driver(self):
        self.driver.close()


def main(tag):
    scrape = ScrapeQuotes("https://quotes.toscrape.com/tag/" + tag)
    scrape.load_page()
    quotes = scrape.scrape_quotes()
    persist_data(quotes, "quotes.csv")


if __name__ == "__main__":
    tags = ["love", "truth", "books", "life", "inspirational"]
    for tag in tags:
        main(tag)

Calculating Time โฑ๏ธ :

In an order to see the difference between a normal Web Scraper and a Concurrent nature Web Scraper, we will run the following scraper with a time command.

time -p /usr/bin/python3 scrapeQuotes.py

Screenshotfrom20220821202338-660x115.png

Here, we can see our script took 22.84 secs.

Configure Multithreading โš™๏ธ :

we will rewrite our script to use concurrent.futures library, from this library we will import a function called ThreadPoolExecutor which is used to spawn a pool of threads for executing the main function parallelly.

    # list to store the threads
    threadList = []

    # initialize the thread pool
    with ThreadPoolExecutor() as executor:
        for tag in tags:
            threadList.append(executor.submit(main, tag))

    # wait for all the threads to complete
    wait(threadList)

In this executor.submit function takes the name of the function which is main in our case as well as the arguments, we need to pass to the main function. The wait function is used to block the flow of execution until all tasks are complete.

Editing scrapeQuotes.py for using Multithreading ๐Ÿ“ :

from concurrent.futures import ThreadPoolExecutor, process, wait

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

from scraper.scraper import get_driver, persist_data


class ScrapeQuotes:
    def __init__(self, url):
        self.url = url
        self.driver = get_driver()

    def load_page(self):
        self.driver.get(self.url)

    def scrape_quotes(self):
        quotes_list = []
        quotes = WebDriverWait(self.driver, 20).until(
            EC.visibility_of_all_elements_located((By.CLASS_NAME, "quote"))
        )
        for quote in quotes:
            quotes_list.append(self.clean_output(quote))
        self.close_driver()
        return quotes_list

    def clean_output(self, quote):
        raw_quote = quote.text.strip().split("\n")
        raw_quote[0] = raw_quote[0].replace("โ€œ", '"')
        raw_quote[0] = raw_quote[0].replace("โ€", '"')
        raw_quote[1] = raw_quote[1].replace("by", "")
        raw_quote[1] = raw_quote[1].replace("(about)", "")
        raw_quote[2] = raw_quote[2].replace("Tags: ", "")
        return ",".join(raw_quote)

    def close_driver(self):
        self.driver.close()


def main(tag):
    scrape = ScrapeQuotes("https://quotes.toscrape.com/tag/" + tag)
    scrape.load_page()
    quotes = scrape.scrape_quotes()
    persist_data(quotes, "quotes.csv")


if __name__ == "__main__":
    tags = ["love", "truth", "books", "life", "inspirational"]

    # list to store the threads
    threadList = []

    # initialize the thread pool
    with ThreadPoolExecutor() as executor:
        for tag in tags:
            threadList.append(executor.submit(main, tag))

    # wait for all the threads to complete
    wait(threadList)

Calculating Time โฑ๏ธ :

Let us calculate the time required for running using Multithreading in this script.

time -p /usr/bin/python3 scrapeQuotes.py

Screenshotfrom20220821221935-660x101.png

Here, we can see our script took 9.27 secs.

Configure Multiprocessing โš™๏ธ :

We can easily convert Multithreading scripts to multiprocessing scripts, as in python multiprocessing is implemented by ProcessPoolExecutor which has the same interface as ThreadPoolExecutor. Multiprocessing is known for parallelism of tasks whereas Multithreading is known for concurrency in tasks.

    # list to store the processes
    processList = []

    # initialize the mutiprocess interface
    with ProcessPoolExecutor() as executor:
        for tag in tags:
            processList.append(executor.submit(main, tag))

    # wait for all the threads to complete
    wait(processList)

Editing scrapeQuotes.py for using Multiprocessing ๐Ÿ“ :

from concurrent.futures import ProcessPoolExecutor, wait

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

from scraper.scraper import get_driver, persist_data


class ScrapeQuotes:
    def __init__(self, url):
        self.url = url
        self.driver = get_driver()

    def load_page(self):
        self.driver.get(self.url)

    def scrape_quotes(self):
        quotes_list = []
        quotes = WebDriverWait(self.driver, 20).until(
            EC.visibility_of_all_elements_located((By.CLASS_NAME, "quote"))
        )
        for quote in quotes:
            quotes_list.append(self.clean_output(quote))
        self.close_driver()
        return quotes_list

    def clean_output(self, quote):
        raw_quote = quote.text.strip().split("\n")
        raw_quote[0] = raw_quote[0].replace("โ€œ", '"')
        raw_quote[0] = raw_quote[0].replace("โ€", '"')
        raw_quote[1] = raw_quote[1].replace("by", "")
        raw_quote[1] = raw_quote[1].replace("(about)", "")
        raw_quote[2] = raw_quote[2].replace("Tags: ", "")
        return ",".join(raw_quote)

    def close_driver(self):
        self.driver.close()


def main(tag):
    scrape = ScrapeQuotes("https://quotes.toscrape.com/tag/" + tag)
    scrape.load_page()
    quotes = scrape.scrape_quotes()
    persist_data(quotes, "quotes.csv")


if __name__ == "__main__":
    tags = ["love", "truth", "books", "life", "inspirational"]

    # list to store the processes
    processList = []

    # initialize the mutiprocess interface
    with ProcessPoolExecutor() as executor:
        for tag in tags:
            processList.append(executor.submit(main, tag))

    # wait for all the threads to complete
    wait(processList)

Calculating Time โฑ๏ธ :

Let's calculate the time required for running using Multiprocessing in this script.

time -p /usr/bin/python3 scrapeQuotes.py

Screenshotfrom20220821234155.png

Here, we can see our Multiprocessing script took 8.23 secs.

Did you find this article valuable?

Support Vinayak's Blog by becoming a sponsor. Any amount is appreciated!

ย