Scrapy vs BeautifulSoup vs Selenium – 3 Powerful Tools

Contents hide

Scrapy vs. BeautifulSoup vs Selenium – This question keeps coming again and again. In this blog, we will try to understand the differences. Let’s talk about Selenium first.

Selenium

Selenium was created as a UI Testing Tool. This is still used by Testers in software companies. Let’s see a typical scenario.

Let’s assume a Tester is working for the IT department of the Malaysian Government. This tester is testing the Registration Page. So this is how Selenium is used:

The Testers write a script in Python (or another language supported by Selenium). The script will be something like this:

Launch Chrome
Open a URL – https://www.malaysia.gov.my/portal/register
Click Citizenship status and click on Permanent Citizen.
Verify that the Permanent Citizen value is selected [This is known as POSITIVE TESTING]
Click Citizenship Status and click on Indian Citizen.
Verify that an Indian Citizen was not found [This is known as NEGATIVE TESTING]
Do this for all fields: Enter valid values and verify no errors. Enter invalid values and verify expected errors
Submit the page
If there is unexpected behavior, take a screenshot and send it to developers for a fix
Close the browser
Repeat the same steps, and Firefox

So I hope you understand how Selenium was originally intended to be used.

Note that in the above case, I DID NOT talk about HTML. JavaScript etc, but only Browser.

So when we use Selenium and Python, we see what Browser sees. There is no complication of HTML, JavaScript, etc. In fact, you can call
driver.save_screenshot("screenshot.png")
to save a screenshot any time in your script.

BeutifulSoup

Because we have our Browser as a mediator, using selenium makes things slow and memory hungry. So we use Selenium for web scraping only when other approaches don’t work.

In most cases, we actually don’t need to wait for HTML to be “painted” on the browser window. For example, we don’t need to see how the price looks, what fonts are applied, what the color is, etc. We just need the price, and it is there in HTML.

So this browser is replaced with requests library. This library works with raw HTTP Requests and HTTP Responses. This is the role of the requests library.

This HTTP Response needs be to now parsed so that a valid HTML can be extracted. This is done by a parser. In our case, we used lxml library.

This HTML now needs to be made “pretty” so that we, as a developer, can navigate and select elements that we need. this is where BeautifulSoup comes into the picture.

I hope this code from my detailed tutorial is more understandable now:

response = requests.get(address, headers=config.header_values).text
soup = BeautifulSoup(response_text, 'lxml')

By the way, we can actually do the same things using lxml alone, but it is not easy. BeutifulSoup always sits above an HTML parser, and its job is to make finding and selecting values easier.

I hope things are more apparent now.

So let’s understand what we are trying to solve.

If you just need a utility for the smaller tasks, like reading prices from one or a few web pages, we can stop here.

Beautifulsoup can handle most scenarios, and if you use Selenium, you can handle all remaining scenarios.

BeautifulSoup + Requests is a Utility for simpler tasks.

Scrapy

In the job world, the problems that need to be solved by Web Scraping are much more extensive and complex.

For example, “Get all product prices from these ten sites” [Competitor Price Monitoring]

“Get contact details of all Hiring managers from linked-in, along with their photo” [Sales Prospecting]

“Go to https://www.webmd.com/a-to-z-guides/qa, select a topic, go through all the questions, and get the answer (three-level deep links)” [REAL job posted on a freelance site]

Scrapy is a very powerful tool with a lot of functionality inbuilt.

It can send a lot of requests in parallel, and works asynchronously (it doesn’t have to wait for one request processing to complete before moving on to the next), making it really fast.

Also, tasks like saving data to CSV, JSON need only are built-in.

Scraped data can be cleaned up, updated, and processed with ease.

Feature	Scrapy	BeautifulSoup	Selenium
Speed	Fastest (built on top of the Twisted engine)	Slower	Slower
Asynchronous requests	Yes	No	No
Cache	Built-in	No	No
HTTP error handling	Built-in	No	No
Retrying failed requests	Built-in	No	No
Plugin system	Yes	No	No
Middlewares	Built-in (e.g., modifying request headers)	No	No
Flexibility	More flexible and robust architecture	Simple	Simple
Browser emulation	No	No	Yes
JavaScript execution	No	No	Yes

A comparison of Scrapy, BeautifulSoup, and Selenium

See this link on Scrapy Official Documentation for a few more comparison examples.

Ready to learn more about Scrapy? Head over to this page to learn more.

5 4 votes

Article Rating

codeRECODE

Scrapy vs BeautifulSoup vs Selenium – 3 Powerful Tools

Selenium

BeutifulSoup

Scrapy

You May Also be Interested In These Blogs

Embracing AI: The New Reality

Virtual Environments in Python

How to create a tab-separated file with Scrapy

Read CSV, Excel in Scrapy – The BEST way!

Start earning in a week!

Register for the free course on web scraping and make that first Dollar!

© 2024 codeRECODE. All Rights Reserved.

codeRECODE