Scrapy vs BeautifulSoup vs Selenium – 3 Powerful Tools

Scrapy vs. BeautifulSoup vs Selenium – This question keeps coming again and again. In this blog, we will try to understand the differences. Let’s talk about Selenium first.

Selenium

Selenium was created as a UI Testing Tool. This is still used by Testers in software companies. Let’s see a typical scenario.

Let’s assume a Tester is working for the IT department of the Malaysian Government. This tester is testing the Registration Page. So this is how Selenium is used:

The Testers write a script in Python (or another language supported by Selenium). The script will be something like this:

  • Launch Chrome
  • Open a URL – https://www.malaysia.gov.my/portal/register
  • Click Citizenship status and click on Permanent Citizen.
  • Verify that the Permanent Citizen value is selected [This is known as POSITIVE TESTING]
  • Click Citizenship Status and click on Indian Citizen.
  • Verify that an Indian Citizen was not found [This is known as NEGATIVE TESTING]
  • Do this for all fields: Enter valid values and verify no errors. Enter invalid values and verify expected errors
  • Submit the page
  • If there is unexpected behavior, take a screenshot and send it to developers for a fix
  • Close the browser
  • Repeat the same steps, and Firefox

So I hope you understand how Selenium was originally intended to be used.

Note that in the above case, I DID NOT talk about HTML. JavaScript etc, but only Browser.

So when we use Selenium and Python, we see what Browser sees. There is no complication of HTML, JavaScript, etc. In fact, you can call
driver.save_screenshot("screenshot.png")
to save a screenshot any time in your script.

BeutifulSoup

Because we have our Browser as a mediator, using selenium makes things slow and memory hungry. So we use Selenium for web scraping only when other approaches don’t work.

In most cases, we actually don’t need to wait for HTML to be “painted” on the browser window. For example, we don’t need to see how the price looks, what fonts are applied, what the color is, etc. We just need the price, and it is there in HTML.

So this browser is replaced with requests library. This library works with raw HTTP Requests and HTTP Responses. This is the role of the requests library.

This HTTP Response needs be to now parsed so that a valid HTML can be extracted. This is done by a parser. In our case, we used lxml library.

This HTML now needs to be made “pretty” so that we, as a developer, can navigate and select elements that we need. this is where BeautifulSoup comes into the picture.

I hope this code from my detailed tutorial is more understandable now:

response = requests.get(address, headers=config.header_values).text
soup = BeautifulSoup(response_text, 'lxml')

By the way, we can actually do the same things using lxml alone, but it is not easy. BeutifulSoup always sits above an HTML parser, and its job is to make finding and selecting values easier.

I hope things are more apparent now.

So let’s understand what we are trying to solve.

If you just need a utility for the smaller tasks, like reading prices from one or a few web pages, we can stop here.

Beautifulsoup can handle most scenarios, and if you use Selenium, you can handle all remaining scenarios.

BeautifulSoup + Requests is a Utility for simpler tasks.

Scrapy

In the job world, the problems that need to be solved by Web Scraping are much more extensive and complex.

For example, “Get all product prices from these ten sites” [Competitor Price Monitoring]

“Get contact details of all Hiring managers from linked-in, along with their photo” [Sales Prospecting]

“Go to https://www.webmd.com/a-to-z-guides/qa, select a topic, go through all the questions, and get the answer (three-level deep links)” [REAL job posted on a freelance site]

Scrapy is a very powerful tool with a lot of functionality inbuilt.

It can send a lot of requests in parallel, and works asynchronously (it doesn’t have to wait for one request processing to complete before moving on to the next), making it really fast.

Also, tasks like saving data to CSV, JSON need only are built-in.

Scraped data can be cleaned up, updated, and processed with ease.

FeatureScrapyBeautifulSoupSelenium
SpeedFastest (built on top of the Twisted engine)SlowerSlower
Asynchronous requestsYesNoNo
CacheBuilt-inNoNo
HTTP error handlingBuilt-inNoNo
Retrying failed requestsBuilt-inNoNo
Plugin systemYesNoNo
MiddlewaresBuilt-in (e.g., modifying request headers)NoNo
FlexibilityMore flexible and robust architectureSimpleSimple
Browser emulationNoNoYes
JavaScript executionNoNoYes
A comparison of Scrapy, BeautifulSoup, and Selenium

See this link on Scrapy Official Documentation for a few more comparison examples.

Ready to learn more about Scrapy? Head over to this page to learn more.

5 4 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

You May Also be Interested In These Blogs

Python

Virtual Environments in Python

A Python virtual environment is a tool to isolate specific Python environments on a single machine, allowing for separate dependencies and packages for different projects.