Scrapy vs. BeautifulSoup vs Selenium – This question keeps coming again and again. In this blog, we will try to understand the differences. Let’s talk about Selenium first.
Selenium
Selenium was created as a UI Testing Tool. This is still used by Testers in software companies. Let’s see a typical scenario.
Let’s assume a Tester is working for the IT department of the Malaysian Government. This tester is testing the Registration Page. So this is how Selenium is used:
The Testers write a script in Python (or another language supported by Selenium). The script will be something like this:
- Launch Chrome
- Open a URL – https://www.malaysia.gov.my/portal/register
- Click Citizenship status and click on Permanent Citizen.
- Verify that the Permanent Citizen value is selected [This is known as POSITIVE TESTING]
- Click Citizenship Status and click on Indian Citizen.
- Verify that an Indian Citizen was not found [This is known as NEGATIVE TESTING]
- Do this for all fields: Enter valid values and verify no errors. Enter invalid values and verify expected errors
- Submit the page
- If there is unexpected behavior, take a screenshot and send it to developers for a fix
- Close the browser
- Repeat the same steps, and Firefox
So I hope you understand how Selenium was originally intended to be used.
Note that in the above case, I DID NOT talk about HTML. JavaScript etc, but only Browser.
So when we use Selenium and Python, we see what Browser sees. There is no complication of HTML, JavaScript, etc. In fact, you can call
driver.save_screenshot("screenshot.png")
to save a screenshot any time in your script.
BeutifulSoup
Because we have our Browser as a mediator, using selenium makes things slow and memory hungry. So we use Selenium for web scraping only when other approaches don’t work.
In most cases, we actually don’t need to wait for HTML to be “painted” on the browser window. For example, we don’t need to see how the price looks, what fonts are applied, what the color is, etc. We just need the price, and it is there in HTML.
So this browser is replaced with requests
library. This library works with raw HTTP Requests and HTTP Responses. This is the role of the requests
library.
This HTTP Response needs be to now parsed so that a valid HTML can be extracted. This is done by a parser. In our case, we used lxml
library.
This HTML now needs to be made “pretty” so that we, as a developer, can navigate and select elements that we need. this is where BeautifulSoup comes into the picture.
I hope this code from my detailed tutorial is more understandable now:
response = requests.get(address, headers=config.header_values).text
soup = BeautifulSoup(response_text, 'lxml')
By the way, we can actually do the same things using lxml
alone, but it is not easy. BeutifulSoup always sits above an HTML parser, and its job is to make finding and selecting values easier.
I hope things are more apparent now.
So let’s understand what we are trying to solve.
If you just need a utility for the smaller tasks, like reading prices from one or a few web pages, we can stop here.
Beautifulsoup can handle most scenarios, and if you use Selenium, you can handle all remaining scenarios.
BeautifulSoup + Requests is a Utility for simpler tasks.
Scrapy
In the job world, the problems that need to be solved by Web Scraping are much more extensive and complex.
For example, “Get all product prices from these ten sites” [Competitor Price Monitoring]
“Get contact details of all Hiring managers from linked-in, along with their photo” [Sales Prospecting]
“Go to https://www.webmd.com/a-to-z-guides/qa, select a topic, go through all the questions, and get the answer (three-level deep links)” [REAL job posted on a freelance site]
Scrapy is a very powerful tool with a lot of functionality inbuilt.
It can send a lot of requests in parallel, and works asynchronously (it doesn’t have to wait for one request processing to complete before moving on to the next), making it really fast.
Also, tasks like saving data to CSV, JSON need only are built-in.
Scraped data can be cleaned up, updated, and processed with ease.
Feature | Scrapy | BeautifulSoup | Selenium |
---|---|---|---|
Speed | Fastest (built on top of the Twisted engine) | Slower | Slower |
Asynchronous requests | Yes | No | No |
Cache | Built-in | No | No |
HTTP error handling | Built-in | No | No |
Retrying failed requests | Built-in | No | No |
Plugin system | Yes | No | No |
Middlewares | Built-in (e.g., modifying request headers) | No | No |
Flexibility | More flexible and robust architecture | Simple | Simple |
Browser emulation | No | No | Yes |
JavaScript execution | No | No | Yes |
See this link on Scrapy Official Documentation for a few more comparison examples.
Ready to learn more about Scrapy? Head over to this page to learn more.