BeautifulSoup, Selenium, Scrapy: What’s the Difference?

Share This!

Facebook
LinkedIn
Twitter
Email

So let’s understand Selenium first.

Selenium

Selenium was created as a UI Testing Tool. This still used by Testers in software companies. Let’s see a typical scenario.

Let’s assume a Tester is working for the IT department of the Malaysian Government. This tester is testing the Registration Page. So this is how Selenium is used:

The Testers write a script in Python (or another language supported by Selenium). The script will be something like:

  • Launch Chrome
  • Open a URL – https://www.malaysia.gov.my/portal/register
  • Click Citizenship status and click on Permanent Citizen.
  • Verify that the Permanent Citizen value is selected [This is known as POSITIVE TESTING]
  • Click Citizenship Status and click on Indian Citizen.
  • Verify that Indian Citizen was not found [This is known as NEGATIVE TESTING]
  • Do this for all fields: Enter valid values and verify no errors. Enter invalid values and verify expected errors
  • Submit the page
  • If there is unexpected behavior, take a screenshot and send to developers for a fix
  • Close the browser
  • Repeat the same steps and Firefox

So I hope you get the idea of how Selenium was originally intended to be used.

Note that in the above case, I DID NOT talk about HTML. JavaScript etc, but only Browser.

So when we use Selenium and Python, we see what Browser sees. There is no complication of HTML, JavaScript, etc. In fact, you can call
driver.save_screenshot("screenshot.png")
to save a screenshot any time in your script.

BeutifulSoup

Now because we have our Browser as a mediator, using selenium makes things slow and memory hungry. So we use Selenium for web scraping only when other approaches don’t work.

In most cases, we actually don’t need to wait for HTML to be “painted” on the browser window. For example, we don’t need to see how price looks, what fonts are applied, what is the color, etc. We just need the price, and it is there in HTML.

So this browser is replaced with requests library. This library works with raw HTTP Requests and HTTP Response. This is the role of requests library.

This HTTP Response needs be to now parsed so that a valid HTML can be extracted. This is done by a parser. In our case, we used lxml library.

This HTML now needs to be made “pretty” so that we as a developer can navigate and select elements that we need. this is where BeautifulSoup comes in picture.

I hope this code from my detailed tutorial is more understandable now:

response = requests.get(address, headers=config.header_values).text
soup = BeautifulSoup(response_text, 'lxml')

By the way, we can actually do the same things using lxml alone, but it is not easy. BeutifulSoup always sits above an HTML parser and its job is to make finding and selecting values easier.

I hope things are clearer now.

So let’s understand what we are trying to solve.

If you just need a utility for the smaller tasks, like reading prices from one or a few web pages, we can stop here.

Beautifulsoup can handle most scenarios, and if you use Selenium, you can handle all remaining scenarios.

BeautifulSoup + Requests is a Utility for simpler tasks.

Scrapy

In the job world, the problems that need to be solved by Web Scraping are much bigger and complex.

For example, “Get all product prices from these 10 sites” [Competitor Price Monitoring]

“Get contact details of all Hiring managers from linked-in, along with their photo” [Sales Prospecting]

“Go to https://www.webmd.com/a-to-z-guides/qa, select a topic, go through all the questions, and get the answer (three-level deep links)” [REAL job posted on a freelance site]

Scrapy is a very powerful tool with a lot of functionality inbuilt.

It can send lof of requests in parallel, works asynchronously (doesn’t have to wait for one request processing to complete before moving on the next) making it really fast.

Also, tasks like saving data to CSV, JSON need only are built-in.

Scraped data can be cleaned up, updated, and processed with ease.

See this link on Scrapy Official Documentation for few more comparison examples.

Ready to learn more about Scrapy? Head over to this page to learn more.

learn Web Scraping with Python in ONE HOUR?

Sign-up now for our Free Course!

Leave a Reply

Your email address will not be published. Required fields are marked *

More To Explore

Ready To

Say bye to the boring courses that put you to sleep immediately.

Learn how to monitor online store prices, save data to CSV file, and even send mail alerts—all of this in less than 1 hour!

Enter your details to receive your invite for FREE.

learning Python Can be fun!

Most Python courses are dull and boring. They talk for hours theory that you would probably never use. For many hours, you just talk about is not relatable because all they teach you is just printing 1 to 10 in different ways

No more boring theory

Learn a skill that can be used right now!

Take FREE Course

Monitor Price with Python in ONE Hour

This course will begin with installing Python and end with a program that would monitor prices of your favorite product at your favorite online store.
You would learn:
    Basics of Python
    Web scraping
    Working with CSV files
    Sending emails