So let’s understand Selenium first.
Selenium was created as a UI Testing Tool. This still used by Testers in the software companies. Let’s see a typical scenario.
Let’s assume a Tester is working for the IT department of the Malaysian Government. This tester is testing the Registration Page. So this is how Selenium is used:
The Testers write a script in Python (or another language supported by Selenium). The script will be something like:
- Launch Chrome
- Open a URL – https://www.malaysia.gov.my/portal/register
- Click Citizenship status and click on Permanent Citizen.
- Verify that the Permanent Citizen value is selected [This is known as POSITIVE TESTING]
- Click Citizenship Status and click on Indian Citizen.
- Verify that Indian Citizen was not found [This is known as NEGATIVE TESTING]
- Do this for all fields: Enter valid values and verify no errors. Enter invalid values and verify expected errors
- Submit the page
- If there is unexpected behavior, take a screenshot and send to developers for a fix
- Close the browser
- Repeat the same steps and Firefox
So I hope you get the idea of how Selenium was originally intended to be used.
to save a screenshot any time in your script.
Now because we have our Browser as a mediator, using selenium makes things slow and memory hungry. So we use Selenium for web scraping only when other approaches don’t work.
In most cases, we actually don’t need to wait for HTML to be “painted” on the browser window. For example, we don’t need to see how price looks, what fonts are applied, what is the color, etc. We just need the price, and it is there in HTML.
So this browser is replaced with requests library. This library works with raw HTTP Requests and HTTP Response. This is the role of
This HTTP Response needs be to now parsed so that a valid HTML can be extracted. This is done by a parser. In our case, we used
This HTML now needs to be made “pretty” so that we as a developer can navigate and select elements that we need. this is where
BeautifulSoup comes in picture.
I hope this code is more understandable now:
response = requests.get(address, headers=config.header_values).text
soup = BeautifulSoup(response_text, 'lxml')
By the way, we can actually do the same things using
lxml alone, but it is not easy. BeutifulSoup always sits above an HTML parser and its job is to make finding and selecting values easier.
I hope things are clearer now.
So let’s understand what we are trying to solve.
If you just need a utility for the smaller tasks, like reading prices from one or a few web pages, we can stop here.
Beautifulsoup can handle most scenarios, and if you use Selenium, you can handle all remaining scenario.
Summary: It is a Utility for comparatively simple tasks.
In the job world, the problems that need to be solved by Web Scraping are much bigger and complex.
For example, “Get all product prices from these 10 sites” [Competitor Price Monitoring]
“Get contact details of all Hiring managers from linked-in, along with their photo” [Sales Prospecting]
“Go to https://www.webmd.com/a-to-z-guides/qa, select a topic, go through all the questions, and get the answer (three-level deep links)” [REAL job posted on a freelance site]
Scrapy is a very powerful tool with a lot of functionality inbuilt.
It can send lof of requests in parallel, works asynchronously (doesn’t have to wait for one request processing to complete before moving on the next) making it really fast.
Also, tasks like saving data to CSV, JSON need only are built-in.
Scraped data can be cleaned up, updated, and processed with ease.
See this link on Scrapy Official Documentation for few more comparison examples.
Ready to learn more about Scrapy? Head over to this page to learn more.