This post will show you few ways to read CSV and Excel files and use these keywords as seeds into your scrapy spiders
In this post, we are going to have a look at various ways of reading CSV and Excel files into Scrapy.
Introduction
Reading data from CSV and Excel would be useful in cases where you need to read a file on the go and do something with the data in it.
One scenario can be a list of first names and last names in an excel file, and you need to search these names one by one on a page and save the results.
In the field of automated testing, these files with data are often called data pools.
A Hypothetical Job Post
Here is a hypothetical task that we would be working with:
- Go to stackoverflow.com
- Search for all tags in from the second column of the Excel/CSV file
- Get the number of questions tagged with that keyword
- Export the output to a JSON file
I have created one CSV and one Excel file that you can download from the Github Repository
A quick analysis of StackOverflow will reveal that the URL for any tag search follow this pattern:
https://stackoverflow.com/questions/tagged/java
https://stackoverflow.com/questions/tagged/javascript
In the Scrapy spider, we can store “https://stackoverflow.com/questions/tagged/” in a variable and append the tag name to reach the page that we need.
Great! Let’s move on to the second part, reading a CSV file.
Reading CSV files
There are two most commonly used methods of reading CSV files – the csv module which doesn’t need any separate installation, and the Pandas module.
Reading CSV with the csv module
First thing first, let’s import the csv module:
import csv
Now let’s open the csv file in read mode. Optionally, you can specify the encoding.
with open('so_tags.csv', 'r') as csv_file:
The next step is to get the reader object. It would need the filehandle, csv_file, that we created just now:
reader = csv.reader(csv_file)
Now we can simply run a loop over the reader to get each row
for row in reader:
print(row)
you will see that each row is a list object with all the cells in that row. Let’s make it even better.
The first thing that we want is to skip the first row, which contains only the header names. For this, we can use the enumerate function. This will allow is to get get the row, and also a counter.
for i, row in enumerate(reader):
if i == 0: continue #skip header row
print(row)
Next, we need to get only a specific column, not all columns. Remember that the index begins with 0, which means column A. In our case, we need column B, which would be number 1.
print(row[1])
That’s all we need to do. Let’s put everything together and wrap it in a function:
def get_list_from_csv(filename, column_index, ignore_header=True):
tags = []
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file)
for i, row in enumerate(reader):
tags.append(row[column_index]))
return tags
You can see that I improvised it a little more and made it more reusable.
Again. you can download everything from https://github.com/coderecode-com/read_csv_excel_into_scrapy
Reading CSV using Pandas
Pandas is a very powerful library for working with a large amount of data. If you are using Anaconda, you should already have it installed. If not, we can still install it using pip on windows or pip3 on Linux/macOS:
pip install pandas
Once you have pandas, import it:
import pandas as pd
Once imported, call the read_csv() method and supply the file name. Store the data frame returned by this function in a variable:
df = pd.read_csv('so_tags.csv')
If you don’t know what data frame is, you don’t need to know at this point. The only thing that matters is you can select any column by specifying the column names. For example, we can extract the Tags column like this:
df['Tags']
To get the values of this column as a list, we need to access the .values and then call the tolist() method:
print(df['Tags'].values.tolist())
Note: It’s easier to get the columns by name. If you want to get them by Index, you would need to call the iloc
df.iloc[:,1]
Effectively, all you needed to do were two lines:
df = pd.read_csv('so_tags.csv')
print(df['Tags'].values.tolist())
Easier? Definitely. If you are in a position where you can use Pandas, go ahead and use it.
But now you know how to do it without using Pandas too!
Let’s have a very quick look at Excel now.
Reading Excel
Again there are two ways to read pandas. Let’s see the fast way first
Reading Excel using Pandas
I could just copy-paste the two lines from the previous section, change one function and it will work. Have a look at this:
df = pd.read_excel('so_tags.xlsx')
print(df['Tags'].values.tolist())
The only change is that instead of read_csv, the function to load excel is read_excel.
Reading Excel with OpenPyXl
This is a longer way, but a very useful library! It can read and write XLSX.
You can install it using
pip install openpyxl
For now, I am just sharing a simple snippet and leave the interpretation to my readers:
def get_list_from_excel(filename, column_index, ignore_header= True):
from openpyxl import load_workbook
tags = []
wb=load_workbook(filename = 'so_tags.xlsx')
sheet=wb.worksheets[0] # Loading first sheet.
for row_num, row in enumerate(sheet.iter_rows()):
if row_num == 0 and ignore_header:
continue # skipping header
# each row is a collection of cell objects.
# Nested loop is needed
for col_num, cell in enumerate(row):
if col_num == column_index:
tags.append(cell.value)
return tags
Using it Scrapy spider
When you write your scrapy spider, delete start_urls.
Instead using start_requests method. In this method, you can simply take the base url and append the tag in a loop
Here is the complete code of the spider:
import scrapy
from read_files import read_csv, read_excel
base_url = 'https://stackoverflow.com/questions/tagged/{}'
class SoSpider(scrapy.Spider):
name = 'so'
def start_requests(self):
for tag in read_excel():
yield scrapy.Request(base_url.format(tag))
def parse(self, response):
questions = response.xpath('normalize-space(//*[@id="mainbar"]/div[4]/div/div[1]/text())').get()
questions = questions.strip('questions')
yield {
'questions': questions,
'url': response.url
}
The complete code and example CSV and Excel files can be download from the github repo.
To run the code, simply clone the repository and run the spider.
git clone https://github.com/coderecode-com/read_csv_excel_into_scrapy.git
cd read_csv_excel_into_scrapy
scrapy runspider so.py
Amazing article, just correct typo on last line “simply clone the repo”.
Corrected. Thank you for taking out the time 🙂