Read CSV, Excel in Scrapy – The BEST way!

This post will show you few ways to read CSV and Excel files and use these keywords as seeds into your scrapy spiders

In this post, we are going to have a look at various ways of reading CSV and Excel files into Scrapy.

Introduction

Reading data from CSV and Excel would be useful in cases where you need to read a file on the go and do something with the data in it. 

One scenario can be a list of first names and last names in an excel file, and you need to search these names one by one on a page and save the results. 

In the field of automated testing, these files with data are often called data pools. 

A Hypothetical Job Post

Here is a hypothetical task that we would be working with:

  1. Go to stackoverflow.com 
  2. Search for all tags in from the second column of the Excel/CSV file
  3. Get the number of questions tagged with that keyword
  4. Export the output to a JSON file

I have created one CSV and one Excel file that you can download from the Github Repository

A quick analysis of StackOverflow will reveal that the URL for any tag search follow this pattern:

https://stackoverflow.com/questions/tagged/java
https://stackoverflow.com/questions/tagged/javascript

In the Scrapy spider, we can store “https://stackoverflow.com/questions/tagged/” in a variable and append the tag name to reach the page that we need.

Great! Let’s move on to the second part, reading a CSV file.

Reading CSV files 

There are two most commonly used methods of reading CSV files – the csv module which doesn’t need any separate installation, and the Pandas module. 

Reading CSV with the csv module

First thing first, let’s import the csv module:

import csv

Now let’s open the csv file in read mode. Optionally, you can specify the encoding.

with open('so_tags.csv', 'r') as csv_file:

The next step is to get the reader object. It would need the filehandle, csv_file, that we created just now:

reader = csv.reader(csv_file)

Now we can simply run a loop over the reader to get each row

for row in reader:
    print(row)

you will see that each row is a list object with all the cells in that row. Let’s make it even better. 

The first thing that we want is to skip the first row, which contains only the header names. For this, we can use the enumerate function. This will allow is to get get the row, and also a counter.


for i, row in enumerate(reader):
    if i == 0: continue #skip header row
        print(row)

Next, we need to get only a specific column, not all columns. Remember that the index begins with 0, which means column A. In our case, we need column B, which would be number 1.

print(row[1])

That’s all we need to do. Let’s put everything together and wrap it in a function:

def get_list_from_csv(filename, column_index, ignore_header=True):
    tags = []
    with open(filename, 'r') as csv_file:
        reader = csv.reader(csv_file)
        for i, row in enumerate(reader):
            tags.append(row[column_index]))
    return tags

You can see that I improvised it a little more and made it more reusable.

Again. you can download everything from https://github.com/coderecode-com/read_csv_excel_into_scrapy

Reading CSV using Pandas

Pandas is a very powerful library for working with a large amount of data. If you are using Anaconda, you should already have it installed. If not, we can still install it using pip on windows or pip3 on Linux/macOS:

pip install pandas

Once you have pandas, import it:

import pandas as pd

Once imported, call the read_csv() method and supply the file name. Store the data frame returned by this function in a variable:

df = pd.read_csv('so_tags.csv')

If you don’t know what data frame is, you don’t need to know at this point. The only thing that matters is you can select any column by specifying the column names. For example, we can extract the Tags column like this:

df['Tags']

To get the values of this column as a list, we need to access the .values and then call the tolist() method:

print(df['Tags'].values.tolist())

Note: It’s easier to get the columns by name. If you want to get them by Index, you would need to call the iloc

df.iloc[:,1]

Effectively, all you needed to do were two lines:

df = pd.read_csv('so_tags.csv')
print(df['Tags'].values.tolist())	

Easier? Definitely. If you are in a position where you can use Pandas, go ahead and use it. 

But now you know how to do it without using Pandas too!

Let’s have a very quick look at Excel now.

Reading Excel 

Again there are two ways to read pandas. Let’s see the fast way first

Reading Excel using Pandas

I could just copy-paste the two lines from the previous section, change one function and it will work. Have a look at this:

df = pd.read_excel('so_tags.xlsx')
print(df['Tags'].values.tolist())

The only change is that instead of read_csv, the function to load excel is read_excel.

Reading Excel with OpenPyXl

This is a longer way, but a very useful library! It can read and write XLSX. 

You can install it using 

pip install openpyxl

For now, I am just sharing a simple snippet and leave the interpretation to my readers:

def get_list_from_excel(filename, column_index, ignore_header= True):
    from openpyxl import load_workbook
    tags = []
    wb=load_workbook(filename = 'so_tags.xlsx')
    sheet=wb.worksheets[0]  # Loading first sheet.
    for row_num, row in enumerate(sheet.iter_rows()):
        if row_num == 0 and ignore_header:
            continue  # skipping header
        # each row is a collection of cell objects. 
        # Nested loop is needed
        for col_num, cell in enumerate(row):
        if col_num == column_index:
            tags.append(cell.value)
    return tags

Using it Scrapy spider

When you write your scrapy spider, delete start_urls.

Instead using start_requests method. In this method, you can simply take the base url and append the tag in a loop

Here is the complete code of the spider:

import scrapy
from read_files import read_csv, read_excel
base_url = 'https://stackoverflow.com/questions/tagged/{}'
class SoSpider(scrapy.Spider):
    name = 'so'
    def start_requests(self):
        for tag in read_excel():
            yield scrapy.Request(base_url.format(tag))

    def parse(self, response):
        questions = response.xpath('normalize-space(//*[@id="mainbar"]/div[4]/div/div[1]/text())').get()
        questions = questions.strip('questions')
        yield {
            'questions': questions,
            'url': response.url
        }

The complete code and example CSV and Excel files can be download from the github repo.

To run the code, simply close the repo and run the spider.

git clone https://github.com/coderecode-com/read_csv_excel_into_scrapy.git
cd read_csv_excel_into_scrapy
scrapy runspider so.py

You May Also be Interested In These Blogs