fbpx

Scrapy and JSON Data: A Simple Spider

Share This!

Facebook
LinkedIn
Twitter
Email

How easy is it to get JSON data with Scrapy?

The answer—very easy, even if you have basic knowledge of Scrapy.

Background

The most common question that I get asked is which is the best tool for getting data from web pages

It is difficult to have a one size fits all answer to this as the use case is very different. I wrote on this earlier explaining these differences.

Using Scrapy

Scrapy is perceived to be difficult, just because it can do a lot of things.

It is actually very easy to get started if you follow the correct approach.

Getting Dynamic Data

Let’s see one example problem:

Let’s try to solve this problem in the easiest way possible.

Understanding How the Web Page Works

Open this web page in Chrome, enable the developer tab. Go to Network and filter with xhr.

Dynamic Page with JSON
The Data is Populated from a JSON file

After examining this, we would know that the data is actually being loaded from a json file:

https://www.nseindia.com/live_market/dynaContent/live_watch/stock_watch/niftyStockWatch.json

This makes our work very easy. We don’t need to worry about selecting the elements or using more complex ways.

Ready to use Templates in Scrapy

Let’s create our scrapy spider.

First, Install Scrapy (Ideally in a virtual environment)

pip install scrapy

Now, create a simple spider with the default template. You can actually run the following command to see the list of available templets:

scrapy genspider -l

The output of this command is like this:

Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

Now we can either use -l basic switch to specify the basic template, or skip the -l switch. The default template is basic, so this is not a problem.

scrapy genspider live nseindia.com

This will create live.py file with the skeleton of a Scrapy spider.

import scrapy

class LiveSpider(scrapy.Spider):
    name = 'live'
    allowed_domains = ['nseindia.com']
    start_urls = ['http://nseindia.com/']

    def parse(self, response):
        pass

We know that the request will return a JSON response. We can use Python’s json module parse it and return an anonymous object.

Scraping the JSON Data

import scrapy
import json


class LiveSpider(scrapy.Spider):
    name = 'live'
    start_urls = ['https://www.nseindia.com/live_market/dynaContent/live_watch/stock_watch/niftyStockWatch.json']

    def parse(self, response):
        json_response = json.loads(response.text)
        listings = json_response['data']
        for listing in listings:
            yield{
                "symbol": listing['symbol'],
                "open":  listing['open'],
                "high":  listing['high'],
                "low":  listing['low']
            }

We can finally run scrapy with -o switch to move the output to CSV.

scrapy runspider live.py -o stocks.csv

Easy. Isn’t it?

learn Web Scraping with Python in ONE HOUR?

Sign-up now for our Free Course!

Leave a Reply

Your email address will not be published. Required fields are marked *

More To Explore

Ready To

Say bye to the boring courses that put you to sleep immediately.

Learn how to monitor online store prices, save data to CSV file, and even send mail alerts—all of this in less than 1 hour!

Enter your details to receive your invite for FREE.

learning Python Can be fun!

Most Python courses are dull and boring. They talk for hours theory that you would probably never use. For many hours, you just talk about is not relatable because all they teach you is just printing 1 to 10 in different ways

No more boring theory

Learn a skill that can be used right now!

Take FREE Course

Monitor Price with Python in ONE Hour

This course will begin with installing Python and end with a program that would monitor prices of your favorite product at your favorite online store.
You would learn:
    Basics of Python
    Web scraping
    Working with CSV files
    Sending emails