24 September 2017

Scrapy

Build and run your
web spiders
Terminal
 pip install shub
 shub login
Insert your Scrapinghub API Key: <API_KEY>

# Deploy the spider to Scrapy Cloud
 shub deploy

# Schedule the spider for execution
 shub schedule blogspider 
Spider blogspider scheduled, watch it running here:
https://app.scrapinghub.com/p/26731/job/1/8

# Retrieve the scraped data
 shub items 26731/1/8
{"title": "Improved Frontera: Web Crawling at Scale with Python 3 Support"}
{"title": "How to Crawl the Web Politely with Scrapy"}
...
Deploy them to
Scrapy Cloud
or use Scrapyd to host the spiders on your own server

Fast and powerful

write the rules to extract the data and let Scrapy do the rest

Easily extensible

extensible by design, plug new functionality easily without having to touch the core

Scrapy at a glance

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Walk-through of an example spider

In order to show you what Scrapy brings to the table, we’ll walk you through an example of a Scrapy Spider using the simplest way to run a spider.
Here’s the code for a spider that scrapes famous quotes from websitehttp://quotes.toscrape.com, following the pagination:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)
Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:
scrapy runspider quotes_spider.py -o quotes.json
When this finishes you will have in the quotes.json file a list of the quotes in JSON format, containing text and author, looking like this (reformatted here for better readability):
[{
    "author": "Jane Austen",
    "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
    "author": "Groucho Marx",
    "text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.\u201d"
},
{
    "author": "Steve Martin",
    "text": "\u201cA day without sunshine is like, you know, night.\u201d"
},
...]

What just happened?

When you ran the command scrapy runspider quotes_spider.py, Scrapy looked for a Spider definition inside it and ran it through its crawler engine.
The crawl started by making requests to the URLs defined in the start_urlsattribute (in this case, only the URL for quotes in humor category) and called the default callback method parse, passing the response object as an argument. In the parse callback, we loop through the quote elements using a CSS Selector, yield a Python dict with the extracted quote text and author, look for a link to the next page and schedule another request using the same parse method as callback.
Here you notice one of the main advantages about Scrapy: requests arescheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.
While this enables you to do very fast crawls (sending multiple concurrent requests at the same time, in a fault-tolerant way) Scrapy also gives you control over the politeness of the crawl through a few settings. You can do things like setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and even using an auto-throttling extension that tries to figure out these automatically.
Note
This is using feed exports to generate the JSON file, you can easily change the export format (XML or CSV, for example) or the storage backend (FTP or Amazon S3, for example). You can also write an item pipeline to store the items in a database.

What else?

You’ve seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:
  • Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.
  • An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.
  • Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
  • Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.
  • Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, andpipelines).
  • Wide range of built-in extensions and middlewares for handling:
    • cookies and session handling
    • HTTP features like compression, authentication, caching
    • user-agent spoofing
    • robots.txt
    • crawl depth restriction
    • and more
  • Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler
  • Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching DNS resolver, and much more!

What’s next?

The next steps for you are to install Scrapy,follow through the tutorial to learn how to create a full-blown Scrapy project and join the community. Thanks for your interest!
abril said...

Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

Adlina Jessi said...

Awesome Blog, you have provided the right information that will be beneficial to us. Thanks for sharing your valuable Ideas to our vision. Linux Training Institute in Chennai | Unix Training Institute in Chennai | Python Training Institute in Chennai

Nisha Premrahul said...

I really enjoyed while reading your article, the information you have delivered in this post was damn good. Keep sharing your post with efficient news.
B.Com Project Center in Chennai | B.Com Project Center in Velachery

viji muru said...

Thanks for sharing innovative information from your blog..
Summer Courses for Android in Perungudi | Summer Courses for IOS in Velachery | Summer Courses in OMR

ancy said...

Very nice and informative article. It was easy to understand. So, thanks for sharing...
Summer Courses for Business Administration | Best Summer Course in Porur

kri pav said...

Great blog.you put Good stuff.All the topics were explained briefly.so quickly understand for me.I am waiting for your next fantastic blog.Thanks for sharing.
Summer Courses in Perungudi | Summer Courses in OMR | Summer Courses in Velachery

Start typing and press Enter to search