0

I'm quite new to web scraping, and in particular in using Scrapy's spiders, pipelines... I'm getting some 202 status from some spider requests' response, hence the page content is not available yet How to handle these status code properly, like "wait for the page to fully load"? I saw/looked into both Scrapy's downloadermiddleware's get_retry_request and reactor's callLater with a lambda to try to handle retries with delay, but without success so far...

(e.g.

yield scrapy.downloadermiddlewares.retry.get_retry_request(
    request=response.request, spider=self, reason='202 Accepted - retrying after delay', max_retry_times=self.max_retries)

or something like:

reactor.callLater(self.retry_delay,
    lambda: self.crawler.engine.crawl(
        scrapy.Request(
            url=response.url,
            callback=self.parse,
            meta={'location_name': response.meta.get('location_name', ''),'retries': retries + 1})))

)

Thanks in advance for any support!

1 Answer 1

0

Found a nice way using Scrapy's middlewares!

Here is a snippet for such middleware handling some specific request:

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from twisted.internet import reactor, defer
import random

class Retry202Middleware(RetryMiddleware):
    DEFAULT_DELAY = 3 # seconds
    MAX_RETRIES = 3
    
    @defer.inlineCallbacks
    def process_response(self, request, response, spider):
        if response.status == 202:
            retries = request.meta.get('retry_times', 0)
            if retries >= self.MAX_RETRIES:
                spider.logger.warning(f'Gave up on {response.url} after {self.MAX_RETRIES} 202s')
                return response

            delay = self.DEFAULT_DELAY * (2 ** retries) + random.uniform(0, 2)
            
            spider.logger.info(
                f"202 for {response.url}, retry {retries+1}/{self.MAX_RETRIES} in {delay:.1f}s..."
            )

            # Non-blocking delay
            yield self._deferred_sleep(delay)

            reason = response_status_message(response.status)
            new_request = self._retry(request, reason, spider)
            if new_request:
                return new_request
            
            return response
        
        return response
    
    def _deferred_sleep(self, delay):
        d = defer.Deferred()
        reactor.callLater(delay, d.callback, True)
        return d

Note than you need to have initialized the "twisted reactor engine" first.
Also, settings usage tip, add following to your settings:

    'DOWNLOADER_MIDDLEWARES': {
        'middlewares.Retry202Middleware': 544,
    },
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.