热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

斜面爬坡的深度超过1。-Cantcrawlscrapywithdepthmorethan1

Icouldntconfigurescrapytorunwithdepth>1,Ihavetriedthe3followingoptions,nooneof

I couldn't configure scrapy to run with depth > 1, I have tried the 3 following options, noone of them worked and the request_depth_max at summary log is always 1:

我无法配置scrapy以深度> 1运行,我尝试了以下3个选项,没有一个成功,汇总日志的request_depth_max总是1:

1) Adding:

1)添加:

from scrapy.conf import settings
settings.overrides['DEPTH_LIMIT'] = 2

to spider file (the example on site, just with different site)

到spider文件(站点上的示例,只是针对不同的站点)

2) Running the command line with -s option:

2)使用-s选项运行命令行:

/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org

3) Adding to settings.py and scrapy.cfg:

3)增加设置。py和scrapy.cfg:

DEPTH_LIMIT=2

How should it be configured to more than 1?

如何将其配置为大于1?

3 个解决方案

#1


4  

warwaruk is right, The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".

waruk是对的,DEPTH_LIMIT设置的默认值是0 -即。“没有强加的限制”。

So let's scrape miniova and see what happens. Starting at the today page we see that there are two tor links:

让我们把miniova擦去看看会发生什么。从今天的页面开始,我们看到有两个tor链接:

stav@maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]

Let's scrape the first link, where we see there are no new tor links on that page, just the link to iteself, which does not get recrawled by default (scrapy.http.Request(url[, ... dont_filter=False, ...])):

让我们抓取第一个链接,在那里我们看到页面上没有新的tor链接,只有到iteself的链接,默认情况下不会被取消(剪贴。http)。请求(url,……dont_filter = False,…))):

>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200)  (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]

No luck there, we are still at depth 1. Let's try the other link:

没有运气,我们还在深度1。让我们试试另一个环节:

>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200)  (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]

Nope, this page only contains one link as well, a link to itself, which also gets filtered. So there are actually no links to scrape, so Scrapy closes the spider (at depth==1).

不,这个页面也只有一个链接,一个链接到它自己,它也会被过滤。因此,实际上没有与刮擦的链接,所以Scrapy关闭了蜘蛛(深度==1)。

#2


3  

I had a similar issue, it helped to set follow=True when defining Rule:

我也遇到过类似的问题,在定义规则时帮助设置follow=True:

follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it default to False.

follow是一个布尔值,它指定是否应该从使用此规则提取的每个响应中遵循链接。如果回调为None,则默认为True,否则默认为False。

#3


1  

The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".

DEPTH_LIMIT设置的默认值为0,即。“没有强加的限制”。

You wrote:

你写的:

request_depth_max at summary log is always 1

汇总日志的request_depth_max总是1

What you see in the logs is the statistics, not the settings. When it says that request_depth_max as 1 it means that from the first callback no other requests have been yielded.

您在日志中看到的是统计数据,而不是设置。当它说request_depth_max为1时,意味着从第一个回调开始,没有产生任何其他请求。

You have to show your spider code to understand what is going on.

您必须显示您的爬行器代码来理解正在发生的事情。

But create another question for it.

但是再问一个问题。

UPDATE:

更新:

Ah, i see you are running mininova spider for the scrapy intro:

啊,我看到你正在运行米尼诺娃蜘蛛为痒的介绍:

class MininovaSpider(CrawlSpider):

    name = 'mininova.org'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)

        torrent = TorrentItem()
        torrent['url'] = response.url
        torrent['name'] = x.select("//h1/text()").extract()
        torrent['description'] = x.select("//div[@id='description']").extract()
        torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        return torrent

As you see from the code, the spider never issues any request for other pages, it scrapes all the data right from the top level pages. That's why the maximum depth is 1.

正如您从代码中看到的,爬行器从不发出任何对其他页面的请求,而是从顶层页面中删除所有数据。这就是为什么最大深度是1。

If you make you own spider which will be following links to other pages, the maximum depth will be greater then 1.

如果你制作自己的爬行器,它将跟随到其他页面的链接,最大深度将大于1。


推荐阅读
author-avatar
手机用户2602899031
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有