斜面爬坡的深度超过1。-Cantcrawlscrapywithdepthmorethan1

作者：手机用户2602899031 | 来源：互联网 | 2023-05-19 04:57

Icouldntconfigurescrapytorunwithdepth>1,Ihavetriedthe3followingoptions,nooneof

I couldn't configure scrapy to run with depth > 1, I have tried the 3 following options, noone of them worked and the request_depth_max at summary log is always 1:

我无法配置scrapy以深度> 1运行，我尝试了以下3个选项，没有一个成功，汇总日志的request_depth_max总是1:

1) Adding:

1)添加:

from scrapy.conf import settings
settings.overrides['DEPTH_LIMIT'] = 2

to spider file (the example on site, just with different site)

到spider文件(站点上的示例，只是针对不同的站点)

2) Running the command line with -s option:

2)使用-s选项运行命令行:

/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org

3) Adding to settings.py and scrapy.cfg:

3)增加设置。py和scrapy.cfg:

DEPTH_LIMIT=2

How should it be configured to more than 1?

如何将其配置为大于1?

3 个解决方案

#1

warwaruk is right, The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".

waruk是对的，DEPTH_LIMIT设置的默认值是0 -即。“没有强加的限制”。

So let's scrape miniova and see what happens. Starting at the today page we see that there are two tor links:

让我们把miniova擦去看看会发生什么。从今天的页面开始，我们看到有两个tor链接:

stav@maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]

Let's scrape the first link, where we see there are no new tor links on that page, just the link to iteself, which does not get recrawled by default (scrapy.http.Request(url[, ... dont_filter=False, ...])):

让我们抓取第一个链接，在那里我们看到页面上没有新的tor链接，只有到iteself的链接，默认情况下不会被取消(剪贴。http)。请求(url,……dont_filter = False,…))):

>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200)  (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]

No luck there, we are still at depth 1. Let's try the other link:

没有运气，我们还在深度1。让我们试试另一个环节:

>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200)  (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]

Nope, this page only contains one link as well, a link to itself, which also gets filtered. So there are actually no links to scrape, so Scrapy closes the spider (at depth==1).

不，这个页面也只有一个链接，一个链接到它自己，它也会被过滤。因此，实际上没有与刮擦的链接，所以Scrapy关闭了蜘蛛(深度==1)。

#2

I had a similar issue, it helped to set follow=True when defining Rule:

我也遇到过类似的问题，在定义规则时帮助设置follow=True:

follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it default to False.

follow是一个布尔值，它指定是否应该从使用此规则提取的每个响应中遵循链接。如果回调为None，则默认为True，否则默认为False。

#3

The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".

DEPTH_LIMIT设置的默认值为0，即。“没有强加的限制”。

You wrote:

你写的:

request_depth_max at summary log is always 1

汇总日志的request_depth_max总是1

What you see in the logs is the statistics, not the settings. When it says that request_depth_max as 1 it means that from the first callback no other requests have been yielded.

您在日志中看到的是统计数据，而不是设置。当它说request_depth_max为1时，意味着从第一个回调开始，没有产生任何其他请求。

You have to show your spider code to understand what is going on.

您必须显示您的爬行器代码来理解正在发生的事情。

But create another question for it.

但是再问一个问题。

UPDATE:

更新:

Ah, i see you are running mininova spider for the scrapy intro:

啊，我看到你正在运行米尼诺娃蜘蛛为痒的介绍:

class MininovaSpider(CrawlSpider):

    name = 'mininova.org'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)

        torrent = TorrentItem()
        torrent['url'] = response.url
        torrent['name'] = x.select("//h1/text()").extract()
        torrent['description'] = x.select("//div[@id='description']").extract()
        torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        return torrent

As you see from the code, the spider never issues any request for other pages, it scrapes all the data right from the top level pages. That's why the maximum depth is 1.

正如您从代码中看到的，爬行器从不发出任何对其他页面的请求，而是从顶层页面中删除所有数据。这就是为什么最大深度是1。

If you make you own spider which will be following links to other pages, the maximum depth will be greater then 1.

如果你制作自己的爬行器，它将跟随到其他页面的链接，最大深度将大于1。

推荐阅读

string
在类中定义数组时出错 - Error on defining arrays in class

Iamtryingtomakeaclassthatwillreadatextfileofnamesintoanarray,thenreturnthatarra ... [详细]

蜡笔小新 2023-12-14 17:38:12
java
WinPythonHadoop在Win10上安装教程

本文介绍了在Win10上安装WinPythonHadoop的详细步骤，包括安装Python环境、安装JDK8、安装pyspark、安装Hadoop和Spark、设置环境变量、下载winutils.exe等。同时提醒注意Hadoop版本与pyspark版本的一致性，并建议重启电脑以确保安装成功。 ... [详细]

蜡笔小新 2023-12-14 11:26:56
request
Support Paged.JS for automatic hugo resume> PDF conversion.

FeatureRequestIsyourfeaturerequestrelatedtoaproblem?Please ... [详细]

蜡笔小新 2023-12-13 11:52:05
function
vue使用

关键词： ... [详细]

蜡笔小新 2023-12-14 19:14:56
string
向QTextEdit拖放文件的方法及实现步骤

本文介绍了在使用QTextEdit时如何实现拖放文件的功能，包括相关的方法和实现步骤。通过重写dragEnterEvent和dropEvent函数，并结合QMimeData和QUrl等类，可以轻松实现向QTextEdit拖放文件的功能。详细的代码实现和说明可以参考本文提供的示例代码。 ... [详细]

蜡笔小新 2023-12-14 16:06:38
get
Linux重启网络命令实例及关机和重启示例教程

本文介绍了Linux系统中重启网络命令的实例，以及使用不同方式关机和重启系统的示例教程。包括使用图形界面和控制台访问系统的方法，以及使用shutdown命令进行系统关机和重启的句法和用法。 ... [详细]

蜡笔小新 2023-12-14 15:52:52
get
android listview OnItemClickListener失效原因

最近在做listview时发现OnItemClickListener失效的问题，经过查找发现是因为button的原因。不仅listitem中存在button会影响OnItemClickListener事件的失效，还会导致单击后listview每个item的背景改变，使得item中的所有有关焦点的事件都失效。本文给出了一个范例来说明这种情况，并提供了解决方法。 ... [详细]

蜡笔小新 2023-12-14 14:25:50
request
t-io 2.0.0发布-法网天眼第一版的回顾和更新说明

本文回顾了t-io 1.x版本的工程结构和性能数据，并介绍了t-io在码云上的成绩和用户反馈。同时，还提到了@openSeLi同学发布的t-io 30W长连接并发压力测试报告。最后，详细介绍了t-io 2.0.0版本的更新内容，包括更简洁的使用方式和内置的httpsession功能。 ... [详细]

蜡笔小新 2023-12-14 10:17:48
string
关于cuowu类的错误提示和使用AdjustmentListener的问题

本文讨论了一个关于cuowu类的问题，作者在使用cuowu类时遇到了错误提示和使用AdjustmentListener的问题。文章提供了16个解决方案，并给出了两个可能导致错误的原因。 ... [详细]

蜡笔小新 2023-12-13 22:09:56
string
XML介绍与使用的概述及标签规则

本文介绍了XML的基本概念和用途，包括XML的可扩展性和标签的自定义特性。同时还详细解释了XML标签的规则，包括标签的尖括号和合法标识符的组成，标签必须成对出现的原则以及特殊标签的使用方法。通过本文的阅读，读者可以对XML的基本知识有一个全面的了解。 ... [详细]

蜡笔小新 2023-12-13 17:39:50
config
在重复造轮子的情况下用ProxyServlet反向代理来减少工作量

像不少公司内部不同团队都会自己研发自己工具产品，当各个产品逐渐成熟，到达了一定的发展瓶颈，同时每个产品都有着自己的入口，用户 ... [详细]

蜡笔小新 2023-12-13 15:19:01
config
Linux如何安装Mongodb的详细步骤和注意事项

本文介绍了Linux如何安装Mongodb的详细步骤和注意事项，同时介绍了Mongodb的特点和优势。Mongodb是一个开源的数据库，适用于各种规模的企业和各类应用程序。它具有灵活的数据模式和高性能的数据读写操作，能够提高企业的敏捷性和可扩展性。文章还提供了Mongodb的下载安装包地址。 ... [详细]

蜡笔小新 2023-12-12 21:54:15
config
通过Go SDK（Amazon S3）从Bucket生成Torrent - Generate Torrent from Bucket via Go SDK (Amazon S3)

Imtryingtofigureoutawaytogeneratetorrentfilesfromabucket,usingtheAWSSDKforGo.我正 ... [详细]

蜡笔小新 2023-12-12 14:13:01
request
解决nginx启动报错epoll_wait() reported that client prematurely closed connection的方法

本文介绍了解决nginx启动报错epoll_wait() reported that client prematurely closed connection的方法，包括检查location配置是否正确、pass_proxy是否需要加“/”等。同时，还介绍了修改nginx的error.log日志级别为debug，以便查看详细日志信息。 ... [详细]

蜡笔小新 2023-12-12 13:19:04
function
mac php错误日志配置方法及错误级别修改

本文介绍了在mac环境下配置php错误日志的方法，包括修改php.ini文件和httpd.conf文件的操作步骤。同时还介绍了如何修改错误级别，以及相应的错误级别参考链接。 ... [详细]

蜡笔小新 2023-12-12 11:59:08

手机用户2602899031

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章