python - 用scrapy爬取网易新闻时出错[新手]

 用户d4k2wd8en1 发布于 2022-11-02 16:14

items.py

import scrapy
class News163Item(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    source = scrapy.Field()
    content = scrapy.Field()

news_spider.py

#coding:utf-8
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule
class ExampleSpider(CrawlSpider):
    name = "news"
    allowed_Domains = ["news.163.com"]
    start_urls = ['http://news.163.com/']
    rules = [
        Rule(LinkExtractor(allow=r"/14/12\d+/\d+/*"),
        'parse_news')
    ]

def parse_news(self,response):
    news = News163Item()
    news['title'] = response.xpath("//*[@id="h1title"]/text()").extract()
    news['source'] = response.xpath("//*[@id="ne_article_source"]/text()").extract()
    news['content'] = response.xpath("//*[@id="endText"]/text()").extract()
    news['url'] = response.url
    return news

cd进入所在目录后,命令行执行:

scrapy crawl news -o news163.json

会跳出如下错误:

Traceback (most recent call last):
  File "/usr/bin/scrapy", line 9, in 
    load_entry_point('Scrapy==0.24.4', 'console_scripts', 'scrapy')()
  File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/usr/lib/pymodules/python2.7/scrapy/commands/crawl.py", line 57, in run
    crawler = self.crawler_process.create_crawler()
  File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 87, in create_crawler
    self.crawlers[name] = Crawler(self.settings)
  File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 25, in __init__
    self.spiders = spman_cls.from_crawler(self)
  File "/usr/lib/pymodules/python2.7/scrapy/spidermanager.py", line 35, in from_crawler
    sm = cls.from_settings(crawler.settings)
  File "/usr/lib/pymodules/python2.7/scrapy/spidermanager.py", line 31, in from_settings
    return cls(settings.getlist('SPIDER_MODULES'))
  File "/usr/lib/pymodules/python2.7/scrapy/spidermanager.py", line 22, in __init__
    for module in walk_modules(name):
  File "/usr/lib/pymodules/python2.7/scrapy/utils/misc.py", line 68, in walk_modules
    submod = import_module(fullpath)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/home/gao/news/news/spiders/news_spider.py", line 15
    news['title'] = response.xpath("//*[@id="h1title"]/text()").extract()
                                                   ^
SyntaxError: invalid syntax

请问是哪里出错了?python新手,scrapy也是最近才用的,很生疏,求指点。

谢谢:@捏造的信仰 的回答,但是更改过之后,还是有错误。

2014-12-02 20:13:02+0800 [news] ERROR: Spider error processing http://lady.163.com/14/1201/09/ACCAN4MB002649P6.html>;
    Traceback (most recent call last):
      File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 638, in _tick
        taskObj._oneWorkUnit()
      File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
        result = next(self._iterator)
      File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 57, in 
        work = (callable(elem, *args, **named) for elem in iterable)
    ---  ---
      File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 96, in iter_errback
        yield next(it)
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
        for x in result:
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spidermiddleware/referer.py", line 22, in 
        return (_set_referer(r) for r in result or ())
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spidermiddleware/urllength.py", line 33, in 
        return (r for r in result or () if _filter(r))
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spidermiddleware/depth.py", line 50, in 
        return (r for r in result or () if _filter(r))
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spiders/crawl.py", line 67, in _parse_response
        cb_res = callback(response, **cb_kwargs) or ()
      File "/home/gao/news/news/spiders/news_spider.py", line 14, in parse_news
        news = News163Item()
    exceptions.NameError: global name 'News163Item' is not defined

请问这又是什么原因呢?

3 个回答
  • 你需要做的是导包"
    from 你的项目名称.items import News163Item

    2022-11-04 17:38 回答
  • 字符串中的引号没有转码导致的语法错误。应该改为

    news['title'] = response.xpath("//*[@id=\"h1title\"]/text()").extract()
    

    下面几行也是的。

    2022-11-04 17:58 回答
  • 字符串外部使用的是双引号,在双引号内部还需要使用引号的话可以使用单引号。例如

    news['title'] = response.xpath("//*[@id='h1title']/text()").extract()
    
    2022-11-04 18:05 回答
撰写答案
今天,你开发时遇到什么问题呢?
立即提问
热门标签
PHP1.CN | 中国最专业的PHP中文社区 | PNG素材下载 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有