由于不太清楚传输的机制,卡在SCRAPY传输的这个问题上近半个月,翻阅了好多资料,还是不懂,基础比较差所以上来求助各位老师!
不涉及自定义就以SCRAPY默认的格式为例
spider return
的东西需要什么样的格式?
dict?{a:1,b:2,.....}
还是[{a:1,aa:11},{b:2,bb:22},{......}]
return
的东西传去哪了?
是不是下面代码的item?
class pipeline : def process_item(self, item, spider):
我真的是很菜,但是我很想学希望能得到各位老师的帮助!下面是我的代码,希望能指出缺点
spider:
# -*- coding: utf-8 -*- import scrapy from pm25.items import Pm25Item import re class InfospSpider(scrapy.Spider): name = "infosp" allowed_domains = ["pm25.com"] start_urls = ['http://www.pm25.com/rank/1day.html', ] def parse(self, response): item = Pm25Item() re_time = re.compile("\d+-\d+-\d+") date = response.xpath("/html/body/p[4]/p/p/p[2]/span").extract()[0] #单独解析出DATE # items = [] selector = response.selector.xpath("/html/body/p[5]/p/p[3]/ul[2]/li") #从response里确立解析范围 for subselector in selector: #通过范围逐条解析 try: #防止[0]报错 rank = subselector.xpath("span[1]/text()").extract()[0] quality = subselector.xpath("span/em/text()")[0].extract() city = subselector.xpath("a/text()").extract()[0] province = subselector.xpath("span[3]/text()").extract()[0] aqi = subselector.xpath("span[4]/text()").extract()[0] pm25 = subselector.xpath("span[5]/text()").extract()[0] except IndexError: print(rank,quality,city,province,aqi,pm25) item['date'] = re_time.findall(date)[0] item['rank'] = rank item['quality'] = quality item['province'] = city item['city'] = province item['aqi'] = aqi item['pm25'] = pm25 # items.append(item) yield item #这里不懂该怎么用,出来的是什么格式, #有的教程会return items,所以希望能得到指点
pipeline:
import time class Pm25Pipeline(object): def process_item(self, item, spider): today = time.strftime("%y%m%d",time.localtime()) fname = str(today) + ".txt" with open(fname,"a") as f: for tmp in item: #不知道这里是否写的对, #个人理解是spider return出来的item是yiled dict #[{a:1,aa:11},{b:2,bb:22},{......}] f.write(tmp["date"] + '\t' + tmp["rank"] + '\t' + tmp["quality"] + '\t' + tmp["province"] + '\t' + tmp["city"] + '\t' + tmp["aqi"] + '\t' + tmp["pm25"] + '\n' ) f.close() return item
items:
import scrapy class Pm25Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() date = scrapy.Field() rank = scrapy.Field() quality = scrapy.Field() province = scrapy.Field() city = scrapy.Field() aqi = scrapy.Field() pm25 = scrapy.Field() pass
部分运行报错代码:
Traceback (most recent call last): File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item tmp["pm25"] + '\n' TypeError: string indices must be integers 2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '30', 'city': '新疆', 'date': '2017-04-02', 'pm25': '13 ', 'province': '伊犁哈萨克州', 'quality': '优', 'rank': '357'} Traceback (most recent call last): File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item tmp["pm25"] + '\n' TypeError: string indices must be integers 2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28', 'city': '西藏', 'date': '2017-04-02', 'pm25': '11 ', 'province': '林芝', 'quality': '优', 'rank': '358'} Traceback (most recent call last): File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item tmp["pm25"] + '\n' TypeError: string indices must be integers 2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28', 'city': '云南', 'date': '2017-04-02', 'pm25': '11 ', 'province': '丽江', 'quality': '优', 'rank': '359'} Traceback (most recent call last): File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item tmp["pm25"] + '\n' TypeError: string indices must be integers 2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '27', 'city': '云南', 'date': '2017-04-02', 'pm25': '15 ', 'province': '玉溪', 'quality': '优', 'rank': '360'} Traceback (most recent call last): File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item tmp["pm25"] + '\n' TypeError: string indices must be integers 2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '26', 'city': '云南', 'date': '2017-04-02', 'pm25': '10 ', 'province': '楚雄州', 'quality': '优', 'rank': '361'} Traceback (most recent call last): File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item tmp["pm25"] + '\n' TypeError: string indices must be integers 2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '24', 'city': '云南', 'date': '2017-04-02', 'pm25': '11 ', 'province': '迪庆州', 'quality': '优', 'rank': '362'} Traceback (most recent call last): File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item tmp["pm25"] + '\n' TypeError: string indices must be integers 2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '22', 'city': '云南', 'date': '2017-04-02', 'pm25': '9 ', 'province': '怒江州', 'quality': '优', 'rank': '363'} Traceback (most recent call last): File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item tmp["pm25"] + '\n' TypeError: string indices must be integers 2017-04-03 10:23:14 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-03 10:23:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 328, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 38229, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 3, 2, 23, 14, 972356), 'log_count/DEBUG': 2, 'log_count/ERROR': 363, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 4, 3, 2, 23, 13, 226730)} 2017-04-03 10:23:14 [scrapy.core.engine] INFO: Spider closed (finished)
希望能到到各位老师的帮助再次感谢~!
你可以把一个item
看作一个字典,实际它就是dict
类的派生类。你在pipeline
里对这个item
直接遍历,取到的tmp
实际是都是字典的键,类型是字符串,所以tmp['pm25']
这种操作报出TypeError:string类型的对象索引必须是int型
。
Scrapy的Item类似python字典,扩展了一些功能而已。
Scrapy的设计,每生成一个Item,即可传递到pipeline中处理。你在里面写的for tmp in item
循环的是item字典的键了,键应是字符串,再用__getitem__语法就会提示你使用的不是数字。
搜索:TypeError: string indices must be integers,搞清楚什么问题
定位行数,解决问题
直接写入就行,不用做循环,item是单个处理,并不是你想的那样的列表:
import time class Pm25Pipeline(object): def process_item(self, item, spider): today = time.strftime("%y%m%d", time.localtime()) fname = str(today) + ".txt" with open(fname, "a") as f: f.write(item["date"] + '\t' + item["rank"] + '\t' + item["quality"] + '\t' + item["province"] + '\t' + item["city"] + '\t' + item["aqi"] + '\t' + item["pm25"] + '\n' ) f.close() return item