用硒爬行但不刮痧

 霞霞123321 发布于 2022-12-04 12:38

我已经阅读了使用scrapy for AJAX页面的所有线程并安装了selenium webdrive以简化任务,我的蜘蛛可以部分爬行但无法将任何数据存入我的Items.

我的目标是:

    从抓取此页到这个页面

    刮掉每个项目(帖子)的:

    author_name (xpath:/html/body/div[8]/div/div[1]/div[3]/div[3]/ul/li[2]/div[2]/span[2]/ul/li[3]/a/text())
    author_page_url (xpath:/html/body/div[8]/div/div[1]/div[3]/div[3]/ul/li[2]/div[2]/span[2]/ul/li[3]/a/@href)
    post_title (xpath://a[@class="title_txt"])
    post_page_url (xpath://a[@class="title_txt"]/@href)
    post_text (xpath on a separate post page: //div[@id="a_NMContent/text()")
    

这是我的猴子代码(因为我只是在Python中作为一个有抱负的自然语言处理学生,我过去主修语言学的第一步):

import scrapy
import time
from selenium import webdriver
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import XPathSelector

class ItalkiSpider(CrawlSpider):
    name = "italki"
    allowed_domains = ['italki.com']
    start_urls = ['http://www.italki.com/entries/korean']
    # not sure if the rule is set correctly
    rules = (Rule(LxmlLinkExtractor(allow="\entry"), callback = "parse_post", follow = True),)
    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        # adding necessary search parameters to the URL
        self.driver.get(response.url+"#language=korean&author-language=russian&marks-min=-5&sort=1&page=1")
        # pressing the "Show More" button at the bottom of the search results page to show the next 15 posts, when all results are loaded to the page, the button disappears
        more_btn = self.driver.find_element_by_xpath('//a[@id="a_show_more"]')

        while more_btn:
            more_btn.click()
            # sometimes waiting for 5 sec made spider close prematurely so keeping it long in case the server is slow
            time.sleep(10)

        # here is where the problem begins, I am making a list of links to all the posts on the big page, but I am afraid links will contain only the first link, because selenium doesn't do the multiple selection as one would expect from this xpath...how can I grab all the links and put them in the links list (and should I?)
        links=self.driver.find_elements_by_xpath('/html/body/div[8]/div/div[1]/div[3]/div[3]/ul/li/div[2]/a')
        for link in links:
            link.click()
            time.sleep(3)

    # this is the function for parsing individual posts, called back by the *parse* method as specified in the rule of the spider; if it is correct, it should have saved at least one post into an item... I don't really understand how and where this callback function gets the response from the new page (the page of the post in this case)...is it automatically loaded to drive and then passed on to the callback function as soon as selenium has clicked on the link (link.click())? or is it all total nonsense...
    def parse_post(self, response):
        hxs = Selector(response)
        item = ItalkiItem()
        item["post_item"] = hxs.xpath('//div [@id="a_NMContent"]/text()').extract()
        return item

alecxe.. 6

让我们考虑一下:

在浏览器中打开页面,然后单击"显示更多",直到找到所需的页面

TextResponse使用当前页面源初始化scrapy (加载所有必需的帖子)

对于每个帖子初始化一个Item,产生一个Request到帖子页面并将一个item实例从请求传递到meta字典中的响应

我正在介绍的注释和更改:

使用普通Spider班级

使用Selenium等待 "显示更多"按钮可见

关闭spider_closed信号调度程序中的驱动程序实例

代码:

import scrapy
from scrapy import signals
from scrapy.http import TextResponse 
from scrapy.xlib.pydispatch import dispatcher

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class ItalkiItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    text = scrapy.Field()


class ItalkiSpider(scrapy.Spider):
    name = "italki"
    allowed_domains = ['italki.com']
    start_urls = ['http://www.italki.com/entries/korean']

    def __init__(self):
        self.driver = webdriver.Firefox()
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
        self.driver.close()

    def parse(self, response):
        # selenium part of the job
        self.driver.get('http://www.italki.com/entries/korean')
        while True:
            more_btn = WebDriverWait(self.driver, 10).until(
                EC.visibility_of_element_located((By.ID, "a_show_more"))
            )

            more_btn.click()

            # stop when we reach the desired page
            if self.driver.current_url.endswith('page=52'):
                break

        # now scrapy should do the job
        response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
        for post in response.xpath('//ul[@id="content"]/li'):
            item = ItalkiItem()
            item['title'] = post.xpath('.//a[@class="title_txt"]/text()').extract()[0]
            item['url'] = post.xpath('.//a[@class="title_txt"]/@href').extract()[0]

            yield scrapy.Request(item['url'], meta={'item': item}, callback=self.parse_post)

    def parse_post(self, response):
        item = response.meta['item']
        item["text"] = response.xpath('//div[@id="a_NMContent"]/text()').extract()
        return item

这是您应该用作基本代码并改进以填写所有其他字段的内容,例如authorauthor_url.希望有所帮助.

1 个回答
  • 让我们考虑一下:

    在浏览器中打开页面,然后单击"显示更多",直到找到所需的页面

    TextResponse使用当前页面源初始化scrapy (加载所有必需的帖子)

    对于每个帖子初始化一个Item,产生一个Request到帖子页面并将一个item实例从请求传递到meta字典中的响应

    我正在介绍的注释和更改:

    使用普通Spider班级

    使用Selenium等待 "显示更多"按钮可见

    关闭spider_closed信号调度程序中的驱动程序实例

    代码:

    import scrapy
    from scrapy import signals
    from scrapy.http import TextResponse 
    from scrapy.xlib.pydispatch import dispatcher
    
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    class ItalkiItem(scrapy.Item):
        title = scrapy.Field()
        url = scrapy.Field()
        text = scrapy.Field()
    
    
    class ItalkiSpider(scrapy.Spider):
        name = "italki"
        allowed_domains = ['italki.com']
        start_urls = ['http://www.italki.com/entries/korean']
    
        def __init__(self):
            self.driver = webdriver.Firefox()
            dispatcher.connect(self.spider_closed, signals.spider_closed)
    
        def spider_closed(self, spider):
            self.driver.close()
    
        def parse(self, response):
            # selenium part of the job
            self.driver.get('http://www.italki.com/entries/korean')
            while True:
                more_btn = WebDriverWait(self.driver, 10).until(
                    EC.visibility_of_element_located((By.ID, "a_show_more"))
                )
    
                more_btn.click()
    
                # stop when we reach the desired page
                if self.driver.current_url.endswith('page=52'):
                    break
    
            # now scrapy should do the job
            response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
            for post in response.xpath('//ul[@id="content"]/li'):
                item = ItalkiItem()
                item['title'] = post.xpath('.//a[@class="title_txt"]/text()').extract()[0]
                item['url'] = post.xpath('.//a[@class="title_txt"]/@href').extract()[0]
    
                yield scrapy.Request(item['url'], meta={'item': item}, callback=self.parse_post)
    
        def parse_post(self, response):
            item = response.meta['item']
            item["text"] = response.xpath('//div[@id="a_NMContent"]/text()').extract()
            return item
    

    这是您应该用作基本代码并改进以填写所有其他字段的内容,例如authorauthor_url.希望有所帮助.

    2022-12-11 02:11 回答
撰写答案
今天,你开发时遇到什么问题呢?
立即提问
热门标签
PHP1.CN | 中国最专业的PHP中文社区 | PNG素材下载 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有