python - 使用selenium和phantomjs爬虫遇到的缓存问题 ?

 天使犯罪de快乐 发布于 2022-10-27 17:11

使用selenium和phantomjs爬虫遇到问题,代码如下【【采集时我用了蓝灯软件来代理,不能直接采集】】:

代码如下:

from selenium import webdriver
import time 
driver = webdriver.PhantomJS()
driver.get('http://chuansong.me')
alla = driver.find_elements_by_class_name('question_link')
for a in alla:
    a = a.get_attribute('href')
    print(a)
    driver.get(a)
    title = driver.find_element_by_id('activity-name').text
    writer = driver.find_element_by_id('post-user').text
    content = driver.find_element_by_id('js_content').text
    print(writer,title,content)
    #time.sleep(8)
driver.close()
driver.quit()

能采集到一个网址链接的内容,然后提示错误:

Traceback (most recent call last):
  File "D:/python-work/test.py", line 10, in 
    a = a.get_attribute('href')
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 141, in get_attribute
    resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 494, in _execute
    return self._parent.execute(command, params)
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 236, in execute
    self.error_handler.check_response(response)
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 192, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: {"errorMessage":"Element does not exist in cache","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:60284","User-Agent":"Python-urllib/3.5"},"httpVersion":"1.1","method":"GET","url":"/attribute/href","urlParsed":{"anchor":"","query":"","file":"href","directory":"/attribute/","path":"/attribute/href","relative":"/attribute/href","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/attribute/href","queryKey":{},"chunks":["attribute","href"]},"urlOriginal":"/session/bcbced70-c66a-11e6-a824-4b87531d9c78/element/:wdc:1482207278197/attribute/href"}}
Screenshot: available via screen
1 个回答
  • 大神们,我修改了代码,但是执行速度非常慢,也禁止了图片的加载,有时候又出现同样的问题,请大神给看看,有哪些还可以修改和优化的,代码如下:

    __author__ = 'Administrator'
    
    from selenium import webdriver
    import time
    
    cap = webdriver.DesiredCapabilities.PHANTOMJS
    cap["phantomjs.page.settings.resourceTimeout"] = 1000
    cap["phantomjs.page.settings.loadImages"] = False
    #cap["phantomjs.page.settings.javascriptEnabled"] = False
    cap["phantomjs.page.settings.localToRemoteUrlAccessEnabled"] = False
    driver = webdriver.PhantomJS(desired_capabilities=cap)
    
    #driver = webdriver.PhantomJS()
    driver.get('http://chuansong.me')
    length = len(driver.find_elements_by_class_name('question_link'))
    for i in range(0,length):
        alla = driver.find_elements_by_class_name('question_link')
        a = alla[i]
        print(a)
        if 'question_link' in a.get_attribute('class') or 'n' in a.get_attribute('href'):
            a.click()
            driver.get(a.get_attribute('href'))
            title = driver.find_element_by_id('activity-name').text
            writer = driver.find_element_by_id('post-user').text
            content = driver.find_element_by_id('js_content').get_attribute('outerHTML')
            print(writer,title,content)
            driver.back()
            time.sleep(8)
    driver.close()
    driver.quit()
    2022-10-28 14:24 回答
撰写答案
今天,你开发时遇到什么问题呢?
立即提问
热门标签
PHP1.CN | 中国最专业的PHP中文社区 | PNG素材下载 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有