使用selenium和phantomjs爬虫遇到问题,代码如下【【采集时我用了蓝灯软件来代理,不能直接采集】】:
代码如下:
from selenium import webdriver import time driver = webdriver.PhantomJS() driver.get('http://chuansong.me') alla = driver.find_elements_by_class_name('question_link') for a in alla: a = a.get_attribute('href') print(a) driver.get(a) title = driver.find_element_by_id('activity-name').text writer = driver.find_element_by_id('post-user').text content = driver.find_element_by_id('js_content').text print(writer,title,content) #time.sleep(8) driver.close() driver.quit()
能采集到一个网址链接的内容,然后提示错误:
Traceback (most recent call last): File "D:/python-work/test.py", line 10, ina = a.get_attribute('href') File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 141, in get_attribute resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name}) File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 494, in _execute return self._parent.execute(command, params) File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 236, in execute self.error_handler.check_response(response) File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 192, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.StaleElementReferenceException: Message: {"errorMessage":"Element does not exist in cache","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:60284","User-Agent":"Python-urllib/3.5"},"httpVersion":"1.1","method":"GET","url":"/attribute/href","urlParsed":{"anchor":"","query":"","file":"href","directory":"/attribute/","path":"/attribute/href","relative":"/attribute/href","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/attribute/href","queryKey":{},"chunks":["attribute","href"]},"urlOriginal":"/session/bcbced70-c66a-11e6-a824-4b87531d9c78/element/:wdc:1482207278197/attribute/href"}} Screenshot: available via screen
大神们,我修改了代码,但是执行速度非常慢,也禁止了图片的加载,有时候又出现同样的问题,请大神给看看,有哪些还可以修改和优化的,代码如下:
__author__ = 'Administrator' from selenium import webdriver import time cap = webdriver.DesiredCapabilities.PHANTOMJS cap["phantomjs.page.settings.resourceTimeout"] = 1000 cap["phantomjs.page.settings.loadImages"] = False #cap["phantomjs.page.settings.javascriptEnabled"] = False cap["phantomjs.page.settings.localToRemoteUrlAccessEnabled"] = False driver = webdriver.PhantomJS(desired_capabilities=cap) #driver = webdriver.PhantomJS() driver.get('http://chuansong.me') length = len(driver.find_elements_by_class_name('question_link')) for i in range(0,length): alla = driver.find_elements_by_class_name('question_link') a = alla[i] print(a) if 'question_link' in a.get_attribute('class') or 'n' in a.get_attribute('href'): a.click() driver.get(a.get_attribute('href')) title = driver.find_element_by_id('activity-name').text writer = driver.find_element_by_id('post-user').text content = driver.find_element_by_id('js_content').get_attribute('outerHTML') print(writer,title,content) driver.back() time.sleep(8) driver.close() driver.quit()