我想下载http://www.chuiyao.com/manhua/3670/393022.html
里面的图片,但用下面的代码搜到的图片是这个http://www.chuiyao.com/static/skin5/images/pic_loading.gif
。headers我是用的chrome浏览器network下的393022.html
的Request Headers
import requests from lxml import html def main(): url = "http://www.chuiyao.com/manhua/3670/393022.html" headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4", "Cache-Control":"no-cache", "Connection":"keep-alive", "Cookie":"__cfduid=d1fd7e3291dbb9fc63a4884a0441f78ee1486866309; bdshare_firstime=1486866314583; UM_distinctid=15af3096a8336d-08f956321b6c03-1d3b6853-1fa400-15af3096a84296; qtmhhis=2017-2-21-18-47-47%5E%5E%u6597%u7834%u82CD%u7A79%5E%5E%u7B2C189%u8BDD%20%u6BD2%u9B54%u6591%5E%5E1%5E%5E393022%5E%5E3670_ShG_; Hm_lvt_1317de45b1b9f5aacfe358d1694b22f9=1488746420,1490136167,1490136167,1490136753; Hm_lpvt_1317de45b1b9f5aacfe358d1694b22f9=1490136753; CNZZDATA1254167849=49322564-1486864570-https%253A%252F%252Fwww.google.com%252F%7C1490131571", "Host":"www.chuiyao.com", "Pragma":"no-cache", "Referer":"http://www.chuiyao.com/manhua/3670/", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" } page = requests.get(url, headers = headers) tree = html.fromstring(page.content) print(tree.xpath('//*[@id="qTcms_pic"]/@src')) if __name__ == "__main__": main()
为什么会这样?
img
标签的内容是通过调用 js
的 Show_Pic_w()
方法在页面加载完成后动态替换的,你用 python
是获取不到和浏览器一样的页面的
图片是js动态加载的,你的爬虫只是获取了静态页面。
主要的加载图片的功能在这个js文件里:
www.chuiyao.com/static/skin5/js/wdshow.js?v=20160713.1
你自己用python的相关模块来模拟该js的功能,即可解析出图片地址。