我用这样的程序去爬类似的有验证码的网站,都能取到正确的数据。只有这个网站,很奇怪啊!一直验证码错误,我只能用代理IP去爬,程序运行两三次,就会IP封了。求解,之前一直以为是取验证码图片时,验证码刷新了,所以不对。所以我直接取的验证码,然后再去页面get数据。其他网站用这个方法都能行,只有这个一直错误!
!!!!!!补充:我在fiddler里面看到的过程是,首先手动输入验证码点击搜索:1.返回验证码输入的对错(传入验证码作为参数) 2.返回搜索的结果 3.重新生成新的验证码 。问题来了,程序中开始就获取验证码图片是对还是错?应该怎么做呢?我写过的爬虫是验证码传进搜索的url里,作为其中一个参数,这样很容易就取到了。但是现在遇到的是,验证码和搜索url无关,验证码作为参数传入一个url,只是返回验证码错误正确的结果。
#coding=utf-8 #from bs4 import BeautifulSoup import urllib import urllib2 import re import sys import time import requests from PIL import Image #from pytesser import * import cookielib reload(sys) sys.setdefaultencoding('utf-8') time=(time.time()) session=requests.session() user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36' headers={'User-Agent':user_agent,'Referer':'http://www.jsgsj.gov.cn:58888/mini/netweb/SMLibrary.jsp','Connection':'keep-alive','Host':'www.jsgsj.gov.cn:58888'} captcha_url='http://www.jsgsj.gov.cn:58888/mini/netWebServlet.json?randImg=true&tmp='+str(time) print captcha_url captcha=(session.get(captcha_url,headers=headers)).content with open('captcha.jpg','wb') as imgfile: imgfile.write(captcha) im = Image.open('captcha.jpg') im.show() captcha=raw_input("enter captcha:") url_company='http://www.jsgsj.gov.cn:58888/mini/netWebServlet.json?codeCheck=true&corpName=苏州&yzm='+str(captcha) html0=session.get(url=url_company,headers=headers) company=(html0.content) print (company) url='http://www.jsgsj.gov.cn:58888/mini/netWebServlet.json?querySMLibrary=true&corpName=苏州&yzm='+str(captcha)+'&pageSize=10&curPage=1&sortName=&sortOrder=' html1=session.get(url=url,headers=headers) page=(html1.content) print type(page),page
#coding=utf-8 import requests, time from PIL import Image url = 'http://www.jsgsj.gov.cn:58888/mini/netWebServlet.json' basic_url = 'http://www.jsgsj.gov.cn:58888/mini/netweb/SMLibrary.jsp' check_url = '{0}?codeCheck=true'.format(url) captcha_url = '{0}?randImg=true&tmp={1}'.format(url, time.time()) url_company = '{0}?querySMLibrary=true'.format(url) session = requests.session() session.headers = { 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36' } session.get(basic_url) with open('captcha.jpg','wb') as file: file.write(session.get(captcha_url).content) Image.open('captcha.jpg').show() captcha = raw_input("enter captcha:") session.post(url=check_url, data='corpName=ngk&yzm={0}'.format(captcha)) data = 'corpName=ngk&yzm={0}&pageSize=10&curPage=1&sortName=&sortOrder='.format(captcha) r = session.post(url=url_company, data=data) print r.text