具有超时,最大大小和连接池的http请求

Question

问

具有超时,最大大小和连接池的http请求

墨镜元龙高卧发布于 2023-01-16 13:05

我正在寻找Python(2.7)中的一种方法来执行具有3个要求的HTTP请求:

超时(可靠性)

内容最大尺寸(安全性)

连接池(用于性能)

我已经检查了所有python HTTP库,但它们都不符合我的要求.例如:

urllib2:很好,但没有汇集

import urllib2
import json

r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100+1)
if len(content) > 100: 
    print 'too large'
    r.close()
else:
    print json.loads(content)

r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100000+1)
if len(content) > 100000: 
    print 'too large'
    r.close()
else:
    print json.loads(content)

请求:没有最大尺寸

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)
r.headers['content-length'] # does not exists for this request, and not safe
content = r.raw.read(100000+1)
print content # ARF this is gzipped, so not the real size
print json.loads(content) # content is gzipped so pretty useless
print r.json() # Does not work anymore since raw.read was used

urllib3:即使使用50Mo文件,也从未让"读取"方法正常工作......

httplib:httplib.HTTPConnection不是池(只有一个连接)

我很难相信urllib2是我能用的最好的HTTP库!所以,如果有人知道librairy可以做什么或如何使用以前的librairy之一...

编辑:

我发现感谢Martijn Pieters的最佳解决方案(即使对于巨大的文件,StringIO也不会减速,其中str添加很多).

r = requests.get('https://github.com/timeline.json', stream=True)
size = 0
ctt = StringIO()


for chunk in r.iter_content(2048):
    size += len(chunk)
    ctt.write(chunk)
    if size > maxsize:
        r.close()
        raise ValueError('Response too large')

content = ctt.getvalue()

Martijn Piet.. 13

你可以做得requests很好; 但你需要知道raw对象是内部的一部分,urllib3并利用HTTPResponse.read()调用支持的额外参数,这允许您指定要读取解码数据:

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)

content = r.raw.read(100000+1, decode_content=True)
if len(content) > 100000:
    raise ValueError('Too large a response')
print content
print json.loads(content)

或者,您可以在阅读之前decode_content在raw对象上设置标志:

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)

r.raw.decode_content = True
content = r.raw.read(100000+1)
if len(content) > 100000:
    raise ValueError('Too large a response')
print content
print json.loads(content)

如果您不喜欢urllib3像这样的内容,请使用response.iter_content()以迭代方式迭代已解码的内容; 这也使用底层HTTPResponse(使用.stream()生成器版本:

import requests

r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)

maxsize = 100000
content = ''
for chunk in r.iter_content(2048):
    content += chunk
    if len(content) > maxsize:
        r.close()
        raise ValueError('Response too large')

print content
print json.loads(content)

这里压缩数据大小的处理方式有细微差别; r.raw.read(100000+1)只会读取100k字节的压缩数据; 未压缩数据将根据您的最大大小进行测试.在极少数情况下,压缩流大于iter_content()未压缩数据,该方法将读取更多未压缩数据.

这两种方法都不允许r.json()工作; 的response._content属性不被这些设置; 你当然可以手动完成.但由于.raw.read()和.iter_content()调用已经让您访问相关内容,因此实际上没有必要.

撰写答案