分享十种py3爬取网页资源的方法

作者：机智的孙志嵘 | 来源：互联网 | 2017-05-14 02:44

这两天学习了python3实现抓取网页资源的方法，发现了很多种方法，所以，今天添加一点小笔记。

1、最简单

import urllib.request
respOnse= urllib.request.urlopen(&＃39;http://python.org/&＃39;)
html = response.read()

2、使用 Request

import urllib.request
 
req = urllib.request.Request(&＃39;http://python.org/&＃39;)
respOnse= urllib.request.urlopen(req)
the_page = response.read()

3、发送数据

#! /usr/bin/env python3
 
import urllib.parse
import urllib.request
 
url = &＃39;http://localhost/login.php&＃39;
user_agent = &＃39;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)&＃39;
values = {
     &＃39;act&＃39; : &＃39;login&＃39;,
     &＃39;login[email]&＃39; : &＃39;yzhang@i9i8.com&＃39;,
     &＃39;login[password]&＃39; : &＃39;123456&＃39;
     }
 
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
req.add_header(&＃39;Referer&＃39;, &＃39;http://www.python.org/&＃39;)
respOnse= urllib.request.urlopen(req)
the_page = response.read()
 
print(the_page.decode("utf8"))

4、发送数据和header

#! /usr/bin/env python3
 
import urllib.parse
import urllib.request
 
url = &＃39;http://localhost/login.php&＃39;
user_agent = &＃39;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)&＃39;
values = {
     &＃39;act&＃39; : &＃39;login&＃39;,
     &＃39;login[email]&＃39; : &＃39;yzhang@i9i8.com&＃39;,
     &＃39;login[password]&＃39; : &＃39;123456&＃39;
     }
headers = { &＃39;User-Agent&＃39; : user_agent }
 
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
respOnse= urllib.request.urlopen(req)
the_page = response.read()
 
print(the_page.decode("utf8"))

5、http 错误

#! /usr/bin/env python3
 
import urllib.request
 
req = urllib.request.Request(&＃39;http://www.python.org/fish.html&＃39;)
try:
  urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
  print(e.code)
  print(e.read().decode("utf8"))

6、异常处理1

#! /usr/bin/env python3
 
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://twitter.com/")
try:
  respOnse= urlopen(req)
except HTTPError as e:
  print(&＃39;The server couldn\&＃39;t fulfill the request.&＃39;)
  print(&＃39;Error code: &＃39;, e.code)
except URLError as e:
  print(&＃39;We failed to reach a server.&＃39;)
  print(&＃39;Reason: &＃39;, e.reason)
else:
  print("good!")
  print(response.read().decode("utf8"))

7、异常处理2

#! /usr/bin/env python3
 
from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request("http://twitter.com/")
try:
  respOnse= urlopen(req)
except URLError as e:
  if hasattr(e, &＃39;reason&＃39;):
    print(&＃39;We failed to reach a server.&＃39;)
    print(&＃39;Reason: &＃39;, e.reason)
  elif hasattr(e, &＃39;code&＃39;):
    print(&＃39;The server couldn\&＃39;t fulfill the request.&＃39;)
    print(&＃39;Error code: &＃39;, e.code)
else:
  print("good!")
  print(response.read().decode("utf8"))

8、HTTP 认证

#! /usr/bin/env python3
 
import urllib.request
 
# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
 
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "https://cms.tetx.com/"
password_mgr.add_password(None, top_level_url, &＃39;yzhang&＃39;, &＃39;cccddd&＃39;)
 
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
 
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)
 
# use the opener to fetch a URL
a_url = "https://cms.tetx.com/"
x = opener.open(a_url)
print(x.read())
 
# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)
 
a = urllib.request.urlopen(a_url).read().decode(&＃39;utf8&＃39;)
print(a)

9、使用代理

#! /usr/bin/env python3
 
import urllib.request
 
proxy_support = urllib.request.ProxyHandler({&＃39;sock5&＃39;: &＃39;localhost:1080&＃39;})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)

 
a = urllib.request.urlopen("http://g.cn").read().decode("utf8")
print(a)

10、超时

#! /usr/bin/env python3
 
import socket
import urllib.request
 
# timeout in seconds
timeout = 2
socket.setdefaulttimeout(timeout)
 
# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request(&＃39;http://twitter.com/&＃39;)
a = urllib.request.urlopen(req).read()
print(a)

推荐阅读

utf-8
Alink回归预测的不完善问题及期待

本文讨论了Alink回归预测的不完善问题，指出目前主要针对Python做案例，对其他语言支持不足。同时介绍了pom.xml文件的基本结构和使用方法，以及Maven的相关知识。最后，对Alink回归预测的未来发展提出了期待。 ... [详细]

蜡笔小新 2023-12-14 14:25:33
int
lua语言闭包、模式匹配、日期、编译、模块的特性及应用

本文介绍了lua语言中闭包的特性及其在模式匹配、日期处理、编译和模块化等方面的应用。lua中的闭包是严格遵循词法定界的第一类值，函数可以作为变量自由传递，也可以作为参数传递给其他函数。这些特性使得lua语言具有极大的灵活性，为程序开发带来了便利。 ... [详细]

蜡笔小新 2023-12-14 18:18:21
dll
Windows下配置PHP5.6的方法及注意事项

本文介绍了在Windows系统下配置PHP5.6的步骤及注意事项，包括下载PHP5.6、解压并配置IIS、添加模块映射、测试等。同时提供了一些常见问题的解决方法，如下载缺失的msvcr110.dll文件等。通过本文的指导，读者可以轻松地在Windows系统下配置PHP5.6，并解决一些常见的配置问题。 ... [详细]

蜡笔小新 2023-12-14 12:37:25
go
知识图谱——机器大脑中的知识库

本文介绍了知识图谱在机器大脑中的应用，以及搜索引擎在知识图谱方面的发展。以谷歌知识图谱为例，说明了知识图谱的智能化特点。通过搜索引擎用户可以获取更加智能化的答案，如搜索关键词"Marie Curie"，会得到居里夫人的详细信息以及与之相关的历史人物。知识图谱的出现引起了搜索引擎行业的变革，不仅美国的微软必应，中国的百度、搜狗等搜索引擎公司也纷纷推出了自己的知识图谱。 ... [详细]

蜡笔小新 2023-12-14 10:06:19
get
GetWindowLong函数

今天在看一个代码里头写了GetWindowLong(hwnd,0)，我当时就有点费解，靠，上网搜索函数原型说明，死活找不到第 ... [详细]

蜡笔小新 2023-12-14 17:58:15
timestamp
Android 新闻App的本地服务器搭建教程

本文介绍了在开发Android新闻App时，搭建本地服务器的步骤。通过使用XAMPP软件，可以一键式搭建起开发环境，包括Apache、MySQL、PHP、PERL。在本地服务器上新建数据库和表，并设置相应的属性。最后，给出了创建new表的SQL语句。这个教程适合初学者参考。 ... [详细]

蜡笔小新 2023-12-14 17:15:19
js
基于layUI的图片上传前预览功能的2种实现方式

本文介绍了基于layUI的图片上传前预览功能的两种实现方式：一种是使用blob+FileReader，另一种是使用layUI自带的参数。通过选择文件后点击文件名，在页面中间弹窗内预览图片。其中，layUI自带的参数实现了图片预览功能。该功能依赖于layUI的上传模块，并使用了blob和FileReader来读取本地文件并获取图像的base64编码。点击文件名时会执行See()函数。摘要长度为169字。 ... [详细]

蜡笔小新 2023-12-14 17:06:58
dll
如何去除Win7快捷方式的箭头

本文介绍了如何去除Win7快捷方式的箭头的方法，通过生成一个透明的ico图标并将其命名为Empty.ico，将图标复制到windows目录下，并导入注册表，即可去除箭头。这样做可以改善默认快捷方式的外观，提升桌面整洁度。 ... [详细]

蜡笔小新 2023-12-14 16:17:05
hash
数据库的存储结构及其重要性

本文介绍了数据库的存储结构及其重要性，强调了关系数据库范例中将逻辑存储与物理存储分开的必要性。通过逻辑结构和物理结构的分离，可以实现对物理存储的重新组织和数据库的迁移，而应用程序不会察觉到任何更改。文章还展示了Oracle数据库的逻辑结构和物理结构，并介绍了表空间的概念和作用。 ... [详细]

蜡笔小新 2023-12-14 16:00:02
import
Java实现大数乘法（分治算法）

本文介绍了使用Java实现大数乘法的分治算法，包括输入数据的处理、普通大数乘法的结果和Karatsuba大数乘法的结果。通过改变long类型可以适应不同范围的大数乘法计算。 ... [详细]

蜡笔小新 2023-12-14 15:43:50
int
HDU 2372 El Dorado（DP）的最长上升子序列长度求解方法

本文介绍了解决HDU 2372 El Dorado问题的一种动态规划方法，通过循环k的方式求解最长上升子序列的长度。具体实现过程包括初始化dp数组、读取数列、计算最长上升子序列长度等步骤。 ... [详细]

蜡笔小新 2023-12-14 15:08:18
go
开发笔记:Flutter 添加APP启动 Story View

篇首语：本文由编程笔记#小编为大家整理，主要介绍了Flutter添加APP启动StoryView相关的知识，希望对你有一定的参考价值。 ... [详细]

蜡笔小新 2023-10-16 22:01:00
hash
MybatisPlus入门系列(13) MybatisPlus之自定义ID生成器

数据库ID生成策略在数据库表设计时，主键ID是必不可少的字段，如何优雅的设计数据库ID，适应当前业务场景，需要根据需求选取 ... [详细]

蜡笔小新 2023-10-16 16:58:54
js
js pjax 和window.history.pushState,replaceState

原文：http:blog.linjunhalida.comblogpjaxgithub:https:github.comdefunktjquery-pjax ... [详细]

蜡笔小新 2023-10-16 10:50:00
go
范式转移：构建超级应用——胖应用 + 胖协议

范式转移：构建超级应用——胖应用 + 胖协议 ... [详细]

蜡笔小新 2023-10-15 17:54:14

机智的孙志嵘

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章