作者:岁月完好 | 来源:互联网 | 2023-05-18 13:22
我正在寻找一个Python脚本(使用3.4.3),它从URL抓取一个HTML页面,并可以通过DOM来尝试查找特定元素.
我目前有这个:
#!/usr/bin/env python
import urllib.request
def getSite(url):
return urllib.request.urlopen(url)
if __name__ == '__main__':
cOntent= getSite('http://www.google.com').read()
print(content)
当我打印内容时,它会打印出整个html页面,这与我想要的内容很接近......虽然我希望能够在DOM中导航而不是将其视为一个巨大的字符串.
我还是Python的新手,但有多种其他语言的经验(主要是Java,C#,C++,C,PHP,JS).我之前用Java做过类似的事情,但想在Python中尝试一下.
任何帮助表示赞赏.干杯!
1> Zach Gates..:
您可以使用许多不同的模块.例如,lxml或BeautifulSoup.
这是一个lxml
例子:
import lxml.html
mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)
description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag
>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
一个BeautifulSoup
例子:
from bs4 import BeautifulSoup
mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)
description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute
>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
请注意如何BeautifulSoup
返回unicode字符串,而lxml
不是.根据需要,这可能有用/有害.