问

html5 - python 处理html页面爬虫数据

牛玺峻国_781 发布于 2022-10-26 17:38

http

请求的url 数据
http://www.hkex.com.hk/chi/st...
对了我只抓取一张表，希望能够提取关键表的数据.

希望抓取的数据是该成交报表，但是HTML 的标签都是

造成了数据提取的困难。

                             賣空成交量                      成交量

代號股票名稱股數(SH) 金額($) 股數(SH) 金額($)

  1 長和　　　　　　     299,500    27,572,475       2,201,171       202,964,029
  2 中電控股　　　　      61,000     4,622,825       1,452,853       110,040,699
  3 香港中華煤氣　　   2,939,000    42,694,880       8,024,558       116,691,466
  4 九龍倉集團　　　     297,000    17,349,550       3,136,238       183,105,286
  5 匯豐控股　　　　   1,102,800    73,202,940       8,630,868       572,622,103
  6 電能實業　　　　   1,016,500    76,262,725       4,876,990       365,926,231
  8 電訊盈科　　　　     731,000     3,478,240      13,579,323        64,672,175
 10 恒隆集團　　　　     172,000     5,209,850         967,980        29,308,292
 11 恒生銀行　　　　     189,000    30,047,370       1,075,185       170,873,130
 12 恒基地產　　　　      94,000     4,025,500       1,382,533        59,183,598
 14 希慎興業　　　　      33,000     1,167,900         642,424        22,747,393
 16 新鴻基地產　　　     425,000    45,490,800       1,635,959       175,284,039
 17 新世界發展　　　     651,000     5,833,670      10,135,381        90,633,244
 19 太古股份公司Ａ　     132,000    10,405,600         554,962        43,709,235
 20 會德豐　　　　　      72,000     3,407,750         683,368        32,286,993
 23 東亞銀行　　　　     451,600    14,991,890       1,817,000        60,295,348
 27 銀河娛樂　　　　   1,134,000    40,408,550      15,089,117       538,712,668
 31 航天控股　　　　     210,000       211,580       4,367,526         4,386,198
 34 九龍建業　　　　      31,000       228,260         292,000         2,156,291
 35 遠東發展　　　　      10,000        33,600         428,075         1,440,321
 38 第一拖拉機股份　       8,000        38,200       1,634,000         7,825,940
 41 鷹君　　　　　　      12,000       422,400         470,146        16,546,562
 45 大酒店　　　　　      35,500       305,605         503,559         4,335,522

 url = "http://www.hkex.com.hk/chi/stat/smstat/dayquot/d20170202c.htm"

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "lxml")

应该如何提取该表格的数据内容。

3 个回答

干嘛这么麻烦用beautifulsoup，杀鸡焉用牛刀

你的网页只有一行行数据啊，格式简单的不能再简单

你直接把页面上的数据复制下来，保存成txt，然后用readline、split、正则表达式提取数据不就可以了嘛

2022-11-12 01:40 回答

谢撒旦法_774

给你一个方案吧。

因为这些数据都是文本信息，没有标签包围。通过抓包，也没有发现特定的数据查询接口。所以数据应该是服务器生成好的通过html写死的发送给浏览器。
那么发现这些数据项每一个特定的属性都是占用同样的位置大小且居右对齐，而且每一项有特定的格式，可以使用正则表达式进行提取。
具体还是请您自行实现吧。

2022-11-12 01:40 回答

陈先森的记忆

解决方法一:

首先先定位賣空成交量位置 a = soup.find('a', attrs={'name':'short_selling'})，然后根据pre->font的相邻关系，一直往下走直到列不到6行就结束

这是结果：

[['代號', '股票名稱', '股數(SH)', '金額($)', '股數(SH)', '金額($)'],
 ['1', '長和', '299,500', '27,572,475', '2,201,171', '202,964,029'],
 ['2', '中電控股', '61,000', '4,622,825', '1,452,853', '110,040,699'],
 ['3', '香港中華煤氣', '2,939,000', '42,694,880', '8,024,558', '116,691,466'],
....

源代码

import pprint
from bs4 import  BeautifulSoup
import requests

r = requests.get('http://www.hkex.com.hk/chi/stat/smstat/dayquot/d170202c.htm')
r.encoding = 'big5'
soup = BeautifulSoup(r.text)
a = soup.find('a', attrs={'name':'short_selling'})
data = []

pre = a.find_parent('pre')

for line in pre.font.text.splitlines():
    item = line.strip().split()
    if len(item) == 6:
        data.append(item)

end = False

for next_pre in pre.next_siblings:
    for line in next_pre.font.text.splitlines():
        item = line.strip().split()
        if len(item) > 7:
            item = item[1:2] + ["".join(item[1:-4])] + item[-4:]
        elif len(item) < 6:
            end = True
            break
        data.append(item)
    if end:
       break

pprint.pprint(data)

2022-11-12 01:40 回答

书友55218170

撰写答案

今天，你开发时遇到什么问题呢？

立即提问

热门标签