纯好奇抖音用户构成做了这件事.抓取数据2999801条.(1)AnyProxy查看app网络请求.发现url会直接返回某个粉丝列表.https:api.amemv.comawe
纯好奇抖音用户构成做了这件事.抓取数据2999801条.
(1) AnyProxy 查看app网络请求.发现url 会直接返回某个粉丝列表 .
https://api.amemv.com/aweme/v1/user/follower/list/?user_id=96744033525&max_time=1527236030&count=20&retry_type=no_retry&iid=31995420310&device_id=51778233807&ac=wifi&channel=aweGW&aid=1128&app_name=aweme&version_code=181&version_name=1.8.1&device_platform=android&ssmix=a&device_type=MI+6&device_brand=Xiaomi&language=zh&os_api=26&os_version=8.0.0&uuid=863264038588223&openudid=99b06d2a82221c9c&manifest_version_code=181&resolution=1080*1920&dpi=480&update_version_code=1810&_rticket=1527236030783&ts=1527236030&as=a1e53c30eeab4b25475991&cp=c3b0bf58e7780452e1bkbe&mas=000de87c6bd683b509ae83095f3572eb948c9c9cacec2cac4c462c
试图伪造请求参数. 大雾..无奈发现不清楚具体算法的情况下无法伪造as cp mas. 直接通过url获取数据失败.
(2)尝试反编译抖音apk . 手机用xposed 直接hook 网络请求. 配合adb forward自己弄一个url 出来.. 传入大号uid . 当前时间戳. 需要拿到的粉丝数量. 下一页的数据就是当前页返回数据的min_time 千辛万苦终于拿到数据. 发现.一个大号竟然只能拿到3w个粉丝has_more 就为false了.然后抖音的进程连续运行一个小时会出现无响应.只能每抓取100个大号.杀死一次进程再重启进程.
Terminal 输入
adb forward tcp:18390 tcp:18390 #访问电脑的18390相当于访问手机18390端口.
Python3内 杀死/启动 抖音进程 返回手机桌面
subprocess.call("adb forward tcp:18390 tcp:18390", shell=True)
time.sleep(3)
subprocess.call("adb shell am force-stop com.ss.android.ugc.aweme", shell=True)
# 杀死进程
time.sleep(10)
subprocess.call("adb shell am start com.ss.android.ugc.aweme/com.ss.android.ugc.aweme.main.MainActivity",shell=True)
# 启动抖音app
time.sleep(15)
subprocess.call("adb shell input keyevent 3", shell=True)
# 返回手机桌面.让抖音在后台运行.否则视频一直播放手机电量不足.
time.sleep(3)
Postman 测试
postman手动调通api
(3) 批量获取大号id?
AnyProxy 查看抖音热搜网络请求.找到url
https://aweme.snssdk.com/aweme/v1/challenge/fresh/aweme/?ch_id=1599721829135383&query_type=0&cursor=1&count=1&type=5&retry_type=no_retry&iid=31995420310&device_id=%s&ac=wifi&channel=aweGW&aid=1128&app_name=aweme&version_code=181&version_name=1.8.1&device_platform=android&ssmix=a&device_type=MI+6&device_brand=Xiaomi&language=en&os_api=26&os_version=8.0.0&uuid=%s&openudid=%s&manifest_version_code=181&resolution=1080*1920&dpi=480&update_version_code=181
其中ch_id 就是热搜的id,遍历完第一个一个chid 至少能拿到5000大号的uid,够了. Python 代码循环获取数据插入数据到数据库就可以了.中间遇到的问题. 抖音进程连续跑一个小时以上.会造成手机关机…
(4) 数据结果.
抖音用户个人数据
抓取记录总数
(5) 数据分析.
用户年龄分布图: MySQL 查询.
#很大一部分人,比如我这样的就是只是随手刷刷抖音的.
#是不会去填年龄什么的.所以.先统计填写了年龄的用户
SELECT count(1) FROM yk_ios_cloud.douyin_fans where birthday != '';
# count(1)
'1411395'
141w 人填写了年龄.
select &#39;(-∞,10)&#39; value,sum(case when user_age<&#61;10 then 1 else 0 end) counts from (
SELECT (YEAR(CURDATE())-YEAR(birthday)) as user_age from yk_ios_cloud.douyin_fans where 1
) TA
union
select &#39;[10,20)&#39; value,sum(case when (user_age>&#61;11 and user_age<20) then 1 else 0 end) counts from (
SELECT (YEAR(CURDATE())-YEAR(birthday)) as user_age from yk_ios_cloud.douyin_fans where 1
) TA
union
select &#39;[20,35)&#39; value,sum(case when (user_age>&#61;20 and user_age<35) then 1 else 0 end) counts from (
SELECT (YEAR(CURDATE())-YEAR(birthday)) as user_age from yk_ios_cloud.douyin_fans where 1
) TA
union
select &#39;[35,&#43;∞)&#39; value,sum(case when user_age>&#61;35 then 1 else 0 end) counts from (
SELECT (YEAR(CURDATE())-YEAR(birthday)) as user_age from yk_ios_cloud.douyin_fans where 1
) TA
# value, counts
&#39;(-∞,10)&#39;, &#39;96697&#39;
&#39;[10,20)&#39;, &#39;391988&#39;
&#39;[20,35)&#39;, &#39;836370&#39;
&#39;[35,&#43;∞)&#39;, &#39;86340&#39;
统计完成看到年龄小于等于10岁的用户有96697 人我是震惊的…
生成年龄分布图.
Python3代码:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
def main():
plt.figure(figsize&#61;(6, 9))
labels &#61; [&#39;-∞,10&#39;, &#39;10,20&#39;, &#39;20,35&#39;, &#39;35,&#43;∞&#39;]
sizes &#61; [7, 28, 60, 6]
colors &#61; [&#39;red&#39;, &#39;yellowgreen&#39;, &#39;lightskyblue&#39;, &#39;yellow&#39;]
explode &#61; [0, 0, 0, 0]
patches, l_text, p_text &#61; plt.pie(
x&#61;sizes,
explode&#61;explode,
labels&#61;labels,
colors&#61;colors,
labeldistance&#61;1.05,
autopct&#61;&#39;%3.1f%%&#39;,
shadow&#61;False,
startangle&#61;90,
pctdistance&#61;0.6)
for t in l_text:
t.set_size(10)
for t in p_text:
t.set_size(10)
plt.axis(&#39;equal&#39;)
plt.legend()
plt.show()
if __name__ &#61;&#61; &#39;__main__&#39;:
main()
2.性别分布图. MySQL 查询
SELECT count(1) as total_gender_not_null FROM yk_ios_cloud.douyin_fans where gender !&#61; 0;
# total_gender_not_null
&#39;1686323&#39;
gender &#61; 2 # 女生
gender &#61; 1 #男生
boy: &#39;798336&#39;
girl: &#39;887987&#39;
用户性别分布图
3.拿到的用户信息地址竟然只有CN 也是无奈.所以用户城市分布图就没了.
4. 个性签名词云.
个性签名词云.
拿到的数据还有每个用户发布和喜欢的视频信息.暂时没有时间做分析了… 有时间再弄.