python爬虫 - 一个python的多线程爬虫,daemon=False主程序无法退出,daemon=Ture程序可以退出

 小洲相册居士 发布于 2022-10-27 14:22

代码在2.7下测试了可以直接运行

大神指点下,对于daemon查了很久了,但是还是没想明白,拜托看一下放在shomy答主的下面的几条评论,补充了一些内容。

问题描述

如果将mutiple.py的第54行改为t.daemon=False,那么所有图片下载完成后,程序会一直卡在这里,不会退出。

$ python mutiple.py
一共下载了 253 张图片
Took 57.710124015808105s
...现在卡死不动了,只能通过kill -9来杀

接下来我用$ pstree -h | grep python,显然主线程和它的子线程现在没有退出,这是为什么呢?因为Queue已经设置了join(),而且print语句也成功打印出来,所以说子线程应该已经完工了呀。

python(6591)-+-{python}(6596)
            |-{python}(6597)
            |-{python}(6598)
            |-{python}(6599)
            |-{python}(6600)
            |-{python}(6601)
            |-{python}(6602)
            '-{python}(6603)

mutiple.py的代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from Queue import Queue
from threading import Thread
from time import time
from itertools import chain
from download import setup_download_dir, get_links, download_link


class DownloadWorker(Thread):
    def __init__(self, queue):
        Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            # Get the work from the queue and expand the tuple
            item = self.queue.get()
            if item is None:
                break
            directory, link = item
            download_link(directory, link)
            self.queue.task_done()


def main():
    ts = time()

    url1 = 'http://www.toutiao.com/a6333981316853907714'
    url2 = 'http://www.toutiao.com/a6334459308533350658'
    url3 = 'http://www.toutiao.com/a6313664289211924737'
    url4 = 'http://www.toutiao.com/a6334337170774458625'
    url5 = 'http://www.toutiao.com/a6334486705982996738'
    download_dir = setup_download_dir('thread_imgs')
    # Create a queue to communicate with the worker threads
    queue = Queue()

    links = list(chain(
        get_links(url1),
        get_links(url2),
        get_links(url3),
        get_links(url4),
        get_links(url5),
    ))

    # Create 8 worker threads
    for x in range(8):
        worker = DownloadWorker(queue)
        # Setting daemon to True will let the main thread exit even though the
        # workers are blocking
        worker.daemon = True
        worker.start()

    # Put the tasks into the queue as a tuple
    for link in links:
        queue.put((download_dir, link))

    # Causes the main thread to wait for the queue to finish processing all
    # the tasks
    queue.join()
    print u'一共下载了 {} 张图片'.format(len(links))
    print u'Took {}s'.format(time() - ts)


if __name__ == '__main__':
    main()

"""
一共下载了 253 张图片
Took 57.710124015808105s
"""

download.py的代码

#!/usr/bin/env python
import os
import requests
from pathlib import Path
from bs4 import BeautifulSoup


def get_links(url):
    '''
    return the links in a list
    '''
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")
    return [img.attrs.get('src') for img in
            soup.find_all('p', class_='img-wrap')
            if img.attrs.get('src') is not None]


def download_link(directory, link):
    '''
    download the img by the link and save it
    '''
    img_name = '{}.jpg'.format(os.path.basename(link))
    download_path = directory / img_name
    r = requests.get(link)
    with download_path.open('wb') as fd:
        fd.write(r.content)


def setup_download_dir(directory):
    '''
    set the dir and create a new dir if not exists
    '''
    download_dir = Path(directory)
    if not download_dir.exists():
        download_dir.mkdir()
    return download_dir

程序运行中,执行一个主线程,如果主线程又创建一个子线程,主线程和子线程就分兵两路,分别运行,那么当主线程完成想退出时,会检验子线程是否完成。如果子线程未完成,则主线程会等待子线程完成后再退出。但是有时候我们需要的是,只要主线程完成了,不管子线程是否完成,都要和主线程一起退出,这时就可以用setDaemon(True)方法了。

1 个回答
  • 我的理解是这样的:

    1. setdaemon(True)就是守护线程的意思吧,即当你设置为True,则主线程结束的时候,子线程被强制退出。

    2. queue.join() 会让主线程等所有子线程完成,主线程才会往下执行。

    3. 线程没有提供退出函数

    综上三点的话,如果setdaemon(False)的话, 那么主线程会一直等待子线程退出。所以卡住

    2022-10-28 12:19 回答
撰写答案
今天,你开发时遇到什么问题呢?
立即提问
热门标签
PHP1.CN | 中国最专业的PHP中文社区 | PNG素材下载 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有