问

python爬虫 - 一个python的多线程爬虫，daemon=False主程序无法退出，daemon=Ture程序可以退出

小洲相册居士发布于 2022-10-27 14:22

图片

代码在2.7下测试了可以直接运行

大神指点下，对于daemon查了很久了，但是还是没想明白，拜托看一下放在shomy答主的下面的几条评论，补充了一些内容。

问题描述

如果将mutiple.py的第54行改为t.daemon=False，那么所有图片下载完成后，程序会一直卡在这里，不会退出。

$ python mutiple.py
一共下载了 253 张图片
Took 57.710124015808105s
...现在卡死不动了，只能通过kill -9来杀

接下来我用$ pstree -h | grep python，显然主线程和它的子线程现在没有退出，这是为什么呢？因为Queue已经设置了join()，而且print语句也成功打印出来，所以说子线程应该已经完工了呀。

python(6591)-+-{python}(6596)
            |-{python}(6597)
            |-{python}(6598)
            |-{python}(6599)
            |-{python}(6600)
            |-{python}(6601)
            |-{python}(6602)
            '-{python}(6603)

mutiple.py的代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from Queue import Queue
from threading import Thread
from time import time
from itertools import chain
from download import setup_download_dir, get_links, download_link


class DownloadWorker(Thread):
    def __init__(self, queue):
        Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            # Get the work from the queue and expand the tuple
            item = self.queue.get()
            if item is None:
                break
            directory, link = item
            download_link(directory, link)
            self.queue.task_done()


def main():
    ts = time()

    url1 = 'http://www.toutiao.com/a6333981316853907714'
    url2 = 'http://www.toutiao.com/a6334459308533350658'
    url3 = 'http://www.toutiao.com/a6313664289211924737'
    url4 = 'http://www.toutiao.com/a6334337170774458625'
    url5 = 'http://www.toutiao.com/a6334486705982996738'
    download_dir = setup_download_dir('thread_imgs')
    # Create a queue to communicate with the worker threads
    queue = Queue()

    links = list(chain(
        get_links(url1),
        get_links(url2),
        get_links(url3),
        get_links(url4),
        get_links(url5),
    ))

    # Create 8 worker threads
    for x in range(8):
        worker = DownloadWorker(queue)
        # Setting daemon to True will let the main thread exit even though the
        # workers are blocking
        worker.daemon = True
        worker.start()

    # Put the tasks into the queue as a tuple
    for link in links:
        queue.put((download_dir, link))

    # Causes the main thread to wait for the queue to finish processing all
    # the tasks
    queue.join()
    print u'一共下载了 {} 张图片'.format(len(links))
    print u'Took {}s'.format(time() - ts)


if __name__ == '__main__':
    main()

"""
一共下载了 253 张图片
Took 57.710124015808105s
"""

download.py的代码

#!/usr/bin/env python
import os
import requests
from pathlib import Path
from bs4 import BeautifulSoup


def get_links(url):
    '''
    return the links in a list
    '''
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")
    return [img.attrs.get('src') for img in
            soup.find_all('p', class_='img-wrap')
            if img.attrs.get('src') is not None]


def download_link(directory, link):
    '''
    download the img by the link and save it
    '''
    img_name = '{}.jpg'.format(os.path.basename(link))
    download_path = directory / img_name
    r = requests.get(link)
    with download_path.open('wb') as fd:
        fd.write(r.content)


def setup_download_dir(directory):
    '''
    set the dir and create a new dir if not exists
    '''
    download_dir = Path(directory)
    if not download_dir.exists():
        download_dir.mkdir()
    return download_dir

程序运行中，执行一个主线程，如果主线程又创建一个子线程，主线程和子线程就分兵两路，分别运行，那么当主线程完成想退出时，会检验子线程是否完成。如果子线程未完成，则主线程会等待子线程完成后再退出。但是有时候我们需要的是，只要主线程完成了，不管子线程是否完成，都要和主线程一起退出，这时就可以用setDaemon(True)方法了。

1 个回答

我的理解是这样的:
1. setdaemon(True)就是守护线程的意思吧，即当你设置为True,则主线程结束的时候，子线程被强制退出。
2. queue.join() 会让主线程等所有子线程完成，主线程才会往下执行。
3. 线程没有提供退出函数
综上三点的话，如果setdaemon(False)的话，那么主线程会一直等待子线程退出。所以卡住
2022-10-28 12:19 回答

ex7776647

撰写答案

今天，你开发时遇到什么问题呢？

立即提问

热门标签