如果将mutiple.py的第54行改为t.daemon=False,那么所有图片下载完成后,程序会一直卡在这里,不会退出。
$ python mutiple.py 一共下载了 253 张图片 Took 57.710124015808105s ...现在卡死不动了,只能通过kill -9来杀
接下来我用$ pstree -h | grep python,显然主线程和它的子线程现在没有退出,这是为什么呢?因为Queue已经设置了join(),而且print语句也成功打印出来,所以说子线程应该已经完工了呀。
python(6591)-+-{python}(6596) |-{python}(6597) |-{python}(6598) |-{python}(6599) |-{python}(6600) |-{python}(6601) |-{python}(6602) '-{python}(6603)
mutiple.py的代码
#!/usr/bin/env python # -*- coding: utf-8 -*- from Queue import Queue from threading import Thread from time import time from itertools import chain from download import setup_download_dir, get_links, download_link class DownloadWorker(Thread): def __init__(self, queue): Thread.__init__(self) self.queue = queue def run(self): while True: # Get the work from the queue and expand the tuple item = self.queue.get() if item is None: break directory, link = item download_link(directory, link) self.queue.task_done() def main(): ts = time() url1 = 'http://www.toutiao.com/a6333981316853907714' url2 = 'http://www.toutiao.com/a6334459308533350658' url3 = 'http://www.toutiao.com/a6313664289211924737' url4 = 'http://www.toutiao.com/a6334337170774458625' url5 = 'http://www.toutiao.com/a6334486705982996738' download_dir = setup_download_dir('thread_imgs') # Create a queue to communicate with the worker threads queue = Queue() links = list(chain( get_links(url1), get_links(url2), get_links(url3), get_links(url4), get_links(url5), )) # Create 8 worker threads for x in range(8): worker = DownloadWorker(queue) # Setting daemon to True will let the main thread exit even though the # workers are blocking worker.daemon = True worker.start() # Put the tasks into the queue as a tuple for link in links: queue.put((download_dir, link)) # Causes the main thread to wait for the queue to finish processing all # the tasks queue.join() print u'一共下载了 {} 张图片'.format(len(links)) print u'Took {}s'.format(time() - ts) if __name__ == '__main__': main() """ 一共下载了 253 张图片 Took 57.710124015808105s """
download.py的代码
#!/usr/bin/env python import os import requests from pathlib import Path from bs4 import BeautifulSoup def get_links(url): ''' return the links in a list ''' req = requests.get(url) soup = BeautifulSoup(req.text, "html.parser") return [img.attrs.get('src') for img in soup.find_all('p', class_='img-wrap') if img.attrs.get('src') is not None] def download_link(directory, link): ''' download the img by the link and save it ''' img_name = '{}.jpg'.format(os.path.basename(link)) download_path = directory / img_name r = requests.get(link) with download_path.open('wb') as fd: fd.write(r.content) def setup_download_dir(directory): ''' set the dir and create a new dir if not exists ''' download_dir = Path(directory) if not download_dir.exists(): download_dir.mkdir() return download_dir
程序运行中,执行一个主线程,如果主线程又创建一个子线程,主线程和子线程就分兵两路,分别运行,那么当主线程完成想退出时,会检验子线程是否完成。如果子线程未完成,则主线程会等待子线程完成后再退出。但是有时候我们需要的是,只要主线程完成了,不管子线程是否完成,都要和主线程一起退出,这时就可以用setDaemon(True)方法了。
我的理解是这样的:
setdaemon(True)就是守护线程的意思吧,即当你设置为True,则主线程结束的时候,子线程被强制退出。
queue.join() 会让主线程等所有子线程完成,主线程才会往下执行。
线程没有提供退出函数
综上三点的话,如果setdaemon(False)的话, 那么主线程会一直等待子线程退出。所以卡住