YARN：最大并行Map任务计数-YARN:maximumparallelMaptaskcount

作者：小薇虫虫_851_413 | 来源：互联网 | 2023-09-24 18:02

FollowingismentionedintheHadoopdefinitiveguideHadoop权威指南中提到了以下内容Whatqualifiesasasmal

Following is mentioned in the Hadoop definitive guide

Hadoop权威指南中提到了以下内容

"What qualifies as a small job? By default one that has less than 10 mappers, only one reducer, and the input size is less than the size of one HDFS block. "

But how does it count no of mapper in a job before executing it on YARN ? In MR1 number of mapper depends on the no. of input splits. is the same applies for the YARN as well ? In YARN containers are flexible. So Is there any way for computing max number of map task that can run on a given cluster in parallel( some kind of tight upper bound, because it will give me rough idea about how much data i can process in parallel ? ) ?

但是在YARN上执行它之前,它如何计算作业中的mapper?在MR1中,映射器的数量取决于否。输入分裂。是否同样适用于YARN?在YARN容器中是灵活的。那么有没有办法计算可以在给定集群上并行运行的最大数量的map任务(某种紧张的上限,因为它会让我粗略地了解我可以并行处理多少数据?)?

2 个解决方案

#1

But how does it count no of mapper in a job before executing it on YARN ? In MR1 number of mapper depends on the no. of input splits. is the same applies for the YARN as well ?

但是在YARN上执行它之前,它如何计算作业中的mapper?在MR1中,映射器的数量取决于否。输入分裂。是否同样适用于YARN?

Yes, in YARN as well if you are using MapReduce based frameworks, the number of mappers depend on input splits.

是的,在YARN中,如果您使用基于MapReduce的框架,则映射器的数量取决于输入拆分。

In YARN containers are flexible. So Is there any way for computing max number of map task that can run on a given cluster in parallel( some kind of tight upper bound, because it will give me rough idea about how much data i can process in parallel ? ) ?

在YARN容器中是灵活的。那么有没有办法计算可以在给定集群上并行运行的最大数量的map任务(某种紧张的上限,因为它会让我粗略地了解我可以并行处理多少数据?)?

The number of map tasks that can run in parallel on the YARN cluster depends on how many containers that can be launched and run in parallel on the cluster. This ultimately depends on how you will configure MapReduce in the cluster, which is clearly explained clearly in this guide from cloudera.

可以在YARN群集上并行运行的映射任务数取决于可以在群集上并行启动和运行的容器数。这最终取决于您将如何在群集中配置MapReduce,这在cloudera的本指南中已清楚地解释。

#2

mapreduce.job.maps = MIN(yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb,yarn.nodemanager.resource.cpu-vcores / mapreduce.map.cpu.vcores, number of physical drives x workload factor) x number of worker nodes

mapreduce.job.reduces = MIN(yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb,yarn.nodemanager.resource.cpu-vcores / mapreduce.reduce.cpu.vcores, # of physical drives xworkload factor) x # of worker nodes

The workload factor can be set to 2.0 for most workloads. Consider a higher setting for CPU-bound workloads.

对于大多数工作负载,工作负载因子可以设置为2.0。考虑更高的CPU绑定工作负载设置。

yarn.nodemanager.resource.memory-mb( Available Memory on a node for containers )= Total System memory – Reserved memory( like 10-20% of memory for Linux and its daemon services) -   HDFS Data node ( 1024 MB) – (resources for task buffers, such as the HDFS Sort I/O buffer) – (Memory allocated for DataNode( default 1024 MB), NodeManager, RegionServer etc.)

Hadoop is a disk I/O-centric platform by design. The number of independent physical drives (“spindles”) dedicated to DataNode use limits how much concurrent processing a node can sustain. As a result, the number of vcores allocated to the NodeManager should be the lesser of either:

Hadoop是一个以磁盘I / O为中心的平台。专用于DataNode的独立物理驱动器(“主轴”)的数量限制了节点可以承受的并发处理量。因此,分配给NodeManager的vcores数量应该是以下两者中的较小者:

 [(total vcores) – (number of vcores reserved for non-YARN use)] or  [ 2 x (number of physical disks used for DataNode storage)]

yarn.nodemanager.resource.cpu-vcores = min{ ((total vcores) – (number of vcores reserved for non-YARN use)),  (2 x (number of physical disks used for DataNode storage))}

Available vcores  on a node for cOntainers= total no. of vcores – for operating system( For calculating vcore demand, consider the number of concurrent processes or tasks each service runs as an initial guide. For OS we take 2 ) – Yarn node Manager( Def. is  1) – HDFS data node( Def. is  1).

Note ==>

mapreduce.map.memory.mb is combination of both mapreduce.map.java.opts.max.heap + some head room (safety value)

The settings for mapreduce.[map | reduce].java.opts.max.heap specify the default memory allotted for mapper and reducer heap size, respectively. The mapreduce.[map| reduce].memory.mb settings specify memory allotted their containers, and the value assigned should allow overhead beyond the task heap size. Cloudera recommends applying a factor of 1.2 to the mapreduce.[map | reduce].java.opts.max.heap setting. The optimal value depends on the actual tasks. Cloudera also recommends setting mapreduce.map.memory.mb to 1–2 GB and setting mapreduce.reduce.memory.mb to twice the mapper value. The ApplicationMaster heap size is 1 GB by default, and can be increased if your jobs contain many concurrent tasks.

mapreduce的设置。[map | reduce] .java.opts.max.heap分别指定为mapper和reducer堆大小分配的默认内存。 mapreduce。[map | reduce] .memory.mb设置指定分配其容器的内存,并且分配的值应允许超出任务堆大小的开销。 Cloudera建议在mapreduce中应用因子1.2。[map | reduce] .java.opts.max.heap设置。最佳值取决于实际任务。 Cloudera还建议将mapreduce.map.memory.mb设置为1-2 GB,并将mapreduce.reduce.memory.mb设置为mapper值的两倍。 ApplicationMaster堆大小默认为1 GB,如果作业包含许多并发任务,则可以增加它。

Reference –

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_yarn_tuning.html
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html

推荐阅读

match
Spring 源码阅读 74：事务管理的原理BeanFactoryTransactionAttributeSourceAdvisor 分析

本文通过对BeanFactoryTransactionAttributeSourceAdvisor类的分析，了解了Spring是如何通过AOP来完成事务的管理的&#x ... [详细]

蜡笔小新 2023-10-16 17:57:25
input
hadoop学习；block数据块；mapreduce实现样例；UnsupportedClassVersionError异常；关联项目源代码...

对于开源的东东，尤其是刚出来不久，我认为最好的学习方式就是能够看源代码和doc，測试它的样例为了方便查看源代码，关联导入源代 ... [详细]

蜡笔小新 2023-10-17 09:49:38
io
Status quo, Dilemma and Outlook of Wallets

Forexperiencedcryptoinvestors,thereareseveralsectorsthatseemedpromisingbutdidn’tlive ... [详细]

蜡笔小新 2023-10-15 16:10:41
io
成功安装Sabayon Linux在thinkpad X60上的经验分享

本文分享了作者在国庆期间在thinkpad X60上成功安装Sabayon Linux的经验。通过修改CHOST和执行emerge命令，作者顺利完成了安装过程。Sabayon Linux是一个基于Gentoo Linux的发行版，可以将电脑快速转变为一个功能强大的系统。除了作为一个live DVD使用外，Sabayon Linux还可以被安装在硬盘上，方便用户使用。 ... [详细]

蜡笔小新 2023-12-13 11:35:40
io
Git版本控制工具中自动增加版本号的替代方案

本文讨论了在使用Git进行版本控制时，如何提供类似CVS中自动增加版本号的功能。作者介绍了Git中的其他版本表示方式，如git describe命令，并提供了使用这些表示方式来确定文件更新情况的示例。此外，文章还介绍了启用$Id:$功能的方法，并讨论了一些开发者在使用Git时的需求和使用场景。 ... [详细]

蜡笔小新 2023-12-09 09:55:13
process
tcpdump 4.5.1 crash 深入分析

tcpdump 4.5.1 crash 深入分析 ... [详细]

蜡笔小新 2023-12-09 07:11:34
input
Python 中的 PyInputPlus 模块

Python中的PyInputPlus模块原文:https ... [详细]

蜡笔小新 2023-10-17 20:32:43
io
移动传感器扫描覆盖

移动传感器扫描覆盖摘要：关于传感器网络中的地址覆盖问题，已经做过很多尝试。他们通常归为两类，全覆盖和栅栏覆盖，统称为静态覆盖 ... [详细]

蜡笔小新 2023-10-17 12:41:17
input
词向量计算文本相似度,通过词向量求文本相似度

基于词向量计算文本相似度1.测试数据：链接：https:pan.baidu.coms1fXJjcujAmAwTfsuTg2CbWA提取码：f4vx2.实验代码：imp ... [详细]

蜡笔小新 2023-10-17 12:10:15
input
Flink使用java实现读取csv文件简单实例

Flink使用java实现读取csv文件简单实例首先我们来看官方文档中给出的几种方法：首先我们来看官方文档中给出的几种方法：第一种：Da ... [详细]

蜡笔小新 2023-10-17 10:21:46
sum
动态多点××× 单云双HUB

动态多点是一个高扩展的IPSEC解决方案传统的ipsecS2S有如下劣势1.中心站点配置量大，无论是采用经典ipsec***还是采用greoveripsec多一个分支 ... [详细]

蜡笔小新 2023-10-17 09:16:50
client
mmcls多标签分类实战（二）：resnet多标签分类

上一章讲了如何制作数据集，接下来我们使用mmcls来实现多标签分类。 ... [详细]

蜡笔小新 2023-10-17 07:43:01
client
湍流|低频_youcans 的 OpenCV 例程 200 篇106. 退化图像的逆滤波

篇首语：本文由编程笔记#小编为大家整理，主要介绍了youcans的OpenCV例程200篇106.退化图像的逆滤波相关的知识，希望对你有一定的参考价值。 ... [详细]

蜡笔小新 2023-10-16 19:36:39
io
Creating dynamically named aws_lambda_alias results in badness

Thisissuewasoriginallyopenedbyashashicorp/terraform#5664.Itwasmigratedhe ... [详细]

蜡笔小新 2023-10-16 19:31:15
input
MapReduce工作流程最详细解释

MapReduce是我们再进行离线大数据处理的时候经常要使用的计算模型，MapReduce的计算过程被封装的很好，我们只用使用Map和Reduce函数，所以对其整体的计算过程不是太 ... [详细]

蜡笔小新 2023-10-16 14:14:27

小薇虫虫_851_413

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章