热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

YARN:最大并行Map任务计数-YARN:maximumparallelMaptaskcount

FollowingismentionedintheHadoopdefinitiveguideHadoop权威指南中提到了以下内容Whatqualifiesasasmal

Following is mentioned in the Hadoop definitive guide

Hadoop权威指南中提到了以下内容

"What qualifies as a small job? By default one that has less than 10 mappers, only one reducer, and the input size is less than the size of one HDFS block. "

But how does it count no of mapper in a job before executing it on YARN ? In MR1 number of mapper depends on the no. of input splits. is the same applies for the YARN as well ? In YARN containers are flexible. So Is there any way for computing max number of map task that can run on a given cluster in parallel( some kind of tight upper bound, because it will give me rough idea about how much data i can process in parallel ? ) ?

但是在YARN上执行它之前,它如何计算作业中的mapper?在MR1中,映射器的数量取决于否。输入分裂。是否同样适用于YARN?在YARN容器中是灵活的。那么有没有办法计算可以在给定集群上并行运行的最大数量的map任务(某种紧张的上限,因为它会让我粗略地了解我可以并行处理多少数据?)?

2 个解决方案

#1


But how does it count no of mapper in a job before executing it on YARN ? In MR1 number of mapper depends on the no. of input splits. is the same applies for the YARN as well ?

但是在YARN上执行它之前,它如何计算作业中的mapper?在MR1中,映射器的数量取决于否。输入分裂。是否同样适用于YARN?

Yes, in YARN as well if you are using MapReduce based frameworks, the number of mappers depend on input splits.

是的,在YARN中,如果您使用基于MapReduce的框架,则映射器的数量取决于输入拆分。

In YARN containers are flexible. So Is there any way for computing max number of map task that can run on a given cluster in parallel( some kind of tight upper bound, because it will give me rough idea about how much data i can process in parallel ? ) ?

在YARN容器中是灵活的。那么有没有办法计算可以在给定集群上并行运行的最大数量的map任务(某种紧张的上限,因为它会让我粗略地了解我可以并行处理多少数据?)?

The number of map tasks that can run in parallel on the YARN cluster depends on how many containers that can be launched and run in parallel on the cluster. This ultimately depends on how you will configure MapReduce in the cluster, which is clearly explained clearly in this guide from cloudera.

可以在YARN群集上并行运行的映射任务数取决于可以在群集上并行启动和运行的容器数。这最终取决于您将如何在群集中配置MapReduce,这在cloudera的本指南中已清楚地解释。

#2


mapreduce.job.maps = MIN(yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb,yarn.nodemanager.resource.cpu-vcores / mapreduce.map.cpu.vcores, number of physical drives x workload factor) x number of worker nodes

mapreduce.job.reduces = MIN(yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb,yarn.nodemanager.resource.cpu-vcores / mapreduce.reduce.cpu.vcores, # of physical drives xworkload factor) x # of worker nodes

The workload factor can be set to 2.0 for most workloads. Consider a higher setting for CPU-bound workloads.

对于大多数工作负载,工作负载因子可以设置为2.0。考虑更高的CPU绑定工作负载设置。

yarn.nodemanager.resource.memory-mb( Available Memory on a node for containers )= Total System memory – Reserved memory( like 10-20% of memory for Linux and its daemon services) -   HDFS Data node ( 1024 MB) – (resources for task buffers, such as the HDFS Sort I/O buffer) – (Memory allocated for DataNode( default 1024 MB), NodeManager, RegionServer etc.)

Hadoop is a disk I/O-centric platform by design. The number of independent physical drives (“spindles”) dedicated to DataNode use limits how much concurrent processing a node can sustain. As a result, the number of vcores allocated to the NodeManager should be the lesser of either:

Hadoop是一个以磁盘I / O为中心的平台。专用于DataNode的独立物理驱动器(“主轴”)的数量限制了节点可以承受的并发处理量。因此,分配给NodeManager的vcores数量应该是以下两者中的较小者:

 [(total vcores) – (number of vcores reserved for non-YARN use)] or  [ 2 x (number of physical disks used for DataNode storage)]

So

yarn.nodemanager.resource.cpu-vcores = min{ ((total vcores) – (number of vcores reserved for non-YARN use)),  (2 x (number of physical disks used for DataNode storage))}

Available vcores  on a node for cOntainers= total no. of vcores – for operating system( For calculating vcore demand, consider the number of concurrent processes or tasks each service runs as an initial guide. For OS we take 2 ) – Yarn node Manager( Def. is  1) – HDFS data node( Def. is  1).

Note ==>

mapreduce.map.memory.mb is combination of both mapreduce.map.java.opts.max.heap + some head room (safety value)

The settings for mapreduce.[map | reduce].java.opts.max.heap specify the default memory allotted for mapper and reducer heap size, respectively. The mapreduce.[map| reduce].memory.mb settings specify memory allotted their containers, and the value assigned should allow overhead beyond the task heap size. Cloudera recommends applying a factor of 1.2 to the mapreduce.[map | reduce].java.opts.max.heap setting. The optimal value depends on the actual tasks. Cloudera also recommends setting mapreduce.map.memory.mb to 1–2 GB and setting mapreduce.reduce.memory.mb to twice the mapper value. The ApplicationMaster heap size is 1 GB by default, and can be increased if your jobs contain many concurrent tasks.

mapreduce的设置。[map | reduce] .java.opts.max.heap分别指定为mapper和reducer堆大小分配的默认内存。 mapreduce。[map | reduce] .memory.mb设置指定分配其容器的内存,并且分配的值应允许超出任务堆大小的开销。 Cloudera建议在mapreduce中应用因子1.2。[map | reduce] .java.opts.max.heap设置。最佳值取决于实际任务。 Cloudera还建议将mapreduce.map.memory.mb设置为1-2 GB,并将mapreduce.reduce.memory.mb设置为mapper值的两倍。 ApplicationMaster堆大小默认为1 GB,如果作业包含许多并发任务,则可以增加它。


Reference –

  • http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_yarn_tuning.html
  • http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html

推荐阅读
author-avatar
小薇虫虫_851_413
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有