热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

MapReduce执行计划及优化

WordCount:selectrank,count(*)cntfromcitygroupbyrank;Map与Reduce都是一个JVM进程,可以理解为都是一个独立

WordCount:

select rank, count(*) cnt from city group by rank;

Map与Reduce 都是一个JVM进程,可以理解为都是一个独立的应用程序

MapReduce 框架的作用就是自动帮你把map端的输出,经过shuffle后变为reduce端的输入

方法:

public void map(Text value,Context context);

public void reduce(Text key, Iterable values, Context context)

 

Shuffle:

 

 

 

common join

 

Map Join

 

 

 

 

 

MapReduce

word count:

 

 


### 执行计划
explain select
a.name,a.entity_id
from
tangbao.shop a
join
ods_order_org.instancedetail b
on a.entity_id=b.entity_id
where b.pt='20180501';

+--------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+--------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-3 depends on stages: Stage-4 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| a |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| a |
| TableScan |
| alias: a |
| filterExpr: entity_id is not null (type: boolean) |
| Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 entity_id (type: string) |
| 1 entity_id (type: string) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: b |
| filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
| Statistics: Num rows: 164539402 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 entity_id (type: string) |
| 1 entity_id (type: string) |
| outputColumnNames: _col9, _col38 |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col9 (type: string), _col38 (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+--------------------------------------------------------------------------------------------------------------------+--+

explain select
a.name,a.entity_id
from
tangbao.shop a
left join
ods_order_org.instancedetail b
on a.entity_id=b.entity_id
where b.pt='20180501';

+--------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+--------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage , consists of Stage-1 |
| Stage-1 |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Conditional Operator |
| |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: entity_id (type: string) |
| sort order: + |
| Map-reduce partition columns: entity_id (type: string) |
| Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
| value expressions: name (type: string) |
| TableScan |
| alias: b |
| Statistics: Num rows: 4821977155 Data size: 284496652145 Basic stats: PARTIAL Column stats: PARTIAL |
| Reduce Output Operator |
| key expressions: entity_id (type: string) |
| sort order: + |
| Map-reduce partition columns: entity_id (type: string) |
| Statistics: Num rows: 4821977155 Data size: 284496652145 Basic stats: PARTIAL Column stats: PARTIAL |
| value expressions: pt (type: string) |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 entity_id (type: string) |
| 1 entity_id (type: string) |
| outputColumnNames: _col9, _col38, _col117 |
| Statistics: Num rows: 5304174985 Data size: 312946324142 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (_col117 = '20180501') (type: boolean) |
| Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col9 (type: string), _col38 (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+--------------------------------------------------------------------------------------------------------------------+--+

explain select
avg(fee),max(fee)
from
olap_pc_history_report.tmp_tp_p_join
group by
entity_id,curr_date,kindpay_id;

+-------------------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+-------------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: tmp_tp_p_join |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: entity_id (type: string), curr_date (type: string), kindpay_id (type: string), fee (type: double) |
| outputColumnNames: entity_id, curr_date, kindpay_id, fee |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: avg(fee), max(fee) |
| keys: entity_id (type: string), curr_date (type: string), kindpay_id (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| sort order: +++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col3 (type: struct), _col4 (type: double) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: avg(VALUE._col0), max(VALUE._col1) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col3 (type: double), _col4 (type: double) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+-------------------------------------------------------------------------------------------------------------------------------+--+

set hive.map.aggr=false;

set hive.groupby.skewindata=true;

explain select
avg(fee),max(fee)
from
olap_pc_history_report.tmp_tp_p_join
group by
entity_id,curr_date,kindpay_id;

+-------------------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+-------------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: tmp_tp_p_join |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: curr_date (type: string), entity_id (type: string), fee (type: double), kindpay_id (type: string) |
| outputColumnNames: curr_date, entity_id, fee, kindpay_id |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: entity_id (type: string), curr_date (type: string), kindpay_id (type: string) |
| sort order: +++ |
| Map-reduce partition columns: rand() (type: double) |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| value expressions: fee (type: double) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: avg(VALUE._col0), max(VALUE._col0) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
| mode: partial1 |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: true |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| sort order: +++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col3 (type: struct), _col4 (type: double) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: avg(VALUE._col0), max(VALUE._col1) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
| mode: final |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col3 (type: double), _col4 (type: double) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+-------------------------------------------------------------------------------------------------------------------------------+--+

set hive.optimize.skewjoin=true;

explain select
a.name,a.entity_id
from
ods_shop_org.shop a
join
ods_order_org.instancedetail b
on a.entity_id=b.entity_id
where b.pt='20180501';

+------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-7 is a root stage , consists of Stage-1 |
| Stage-1 |
| Stage-4 depends on stages: Stage-1 , consists of Stage-8 |
| Stage-8 |
| Stage-3 depends on stages: Stage-8 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-7 |
| Conditional Operator |
| |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: entity_id is not null (type: boolean) |
| Statistics: Num rows: 443697 Data size: 88739406 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 221849 Data size: 44369803 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: entity_id (type: string) |
| sort order: + |
| Map-reduce partition columns: entity_id (type: string) |
| Statistics: Num rows: 221849 Data size: 44369803 Basic stats: COMPLETE Column stats: NONE |
| value expressions: name (type: string) |
| TableScan |
| alias: b |
| filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
| Statistics: Num rows: 164539402 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: entity_id (type: string) |
| sort order: + |
| Map-reduce partition columns: entity_id (type: string) |
| Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| handleSkewJoin: true |
| keys: |
| 0 entity_id (type: string) |
| 1 entity_id (type: string) |
| outputColumnNames: _col9, _col38 |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col9 (type: string), _col38 (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-4 |
| Conditional Operator |
| |
| Stage: Stage-8 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| 1 |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| 1 |
| TableScan |
| HashTable Sink Operator |
| keys: |
| 0 reducesinkkey0 (type: string) |
| 1 reducesinkkey0 (type: string) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Map Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 reducesinkkey0 (type: string) |
| 1 reducesinkkey0 (type: string) |
| outputColumnNames: _col9, _col38 |
| Select Operator |
| expressions: _col9 (type: string), _col38 (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
+------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+------------------------------------------------------------------------------------------------------------------+--+
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+------------------------------------------------------------------------------------------------------------------+--+

 

explain select
a.name,a.entity_id
from
ods_order_org.instancedetail a
left semi join
tangbao.shop b
on a.entity_id=b.entity_id
where a.pt='20180501';

+--------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+--------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-3 depends on stages: Stage-4 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| b |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| b |
| TableScan |
| alias: b |
| filterExpr: entity_id is not null (type: boolean) |
| Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: entity_id (type: string) |
| outputColumnNames: entity_id |
| Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: entity_id (type: string) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 entity_id (type: string) |
| 1 _col0 (type: string) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
| Statistics: Num rows: 82269701 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 41134851 Data size: 8226970212 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 entity_id (type: string) |
| 1 _col0 (type: string) |
| outputColumnNames: _col9, _col29 |
| Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col9 (type: string), _col29 (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+--------------------------------------------------------------------------------------------------------------------+--+

 

 



转:https://www.cnblogs.com/hyc123-/p/9190515.html



推荐阅读
author-avatar
乌鸦bz_371
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有