（七）SparkStreaming算子梳理—repartition算子

作者：傲慢的小草7_170 | 来源：互联网 | 2023-10-12 12:06

目录天小天：（一）SparkStreaming算子梳理—简单介绍streaming运行逻辑天小天：（二）SparkStreaming算子梳理—flatMap和mapPartitio

目录
天小天：（一）Spark Streaming 算子梳理 — 简单介绍streaming运行逻辑
天小天：（二）Spark Streaming 算子梳理 — flatMap和mapPartitions
天小天：（三）Spark Streaming 算子梳理 — transform算子
天小天：（四）Spark Streaming 算子梳理 — Kafka createDirectStream
天小天：（五）Spark Streaming 算子梳理 — foreachRDD
天小天：（六）Spark Streaming 算子梳理 — glom算子
天小天：（七）Spark Streaming 算子梳理 — repartition算子
天小天：（八）Spark Streaming 算子梳理 — window算子

前言

本文主要讲解repartiion的作用及原理。

作用

repartition用来调整父RDD的分区数，入参为调整之后的分区数。由于使用方法比较简单，这里就不写例子了。

源码分析

接下来从源码的角度去分析是如何实现重新分区的。

DStream

/** * Return a new DStream with an increased or decreased level of parallelism. Each RDD in the * returned DStream has exactly numPartitions partitions. */ def repartition(numPartitions: Int): DStream[T] = ssc.withScope { this.transform(_.repartition(numPartitions)) }

从方法中可以看到，实现repartition的方式是通过Dstream的transform算子之间调用RDD的repartition算子实现的。

接下来就是看看RDD的repartition算子是如何实现的。

RDD

/** * Return a new RDD that has exactly numPartitions partitions. * * Can increase or decrease the level of parallelism in this RDD. Internally, this uses * a shuffle to redistribute data. * * If you are decreasing the number of partitions in this RDD, consider using `coalesce`, * which can avoid performing a shuffle. * * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207. */ def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { coalesce(numPartitions, shuffle = true) }

首先可以看到RDD的repartition的实现是调用时coalesce方法。其中入参有两个第一个是numPartitions为重新分区后的分区数量，第二个参数为是否shuffle，这里的入参为true代表会进行shuffle。

接下来看下coalesce是如何实现的。

def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null) : RDD[T] = withScope { require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.") if (shuffle) {// 是否经过shuffle，repartition是走这个逻辑 /** Distributes elements evenly across output partitions, starting from a random partition. */ // distributePartition是shuffle的逻辑， // 对迭代器中的每个元素分派不同的key，shuffle时根据这些key平均的把元素分发到下一个stage的各个partition中。 val distributePartition = (index: Int, items: Iterator[T]) => { var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions) items.map { t => // Note that the hash code of the key will just be the key itself. The HashPartitioner // will mod it with the number of total partitions. position = position + 1 (position, t) } } : Iterator[(Int, T)] // include a shuffle step so that our upstream tasks are still distributed new CoalescedRDD( new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition), // 为每个元素分配key，分配的逻辑为distributePartition new HashPartitioner(numPartitions)), // ShuffledRDD 根据key进行混洗 numPartitions, partitionCoalescer).values } else { // 如果不经过shuffle之间返回CoalescedRDD new CoalescedRDD(this, numPartitions, partitionCoalescer) } }

从源码中可以看到无论是否经过shuffle最终返回的都是CoalescedRDD。其中区别是经过shuffle需要为每个元素分配key，并根据key将所有的元素平均分配到task中。

CoalescedRDD

private[spark] class CoalescedRDD[T: ClassTag]( @transient var prev: RDD[T], // 父RDD maxPartitions: Int, // 最大partition数量，这里就是重新分区后的partition数量 partitionCoalescer: Option[PartitionCoalescer] = None // 重新分区算法，入参默认为None) extends RDD[T](prev.context, Nil) { // Nil since we implement getDependencies require(maxPartitions > 0 || maxPartitions == prev.partitions.length, s"Number of partitions ($maxPartitions) must be positive.") if (partitionCoalescer.isDefined) { require(partitionCoalescer.get.isInstanceOf[Serializable], "The partition coalescer passed in must be serializable.") } override def getPartitions: Array[Partition] = { // 获取重新算法，默认为DefaultPartitionCoalescer val pc = partitionCoalescer.getOrElse(new DefaultPartitionCoalescer()) // coalesce方法是根据传入的rdd和最大分区数计算出每个新的分区处理哪些旧的分区 pc.coalesce(maxPartitions, prev).zipWithIndex.map { case (pg, i) => // pg为partitionGroup即旧的partition组成的集合，集合里的partition对应一个新的partition val ids = pg.partitions.map(_.index).toArray new CoalescedRDDPartition(i, prev, ids, pg.prefLoc) //组成一个新的parititon } } override def compute(partition: Partition, context: TaskContext): Iterator[T] = { // 当执行到这里时分区已经重新分配好了，这部分代码也是执行在新的分区的task中的。 // 新的partition取出就的partition对应的所有partition并以此调用福rdd的迭代器执行next计算。 partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition => firstParent[T].iterator(parentPartition, context) } } override def getDependencies: Seq[Dependency[_]] = { Seq(new NarrowDependency(prev) { def getParents(id: Int): Seq[Int] = partitions(id).asInstanceOf[CoalescedRDDPartition].parentsIndices }) } override def clearDependencies() { super.clearDependencies() prev = null } /** * Returns the preferred machine for the partition. If split is of type CoalescedRDDPartition, * then the preferred machine will be one which most parent splits prefer too. * @param partition * @return the machine most preferred by split */ override def getPreferredLocations(partition: Partition): Seq[String] = { partition.asInstanceOf[CoalescedRDDPartition].preferredLocation.toSeq } }

对于CoalescedRDD来讲getPartitions方法是最核心的方法。旧的parition对应哪些新的partition就是在这个方法里计算出来的。具体的算法是在DefaultPartitionCoalescer的coalesce方法体现出来的。

compute方法是在新的task中执行的，即分区已经重新分配好，并且拉取父RDD指定parition对应的元素提供给下游迭代器计算。

图示

写下来用两张图解释下是如何repartition

无shuffle

《（七）Spark Streaming 算子梳理 — repartition算子》

有shuffle

《（七）Spark Streaming 算子梳理 — repartition算子》

总结

以上repartition的逻辑基本就已经介绍完了。其中DefaultPartitionCoalescer中重新分区的算法逻辑并没有展开说。这里以后如果有时间会再写一篇详细介绍。

推荐阅读

数组
ejava,刘聪dejava

本文目录一览：1、什么是Java？2、java ... [详细]

蜡笔小新 2023-12-09 09:28:18
数组
2021最新总结网易/腾讯/CVTE/字节面经分享（附答案解析）

本文分享作者在2021年面试网易、腾讯、CVTE和字节等大型互联网企业的经历和问题，包括稳定性设计、数据库优化、分布式锁的设计等内容。同时提供了大厂最新面试真题笔记，并附带答案解析。 ... [详细]

蜡笔小新 2023-12-09 19:11:31
int
什么是大数据lambda架构

一、什么是Lambda架构Lambda架构由Storm的作者[NathanMarz]提出，根据维基百科的定义，Lambda架构的设计是为了在处理大规模数 ... [详细]

蜡笔小新 2023-10-17 16:06:09
int
你知道Kafka和Redis的各自优缺点吗？一文带你优化选择，不走弯路

你知道Kafka和Redis的各自优缺点吗？一文带你优化选择，不走弯路 ... [详细]

蜡笔小新 2023-10-15 17:24:27
import
开发笔记:加密&json&StringIO模块&BytesIO模块

篇首语：本文由编程笔记#小编为大家整理，主要介绍了加密&json&StringIO模块&BytesIO模块相关的知识，希望对你有一定的参考价值。一、加密加密 ... [详细]

蜡笔小新 2023-12-14 15:18:35
copy
图解redis的持久化存储机制RDB和AOF的原理和优缺点

本文通过图解的方式介绍了redis的持久化存储机制RDB和AOF的原理和优缺点。RDB是将redis内存中的数据保存为快照文件，恢复速度较快但不支持拉链式快照。AOF是将操作日志保存到磁盘，实时存储数据但恢复速度较慢。文章详细分析了两种机制的优缺点，帮助读者更好地理解redis的持久化存储策略。 ... [详细]

蜡笔小新 2023-12-13 20:24:11
import
无损压缩算法专题——LZSS算法实现

本文介绍了基于无损压缩算法专题的LZSS算法实现。通过Python和C两种语言的代码实现了对任意文件的压缩和解压功能。详细介绍了LZSS算法的原理和实现过程，以及代码中的注释。 ... [详细]

蜡笔小新 2023-12-13 19:47:31
import
2018深入java目标计划及学习内容

本文介绍了作者在2018年的深入java目标计划，包括学习计划和工作中要用到的内容。作者计划学习的内容包括kafka、zookeeper、hbase、hdoop、spark、elasticsearch、solr、spring cloud、mysql、mybatis等。其中，作者对jvm的学习有一定了解，并计划通读《jvm》一书。此外，作者还提到了《HotSpot实战》和《高性能MySQL》等书籍。 ... [详细]

蜡笔小新 2023-12-11 20:00:32
import
从零基础到精通的前台学习路线

随着互联网的发展，前台开发工程师成为市场上非常抢手的人才。本文介绍了从零基础到精通前台开发的学习路线，包括学习HTML、CSS、JavaScript等基础知识和常用工具的使用。通过循序渐进的学习，可以掌握前台开发的基本技能，并有能力找到一份月薪8000以上的工作。 ... [详细]

蜡笔小新 2023-12-10 20:05:15
copy
Android日历提醒软件开源项目分享及使用教程

本文介绍了一款名为Android日历提醒软件的开源项目，作者分享了该项目的代码和使用教程，并提供了GitHub项目地址。文章详细介绍了该软件的主界面风格、日程信息的分类查看功能，以及添加日程提醒和查看详情的界面。同时，作者还提醒了读者在使用过程中可能遇到的Android6.0权限问题，并提供了解决方法。 ... [详细]

蜡笔小新 2023-12-10 19:01:03
callback
JS兼容总结及解决方法

本文总结了在编写JS代码时，不同浏览器间的兼容性差异，并提供了相应的解决方法。其中包括阻止默认事件的代码示例和猎取兄弟节点的函数。这些方法可以帮助开发者在不同浏览器上实现一致的功能。 ... [详细]

蜡笔小新 2023-12-09 17:31:06
callback
Spark Streaming和Kafka整合之路（最新版本）

2019独角兽企业重金招聘Python工程师标准最近完成了SparkStreaming和Kafka的整合工作，耗时虽然不长，但是当中还是遇到了不少 ... [详细]

蜡笔小新 2023-10-16 17:29:47
callback
redis是干嘛的,springboot vue项目

springboot基于redis配置session共享项目环境配置pom.xml引入依赖application.properties配置Cookie序列化（高版本不需要）测试启 ... [详细]

蜡笔小新 2023-10-16 14:12:27
go
2019我的金三银四

先讲一下自己的情况吧，二本学生，17年毕业，目前在一家跨境电商从事Java技术开发工作（不是阿里，没那么厉害），技术栈目前偏向于容器云、持续集成持续交付这一块，也就是SpringBoot、Kuber ... [详细]

蜡笔小新 2023-10-16 10:41:46
runtime
new无法执行@Autowired注解，多线程注意事项和如何判断子线程是否全部执行完成

前言最近一段时间在整公司项目里一个功能的优化，用到了多线程处理。期间也是踩了不少的坑，在这里想说下我遇到的问题和注意事项。以及怎样知道启动的那些多线程都 ... [详细]

蜡笔小新 2023-10-15 05:36:26

傲慢的小草7_170

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章