热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

PySpark/Spark窗口函数第一个/最后一个问题

如何解决《PySpark/Spark窗口函数第一个/最后一个问题》经验,为你挑选了1个好方法。



1> user6910411..:

这不是不正确的.您的窗口定义与您的想法不同.

如果您提供ORDER BY子句,则默认框架为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW:

from pyspark.sql.window import Window
from pyspark.sql.functions import first, last

w = Window.partitionBy('Dept').orderBy('Age')

df = spark.createDataFrame(
    [(38, "medicine", 4), (41, "medicine", 5), (55, "medicine", 7)],
    ("Age", "Dept", "ID")
)

df.select(
    "*",
    first('ID').over(w).alias("first_id"), 
    last('ID').over(w).alias("last_id")
).explain()
== Physical Plan ==
Window [first(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS first_id#38L, last(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS last_id#40L], [Dept#23], [Age#22L ASC NULLS FIRST]
+- *(1) Sort [Dept#23 ASC NULLS FIRST, Age#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(Dept#23, 200)
      +- Scan ExistingRDD[Age#22L,Dept#23,ID#24L]

这意味着窗口函数永远不会向前看,帧中的最后一行是当前行.

您应该将窗口重新定义为

w_uf = (Window
   .partitionBy('Dept')
   .orderBy('Age')
   .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

result = df.select(
    "*", 
    first('ID').over(w_uf).alias("first_id"),
    last('ID').over(w_uf).alias("last_id")
)
== Physical Plan ==
Window [first(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS first_id#56L, last(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS last_id#58L], [Dept#23], [Age#22L ASC NULLS FIRST]
+- *(1) Sort [Dept#23 ASC NULLS FIRST, Age#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(Dept#23, 200)
      +- Scan ExistingRDD[Age#22L,Dept#23,ID#24L]
result.show()
+---+--------+---+--------+-------+
|Age|    Dept| ID|first_id|last_id|
+---+--------+---+--------+-------+
| 38|medicine|  4|       4|      7|
| 41|medicine|  5|       4|      7|
| 55|medicine|  7|       4|      7|
+---+--------+---+--------+-------+


推荐阅读
author-avatar
Paul
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有