当前位置: 开发笔记 > 编程语言 > 正文

Hive_Hive中TopN的实现》利用row_number()函数实现分组TopN

作者：小寒风 | 来源：互联网 | 2023-08-21 12:39

参考文章：1.Hiverow_number()等用法https:www.cnblogs.comAllen-rgp9268627.html2.Hive中分组取前N个

参考文章&＃xff1a;

1. Hive row_number() 等用法

https://www.cnblogs.com/Allen-rg/p/9268627.html

2.Hive中分组取前N个值

https://www.cnblogs.com/1130136248wlxk/articles/5352145.html

TopN 功能一直是一个热门的话题&＃xff0c;下面我们看在Hive 中实现分组 TopN .

Hive 在0.11 之后引入了一个函数 ROW_NUMBER() 可以非常方便的解决此类问题

0.11 前相近功能的实现

在Hive 0.11 之前的版本是没有 topN 函数的。那么我们在0.11 之前的版本该如何实现 topN 呢&＃xff1f; 这里有一篇不错的文章是通过

Hive 的udf 去做实现的&＃xff0c;我们这里做一个参数。

https://www.cnblogs.com/1130136248wlxk/articles/5352145.html

函数使用

下面是函数的原型“

ROW_NUMBER() OVER (partition BY COLUMN_A ORDER BY COLUMN_B ASC/DESC)

首先我们构造一个场景&＃xff0c;假设提供一个成绩表&＃xff0c;有学生姓名&＃xff0c;所选科目 &＃xff0c;分数 3列需要统计每个科目的前三名。

表结构如下&＃xff1a;

表有如下的数据&＃xff1a;

&＃43;----------------&＃43;-----------------&＃43;-------------------&＃43;--&＃43; | student2.name | student2.score | student2.subject | &＃43;----------------&＃43;-----------------&＃43;-------------------&＃43;--&＃43; | a | 22.2 | english | | a | 90.2 | chinese | | a | 33.0 | english | | b | 72.2 | english | | b | 80.2 | chinese | | b | 63.0 | math | | c | 64.2 | english | | c | 85.2 | chinese | | c | 73.0 | math | | d | 24.2 | english | | d | 75.2 | chinese | | d | 43.0 | math | | e | 74.2 | english | | e | 55.2 | chinese | | e | 93.0 | math | | f | 76.2 | english | | f | 20.2 | chinese | | f | 84.0 | math | | f | 63.0 | math | &＃43;----------------&＃43;-----------------&＃43;-------------------&＃43;--&＃43;

0: jdbc:hive2://10.180.0.26:10000> select * from (select name, subject, score, row_number() over(partition by subject order by score desc) rank from student2) tmp where tmp.rank <&＃61; 3;

select * from (

select name, subject, score, row_number() over(partition by subject order by score desc) rank

from student2) tmp

where tmp.rank <&＃61; 3;

对语句简单讲解一下 &＃xff1a;内嵌的子查询根据科目分组求该分组下的排名&＃xff0c;如果有相同分数&＃xff0c;排名&＃43;1

结果&＃xff1a;

&＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43; | tmp.name | tmp.subject | tmp.score | tmp.rank | &＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43; | a | chinese | 90.2 | 1 | | c | chinese | 85.2 | 2 | | b | chinese | 80.2 | 3 | | f | english | 76.2 | 1 | | e | english | 74.2 | 2 | | b | english | 72.2 | 3 | | e | math | 93.0 | 1 | | f | math | 84.0 | 2 | | c | math | 73.0 | 3 | &＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43;

执行流程

除了会使用函数之外&＃xff0c;我们应该也了解下函数的执行顺序。

在使用 row_number() over()函数时候&＃xff0c;over()里头的分组以及排序的执行晚于 where group by order by 的执行。

相近的函数

除了 Hive 的 row_number() &＃xff0c; Hive 有没有提供功能相近的函数呢。答案是有的。

有两个函数&＃xff0c; 分别是

rank() over()

dense_rank() over()

rank() over()

rank() over() 跟 row_number() over() 的功能基本相同&＃xff1a;

不同点在于&＃xff0c;分组中存在相同值的处理流程。

rank() over() 更接近于一般的排名逻辑&＃xff0c;比如有两个并列第一&＃xff0c;那么就会显示

88分第1

80分第3

参考例子&＃xff1a;

0: jdbc:hive2://10.180.0.26:10000> select * from (select name, subject, score, rank() over(partition by subject order by score desc) rank from student2) tmp where tmp.rank <&＃61; 5;

select * from (

select name, subject, score, rank() over(partition by subject order by score desc) rank

from student2) tmp

where tmp.rank <&＃61; 5;

&＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43; | tmp.name | tmp.subject | tmp.score | tmp.rank | &＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43; | a | chinese | 90.2 | 1 | | c | chinese | 85.2 | 2 | | b | chinese | 80.2 | 3 | | d | chinese | 75.2 | 4 | | e | chinese | 55.2 | 5 | | f | english | 76.2 | 1 | | e | english | 74.2 | 2 | | b | english | 72.2 | 3 | | c | english | 64.2 | 4 | | a | english | 33.0 | 5 | | e | math | 93.0 | 1 | | f | math | 84.0 | 2 | | c | math | 73.0 | 3 | | f | math | 63.0 | 4 | | b | math | 63.0 | 4 | &＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43;

可以看到 f 与 b 都是 63分&＃xff0c;显示都是第 4名

dense_rank() over()

该函数与 row_number() over() 的区别也在于相同值的处理过程上。

不同于 rank() over() 的跳跃排序&＃xff0c; 即两个第2名&＃xff0c;之后是第4名&＃xff0c;

该函数是连续排序&＃xff0c;即两个相同第2名&＃xff0c;之后是第3名&＃xff0c;示例如下&＃xff1a;

select name, subject, score, dense_rank() over(partition by subject order by score desc) rank

from student2;

0: jdbc:hive2://10.180.0.26:10000> select name, subject, score, dense_rank() over(partition by subject order by score desc) rank from student2;

结果&＃xff1a;

&＃43;-------&＃43;----------&＃43;--------&＃43;-------&＃43;--&＃43; | name | subject | score | rank | &＃43;-------&＃43;----------&＃43;--------&＃43;-------&＃43;--&＃43; | a | chinese | 90.2 | 1 | | c | chinese | 85.2 | 2 | | b | chinese | 80.2 | 3 | | d | chinese | 75.2 | 4 | | e | chinese | 55.2 | 5 | | f | chinese | 20.2 | 6 | | f | english | 76.2 | 1 | | e | english | 74.2 | 2 | | b | english | 72.2 | 3 | | c | english | 64.2 | 4 | | a | english | 33.0 | 5 | | d | english | 24.2 | 6 | | a | english | 22.2 | 7 | | e | math | 93.0 | 1 | | f | math | 84.0 | 2 | | c | math | 73.0 | 3 | | f | math | 63.0 | 4 | | b | math | 63.0 | 4 | | d | math | 43.0 | 5 | &＃43;-------&＃43;----------&＃43;--------&＃43;-------&＃43;--&＃43;

好了。通过这篇文章你应该对 Hive 的分组排序功能有个非常细致的理解。

祝大家每日进步&＃xff0c;加油&＃xff01;

实际场景

我们有一张表 &＃xff0c;记录了用户的姓名&＃xff0c;性别 &＃xff0c;分数&＃xff0c;我们想找到男生与女生的前3名&＃xff0c;

表中数据如下&＃xff1a;

0: jdbc:hive2://cdh-manager:10000> select * from test_sex_rank . . . . . . . . . . . . . . . . .> ; INFO : Compiling command(queryId&＃61;hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8): select * from test_sex_rank INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test_sex_rank.name, type:string, comment:null), FieldSchema(name:test_sex_rank.sex, type:boolean, comment:null), FieldSchema(name:test_sex_rank.score, type:double, comment:null)], properties:null) INFO : Completed compiling command(queryId&＃61;hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8); Time taken: 0.117 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId&＃61;hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8): select * from test_sex_rank INFO : Completed executing command(queryId&＃61;hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8); Time taken: 0.0 seconds INFO : OK &＃43;---------------------&＃43;--------------------&＃43;----------------------&＃43; | test_sex_rank.name | test_sex_rank.sex | test_sex_rank.score | &＃43;---------------------&＃43;--------------------&＃43;----------------------&＃43; | a1 | false | 82.0 | | a2 | false | 98.0 | | a3 | false | 67.4 | | a4 | false | 87.0 | | b1 | true | 42.0 | | b2 | true | 98.0 | | b3 | true | 77.4 | | b4 | true | 87.0 | &＃43;---------------------&＃43;--------------------&＃43;----------------------&＃43; 8 rows selected (0.179 seconds)

我们编写的SQL 如下&＃xff1a;

0: jdbc:hive2://cdh-manager:10000> select * from (select name, score, sex, rank() over(partition by sex order by score) as rank from test_sex_rank) tmp where rank <&＃61;3;

select * from

(

select name, score, sex, rank() over(partition by sex order by score) as rank

from test_sex_rank

) tmp

where rank <&＃61;3;

结果如下&＃xff1a;

&＃43;-----------&＃43;------------&＃43;----------&＃43;-----------&＃43; | tmp.name | tmp.score | tmp.sex | tmp.rank | &＃43;-----------&＃43;------------&＃43;----------&＃43;-----------&＃43; | a3 | 67.4 | false | 1 | | a1 | 82.0 | false | 2 | | a4 | 87.0 | false | 3 | | b1 | 42.0 | true | 1 | | b3 | 77.4 | true | 2 | | b4 | 87.0 | true | 3 | &＃43;-----------&＃43;------------&＃43;----------&＃43;-----------&＃43;

推荐阅读

command
解决VS写C#项目导入MySQL数据源报错“You have a usable connection already”问题的正确方法

本文介绍了在VS写C#项目导入MySQL数据源时出现报错“You have a usable connection already”的问题，并给出了正确的解决方法。详细描述了问题的出现情况和报错信息，并提供了解决该问题的步骤和注意事项。 ... [详细]

蜡笔小新 2023-12-13 16:31:57
schema
MySQL表分区的创建、增加和删除方法详解

本文详细介绍了MySQL表分区的创建、增加和删除方法，包括查看分区数据量和全库数据量的方法。欢迎大家阅读并给予点评。 ... [详细]

蜡笔小新 2023-12-13 12:26:11
command
Hadoop2.6.0 + 云centos +伪分布式只谈部署

3.0.3玩不好，现将2.6.0tar.gz上传到usr,chmod-Rhadoop:hadophadoop-2.6.0，rm掉3.0.32.在etcp ... [详细]

蜡笔小新 2023-10-17 19:28:24
copy
Hadoop框架之HDFS的shell操作

既然HDFS是存取数据的分布式文件系统，那么对HDFS的操作，就是文件系统的基本操作，比如文件的创建、修改、删除、修改权限等，文件夹的创建、删除、重命名等。对HDFS的操作命令类似于Linux的she ... [详细]

蜡笔小新 2023-10-15 16:12:13
schema
MySQL显示SQL语句执行时间的实例详解

本文详细介绍了如何使用MySQL来显示SQL语句的执行时间，并通过MySQL Query Profiler获取CPU和内存使用量以及系统锁和表锁的时间。同时介绍了效能分析的三种方法：瓶颈分析、工作负载分析和基于比率的分析。 ... [详细]

蜡笔小新 2023-12-12 16:16:42
command
REVERT权限切换的操作步骤和注意事项

本文介绍了在SQL Server中进行REVERT权限切换的操作步骤和注意事项。首先登录到SQL Server，其中包括一个具有很小权限的普通用户和一个系统管理员角色中的成员。然后通过添加Windows登录到SQL Server，并将其添加到AdventureWorks数据库中的用户列表中。最后通过REVERT命令切换权限。在操作过程中需要注意的是，确保登录名和数据库名的正确性，并遵循安全措施，以防止权限泄露和数据损坏。 ... [详细]

蜡笔小新 2023-12-10 19:41:02
schema
http头_http头部注入

1、http头部注入分析1、原理 ... [详细]

蜡笔小新 2023-10-17 15:20:14
schema
MySQL锁--(深入浅出读书笔记)

MySQL锁的概述1.针对不同的引擎，采用不同的锁机制；（表锁，页面锁，行锁）myisam和memory存储引擎：表级锁；BOB存储引擎：页面锁，表级 ... [详细]

蜡笔小新 2023-10-17 09:28:54
schema
Hadoop （CDH4发行版）集群部署（部署脚本，namenode高可用，hadoop管理）

前言折腾了一段时间hadoop的部署管理，写下此系列博客记录一下。为了避免各位做部署这种重复性的劳动，我已经把部署的步骤写成脚本，各位只需要按着本文把脚本执行完，整个环境基本就部署 ... [详细]

蜡笔小新 2023-10-16 15:11:51
schema
hadoop基础----hadoop实战(六)-----hadoop管理工具---Cloudera Manager---CDH介绍

我们在之前的文章中已经初步介绍了Cloudera。hadoop基础----hadoop实战(零)-----hadoop的平台版本选择从版本选择这篇文章中我们了解到除了hadoop官方版本外很多 ... [详细]

蜡笔小新 2023-10-16 14:21:13
instance
MapReduce工作流程最详细解释

MapReduce是我们再进行离线大数据处理的时候经常要使用的计算模型，MapReduce的计算过程被封装的很好，我们只用使用Map和Reduce函数，所以对其整体的计算过程不是太 ... [详细]

蜡笔小新 2023-10-16 14:14:27
command
Azkaban（三）Azkaban的使用

界面介绍首页有四个菜单projects：最重要的部分，创建一个工程，所有flows将在工程中运行。scheduling:显示定时任务executing:显示当前运行的任务histo ... [详细]

蜡笔小新 2023-10-15 23:43:11
header
flume 收集日志到HDFS

作者同类文章X转自：http:www.aboutyun.comthread-7949-1-1.html问题导读：1.什么是flume？ ... [详细]

蜡笔小新 2023-10-12 13:21:24
header
IDEA配置spark与pycharm配置spark教程

eclipse配置spark1.6.0教程https:kevin12.iteye.comblog2274179这里注意修改,根据自己的修改com.JohnsonSpark_2.3. ... [详细]

蜡笔小新 2023-10-12 10:47:38
header
使用clouderaquickstartvm无配置快速部署Hadoop应用

http:zzj270919.blog.163.comblogstatic68997776201522561659999目录：通过CDH网站下载cloudera-vm ... [详细]

蜡笔小新 2023-10-11 18:27:57

小寒风

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章