默认使用derby数据库,不能够多个用户同时使用,多用于测试
使用MySQL数据库存储元数据,多用于生产环境
创建数据仓库目录
hadoop fs -mkdir -p /user/hive/warehouse
赋予权限
hadoop fs -chmod a+w /user/hive/warehouse
hadoop fs -chmod a+w /temp
hadoop 用户将HIVE安装包解压到/home/hadoop/apps安装目录
tar -zxvf apache-hive-1.2.2-bin.tar.gz -C /home/hadoop/apps
切换到root用户
创建软连接
ln -s /home/hadoop/apps/hive-1.2.2 /usr/local/hive
修改属主
chown -R hadoop:hadoop /usr/local/hive
添加环境变量
vim /etc/profile
添加内容
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:${HIVE_HOME}/bin
使环境变量生效
source /etc/profile
切换到hadoop用户
修改HIVE_HOME/conf/hive-site.xml内容,没有则新建
注: 修改对应ip和密码
启动hive
/usr/local/hive/bin/hive
[hadoop@hadoop4 bin]$ hive
Logging initialized using configuration in jar:file:/home/hadoop/apps/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
hive>
启动成功
hive -hiveconf hive.root.logger=DEBUG,console
显示日志方式启动hive
show databases
hive> show databases;
OK
default
Time taken: 0.02 seconds, Fetched: 1 row(s)
create database mytestDB;
create database IF NOT EXISTS mytestDB;
hadoop fs -ls /user/hive/warehouse
use mytestdb
show tables;
user_info
create table user_info(
user_id string,
area_id string,
age int,
occupation string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
创建成功后,同时会在HDFS中创建目录
/user/hive/warehouse/mytestdb.db/user_info
drop table user_info;
create table student_info(
student_id string comment '学号',
name string comment '姓名',
age int comment '年龄',
origin string comment '地域'
)
comment '学生信息表'
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
load data local inpath '/home/hadoop/apps/hive_test_data/student_info_data.txt' into table student_info;
Found 1 items
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:34 /user/hive/warehouse/mydb.db/student_info/student_info_data.txt
hive> select * from student_info where origin='11'
hive> select * from student_info where origin='11'
> ;
OK
1 xiaoming 20 11
6 zhangsan 20 11
7 lisi 19 11
Time taken: 0.473 seconds, Fetched: 3 row(s)
hive>
hadoop fs -put student_info_data.txt /
追加的方式载入
load data inpath '/student_info_data.txt' into table student_info;
hdfs中student_info表位置会出现两个 student_info_data.txt
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/mydb.db/student_info
Found 2 items
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:34 /user/hive/warehouse/mydb.db/student_info/student_info_data.txt
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:39 /user/hive/warehouse/mydb.db/student_info/student_info_data_copy_1.txt
并且HDFS中 /student_info_data.txt会剪切到student_info表的hdfs路径下/user/hive/warehouse/rel.db/student_info
以重写的方式载入
load data inpath '/student_info_data.txt' overwrite into table student_info;
会覆盖原来的数据.
create table employee(
user_id string,
salary int,
worked_citys array
social_security map
welfare struct
)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
load data local inpath '/home/hadoop/apps/hive_test_data/employee_data.txt' into table employee;
hive> select * from employee;
OK
zhangsan 10800 ["beijing","shanghai"] {"养老":1000.0,"医疗":600.0} {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
lisi 20000 ["beijing","nanjing"] {"养老":2000.0,"医疗":1200.0} {"meal_allowance":2000.0,"if_regular":false,"commercial_insurance":500.0}
wangwu 17000 ["shanghai","nanjing"] {"养老":1800.0,"医疗":1100.0} {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
hive> select user_id,
> salary,
> worked_citys[0],
> social_security['养老'],
> welfare.meal_allowance
> from employee
> where welfare.if_regular=true;
OK
zhangsan 10800 beijing 1000.0 2000.0
wangwu 17000 shanghai 1800.0 2000.0
可以提前创建好HDFS路径
hadoop mkdir -p /user/hive/warehouse/data/student_school_info
如果没有提前创建好,在创建外部表的时候会根据指定路径自动创建
create external table rel.student_school_info(
student_id string,
name string,
institute_id string,
major_id string,
school_year string
)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info';
hadoop fs -put /home/hadoop/apps/hive_test_data/student_school_info_external_data.txt /user/hive/warehouse/data/student_school_info/
select * from student_school_info
创建学生入学信息表
字段信息:学号、姓名、学院id
分区字段:专业id
create table student_school_info_partition_maj(
student_id string,
name string,
institute_id string
)
partitioned by(major_id string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
insert into table student_school_info_partition_maj partition(major_id ='bigdata')
select t1.student_id,t1.name,t1.institute_id
from student_school_info t1
where t1. major_id = bigdata;
查看分区
show partitions student_school_info_partition_maj;
查看hdfs路径,会增加major_id =’bigdata’目录
hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition_maj/
删除分区
alter table student_school_info_partition drop partition (major_id ='bigdata');
查看分区表,数据已经被删除
使用动态分区添加数据
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_partition_maj partition(major_id)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1 ;
show partitions student_school_info_partition_maj;
hive> show partitions student_school_info_partition_maj;
OK
major_id=bigdata
major_id=computer
major_id=software
Time taken: 0.114 seconds, Fetched: 3 row(s)
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition_maj
Found 3 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=bigdata
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=computer
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=software
会增加三个目录,每个目录存储对应的数据
创建学生入学信息表
字段信息:学号、姓名、学院id
分区字段:专业id
create external table rel.student_school_info_external_partition_maj(
student_id string,
name string,
institute_id string
)
partitioned by(major_id string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info_external_partition_maj';
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_external_partition_maj partition(major_id)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1;
drop table student_school_info_partition_maj;
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/rel.db/
Found 1 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:23 /user/hive/warehouse/rel.db/student_school_info_partition
hive> drop table student_school_info_external_partition_maj;
OK
Time taken: 0.63 seconds
hive> show tables;
OK
student_school_info
student_school_info_partition
Time taken: 0.027 seconds, Fetched: 2 row(s)
查看hdfs中的文件,数据依然存在
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/
Found 2 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:06 /user/hive/warehouse/data/student_school_info
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition_maj
Found 3 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=bigdata
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=computer
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software
Found 1 items
-rwxrwxrwx 3 hadoop supergroup 46 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software/000000_0
根据已存在的表结构,使用like关键字,复制一个表结构一模一样的新表
create table student_info2 like student_info;
根据已经存在的表,使用as关键字,创建一个与查询结果字段一致的表,同时将查询结果数据插入到新表
create table student_info3 as select * from student_info;
只有student_id,name两个字段的表
create table student_info4 as select student_id,name from student_info;
student_info4表重命名为student_id_name
alter table student_info4 rename to student_id_name;
给student_info3表添加性别列,新添加的字段会在所有列最后,分区列之前,在添加新列之前已经存在的数据文件中
如果没有新添加列对应的数据,在查询的时候显示为空。添加多个列用逗号隔开
alter table student_info_new3 add columns (gender string comment '性别');
删除列或修改列
修改列,将继续存在的列再定义一遍,需要替换的列重新定义
alter table student_info_new3 replace columns(student_id string,name string,age int,origin string,gender2 int);
删除列,将继续存在的列再定义一遍,需要删除的列不再定义
alter table student_info_new3 replace columns(student_id string,name string,age int,origin string);
按照指定字段取它的hash散列值分桶
创建学生入学信息分桶表
字段信息:学号、姓名、学院ID、专业ID
分桶字段:学号,4个桶,桶内按照学号升序排列
create table rel.student_info_bucket(
student_id string,
name string,
age int,
origin string
)
clustered by (student_id) sorted by (student_id asc) into 4 buckets
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
cluster by(student_id);
分桶表一般不使用load向分桶表中导入数据,因为load导入数据只是将数据复制到表的数据存储目录下,hive并不会
在load的时候对数据进行分析然后按照分桶字段分桶,load只会将一个文件全部导入到分桶表中,并没有分桶。一般
采用insert从其他表向分桶表插入数据。
分桶表在创建表的时候只是定义表的模型,插入的时候需要做如下操作:
在每次执行分桶插入的时候在当前执行的session会话中要设置hive.enforce.bucketing = true;声明本次执行的是一次分桶操作。
需要指定reduce个数与分桶的数量相同set mapreduce.job.reduces=4,这样才能保证有多少桶就生成多少个文件。
如果定义了按照分桶字段排序,需要在从其他表查询数据过程中将数据按照分区字段排序之后插入各个桶中,分桶表并不会将各分桶中的数据排序。
排序和分桶的字段相同的时候使用Cluster by(字段),cluster by 默认按照分桶字段在桶内升序排列,如果需要在桶内降序排列,
使用distribute by (col) sort by (col desc)组合实现。
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
distribute by (student_id) sort by (student_id desc);
insert overwrite local directory '/home/hadoop/apps/hive_test_data/export_data' row format delimited fields terminated by '\t' select * from student_info;
hive -e"select * from rel.student_info"> ./student_info_data.txt
select * from a join b on a.id=b.id;
等同于
select * from a inner join b on a.id=b.id;
full outer join 或者 full join
两个表通过id关联,把两个表的数据全部查询出来
select * from a full join b on a.id=b.id;
left join
左连接时,左表中出现的join字段都保留,右表没有连接上的都为空
select * from a left join b on a.id=b.id;
rightjoin
右连接时,右表中出现的join字段都保留,左表没有连接上的都是空
select * from a right join b on a.id=b.id;
left semi join
左半连接实现了类似IN/EXISTS的查询语义,输出符合条件的左表内容。
hive不支持in …exists这种关系型数据库中的子查询结构,hive暂时不支持右半连接。
select a.id, a.name from a where a.id in (select b.id from b);
使用Hive对应于如下语句:
select a.id,a.name from a left semi join b on a.id = b.id;
select /*+ mapjoin(b) */ a.id, a.name from a join b on a.id = b.id
hive> select case 1 when 2 then 'two' when 1 then 'one' else 'zero' end;
one
语法2:CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
说明:如果a为TRUE,则返回b;如果c为TRUE,则返回d;否则返回e
hive> select case when 1=2 then 'two' when 1=1 then 'one' else 'zero' end;
one
当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。
UDF 作用于单个数据行,产生一个数据行作为输出。
步骤:
原因: mysql字符集问题,要设置mysql中hive数据库的字符为latin1