伪分布式安装部署CDH4.2.1与Impala[原创实践]

作者：娜丷衣阵风丶 | 来源：互联网 | 2018-06-11 05:38

参考资料:www.cloudera.comcontentcloudera-contentcloudera-docsCDH4latestCDH4-Quick-Startcdh4qs_topic_3_3.htmlwww.cloudera.comcontentcloudera-contentcloudera-docsImpalalatestInstalling-and-Using-ImpalaInstalling

参考资料: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Quick-Start/cdh4qs_topic_3_3.html http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/Installing

参考资料:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Quick-Start/cdh4qs_topic_3_3.html
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/Installing-and-Using-Impala.html
http://blog.cloudera.com/blog/2013/02/from-zero-to-impala-in-minutes/

什么是Impala？
Cloudera发布了实时查询开源项目Impala，根据多款产品实测表明，它比原来基于MapReduce的Hive SQL查询速度提升3～90倍。Impala是Google Dremel的模仿，但在SQL功能上青出于蓝胜于蓝。

1. 安装JDK
$ sudo yum install jdk-6u41-linux-amd64.rpm

2. 伪分布式模式安装CDH4
$ cd /etc/yum.repos.d/
$ sudo wget http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/cloudera-cdh4.repo
$ sudo yum install hadoop-conf-pseudo

格式化NameNode.
$ sudo -u hdfs hdfs namenode -format

启动HDFS
$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

创建/tmp目录
$ sudo -u hdfs hadoop fs -rm -r /tmp
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

创建YARN与日志目录
$ sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging

$ sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging/history/done_intermediate
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate

$ sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging

$ sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn
$ sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

检查HDFS文件树
$ sudo -u hdfs hadoop fs -ls -R /

drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log
drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn

启动YARN
$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start
$ sudo service hadoop-mapreduce-historyserver start

创建用户目录(以用户dong.guo为例):
$ sudo -u hdfs hadoop fs -mkdir /user/dong.guo
$ sudo -u hdfs hadoop fs -chown dong.guo /user/dong.guo

测试上传文件
$ hadoop fs -mkdir input
$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input

Found 4 items
-rw-r--r--   1 dong.guo supergroup       1461 2013-05-14 03:30 input/core-site.xml
-rw-r--r--   1 dong.guo supergroup       1854 2013-05-14 03:30 input/hdfs-site.xml
-rw-r--r--   1 dong.guo supergroup       1325 2013-05-14 03:30 input/mapred-site.xml
-rw-r--r--   1 dong.guo supergroup       2262 2013-05-14 03:30 input/yarn-site.xml

配置HADOOP_MAPRED_HOME环境变量
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

运行一个测试Job
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'

Job完成后，可以看到以下目录
$ hadoop fs -ls

Found 2 items
drwxr-xr-x   - dong.guo supergroup          0 2013-05-14 03:30 input
drwxr-xr-x   - dong.guo supergroup          0 2013-05-14 03:32 output23

$ hadoop fs -ls output23

Found 2 items
-rw-r--r--   1 dong.guo supergroup          0 2013-05-14 03:32 output23/_SUCCESS
-rw-r--r--   1 dong.guo supergroup        150 2013-05-14 03:32 output23/part-r-00000

$ hadoop fs -cat output23/part-r-00000 | head

1	dfs.safemode.min.datanodes
1	dfs.safemode.extension
1	dfs.replication
1	dfs.namenode.name.dir
1	dfs.namenode.checkpoint.dir
1	dfs.datanode.data.dir

3. 安装 Hive
$ sudo yum install hive hive-metastore hive-server

$ sudo yum install mysql-server

$ sudo service mysqld start

$ cd ~
$ wget 'http://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-5.1.25.tar.gz'
$ tar xzf mysql-connector-java-5.1.25.tar.gz
$ sudo cp mysql-connector-java-5.1.25/mysql-connector-java-5.1.25-bin.jar /usr/lib/hive/lib/

$ sudo /usr/bin/mysql_secure_installation

[...]
Enter current password for root (enter for none):
OK, successfully used password, moving on...
[...]
Set root password? [Y/n] y
New password:hadoophive
Re-enter new password:hadoophive
Remove anonymous users? [Y/n] Y
[...]
Disallow root login remotely? [Y/n] N
[...]
Remove test database and access to it [Y/n] Y
[...]
Reload privilege tables now? [Y/n] Y
All done!

$ mysql -u root -phadoophive

mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;
mysql> CREATE USER 'hive'@'%' IDENTIFIED BY 'hadoophive';
mysql> CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hadoophive';
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'%';
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'localhost';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'%';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'localhost';
mysql> FLUSH PRIVILEGES;
mysql> quit;

$ sudo mv /etc/hive/conf/hive-site.xml /etc/hive/conf/hive-site.xml.bak
$ sudo vim /etc/hive/conf/hive-site.xml



  javax.jdo.option.ConnectionURL
  jdbc:mysql://localhost/metastore
  the URL of the MySQL database
  javax.jdo.option.ConnectionDriverName
  com.mysql.jdbc.Driver
  javax.jdo.option.ConnectionUserName
  hive
  javax.jdo.option.ConnectionPassword
  hadoophive
  datanucleus.autoCreateSchema
  false
  datanucleus.fixedDatastore
  true
  hive.metastore.uris
  thrift://127.0.0.1:9083
  IP address (or fully-qualified domain name) and port of the metastore host
  hive.aux.jars.path
  file:///usr/lib/hive/lib/zookeeper.jar,file:///usr/lib/hive/lib/hbase.jar,file:///usr/lib/hive/lib/hive-hbase-handler-0.10.0-cdh4.2.0.jar,file:///usr/lib/hive/lib/guava-11.0.2.jar

$ sudo service hive-metastore start

Starting (hive-metastore):                                 [  OK  ]

$ sudo service hive-server start

Starting (hive-server):                                    [  OK  ]

$ sudo -u hdfs hadoop fs -mkdir /user/hive
$ sudo -u hdfs hadoop fs -chown hive /user/hive
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod 777 /tmp
$ sudo -u hdfs hadoop fs -chmod o+t /tmp
$ sudo -u hdfs hadoop fs -mkdir /data
$ sudo -u hdfs hadoop fs -chown hdfs /data
$ sudo -u hdfs hadoop fs -chmod 777 /data
$ sudo -u hdfs hadoop fs -chmod o+t /data

$ sudo chown -R hive:hive /var/lib/hive
$ sudo vim /tmp/kv1.txt

1	www.baidu.com
2	www.google.com
3	www.sina.com.cn
4	www.163.com
5	heylinx.com

$ sudo -u hive hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201305140801_825709760.txt
hive> CREATE TABLE IF NOT EXISTS pokes ( foo INT,bar STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LINES TERMINATED BY "\n";
hive> show tables;
OK
pokes
Time taken: 0.415 seconds
hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' OVERWRITE INTO TABLE pokes;
Copying data from file:/tmp/kv1.txt
Copying file: file:/tmp/kv1.txt
Loading data to table default.pokes
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/pokes
Table default.pokes stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 79, raw_data_size: 0]
OK
Time taken: 1.681 seconds

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

4. 安装 Impala
$ cd /etc/yum.repos.d/
$ sudo wget http://archive.cloudera.com/impala/redhat/6/x86_64/impala/cloudera-impala.repo
$ sudo yum install impala impala-shell
$ sudo yum install impala-server impala-state-store

$ sudo vim /etc/hadoop/conf/hdfs-site.xml

...
   dfs.client.read.shortcircuit
   true
   dfs.domain.socket.path
   /var/run/hadoop-hdfs/dn._PORT
   dfs.client.file-block-storage-locations.timeout
   3000    
  dfs.datanode.hdfs-blocks-metadata.enabled
  true

$ sudo cp -rpa /etc/hadoop/conf/core-site.xml /etc/impala/conf/
$ sudo cp -rpa /etc/hadoop/conf/hdfs-site.xml /etc/impala/conf/

$ sudo service hadoop-hdfs-datanode restart

$ sudo service impala-state-store restart
$ sudo service impala-server restart

$ sudo /usr/java/default/bin/jps

5. 安装 Hbase
$ sudo yum install hbase

$ sudo vim /etc/security/limits.conf

hdfs - nofile 32768
hbase - nofile 32768

$ sudo vim /etc/pam.d/common-session

session required pam_limits.so

$ sudo vim /etc/hadoop/conf/hdfs-site.xml

  dfs.datanode.max.xcievers
  4096

$ sudo cp /usr/lib/impala/lib/hive-hbase-handler-0.10.0-cdh4.2.0.jar /usr/lib/hive/lib/hive-hbase-handler-0.10.0-cdh4.2.0.jar

$ sudo /etc/init.d/hadoop-hdfs-namenode restart
$ sudo /etc/init.d/hadoop-hdfs-datanode restart

$ sudo yum install hbase-master
$ sudo service hbase-master start

$ sudo -u hive hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Hive history file=/tmp/hive/hive_job_log_hive_201305140905_2005531704.txt
hive> CREATE TABLE hbase_table_1(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") TBLPROPERTIES ("hbase.table.name" = "xyz");
OK
Time taken: 3.587 seconds
hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=5;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1368502088579_0004, Tracking URL = http://ip-10-197-10-4:8088/proxy/application_1368502088579_0004/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1368502088579_0004
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2013-05-14 09:12:45,340 Stage-0 map = 0%,  reduce = 0%
2013-05-14 09:12:53,165 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.63 sec
MapReduce Total cumulative CPU time: 2 seconds 630 msec
Ended Job = job_1368502088579_0004
1 Rows loaded to hbase_table_1
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 2.63 sec   HDFS Read: 288 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 630 msec
OK
Time taken: 21.063 seconds
hive> select * from hbase_table_1;
OK
5	heylinx.com
Time taken: 0.685 seconds
hive> SELECT COUNT (*) FROM pokes;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_1368502088579_0005, Tracking URL = http://ip-10-197-10-4:8088/proxy/application_1368502088579_0005/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1368502088579_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-05-14 10:32:04,711 Stage-1 map = 0%,  reduce = 0%
2013-05-14 10:32:11,461 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2013-05-14 10:32:12,554 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2013-05-14 10:32:13,642 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2013-05-14 10:32:14,760 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2013-05-14 10:32:15,918 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2013-05-14 10:32:16,991 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2013-05-14 10:32:18,111 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2013-05-14 10:32:19,188 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.04 sec
MapReduce Total cumulative CPU time: 4 seconds 40 msec
Ended Job = job_1368502088579_0005
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.04 sec   HDFS Read: 288 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 40 msec
OK
5
Time taken: 28.195 seconds

6. 测试Impala性能
View parameters on http://ec2-204-236-182-78.us-west-1.compute.amazonaws.com:25000

$ impala-shell

[ip-10-197-10-4.us-west-1.compute.internal:21000] > CREATE TABLE IF NOT EXISTS pokes ( foo INT,bar STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LINES TERMINATED BY "\n";
Query: create TABLE IF NOT EXISTS pokes ( foo INT,bar STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LINES TERMINATED BY "\n"
[ip-10-197-10-4.us-west-1.compute.internal:21000] > show tables;
Query: show tables
Query finished, fetching results ...
+-------+
| name  |
+-------+
| pokes |
+-------+
Returned 1 row(s) in 0.00s
[ip-10-197-10-4.us-west-1.compute.internal:21000] > SELECT * from pokes;
Query: select * from pokes
Query finished, fetching results ...
+-----+-----------------+
| foo | bar             |
+-----+-----------------+
| 1   | www.baidu.com   |
| 2   | www.google.com  |
| 3   | www.sina.com.cn |
| 4   | www.163.com     |
| 5   | heylinx.com     |
+-----+-----------------+
Returned 5 row(s) in 0.28s
[ip-10-197-10-4.us-west-1.compute.internal:21000] > SELECT COUNT (*) from pokes;
Query: select COUNT (*) from pokes
Query finished, fetching results ...
+----------+
| count(*) |
+----------+
| 5        |
+----------+
Returned 1 row(s) in 0.34s

通过两个COUNT的结果来看，Hive使用了 28.195 seconds 而 Impala仅使用了0.34s，由此可以看出Impala的性能确实要优于Hive。

原文地址：伪分布式安装部署CDH4.2.1与Impala[原创实践], 感谢原作者分享。

推荐阅读

hdfs
每天收获一点点Hadoop概述

一、Hadoop来历Hadoop的思想来源于Google在做搜索引擎的时候出现一个很大的问题就是这么多网页我如何才能以最快的速度来搜索到，由于这个问题Google发明 ... [详细]

蜡笔小新 2023-12-14 18:58:01
search
Hadoop源码解析1Hadoop工程包架构解析

1 Hadoop中各工程包依赖简述 Google的核心竞争技术是它的计算平台。Google的大牛们用了下面5篇文章，介绍了它们的计算设施。 GoogleCluster：ht ... [详细]

蜡笔小新 2023-10-17 13:28:20
search
hadoop基础----hadoop实战(六)-----hadoop管理工具---Cloudera Manager---CDH介绍

我们在之前的文章中已经初步介绍了Cloudera。hadoop基础----hadoop实战(零)-----hadoop的平台版本选择从版本选择这篇文章中我们了解到除了hadoop官方版本外很多 ... [详细]

蜡笔小新 2023-10-16 14:21:13
search
使用clouderaquickstartvm无配置快速部署Hadoop应用

http:zzj270919.blog.163.comblogstatic68997776201522561659999目录：通过CDH网站下载cloudera-vm ... [详细]

蜡笔小新 2023-10-11 18:27:57
python
WinPythonHadoop在Win10上安装教程

本文介绍了在Win10上安装WinPythonHadoop的详细步骤，包括安装Python环境、安装JDK8、安装pyspark、安装Hadoop和Spark、设置环境变量、下载winutils.exe等。同时提醒注意Hadoop版本与pyspark版本的一致性，并建议重启电脑以确保安装成功。 ... [详细]

蜡笔小新 2023-12-14 11:26:56
default
REVERT权限切换的操作步骤和注意事项

本文介绍了在SQL Server中进行REVERT权限切换的操作步骤和注意事项。首先登录到SQL Server，其中包括一个具有很小权限的普通用户和一个系统管理员角色中的成员。然后通过添加Windows登录到SQL Server，并将其添加到AdventureWorks数据库中的用户列表中。最后通过REVERT命令切换权限。在操作过程中需要注意的是，确保登录名和数据库名的正确性，并遵循安全措施，以防止权限泄露和数据损坏。 ... [详细]

蜡笔小新 2023-12-10 19:41:02
jar
Maven构建Hadoop,

Maven构建Hadoop工程阅读目录序Maven安装构建示例下载系列索引序　　上一篇，我们编写了第一个MapReduce，并且成功的运行了Job，Hadoop1.x是通过ant ... [详细]

蜡笔小新 2023-10-17 16:11:18
jar
什么是大数据lambda架构

一、什么是Lambda架构Lambda架构由Storm的作者[NathanMarz]提出，根据维基百科的定义，Lambda架构的设计是为了在处理大规模数 ... [详细]

蜡笔小新 2023-10-17 16:06:09
python
《Spark核心技术与高级应用》——1.2节Spark的重要扩展

本节书摘来自华章社区《Spark核心技术与高级应用》一书中的第1章，第1.2节Spark的重要扩展，作者于俊向海代其锋马海平，更多章节内容可以访问云栖社区“华章社区”公众号查看1. ... [详细]

蜡笔小新 2023-10-16 18:07:56
ip
Hadoop （CDH4发行版）集群部署（部署脚本，namenode高可用，hadoop管理）

前言折腾了一段时间hadoop的部署管理，写下此系列博客记录一下。为了避免各位做部署这种重复性的劳动，我已经把部署的步骤写成脚本，各位只需要按着本文把脚本执行完，整个环境基本就部署 ... [详细]

蜡笔小新 2023-10-16 15:11:51
ip
Zookeeper详解应用程序（七）

Zookeeper为分布式环境提供灵活的协调基础架构。ZooKeeper框架支持许多当今最好的工业应用程序。我们将在本章中讨论ZooKeeper的一些最显着的应用。雅虎ZooKee ... [详细]

蜡笔小新 2023-10-16 08:30:29
ip
Hadoop——Hive简介和环境配置

一、Hive的简介和配置1.简介Hive是构建在Hadoop之上的数据操作平台lHive是一个SQL解析引擎，它将SQL转译成MapReduce作业，并 ... [详细]

蜡笔小新 2023-10-14 16:22:56
ip
Hadoop中的MapReduce框架原理、自定义Partitioner步骤、在Job驱动中，设置自定义Partitioner、Partition 分区案例

文章目录13.MapReduce框架原理13.3Shuffle机制13.3.2Partition分区13.3.2.3自定义Partitioner步骤13.3.2.3.1自定义类继承 ... [详细]

蜡笔小新 2023-10-14 11:44:52
ip
Hive简介,HIV的介绍

hive的本质是hadoop客户端通过写sql转换成MapReduce提交给yarn、hdfs执行hive的优点操作接口采用类sql语法提供快速开发能力避免了去写MapReduce ... [详细]

蜡笔小新 2023-10-12 23:50:55
ip
无服务器_云原生数据湖架构中的无服务器 Kafka

篇首语：本文由编程笔记#小编为大家整理，主要介绍了云原生数据湖架构中的无服务器Kafka相关的知识，希望对你有一定的参考价值。 ... [详细]

蜡笔小新 2023-10-12 15:37:48

娜丷衣阵风丶

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章