SparkBulkload(Java)

作者：艳斐儿M | 来源：互联网 | 2023-06-06 18:06

1、使用Spark通过Bulkload的方式导数据到Hbase在未用Bulkload写Hbase时，使用RDD进行封装为Tuple2

1、使用Spark通过Bulkload的方式导数据到Hbase

在未用Bulkload写Hbase时&＃xff0c;使用RDD进行封装为Tuple2的KVRDD&＃xff0c;然后通过saveAsNewAPIHadoopDataset写Hbase&＃xff0c;非常慢&＃xff0c;400G的数据大概写了2H&＃43;还没写完&＃xff0c;后面没有办法就考虑使用Bulkload来导入数据。
在测试之前网上很多资料都是Scala版本的&＃xff0c;并且实现都是单个列来操作&＃xff0c;实际生产中会存在多个列族和列的情况&＃xff0c;并且这里面有很多坑。
先上代码&＃xff1a;

public class HbaseSparkUtils {private static Configuration hbaseConf;static {hbaseConf &＃61; HBaseConfiguration.create();hbaseConf.set(ConfigUtils.getHbaseZK()._1(), ConfigUtils.getHbaseZK()._2());hbaseConf.set(ConfigUtils.getHbaseZKPort()._1(), ConfigUtils.getHbaseZKPort()._2());}public static void saveHDFSHbaseHFile(SparkSession spark, // spark session Dataset ds, // 数据集String table_name, //hbase表名Integer rowKeyIndex, //rowkey的索引idString fields) throws Exception { // 数据集的字段列表hbaseConf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 1024);hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, table_name);Job job &＃61; Job.getInstance();job.setMapOutputKeyClass(ImmutableBytesWritable.class);job.setMapOutputValueClass(KeyValue.class);job.setOutputFormatClass(HFileOutputFormat2.class);Connection conn &＃61; ConnectionFactory.createConnection(hbaseConf);TableName tableName &＃61; TableName.valueOf(table_name);HRegionLocator regionLocator &＃61; new HRegionLocator(tableName, (ClusterConnection) conn);Table realTable &＃61; ((ClusterConnection) conn).getTable(tableName);HFileOutputFormat2.configureIncrementalLoad(job, realTable, regionLocator);JavaRDD javaRDD &＃61; ds.toJavaRDD();JavaPairRDD javaPairRDD &＃61;javaRDD.mapToPair(new PairFunction>>() {&＃64;Overridepublic Tuple2>> call(Row row) throws Exception {List> tps &＃61; new ArrayList<>();String rowkey &＃61; row.getString(rowKeyIndex);ImmutableBytesWritable writable &＃61; new ImmutableBytesWritable(Bytes.toBytes(rowkey));// sort columns。这里需要对列进行排序&＃xff0c;不然会报错ArrayList> tuple2s &＃61; new ArrayList<>();String[] columns &＃61; fields.split(",");for (int i &＃61; 0; i (i, columns[i]));}for (Tuple2 t : tuple2s) {String[] fieldNames &＃61; row.schema().fieldNames();// 不将作为rowkey的字段存在列里面if (t._2().equals(fieldNames[rowKeyIndex])) {System.out.println(String.format("%s &＃61;&＃61; %s continue", t._2(), fieldNames[rowKeyIndex]));continue;}if ("main".equals(t._2())) {continue;}String value &＃61; getRowValue(row, t._1(), tuple2s.size());KeyValue kv &＃61; new KeyValue(Bytes.toBytes(rowkey),Bytes.toBytes(ConfigUtils.getFamilyInfo()._2()),Bytes.toBytes(t._2()), Bytes.toBytes(value));tps.add(new Tuple2<>(writable, kv));}for (Tuple2 t : tuple2s) {String value &＃61; getRowValue(row, t._1(), tuple2s.size());if ("main".equals(t._2())) { // filed &＃61;&＃61; &＃39;main&＃39;KeyValue kv &＃61; new KeyValue(Bytes.toBytes(rowkey),Bytes.toBytes(ConfigUtils.getFamilyMain()._2()),Bytes.toBytes(t._2()), Bytes.toBytes(value));tps.add(new Tuple2<>(writable, kv));break;}}return new Tuple2<>(writable, tps);}// 这里一定要按照rowkey进行排序&＃xff0c;这个效率很低&＃xff0c;目前没有找到优化的替代方案}).sortByKey().flatMapToPair(new PairFlatMapFunction>>,ImmutableBytesWritable, KeyValue>() {&＃64;Overridepublic Iterator> call(Tuple2>> tuple2s) throws Exception {return tuple2s._2().iterator();}});// 创建HDFS的临时HFile文件目录String temp &＃61; "/tmp/bulkload/"&＃43;table_name&＃43;"_"&＃43;System.currentTimeMillis();javaPairRDD.saveAsNewAPIHadoopFile(temp, ImmutableBytesWritable.class,KeyValue.class, HFileOutputFormat2.class, job.getConfiguration());LoadIncrementalHFiles loader &＃61; new LoadIncrementalHFiles(hbaseConf);Admin admin &＃61; conn.getAdmin();loader.doBulkLoad(new Path(temp), admin, realTable, regionLocator);} }

2、下面是一些遇到的异常问题

1、Can not create a Path from a null string

源码分析&＃xff1a;

需要添加下面属性&＃xff1a;

job.getConfiguration().set("mapred.output.dir","/user/wangwei/tmp/"&＃43;tableName);

job.getConfiguration().set("mapreduce.output.fileoutputformat.outputdir", "/tmp/"&＃43;tableName); // 推荐的参数

2、 Bulk load operation did not find any files to load in directory /tmp/wwtest. Does it contain files in subdirectories that correspond to column family names?

17/10/11 15:54:09 WARN LoadIncrementalHFiles: Skipping non-directory file:/tmp/wwtest/_SUCCESS

17/10/11 15:54:09 WARN LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory /tmp/wwtest. Does it contain files in subdirectories that correspond to column family names?

1、查看输入数据是否为空

2、setMapOutputKeyClass 和 saveAsNewAPIHadoopFile 中class是否一致

3、代码BUG

3、 Added a key not lexically larger than previous

java.io.IOException: Added a key not lexically larger than previous key&＃61;\x00\x02Mi\x0BsearchIndexuserId\x00\x00\x01>\xD5\xD6\xF3\xA3\x04, lastkey&＃61;\x00\x01w\x0BsearchIndexuserId\x00\x00\x01>\xD5\xD6\xF3\xA3\x04

最主要原因&＃xff0c;在制作HFile文件的时候&＃xff0c;一定要主键排序。Put进去会自动排序。但自己做成HFile文件不会自动排序。

所有一定要排序好&＃xff0c;从

主键

列族

列

都要手动排序好。然后生成HFile文件。不然只会报错。

4、Caused by: java.lang.ClassCastException: org.apache.hadoop.hbase.client.Put cannot be cast to org.apache.hadoop.hbase.Cell&＃xff08;使用Put作为MapOutputKey出现&＃xff0c;使用KeyValue不存在问题&＃xff09;

没解决&＃xff0c;使用KeyValue 放到一个List里面&＃xff0c;然后FlatMap一下

5、java.io.IOException: Trying to load more than 32 hfiles to one family of one region

hbaseConf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 1024);

推荐阅读

io
kafka 0.9+消费者配置参数说明

ConsumerConfiguration在kafka0.9使用JavaConsumer替代了老版本的scalaConsumer。新版的配置如下：bootstrap. ... [详细]

蜡笔小新 2023-10-16 10:44:59
request
通过Go SDK（Amazon S3）从Bucket生成Torrent - Generate Torrent from Bucket via Go SDK (Amazon S3)

Imtryingtofigureoutawaytogeneratetorrentfilesfromabucket,usingtheAWSSDKforGo.我正 ... [详细]

蜡笔小新 2023-12-12 14:13:01
process
mapreduce源码分析总结

这篇文章总结的非常到位,故而转之一MapReduce概述MapReduce是一个用于大规模数据处理的分布式计算模型，它最初是由Google工程师设计并实现的ÿ ... [详细]

蜡笔小新 2023-10-17 12:36:35
process
Kylin 单节点安装

软件环境Hadoop:2.7,3.1(sincev2.5)Hive:0.13-1.2.1HBase:1.1,2.0(sincev2.5)Spark(optional)2.3.0K ... [详细]

蜡笔小新 2023-10-16 16:09:42
process
spark任务已经执行结束，但还显示RUNNING状态

spark的任务已经执行完成：scalavallinesc.textFile(hdfs:vm122:9000dblp.rdf)line:org.apache ... [详细]

蜡笔小新 2023-10-16 12:18:00
post
rhel5.5搭建网关+LAMP+postfix+dhcp的步骤和配置方法

本文介绍了在rhel5.5操作系统下搭建网关+LAMP+postfix+dhcp的步骤和配置方法。通过配置dhcp自动分配ip、实现外网访问公司网站、内网收发邮件、内网上网以及SNAT转换等功能。详细介绍了安装dhcp和配置相关文件的步骤，并提供了相关的命令和配置示例。 ... [详细]

蜡笔小新 2023-12-14 17:13:20
const
单击后为什么远程通知操作无效？ - Why remote notification action is doing nothing after clicking?

IhaveconfiguredanactionforaremotenotificationwhenitarrivestomyiOsapp.Iwanttwodiff ... [详细]

蜡笔小新 2023-12-14 15:57:44
const
RouterOS 5.16软路由安装图解教程

本文介绍了如何安装RouterOS 5.16软路由系统，包括系统要求、安装步骤和登录方式。同时提供了详细的图解教程，方便读者进行操作。 ... [详细]

蜡笔小新 2023-12-12 10:22:22
const
GreenDAO快速入门

前言之前在自己做项目的时候，用到了GreenDAO数据库，其实对于数据库辅助工具库从OrmLite，到litePal再到GreenDAO，总是在不停的切换，但是没有真正去了解他们的 ... [详细]

蜡笔小新 2023-12-11 12:31:00
const
Hibernate延迟加载深入分析-集合属性的延迟加载策略

本文深入分析了Hibernate延迟加载的机制，特别是集合属性的延迟加载策略。通过延迟加载，可以降低系统的内存开销，提高Hibernate的运行性能。对于集合属性，推荐使用延迟加载策略，即在系统需要使用集合属性时才从数据库装载关联的数据，避免一次加载所有集合属性导致性能下降。 ... [详细]

蜡笔小新 2023-12-10 14:26:13
spring
微信官方授权及获取OpenId的方法，服务器通过SpringBoot实现

主要步骤：前端获取到code(wx.login)，传入服务器服务器通过参数AppID和AppSecret访问官方接口，获取到OpenId ... [详细]

蜡笔小新 2023-12-10 10:54:58
request
AFNetwork框架（零）使用NSURLSession进行网络请求

本文介绍了AFNetwork框架中使用NSURLSession进行网络请求的方法，包括NSURLSession的配置、请求的创建和执行等步骤。同时还介绍了NSURLSessionDelegate和NSURLSessionConfiguration的相关内容。通过本文可以了解到AFNetwork框架中使用NSURLSession进行网络请求的基本流程和注意事项。 ... [详细]

蜡笔小新 2023-12-10 02:03:27
command
Hadoop2.6.0 + 云centos +伪分布式只谈部署

3.0.3玩不好，现将2.6.0tar.gz上传到usr,chmod-Rhadoop:hadophadoop-2.6.0，rm掉3.0.32.在etcp ... [详细]

蜡笔小新 2023-10-17 19:28:24
command
HBase系列之hbase2.2.3安装

1.下载地址hbase-2.2.3下载地址2.解压安装1）解压tarzxvfhbase-2.2.3-bin.tar.gz2）环境变量配置vim ... [详细]

蜡笔小新 2023-10-15 13:51:57
command
改Android app字体,如何改变android app的默认字体？

android：customfonts中有一个自定义字体的图书库以下是如何使用它的示例.在gradle中你需要把这行：compileuk.co.chri ... [详细]

蜡笔小新 2023-10-15 08:20:24

艳斐儿M

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章