HBase,HDFSanddurablesync

作者：mobiledu2502901087 | 来源：互联网 | 2018-06-10 22:27

HBaseandHDFSgohandinhandtoprovideHBasesdurabilityandconsistencyguarantees.OnewayoflookingatthissetupisthatHDFShandlesthedistributionandstorageofyourdatawhereasHBasehandlesthedistributionofCPUcyclesa

HBase and HDFS go hand in hand to provide HBase's durability and consistency guarantees. One way of looking at this setup is that HDFS handles the distribution and storage of your data whereas HBase handles the distribution of CPU cycles a

HBase and HDFS go hand in hand to provide HBase's durability and consistency guarantees.

One way of looking at this setup is that HDFS handles the distribution and storage of your data whereas HBase handles the distribution of CPU cycles and provides a consistent view of that data.

As described in many other places, HBase

Appends all changes to a WAL
Batches/sorts changes in memory
Flushes memory to immutable, sorted data files in HDFS
Combines smaller data files to fewer, larger ones during compactions.

The part that is typical less clearly understood and documented are the exact durability guarantees provided by HDFS (and hence HBase).

HDFS sync has a colorful history, with needed support for HBase only available in an unreleased "append"-branch of HDFS for a long time. (Note that the append and sync features are independent and against common believe HBase only relies on the sync feature). See also this Cloudera blog post.

In order to understand what HDFS provides let's take a look at how a DFSClient (client) interacts with a Datanode (DN).

In a nutshell a DN just waits for commands. One of these commands is WRITE_BLOCK. When the DN receives a WRITE_BLOCK command it instantiates a BlockReceiver thread.

The BlockReceiver then simply waits for packets on an InputStream and flushes the data to OS buffers. An open block is maintained at the DN as an open file. When a block is filled, the block and hence its associate files is closed and the BlockReceiver ends. For all practical purposes the DN forgets that the block existed.

Replication to replica DNs is done via pipelining. The first DN forwards each packet to the next DN in the chain before the data is flushed locally, and waits for the downstreadm DN to respond. The default length of the replication chain is 3.

The other side of the equation is the DFSClient. The DFSClient batches changes until a packet is filled.
Since HADOOP-6313 a Syncable supports hflush and hsync.

hflush flushes all outstanding data (i.e. the current unfinished packet) from the client into the OS buffers on all DN replicas.
hsync flushes the data to the DNs like hflush and should also force the data to disk via fsync (or equivalent). But currently for HDFS hsync is implemented as hflush!

The gist is that both closing of a block and issuing hsync/hflush on the client currently only guarantees that data was flushed from the client to the replica DNs and to OS buffers on each DN; not that the data actually reached a physical disk.

For HBase and similar applications with durability guarantees this can be insufficient. If three or more DN machines crash at the same time (assume three replicas), for example caused by a multi rack or data center power outage, data might be lost.
Further, since HBase constantly compacts older, smaller HFiles into newer, larger ones, this potential data loss is not limited to new data.
(But note that, like most database setups, HBase should be deployed with redundant power supply anyway, so this is not necessarily an issue).

Due to the inner working of the DN it is difficult to implement 100% Posix fsync semantics. Imagine a client, which writes many blocks worth of data and then issues an hsync. In order to sync data correctly to disk either the client or all involved DNs would need to keep track of all blocks (full or partial) written to so far that have not yet been sync'ed.

This would be a significant change to how either the client or the DN work and lead to more complicated code. It would also require to keep the block files open in order to retain the file descriptors so that an fsync could be potentially issued in the future. Since the client might in fact never issue a sync request the number of open files to retain is unbounded.

The other option (similar to Posix' O_SYNC) is to have the DNs call fsync upon receipt of every single packet. leading to many unnecessary fsyncs.

In HDFS-744 I propose a hybrid solution. A data stream can be created with a SYNC_BLOCK flag. This flag causes the DFSClient set a "sync" flag on the last packet of a block. I.e. the block file is fsync'ed upon close.

This flag is also set when the client issues hsync. If the client has outstanding data the current packet is tagged with the "sync" flag and sent immediately, otherwise an empty packet with this flags is sent.

When a DN receives such a packet, it will immediately flush the currently open file (representing the current block - full on close or partial on hsync - being written) to disk.

In summary: With this compromise a client can guarantee - byte-by-byte if needed - which portion of an open file is guaranteed on a durable medium while avoiding either syncing every packet to disk or keeping track of past unsync'ed block.
For HBase this would conveniently deal with compactions as blocks are sync'ed upon close and also with WAL edits as it correctly allows sync'ing the current block.

The downside is that upon close each block needs to be sync'ed to disk, even though the client might never issue a sync request for this stream; this leads to potentially unneeded fsyncs.

HBASE-5954 proposes matching changes to HBase to make use of this new HDFS feature. This issue introduces a WAL sync config option and an HFile sync option.
The former causes HBase to issue an hsync when a batch of WAL entries is written. The latter makes sure HFiles (generated from memstore flushes or compactions) are guaranteed to be on a durable medium when the stream is closed.

There are also a few simple performance tests listed in that issue.

Future optimization is possible:

Only one of the replica DNs could issue the sync
Only one DN in each Rack could issue the sync
The sync could be done in parallel and the response from the DN need not wait for for it to finish (in this case the client has no guarantee that the sync actually finished when hsync returns, only that the DN promised to do it)
For HBase, both options could be made configurable per column family.

原文地址：HBase, HDFS and durable sync, 感谢原作者分享。

推荐阅读

web
Hdoop入门

一、背景1、起源MapReduce编程模型的思想来源于函数式编程语言Lisp，由Google公司于2004年提出并首先应用于大型集群。同时，Google ... [详细]

蜡笔小新 2023-10-10 12:27:30
web
CDH4简介

原文地址：CDH4简介作者：HadoopChinaWebelievethatduring2012,enterprisedistributionsofHa ... [详细]

蜡笔小新 2023-10-11 12:53:33
web
每天收获一点点Hadoop概述

一、Hadoop来历Hadoop的思想来源于Google在做搜索引擎的时候出现一个很大的问题就是这么多网页我如何才能以最快的速度来搜索到，由于这个问题Google发明 ... [详细]

蜡笔小新 2023-12-14 18:58:01
php
Hadoop源码解析1Hadoop工程包架构解析

1 Hadoop中各工程包依赖简述 Google的核心竞争技术是它的计算平台。Google的大牛们用了下面5篇文章，介绍了它们的计算设施。 GoogleCluster：ht ... [详细]

蜡笔小新 2023-10-17 13:28:20
io
mapreduce源码分析总结

这篇文章总结的非常到位,故而转之一MapReduce概述MapReduce是一个用于大规模数据处理的分布式计算模型，它最初是由Google工程师设计并实现的ÿ ... [详细]

蜡笔小新 2023-10-17 12:36:35
io
Hadoop （CDH4发行版）集群部署（部署脚本，namenode高可用，hadoop管理）

前言折腾了一段时间hadoop的部署管理，写下此系列博客记录一下。为了避免各位做部署这种重复性的劳动，我已经把部署的步骤写成脚本，各位只需要按着本文把脚本执行完，整个环境基本就部署 ... [详细]

蜡笔小新 2023-10-16 15:11:51
io
hadoop基础----hadoop实战(六)-----hadoop管理工具---Cloudera Manager---CDH介绍

我们在之前的文章中已经初步介绍了Cloudera。hadoop基础----hadoop实战(零)-----hadoop的平台版本选择从版本选择这篇文章中我们了解到除了hadoop官方版本外很多 ... [详细]

蜡笔小新 2023-10-16 14:21:13
io
iServer集成Hadoop YARN集群，详细操作指南解析分布式分析

HadoopYARN集群是一个通用的资源管理平台，可为各类计算框架提供资源的管理和调度。其核心是通过一个全局的资源管理器来实现分离资源管理与作业调度监控。Hadoop ... [详细]

蜡笔小新 2023-10-14 16:24:53
web
使用clouderaquickstartvm无配置快速部署Hadoop应用

http:zzj270919.blog.163.comblogstatic68997776201522561659999目录：通过CDH网站下载cloudera-vm ... [详细]

蜡笔小新 2023-10-11 18:27:57
php
大数据的明天将驶向何方？

http:www.infoq.comcnarticleswhere-will-big-data--tomorrow-sail-to大数据的明天将驶向何方？作者 36Kr 发布于20 ... [详细]

蜡笔小新 2023-10-11 15:36:45
object
Maven构建Hadoop,

Maven构建Hadoop工程阅读目录序Maven安装构建示例下载系列索引序　　上一篇，我们编写了第一个MapReduce，并且成功的运行了Job，Hadoop1.x是通过ant ... [详细]

蜡笔小新 2023-10-17 16:11:18
web
【转】腾讯分析系统架构解析

TA（TencentAnalytics，腾讯分析）是一款面向第三方站长的免费网站分析系统，在数据稳定性、及时性方面广受站长好评，其秒级的实时数据更新频率也获得业界的认可。本文将从实 ... [详细]

蜡笔小新 2023-10-16 19:05:20
window
flume 收集日志到HDFS

作者同类文章X转自：http:www.aboutyun.comthread-7949-1-1.html问题导读：1.什么是flume？ ... [详细]

蜡笔小新 2023-10-12 13:21:24
window
IDEA配置spark与pycharm配置spark教程

eclipse配置spark1.6.0教程https:kevin12.iteye.comblog2274179这里注意修改,根据自己的修改com.JohnsonSpark_2.3. ... [详细]

蜡笔小新 2023-10-12 10:47:38
window
Ubuntu16.04安装CDH5.14.2

一、安装clouderamanager（下文简称cm）（一）.环境及软件准备：1.环境：Ubuntu16.04desktopx3台ip分别为：10.132.226.121，10.1 ... [详细]

蜡笔小新 2023-10-11 13:46:39

mobiledu2502901087

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章