热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

HBaseProfiling

ByLarsHofhanslModernCPUcorescanexecutehundredsofinstructionsinthetimeittakestoreloadtheL1cache.RAMisthenewdiskasacoworkeratSalesforcelikestosay.TheL1-cacheisthenewRAMImightadd.Asweaddmoreand

By Lars Hofhansl Modern CPU cores can execute hundreds of instructions in the time it takes to reload the L1 cache. "RAM is the new disk" as a coworker at Salesforce likes to say. The L1-cache is the new RAM I might add. As we add more and

By Lars Hofhansl

Modern CPU cores can execute hundreds of instructions in the time it takes to reload the L1 cache. "RAM is the new disk" as a coworker at Salesforce likes to say. The L1-cache is the new RAM I might add.

As we add more and more CPU cores, we can easily be memory IO bound unless we are a careful.

Many common problems I have seen over the years were related to:
  1. concurrency problems
    Aside from safety and liveliness considerations, a typical problem is too much synchronization limiting potential parallel execution.
  2. unneeded or unintended memory barriers
    Memory barriers are required in Java by the following language constructs:
    • synchronized - sets read and write barriers as needed (details depend on JVM, version, and settings)
    • volatile - sets a read barrier before a read to a volatile, and write barrier after a write
    • final - set a write barrier after the assignment
    • AtomicInteger, AtomicLong, etc - uses volatiles and hardware CAS instructions
  3. unnecessary, unintended, or repeated memory copy or access
    Memory copying is often seen in Java for example because of the lack of in-array pointers, or really just general unawareness and the expectation that "garbage collector will clean up the mess." Well, it does, but not without a price.
(Entire collections of books are dedicated to each of these topics, so I won't embarrass myself by going into more detail.)

Like any software project of reasonable size, HBase has problems of all the above categories.

Profiling in Java has become extremely convenient. Just start jVisualVM which ships with the SunOracleJDK, pick the process to profile (in my case a local HBase regionserver) and start profiling.

Over the past few weeks I did some on and off profiling in HBase, which lead to the following issues:

HBASE-6603 - RegionMetricsStorage.incrNumericMetric is called too often

Ironically here it was the collection of a performance metric that caused a measurable slowdown of up 15%(!) for very wide rows (> 10k columns).
The metric was maintained as an AtomicLong, which introduced a memory barrier in one of the hottest code paths in HBase.
The good folks at Facebook have found the same issue at roughly the same time. (It turns that they were also... uhm... the folks who introduced the problem.)

HBASE-6621 - Reduce calls to Bytes.toInt

A KeyValue (the data structure that represents "columns" in HBase) is currently backed by a single byte[]. The sizes of the various parts are encoded in this byte[] and have to read and decoded; each time an extra memory access. In many cases that can be avoided, leading to slight performance improvement.

HBASE-6711 - Avoid local results copy in StoreScanner

All references pertaining to a single row (i.e. KeyValue with the same row key) were copied at the StoreScanner layer. Removing this lead to another slight performance increase with wide rows.

HBASE-7180 - RegionScannerImpl.next() is inefficient

This introduces a mechanism for coprocessors to access RegionScanners at a lower level, thus allowing skipping of a lot of unnecessary setup for each next() call. In tight loops a coprocessor can make use of this new API to save another 10-15%.

HBASE-7279 - Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

The row key of KeyValue was copied in the various scan related classes. To reduce that effect the row key was previously cached in the KeyValue class - leading to extra memory required for each KeyValue.
This change avoids all copying and hence also obviates the need for caching the row key.
A KeyValue now is hardly more than an array pointer (a byte[], an offset, and a length), and no data is copied any longer all the way from the block loaded from disk or cache to the RPC layer (unless the KeyValues are optionally encoded on disk, in which case they still need to be decoded in memory - we're working on improving that too).

Previously the size of a KeyValue on the scan path was at least 116 bytes + the length of the rowkey (which can be arbitrarily long). Now it is ~60 bytes, flat and including its own reference.
(remember during a course of a large scan we might be creating millions or even billions of KeyValue objects)

This is nice improvement both in term of scan performance (15-20% for small row keys of few bytes, much more for large ones) and in terms of produced garbage.
Since all copying is avoided, scanning now scales almost linearly with the number of cores.

HBASE-6852 - SchemaMetrics.updateOnCacheHit costs too much while full scanning a table with all of its fields

Other folks have been busy too. Here Cheng Hao found another problem with a scan related metric that caused a noticeable slowdown (even though I did not believe him first).
This removed another set of unnecessary memory barriers.

HBASE-7336 - HFileBlock.readAtOffset does not work well with multiple threads

This is slightly different issue caused by bad synchronization of the FSReader associated with a Storefile. There is only a single reader per storefile. So if the file's blocks are not cached - possibly because the scan indicated that it wants no caching, because it expects to touch too many blocks - the scanner threads are now competing for read access to the store file. That lead to outright terrible performance, such a scanners timing out even with just two scanners accessing the same file in tight loop.
This patch is a stop gap measure: Attempt to acquire the lock on the reader, if that failed switch to HDFS positional reads, which can read at an offset without affecting the state of the stream, and hence requires no locking.

Summary

Together these various changes can lead to ~40-50% scan performance improvement when using a single core. Even more when using multiple cores on the same machines (as is the case with HBase)

An entirely unscientific benchmark

20m rows, with two column families just a few dozen bytes each.

I performed two tests:

1. A scan that returns rows to the client

2. A scan that touches all rows via a filter but does not return anything to the client.

(This is useful to gauge the actual server side performance).


Further I tested with (1) no caching, all reads from disk (2) all data in the OS cache and (3) all data in HBase's block cache.


I compared 0.94.0 against the current 0.94 branch (what I will soon release as 0.94.4).


Results:

  • Scanning with scanner caching set to 10000:
    • 0.94.0
      no data in cache: 54s
      data in OS cache: 51s
      data in block cache: 35s

    • 0.94.4-snapshot
      no data in cache: 50s (IO bound between disk and network)
      data in OS cache: 43s
      data in block cache: 32s
      (limiting factor was shipping the results to the client)

  • all data filtered at the server (with a SingleValueColumnFilter that does not match anything, so each rows is still scanned)
    • 0.94.0
      no data in cache: 31s

      data in OS cache: 25s
      data in block cache: 11s

    • 0.94.4-snapshot
      no data in cache: 22s
      data in OS cache: 17s
      cache in block cache: 6.3s

I have not quantified the same with multiple concurrent scanners, yet.

So as you can see scan performance has significantly improved since 0.94.0.

Salesforce just hired some performance engineers from a well known chip manufacturer, and I plan to get some of their time to analyze HBase in even more details, to track down memory stalls, etc.

推荐阅读
  • 知识图谱——机器大脑中的知识库
    本文介绍了知识图谱在机器大脑中的应用,以及搜索引擎在知识图谱方面的发展。以谷歌知识图谱为例,说明了知识图谱的智能化特点。通过搜索引擎用户可以获取更加智能化的答案,如搜索关键词"Marie Curie",会得到居里夫人的详细信息以及与之相关的历史人物。知识图谱的出现引起了搜索引擎行业的变革,不仅美国的微软必应,中国的百度、搜狗等搜索引擎公司也纷纷推出了自己的知识图谱。 ... [详细]
  • React 小白初入门
    推荐学习:React官方文档:https:react.docschina.orgReact菜鸟教程:https:www.runoob.c ... [详细]
  • Hadoop源码解析1Hadoop工程包架构解析
    1 Hadoop中各工程包依赖简述   Google的核心竞争技术是它的计算平台。Google的大牛们用了下面5篇文章,介绍了它们的计算设施。   GoogleCluster:ht ... [详细]
  • mapreduce源码分析总结
    这篇文章总结的非常到位,故而转之一MapReduce概述MapReduce是一个用于大规模数据处理的分布式计算模型,它最初是由Google工程师设计并实现的ÿ ... [详细]
  • Hello.js 是一个用于连接OAuth2服务的JavascriptRESTFULAPI库,如Go ... [详细]
  • 项目需要将音视频文件上传服务器,考虑并发要求高,通过七牛来实现。直接上代码usingQiniu.IO;usingQiniu.IO.Resumable;usingQiniu.RPC; ... [详细]
  • java布尔字段用is前缀_POJO类中布尔类型的变量都不要加is前缀详解
    前言对应阿里巴巴开发手册第一章的命名风格的第八条。【强制】POJO类中布尔类型的变量都不要加is前缀,否则部分框架解析会引起序列化错误。反例:定义为基本 ... [详细]
  • Hadoop 源码学习笔记(4)Hdfs 数据读写流程分析
    Hdfs的数据模型在对读写流程进行分析之前,我们需要先对Hdfs的数据模型有一个简单的认知。数据模型如上图所示,在NameNode中有一个唯一的FSDirectory类负责维护文件 ... [详细]
  • OAuth2.0指南
    引言OAuth2.0是一种应用之间彼此访问数据的开源授权协议。比如,一个游戏应用可以访问Facebook的用户数据,或者一个基于地理的应用可以访问Foursquare的用户数据等。 ... [详细]
  • FontAwesome是一种webfont,它包含了几乎所有常用的图标,比如Twitter、facebook等等。用户可以自定义这些图标字体,包括大小、颜色、阴影效果以及其它可以通过CSS控制的属性。它有以下的优点:1、像矢量图形一样,可以无限放大2、只需一种字体,同时拥有多个图标,目前支持439个图标3、无需考虑兼容性问题,fontawesome不依赖于javascri ... [详细]
  • 篇首语:本文由编程笔记#小编为大家整理,主要介绍了Flutter添加APP启动StoryView相关的知识,希望对你有一定的参考价值。 ... [详细]
  • navicat生成er图_实践案例丨ACL2020 KBQA 基于查询图生成回答多跳复杂问题
    摘要:目前复杂问题包括两种:含约束的问题和多跳关系问题。本文对ACL2020KBQA基于查询图生成的方法来回答多跳复杂问题这一论文工作进行了解读 ... [详细]
  • 基于深度学习的遥感应用
    文章目录深度学习的发展过程深度学习在遥感中的应用基于深度学习的遥感样例库建设基于深度学习的遥感影像目标及场景检索基于深度学习的建筑物提取基于深度学习的密集建筑物自动检测基于深度学习 ... [详细]
  • windows平台使用NSP拦截具体进程的域名解析过程(xFsRedir的代理功能之域名代理)
    byfanxiushu2022-10-17转载或引用请注明原始作者。xFsRedir软件其中之一的功能就是实现了全方位的网络代理,从主机代理,到本地代理 ... [详细]
  • 为元宇宙提供动力的 5 项重要技术
    元宇宙是你肯定听说过的东西。在过去的一年里,每个人都在谈论它。这是技术领域的下一件大事。Bloomberg情报高级行业分析师马修·坎特曼(MatthewKanterman)的分析显 ... [详细]
author-avatar
小美女阿风
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有