[转]IntegratingLucenewithHBase

作者：手机用户2602934713 | 来源：互联网 | 2018-06-11 05:18

转载自：www.infoq.comarticlesLuceneHbaseSearchplaysapivotalroleinjustaboutanymodernapplicationfromshoppingsitestosocialnetworkstopointsofinterest.Lucenesearchlibraryistodaysdefactostandardforimplemen

转载自：http://www.infoq.com/articles/LuceneHbase Search plays a pivotal role in just about any modern application from shopping sites to social networks to points of interest. Lucene search library is today's de facto standard for implementing search engines. It is used by Apple, IBM, Attlassian (Jira), Wolfram, pick your favorite company [1]. As a result, any implementation allowing for improving of Lucene's scalability and performance is of great interest.

Quick introduction to Lucene

Searchable entities in Lucene are represented as documents comprised of fields and their values. Every field value is comprised of one or more searchable elements - terms. Lucene search is based on inverted index containing information about searchable documents. Unlike normal indexes, where you can look up a document to know what fields it contains, in inverted index, you look up a field's term to know all the documents it appears in. A high-level Lucene architecture [2] is presented at Figure 1. Its main components are IndexSearcher, IndexReader, IndexWriter and Directory. IndexSearcher implements the search logic. IndexWriter writes reverse indexes for each inserted document. IndexReader reads the content of indexes in support of IndexSearcher. Both IndexReader and IndexWriter rely on Directory, which provides APIs for manipulating index data sets, which are directly mimicking file system API. fig1

Figure 1: High-level Lucene architecture

The standard Lucene distribution contains several directory implementations, including file system -based and memory-based[1]. The drawback of a standard file system - based backend (directory implementation) is a performance degradation caused by the index growth. Different techniques were used to overcome this problem including load balancing and index sharding - splitting indexes between multiple Lucene instances. Although powerful, usage of sharding complicates overall implementation architecture and requires a certain amount of an apriory knowledge about expected documents to properly partition Lucene indexes. A different approach is to allow an index backend itself to shard data correctly and build an implementation based on such a backend. One of such backend can be a noSQL database. In this article we will describe an implementation based on an HBase [4].

Implementation approach

As explained in [3], at a very high level, Lucene operates on 2 distinct data sets:

Index data set keeps all the Field/Term pairs (with additional info like, term frequency, position etc.) and the documents containing these terms in appropriate fields.
Document data set stores all the documents, including stored fields, etc.

As we have mentioned above, directly implementing directory interface is not always the simplest (most convenient) approach to port Lucene to a new backend. As a result, several Lucene ports, including a limited memory index support from Lucene contrib. module, Lucandra [5] and HBasene [6] took a different approach [2] and overwrote not a directory but higher level Lucene's classes - IndexReader and IndexWriter, thus bypassing Directory APIs (Figure 2). 1fig2

Figure 2: Integration Lucene with back end without file system

Although such approach often requires more work [2], it leads to significantly more powerful implementations allowing for full utilization of back end's native capabilities. The implementation[2]?presented in the article follows this approach.

Overall Architecture

The overall implementation (Figure 3) is based on a memory-based backend used as an in memory cache and a mechanism for synchronizing this cache with the HBase backend. 1fig3

Figure 3: Overall Architecture of HBase-based Lucene implementation

The implementation tries to balance two conflicting requirements - performance: in memory cache can drastically improve performance by minimizing the amount of HBase reads for search and documents retrieval; and scalability: ability to run as many Lucene instances as required to support growing search clients population. The latter requires minimizing of the cache life time to synchronize content with the HBase instance (a single copy of thruth). A compromise is achieved through implementing configurable cache time to live parameter, limiting cache presence in each Lucene instance.

Underlying data model for in memory cache

As mentioned before, internal Lucene data model is based on two main data sets – Index and documents, which are implemented as two models – IndexMemoryModel and DocumentMemoryModel. In our implementation both reads and writes (IndexReader/IndexWriter) are done through the memory cache, but their implementation is very different. For reads, the cache first checks if the required data is in memory and is not stale[3]?and if it is uses it directly. Otherwise the cache reads/refreshes data from HBase and then returns it to the IndexReader. For writes, on another hand, the data is written directly to the HBase without storing it in memory. Although this might create a delay in the actual data availability, it makes implementation significantly simpler – we do not need to worry which caches to deliver new/updated data. This delay can be controlled, to adhere to the business requirements, by setting an appropriate cache expiration time.

IndexMemoryModel

The class diagram for index memory model is presented at Figure 4. 1fig4

Figure 4: IndexMemoryModel class diagram

In this implementation:

LuceneIndexMemoryModel class contains FieldTermDocuments class for every field currently present in memory. It also provides all of the internal APIs necessary for implementation of IndexReader/IndexWriter.
FieldTermDocuments class manages TermDocuments for every field value. Typically for a scannable database list of fields and list of field values are combined in one navigable list of (field/term values). For memory-based cache implementation we have split them into two separate maps to make search times more predictable.
TermDocuments class contains a list of TermDocument class for every document ID.
TermDocument class contains information stored in index for a given document - document frequency and array of positions.

DocumentMemoryModel

A class diagram for the document memory model is presented at Figure 5 fig5

Figure 5: DocumentMemoryModel class diagram

In this implementation:

LuceneDocumentMemoryModel class contains a map of DocumentStructure class for every indexed document.
DocumentStructure class contains information about a single document. For every document it contains a list of saved fields and information about each indexed field.
FieldData class contains information saved for stored field, including field name, value and binary/string flag.
DocumentTermFrequency class contains information about each indexed field including a back reference to corresponding index structure (field, term) term frequency in the document, positions of the term in the documents and offsets from the beginning of the document.

LuceneDocumentNormMemoryModel

As explained in [9] norms are used to represent document/field's boost factor, thus providing for better search results ranking at the expense of using significant amount of memory. Class implementation is based on map of maps, where the inner map stores a norm for a document, while the outer one stores a norm map for a field. Although norm's information is keyed by field name and thus can be appended to LuceneIndexMemoryModel class we decided to implement norms management as a separate class - LuceneDocumentNormMemoryModel. The reason for this is that usage of norms in Lucene is optional.

IndexWriter

With underlying memory model, described above, implementation of index writer is fairly straightforward. Because Lucene does not define IndexWriter interface we had to implement IndexWriter by implementing all of the methods that exist in the standard Lucene implementation. The workhorse of this class is addDocument method. This method iterates through all document's fields. For every field, the method checks whether it should be tokenized and uses specified analyzer to do so. This method also updates all three memory structures - index, document and (optionally) norm storing information for an added document.

IndexReader

IndexReader implements IndexReader interface provided by Lucene core. Because list gets in Hbase are much faster compared to inpidual reads we extended this class with the method, allowing to read multiple documents.The class itself does not do much outsourcing most of the processing to several classes, which it manages:

While document ID is typically a string, Lucene internally operates on integers. A class DocIDManager is responsible for string to number translation management. This class is used by IndexReader in the form of ThreadLocalStorage, allowing for automatic clean up as the thread ends.
MemoryTermEnum class extends TermEnum class provided by Lucene and is responsible for scanning through field/term values.
MemoryTermFrequencyVector class implements interfaces TermDocs and TermPositions provided by Lucene and is responsible for processing information about documents for a given field/term pair.
MemoryTermFrequencyVector class implements interfaces TermFreqVector and TermPositionVector provided by Lucene and is responsible for returning back information about frequencies and positions of documents' fields for given documents IDs

HBase tables

Proposed solution is based on two main HBase tables - Index table (Figure 6) and document table (Figure 7). fig6

Figure 6: Hbase Index table

(Click on the image to enlarge it) fig7

Figure 7: HBase document tale

An optional third table (Figure 8) can be implemented if Lucene norms need to be supported. fig8

Figure 8: HBase norm table

HBase index table (Figure 6) is the workhorse of the implementation. This table has an entry (row) for every field/term combination known to a Lucene instance, which contains one column family - documents family. This column family contains a column (named as a document ID) for every document containing this field/term. The content of each column is a value of TermDocument class. HBase document table (Figure 7) stores documents themselves, back references to the indexes/norms, referencing these documents and some additional bookkeeping information used by Lucene for documents processing. It has an entry (row) for every document known to a Lucene instance. Each document is uniquely identified by a document ID (key) and contains two column families - fields family and index family. Fields column family contains a column (named as a field name) for every document's field stored by Lucene. The column value is comprised of the value type (string or byte array) and the value itself. Index column family contains a column (named as a field/term) for every index referencing this document. The column value include document frequency, positions and offsets for a given field/term. HBase norm table (Figure 8) stores document norms for every field. It has an entry (row) for every field (key) known to a Lucene instance. Each row contains a single column family - norms family. This family has a column (named as document ID) for every document for which a given field's norm needs to be stored.

Data formats

A final design decision is determining data formats for storing data in HBase. For this implementation we have chosen Avro [10] based on its performance, minimal size of resulting data and tight integration with Hadoop. The main data structures used by implementation are TermDocument (Listing 1), Document's FieldData (Listing 2) and DocumentTermFrequency (Listing 3)

{
  "type" : "record",
  "name" : "TermDocument",
  "namespace" : "com.navteq.lucene.hbase.document",
  "fields" : [ {
    "name" : "docFrequency",
    "type" : "int"
  }, {
    "name" : "docPositions",
    "type" : ["null", {
      "type" : "array",
      "items" : "int"
   }]
  } ]
}

Listing 1 Term Document AVRO definition

{
  "type" : "record",
  "name" : "FieldsData",
  "namespace" : "com.navteq.lucene.hbase.document",
  "fields" : [ {
    "name" : "fieldsArray",
    "type" : {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "singleField",
        "fields" : [ {
          "name" : "binary",
          "type" : "boolean"
        }, {
          "name" : "data",
          "type" : [ "string", "bytes" ]
        } ]
      }
    }
  } ]
}

Listing 2 Field data AVRO definition

{
  "type" : "record",
  "name" : "TermDocumentFrequency",
  "namespace" : "com.navteq.lucene.hbase.document",
  "fields" : [ {
    "name" : "docFrequency",
    "type" : "int"
  }, {
    "name" : "docPositions",
    "type" : ["null",{
      "type" : "array",
      "items" : "int"
    }]
  }, {
    "name" : "docOffsets",
    "type" : ["null",{
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "TermsOffset",
        "fields" : [ {
          "name" : "startOffset",
          "type" : "int"
        }, {
          "name" : "endOffset",
          "type" : "int"
        } ]
      }
    }]
  } ]
}

Listing 3 TermDocumentFrequency AVRO definition

Conclusion

The simple implementation, described in this paper fully supports all of the Lucene functionality as validated by many unit tests from both Lucene core and contrib modules. It can be used as a foundation of building a very scalable search implementation leveraging inherent scalability of HBase and its fully symmetric design, allowing for adding any number of processes serving HBase data. It also avoids the necessity to close an open Lucene Index reader to incorporate newly indexed data, which will be automatically available to user with possible delay controlled by the cache time to live parameter. In the next article we will show how to extend this implementation to incorporate geospatial search support.

About the Authors

Boris Lublinsky?is principal architect at NAVTEQ, where he is working on defining architecture vision for large data management and processing and SOA and implementing various NAVTEQ projects. He is also an SOA editor for InfoQ and a participant of SOA RA working group in OASIS. Boris is an author and frequent speaker, his most recent book "Applied SOA". Michael Segel?has spent the past 20+ years working with customers identifying and solving their business problems. Michael has worked in multiple roles, in multiple industries. He is an independent consultant who is always looking to solve any challenging problems. Michael has a Software Engineering degree from the Ohio State University.

References

1. Lucene-java Wiki 2. Animesh Kumar.?Apache Lucene and Cassandra? 3. Animesh Kumar.?Lucandra - an inside story!? 4. HBase? 5. Lucandra? 6. HBasene? 7. Bigtable? 8. Cassandra? 9. Michael McCandless, Erik Hatcher, Otis Gospodnetic.?Lucene in Action, Second Edition.? 10. Boris Lublinsky.?Using Apache Avro.

[1]?Additionally Lucene contrib contains DB directory build for Berkley DB. [2]?This implementation is inspired by Lusandra source code [3] [3]?Has not been in memory too long &＃160;

原文地址：[转]Integrating Lucene with HBase, 感谢原作者分享。

推荐阅读

go
每天收获一点点Hadoop概述

一、Hadoop来历Hadoop的思想来源于Google在做搜索引擎的时候出现一个很大的问题就是这么多网页我如何才能以最快的速度来搜索到，由于这个问题Google发明 ... [详细]

蜡笔小新 2023-12-14 18:58:01
scala
WinPythonHadoop在Win10上安装教程

本文介绍了在Win10上安装WinPythonHadoop的详细步骤，包括安装Python环境、安装JDK8、安装pyspark、安装Hadoop和Spark、设置环境变量、下载winutils.exe等。同时提醒注意Hadoop版本与pyspark版本的一致性，并建议重启电脑以确保安装成功。 ... [详细]

蜡笔小新 2023-12-14 11:26:56
scala
大数据Hadoop生态(20)MapReduce框架原理OutputFormat的开发笔记

本文介绍了大数据Hadoop生态(20)MapReduce框架原理OutputFormat的开发笔记，包括outputFormat接口实现类、自定义outputFormat步骤和案例。案例中将包含nty的日志输出到nty.log文件，其他日志输出到other.log文件。同时提供了一些相关网址供参考。 ... [详细]

蜡笔小新 2023-12-10 11:44:06
ip
Maven构建Hadoop,

Maven构建Hadoop工程阅读目录序Maven安装构建示例下载系列索引序　　上一篇，我们编写了第一个MapReduce，并且成功的运行了Job，Hadoop1.x是通过ant ... [详细]

蜡笔小新 2023-10-17 16:11:18
ip
Hadoop源码解析1Hadoop工程包架构解析

1 Hadoop中各工程包依赖简述 Google的核心竞争技术是它的计算平台。Google的大牛们用了下面5篇文章，介绍了它们的计算设施。 GoogleCluster：ht ... [详细]

蜡笔小新 2023-10-17 13:28:20
stream
【转】腾讯分析系统架构解析

TA（TencentAnalytics，腾讯分析）是一款面向第三方站长的免费网站分析系统，在数据稳定性、及时性方面广受站长好评，其秒级的实时数据更新频率也获得业界的认可。本文将从实 ... [详细]

蜡笔小新 2023-10-16 19:05:20
scala
基于 Docker 快速部署多需求 Spark 自动化测试环境

基于,docker,快速,部署,多,需求,spark ... [详细]

蜡笔小新 2023-10-16 11:58:06
scala
2018年人工智能大数据的爆发，学Java还是Python？

本文介绍了2018年人工智能大数据的爆发以及学习Java和Python的相关知识。在人工智能和大数据时代，Java和Python这两门编程语言都很优秀且火爆。选择学习哪门语言要根据个人兴趣爱好来决定。Python是一门拥有简洁语法的高级编程语言，容易上手。其特色之一是强制使用空白符作为语句缩进，使得新手可以快速上手。目前，Python在人工智能领域有着广泛的应用。如果对Java、Python或大数据感兴趣，欢迎加入qq群458345782。 ... [详细]

蜡笔小新 2023-12-14 20:08:28
stream
mac php错误日志配置方法及错误级别修改

本文介绍了在mac环境下配置php错误日志的方法，包括修改php.ini文件和httpd.conf文件的操作步骤。同时还介绍了如何修改错误级别，以及相应的错误级别参考链接。 ... [详细]

蜡笔小新 2023-12-12 11:59:08
ip
一句话解决高并发的核心原则

本文介绍了解决高并发的核心原则，即将用户访问请求尽量往前推，避免访问CDN、静态服务器、动态服务器、数据库和存储，从而实现高性能、高并发、高可扩展的网站架构。同时提到了Google的成功案例，以及适用于千万级别PV站和亿级PV网站的架构层次。 ... [详细]

蜡笔小新 2023-12-12 10:56:24
ip
PHP组合工具以及开发所需的工具

本文介绍了PHP开发中常用的组合工具和开发所需的工具。对于数据分析软件，包括Excel、hihidata、SPSS、SAS、MARLAB、Eview以及各种BI与报表工具等。同时还介绍了PHP开发所需的PHP MySQL Apache集成环境，包括推荐的AppServ等版本。 ... [详细]

蜡笔小新 2023-12-09 17:36:44
ip
Simple Tips on C++(对于C++的一些建议)

Introduction（简介）Forbeingapowerfulobject-orientedprogramminglanguage,Cisuseda ... [详细]

蜡笔小新 2023-10-17 19:48:02
ip
Hadoop与大数据技术大会将于11月30日开幕

11月26日，由中国计算机协会（CCF）主办，CCF大数据专家委员会协办，CSDN承办的Hadoop与大数据技术大会（Hadoop&BigDataTechnology ... [详细]

蜡笔小新 2023-10-17 17:47:11
ip
CentOS 7配置SSH远程访问及控制

nsitionalENhttp:www.w3.orgTRxhtml1DTDxhtml1-transitional.dtd ... [详细]

蜡笔小新 2023-10-16 18:40:50
ip
2亿简历遭泄漏，到底谁的锅？

前面刚有AWS开战MongoDB，双方“隔空互呛”，这厢又曝出2亿+简历信息泄露——MongoDB的这场开年似乎“充实”得过分了些。长期以来，作为“最受欢迎的NoSQL数据库”，M ... [详细]

蜡笔小新 2023-10-15 17:05:15

手机用户2602934713

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章