当前位置: 开发笔记 > 后端 > 正文

MySQLInternals-IndexMerge优化

作者：mobiledu2502856483 | 来源：互联网 | 2013-05-20 17:18

之前搞错了，以为IndexMerge是MySQL5.6的新特性，原来不是，发现5.5也有，看了下manual，发现5.0的manual就已经存在了，可以说是一个历史悠久的优化手段了，好吧，不管怎么样，今天就拨开其神秘的面纱，看看其内部到底如何生成这种IndexMerge的计划的。这里只详细介绍Intersect

0 前言

之前搞错了，以为Index Merge是MySQL5.6的新特性，原来不是，发现5.5也有，看了下manual，发现5.0的manual就已经存在了，可以说是一个历史悠久的优化手段了，好吧，不管怎么样，今天就拨开其神秘的面纱，看看其内部到底如何生成这种Index Merge的计划的。这里只详细介绍Intersect操作，对于Union和Sort-Union的具体代码，还没开始研究。

1 Index Merge理论基础

Index Merge——索引归并，即针对一张表，同时使用多个索引进行查询，然后将各个索引查出来的结果进行进一步的操作，可以是求交 ——Intersect，也可以是求和——Union，针对union还有一种补充算法——Sort-Union，很奇怪为什么没有Sort-Intersect，按道理也是可以做的。

什么情况下，同时使用多个索引会有利呢？比如说WHERE条件是C1=10 AND C2 =100，但是只有分别针对C1和C2的索引，而没有(C1,C2)这种索引，两个索引同时使用才有意义,通过两个索引都可以快速定位到一批数据，然后对这一批数据进行进一步的求交或求和操作即可,这样的效率可能比全表扫描或者只使用其中一个索引进行扫描然后再去主索引查询要快。

Intersect和Union都需要使用的索引是ROR的，也就时ROWID ORDERED，即针对不同的索引扫描出来的数据必须是同时按照ROWID排序的，这里的 ROWID其实也就是InnoDB的主键(如果不定义主键，InnoDB会隐式添加ROWID列作为主键)。只有每个索引是ROR的，才能进行归并排序，你懂的。当然你可能会有疑惑，查不记录后内部进行一次sort不一样么，何必必须要ROR呢，不错，所以有了SORT-UNION。SORT-UNION就是每个非ROR的索引排序后再进行Merge。至于为什么没有SORT-INTERSECT，我也很是迷茫。

2 初始化数据

mysql>

mysql> show create table im\G

*************************** 1. row ***************************

Table: im

Create Table: CREATE TABLE `im` (

`c1` int(11) DEFAULT NULL,

`c2` int(11) DEFAULT NULL,

`c3` int(11) DEFAULT NULL,

KEY `c1` (`c1`,`c3`),

KEY `c2` (`c2`,`c1`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1

1 row in set (0.00 sec)

mysql> show create procedure fill_im1\G

*************************** 1. row ***************************

Procedure: fill_im1

sql_mode: NO_ENGINE_SUBSTITUTION

Create Procedure: CREATE DEFINER=`root`@`127.0.0.1` PROCEDURE `fill_im1`(cnt int)

begindeclarei intdefault0; repeat insert into im values(100, 50, 100); set i=i+1; until i > cntendrepeat;end

character_set_client: utf8

collation_connection: utf8_general_ci

Database Collation: latin1_swedish_ci

1 row in set (0.07 sec)

mysql> show create procedure fill_im2\G

*************************** 1. row ***************************

Procedure: fill_im2

sql_mode: NO_ENGINE_SUBSTITUTION

Create Procedure: CREATE DEFINER=`root`@`127.0.0.1` PROCEDURE `fill_im2`(cnt int)

begindeclarei intdefault0; repeat insert into im values(100, 100, 50); set i=i+1; until i > cntendrepeat;end

character_set_client: utf8

collation_connection: utf8_general_ci

Database Collation: latin1_swedish_ci

1 row in set (0.00 sec)

mysql> call fill_im1(2000)

mysql> call fill_im2(2000)

mysql> insert into im values(100,50,50);

Query OK, 1 row affected (0.00 sec)

mysql> insert into im values(100,50,50);

Query OK, 1 row affected (0.00 sec)

mysql> commit;

Query OK, 0 rows affected (0.05 sec)

mysql> select * from im where c1=100andc2 = 50andc3 = 50\G

*************************** 1. row ***************************

c1: 100

c2: 50

c3: 50

*************************** 2. row ***************************

c1: 100

c2: 50

c3: 50

2 rows in set (0.13 sec)

3 执行计划

mysql>

mysql> explain select * from im where c1=100and c2 =50and c3 =50\G

***************************1. row ***************************

id:1

select_type: SIMPLE

table: im

type: index_merge

possible_keys: c1,c2

key: c1,c2

key_len:10,10

ref: NULL

rows:1001

Extra: Using intersect(c1,c2); Using where; Using index

1rowinset(0.00sec)

4 代码分析

从生成数据的方法可以看出来，是专门针对查询的语句进行构造的。无论是根据(c1,c3)的索引查询还是根据(c2,c1)的索引查询，都会查出一般的数据，即效率接近于全表扫描的一半。但是如果利用两个索引同时进行过滤，那么过滤出来的数据就很少了,也就是结果中的两条。

也就是说如果单独查询各个索引，过滤效果不明显，但是如果联合两个索引进行MERGE过滤，那么效果可能很明显，这里所说的过滤，用更专业的词来说是选择因子——selectivity。而计划的选择时代价的计算，便是计算这个选择因子。如果综合多个索引，导致选择因子很小，从而达到索引merge出来的结果集很小的话，那么计划就更倾向于Index Merge，反之则不然。

下面是选择子计算的代码：

static

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

staticdouble ror_scan_selectivity(constROR_INTERSECT_INFO *info,

constROR_SCAN_INFO *scan)

{

double selectivity_mult=1.0;

constTABLE *cOnsttable= info->param->table;

constKEY_PART_INFO *constkey_part= table->key_info[scan->keynr].key_part;

/**

key values tuple, used to store both min_range.key and

max_range.key. This function is only called for equality ranges;

open ranges (e.g. "min_value

rowid ordered retrieval, so in this function we know that

min_range.key == max_range.key

uchar key_val[MAX_KEY_LENGTH+MAX_FIELD_WIDTH];

uchar *key_ptr= key_val;

SEL_ARG *sel_arg, *tuple_arg= NULL;

key_part_map keypart_map=0;

bool cur_covered;

bool prev_covered= test(bitmap_is_set(&info->covered_fields,

key_part->fieldnr-1));

key_range min_range;

key_range max_range;

min_range.key= key_val;

min_range.flag= HA_READ_KEY_EXACT;

max_range.key= key_val;

max_range.flag= HA_READ_AFTER_KEY;

ha_rows prev_records= table->file->stats.records;

DBUG_ENTER("ror_scan_selectivity");

for(sel_arg= scan->sel_arg; sel_arg;

sel_arg= sel_arg->next_key_part)

{

DBUG_PRINT("info",("sel_arg step"));

cur_covered= test(bitmap_is_set(&info->covered_fields,

key_part[sel_arg->part].fieldnr-1));

if(cur_covered != prev_covered)

{

/* create (part1val, ..., part{n-1}val) tuple. */

bool is_null_range=false;

ha_rows records;

if(!tuple_arg)

{

tuple_arg= scan->sel_arg;

/* Here we use the length of the first key part */

tuple_arg->store_min(key_part[0].store_length, &key_ptr,0);

is_null_range|= tuple_arg->is_null_interval();

keypart_map=1;

}

while(tuple_arg->next_key_part != sel_arg)

{

tuple_arg= tuple_arg->next_key_part;

tuple_arg->store_min(key_part[tuple_arg->part].store_length,

&key_ptr,0);

is_null_range|= tuple_arg->is_null_interval();

keypart_map= (keypart_map <<1) |1;

}

min_range.length= max_range.length= (size_t) (key_ptr - key_val);

min_range.keypart_map= max_range.keypart_map= keypart_map;

Get the number of rows in this range. This is done by calling

records_in_range() unless all these are true:

1) The user has requested that index statistics should be used

for equality ranges to avoid the incurred overhead of

index dives in records_in_range()

2) The range is not on the form "x IS NULL". The reason is

that the number of rows with this value are likely to be

very different than the values in the index statistics

3) Index statistics is available.

@see key_val

if(!info->param->use_index_statistics || // (1)

is_null_range || // (2)

!(records= table->key_info[scan->keynr].

rec_per_key[tuple_arg->part])) // (3)

{

DBUG_EXECUTE_IF("crash_records_in_range", DBUG_SUICIDE(););

DBUG_ASSERT(min_range.length >0);

records= (table->file->

records_in_range(scan->keynr, &min_range, &max_range));

}

if(cur_covered)

{

/* uncovered -> covered */

double tmp= rows2double(records)/rows2double(prev_records);

DBUG_PRINT("info", ("Selectivity multiplier: %g", tmp));

selectivity_mult *= tmp;

prev_records= HA_POS_ERROR;

}

else

{

/* covered -> uncovered */

prev_records= records;

}

prev_covered= cur_covered;

}

if(!prev_covered)

{

double tmp= rows2double(table->quick_rows[scan->keynr]) /

rows2double(prev_records);

DBUG_PRINT("info", ("Selectivity multiplier: %g", tmp));

selectivity_mult *= tmp;

}

// Todo: This assert fires in PB sysqa RQG tests.

// DBUG_ASSERT(selectivity_mult <= 1.0);

DBUG_PRINT("info", ("Returning multiplier: %g", selectivity_mult));

DBUG_RETURN(selectivity_mult);

}

刚看到这段代码时，确实有点犯懵，代码的注释给了很大的帮助：

Get selectivity of adding a ROR scan to the ROR-intersection.

SYNOPSIS

ror_scan_selectivity()

info ROR-interection, an intersection of ROR index scans

scan ROR scan that may or may not improve the selectivity

of 'info'

NOTES

Suppose we have conditions on several keys

cOnd=k_11=c_11 AND k_12=c_12 AND ... // key_parts of first key in 'info'

k_21=c_21 AND k_22=c_22 AND ... // key_parts of second key in 'info'

...

k_n1=c_n1 AND k_n3=c_n3 AND ... (1) //key_parts of 'scan'

where k_ij may be the same as any k_pq (i.e. keys may have common parts).

Note that for ROR retrieval, only equality conditions are usable so there

are no open ranges (e.g., k_ij > c_ij) in 'scan' or 'info'

A full row is retrieved if entire condition holds.

The recursive procedure for finding P(cond) is as follows:

First step:

Pick 1st part of 1st key and break conjunction (1) into two parts:

cOnd= (k_11=c_11 AND R)

Here R may still contain condition(s) equivalent to k_11=c_11.

Nevertheless, the following holds:

P(k_11=c_11 AND R) = P(k_11=c_11) * P(R | k_11=c_11).

Mark k_11 as fixed field (and satisfied condition) F, save P(F),

save R to be cond and proceed to recursion step.

Recursion step:

We have a set of fixed fields/satisfied conditions) F, probability P(F),

and remaining conjunction R

Pick next key part on current key and its condition "k_ij=c_ij".

We will add "k_ij=c_ij" into F and update P(F).

Lets denote k_ij as t, R = t AND R1, where R1 may still contain t. Then

P((t AND R1)|F) = P(t|F) * P(R1|t|F) = P(t|F) * P(R1|(t AND F)) (2)

(where '|' mean conditional probability, not "or")

Consider the first multiplier in (2). One of the following holds:

a) F contains condition on field used in t (i.e. t AND F = F).

Then P(t|F) = 1

b) F doesn't contain condition on field used in t. Then F and t are

considered independent.

P(t|F) = P(t|(fields_before_t_in_key AND other_fields)) =

= P(t|fields_before_t_in_key).

P(t|fields_before_t_in_key) = #records(fields_before_t_in_key) /

#records(fields_before_t_in_key, t)

The second multiplier is calculated by applying this step recursively.

IMPLEMENTATION

This function calculates the result of application of the "recursion step"

described above for all fixed key members of a single key, accumulating set

of covered fields, selectivity, etc.

The calculation is conducted as follows:

Lets denote #records(keypart1, ... keypartK) as n_k. We need to calculate

n_{k1} n_{k2}

--------- * --------- * .... (3)

n_{k1-1} n_{k2-1}

where k1,k2,... are key parts which fields were not yet marked as fixed

( this is result of application of option b) of the recursion step for

parts of a single key).

Since it is reasonable to expect that most of the fields are not marked

as fixed, we calculate (3) as

n_{i1} n_{i2}

(3) = n_{max_key_part} / ( --------- * --------- * .... )

n_{i1-1} n_{i2-1}

where i1,i2, .. are key parts that were already marked as fixed.

In order to minimize number of expensive records_in_range calls we

group and reduce adjacent fractions. Note that on the optimizer's

request, index statistics may be used instead of records_in_range

@see RANGE_OPT_PARAM::use_index_statistics.

RETURN

Selectivity of given ROR scan, a number between 0 and 1. 1 means that

adding 'scan' to the intersection does not improve the selectivity.

注释想说明的就是选择因子的概率如何进行计算，其实就是不同INDEX之间差异性的索引列会引起选择因子不断变小，即 Index之间差异性越大，过滤的记录就越多，选择出来的数据集就会越少。INDEX的差异性就是INdex之间索引列列是否重复出现在不同索引之间，两个INDEX约相似，那么MERGE的结果集越大。具体的实现大家自己看看吧，明白了原理，实现都是浮云了。

BTW, 5.6的Optimizer trace十分好用，对于想要跟踪Optimizer内部的同学来说，可以先把详细的计划生成流程通过Optimizer trace 打印出来，对照优化流程，就能更好的定位到代码。

推荐阅读

sql
每天收获一点点Hadoop概述

一、Hadoop来历Hadoop的思想来源于Google在做搜索引擎的时候出现一个很大的问题就是这么多网页我如何才能以最快的速度来搜索到，由于这个问题Google发明 ... [详细]

蜡笔小新 2023-12-14 18:58:01
sql
推荐一个ASP的内容管理框架（ASP Nuke）的优势和适用场景

本文推荐了一个ASP的内容管理框架ASP Nuke，并介绍了其主要功能和特点。ASP Nuke支持文章新闻管理、投票、论坛等主要内容，并可以自定义模块。最新版本为0.8，虽然目前仍处于Alpha状态，但作者表示会继续更新完善。文章还分析了使用ASP的原因，包括ASP相对较小、易于部署和较简单等优势，适用于建立门户、网站的组织和小公司等场景。 ... [详细]

蜡笔小新 2023-12-14 18:11:11
sql
Android 新闻App的本地服务器搭建教程

本文介绍了在开发Android新闻App时，搭建本地服务器的步骤。通过使用XAMPP软件，可以一键式搭建起开发环境，包括Apache、MySQL、PHP、PERL。在本地服务器上新建数据库和表，并设置相应的属性。最后，给出了创建new表的SQL语句。这个教程适合初学者参考。 ... [详细]

蜡笔小新 2023-12-14 17:15:19
sql
如何在MySQL中将零值替换为先前的非零值？

本文介绍了如何在MySQL中将零值替换为先前的非零值的方法，包括使用内联查询和更新查询。同时还提供了选择正确值的方法。 ... [详细]

蜡笔小新 2023-12-14 16:59:24
sql
mysql分组排序_在MySQL中实现分组排序功能

在数据分析工作中，我们通常会遇到这样的问题，一个业务部门由若干业务组构成，需要筛选出每个业务组里业绩前N名的业务员。这其实是一个分组排序的 ... [详细]

蜡笔小新 2023-12-14 14:41:26
sql
如何限制php数据库链接数和连接超时时间？

本文介绍了如何使用php限制数据库插入的条数并显示每次插入数据库之间的数据数目，以及避免重复提交的方法。同时还介绍了如何限制某一个数据库用户的并发连接数，以及设置数据库的连接数和连接超时时间的方法。最后提供了一些关于浏览器在线用户数和数据库连接数量比例的参考值。 ... [详细]

蜡笔小新 2023-12-14 14:06:10
sql
Oracle Database 10g许可授予信息及高级功能详解

本文介绍了Oracle Database 10g许可授予信息及其中的高级功能，包括数据库优化数据包、SQL访问指导、SQL优化指导、SQL优化集和重组对象。同时提供了详细说明，指导用户在Oracle Database 10g中如何使用这些功能。 ... [详细]

蜡笔小新 2023-12-14 13:12:10
sql
Hibernate基础映射

在说Hibernate映射前，我们先来了解下对象关系映射ORM。ORM的实现思想就是将关系数据库中表的数据映射成对象，以对象的形式展现。这样开发人员就可以把对数据库的操作转化为对 ... [详细]

蜡笔小新 2023-12-14 10:57:47
sql
MysqlDump_mysqldump全库备份相关知识详解

本文详细介绍了MysqlDump和mysqldump进行全库备份的相关知识，包括备份命令的使用方法、my.cnf配置文件的设置、binlog日志的位置指定、增量恢复的方式以及适用于innodb引擎和myisam引擎的备份方法。对于需要进行数据库备份的用户来说，本文提供了一些有价值的参考内容。 ... [详细]

蜡笔小新 2023-12-14 10:03:27
sql
PHP中的MySQL函数库及其常用函数介绍

本文由编程笔记小编整理，介绍了PHP中的MySQL函数库及其常用函数，包括mysql_connect、mysql_error、mysql_select_db、mysql_query、mysql_affected_row、mysql_close等。希望对读者有一定的参考价值。 ... [详细]

蜡笔小新 2023-12-14 08:19:53
static
Spring特性实现接口多类的动态调用详解

本文详细介绍了如何使用Spring特性实现接口多类的动态调用。通过对Spring IoC容器的基础类BeanFactory和ApplicationContext的介绍，以及getBeansOfType方法的应用，解决了在实际工作中遇到的接口及多个实现类的问题。同时，文章还提到了SPI使用的不便之处，并介绍了借助ApplicationContext实现需求的方法。阅读本文，你将了解到Spring特性的实现原理和实际应用方式。 ... [详细]

蜡笔小新 2023-12-14 03:24:19
static
高校天文共享平台开发过程中的思考与规划

本文介绍了高校天文共享平台的开发过程中的思考和规划。该平台旨在为高校学生提供天象预报、科普知识、观测活动、图片分享等功能。文章分析了项目的技术栈选择、网站前端布局、业务流程、数据库结构等方面，并总结了项目存在的问题，如前后端未分离、代码混乱等。作者表示希望通过记录和规划，能够理清思路，进一步完善该平台。 ... [详细]

蜡笔小新 2023-12-13 18:08:58
static
如何查看mysql的安装路径

本文介绍了通过mysql命令查看mysql的安装路径的方法，提供了相应的sql语句，并希望对读者有参考价值。 ... [详细]

蜡笔小新 2023-12-13 13:23:09
static
PhysioNet生理信号处理（三）WFDB Toolbox for Matlab的安装和使用方法

本文介绍了PhysioNet网站提供的生理信号处理工具箱WFDB Toolbox for Matlab的安装和使用方法。通过下载并添加到Matlab路径中或直接在Matlab中输入相关内容，即可完成安装。该工具箱提供了一系列函数，可以方便地处理生理信号数据。详细的安装和使用方法可以参考本文内容。 ... [详细]

蜡笔小新 2023-12-13 20:46:48
static
相机防抖设置详解及使用方法

本文详细介绍了相机防抖的设置方法和使用技巧，包括索尼防抖设置、VR和Stabilizer档位的选择、机身菜单设置等。同时解释了相机防抖的原理，包括电子防抖和光学防抖的区别，以及它们对画质细节的影响。此外，还提到了一些运动相机的防抖方法，如大疆的Osmo Action的Rock Steady技术。通过本文，你将更好地理解相机防抖的重要性和使用技巧，提高拍摄体验。 ... [详细]

蜡笔小新 2023-12-13 20:39:20

mobiledu2502856483

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章