热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

TwilioincidentandRedis

Twiliojustreleasedapostmortemaboutanincidentthatcausedissueswiththebillingsystem:www.twilio.comblog201307billing-incident-post-mortem.htmlTheproblemwasaboutaRedisserver,sinceTwilioisusingRedistostor

Twilio just released a post mortem about an incident that caused issues with the billing system: http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html The problem was about a Redis server, since Twilio is using Redis to stor

Twilio just released a post mortem about an incident that caused issues with the billing system:

http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html

The problem was about a Redis server, since Twilio is using Redis to store the in-flight account balances, in a master-slaves setup, with multiple slaves in different data centers for obvious availability and data safety concerns.

This is a short analysis of the incident, what Twilio can do and what Redis can do to avoid this kind of issues.

The first observation is that Twilio uses Redis, an in memory system, in order to save balances, so everybody will say "WTF Twilio! Are you serious with your data?". Actually Redis uses memory to serve data and to internally manipulate its data structures, but the incident has *nothing to do* with the durability of Redis as a DB. In fact Twilio stated that they are using the append only file that can be a very durable solution as explained here: http://oldblog.antirez.com/post/redis-persistence-demystified.html

The incident is actually centered around two main aspects of Redis:

1) The replication system.
2) The configuration.

I'll address they two things respectively.

Analysis of the replication issue
===

Redis 2.6 always needs a full resynchronization between a master and a slave after a connection issue between the two.
Redis 2.8 addressed this problem, but is currently a release candidate, so Twilio had no way to use the new feature called "partial resynchronization".

Apparently the master became unavailable because many slaves tried to resynchronize at the same time.

Actually for the way Redis works a single slave or multiple slaves trying to resynchronize should not make a huge difference, since just a single RDB is created. As soon as the second slave attaches and there is already a background save in progress in order to create the first RDB (used for the bulk data transfer), it is put in a queue with the previous slave, and so forth for all the other slaves attaching. Redis will just produce a single RDB file.

However what is true is that Redis may use additional memory with many slaves attaching at the same time, since there are multiple output buffers to "record" to transfer when the RDB file is ready. This is true especially in the case of replication over WAN. In the Twilio blog post I read "multiple data centers" so it is possible that the replication process may be slow in some case.

The bottom line is, Redis normally does not need to go slow when multiple slaves are resynchronizing at the same time, unless something strange happens like hitting the memory limit of the server, with the master starting to swap and/or problems with very slow disks (probably EC2?) so that creating an RDB starts to mess with the ability to write to the AOF file.

However issues writing to the AOF are a bit unlikely to be the cause, since during the AOF rewrite there is the same kind of disk i/o stress, with one thread writing a lot of data to the new AOF, and the other (main) thread logging every new write to the AOF. Everything considered memory pressure seems more probable, but Twilio engineers can just comment with details about what happened, this will be an useful real-world data point for sure.

From the Twilio side, what is possible to do to minimize incidents, is to understand exactly why the master is not able, with the current architecture, to survive without serious loss of performance to many slaves resynchronizing.

From the Redis side, well, we had to do our homework and provide partial resynchronization *long time ago* probably, we finally have it in Redis 2.8, and it is very good that a few days ago I pushed forward the 2.8 release skipping all the other pending features for this release that will be postponed for the next release. Now we have the first release candidate, in a few weeks this should be a release in the hands of users.

The configuration
===

The other obvious problem, probably the biggest one, was restarting the master with the wrong configuration.

Again I think here there was an human error that was "helped" by a Redis non perfect mechanism.

Basically up to Redis 2.6 you had CONFIG SET to change the configuration by hand, so it was possible for example to switch the system from RDB to AOF for more data safety with just:

redis-cli CONFIG SET appendonly yes

However you had to change the configuration file manually in order to ensure that the change will affect the instance after the next restart. Otherwise the change is only in the current in memory configuration and a restart will bring you back to the old config.

Maybe this was not the case, but it is not unlikely that Twilio engineers modified the wrong redis.conf file or forgot to do it in some way.

Fortunately Redis 2.8 provides a better workflow for on-the-fly configuration changes, that is:

redis-cli CONFIG SET appendonly yes
redis-cli CONFIG REWRITE

Basically the config rewriting feature will make sure to change the currently used configuration file, in order to contain the configuration changes operated by CONFIG SET, which is definitely safer.

In the end
===

I'll be happy to work with the Twilio engineers in the next weeks in order to understand the details and their requests and see how Redis can be improved to make incidents like this less likely to happen.

A real world test
===

I just tried to setup a master with AOF enabled, rotating disks, and a huge write load. Only trick is, it is bare metal entry-level hardware.

Then I put a steady load on it of 70k writes per second across 10 millions of keys.

Finally I tried to mass-resync four slaves form scratch multiple times.

Results:

$ redis-cli -h 192.168.1.10 --latency-history
min: 0, max: 26, avg: 0.97 (1254 samples) -- 15.00 seconds range
min: 0, max: 5, avg: 0.66 (1287 samples) -- 15.00 seconds range
min: 0, max: 2, avg: 0.62 (1290 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.47 (1307 samples) -- 15.01 seconds range
min: 0, max: 10, avg: 0.48 (1306 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.47 (1310 samples) -- 15.01 seconds range
min: 0, max: 3, avg: 0.45 (1311 samples) -- 15.01 seconds range
min: 0, max: 10, avg: 0.48 (1305 samples) -- 15.01 seconds range
min: 0, max: 23, avg: 0.49 (1306 samples) -- 15.01 seconds range
min: 0, max: 3, avg: 0.47 (1307 samples) -- 15.01 seconds range
min: 0, max: 36, avg: 0.86 (1255 samples) -- 15.00 seconds range
min: 0, max: 6, avg: 1.05 (1246 samples) -- 15.01 seconds range
min: 0, max: 21, avg: 0.52 (619 samples)^C

As you can see there is no moment in which the server struggles with this load. During the test the load continued to be accepted at the rate of 70k writes/sec.

This test is in no way able to simulate the Twilio architecture, but the bottom line here is, Redis is supposed to handle this well with minimally capable hardware so something odd happened, or there was a low memory condition, or there was the "EC2 effect", that is, some very poor disk performance allowed for memory pressure. Comments
推荐阅读
  • 微软头条实习生分享深度学习自学指南
    本文介绍了一位微软头条实习生自学深度学习的经验分享,包括学习资源推荐、重要基础知识的学习要点等。作者强调了学好Python和数学基础的重要性,并提供了一些建议。 ... [详细]
  • 本文介绍了数据库的存储结构及其重要性,强调了关系数据库范例中将逻辑存储与物理存储分开的必要性。通过逻辑结构和物理结构的分离,可以实现对物理存储的重新组织和数据库的迁移,而应用程序不会察觉到任何更改。文章还展示了Oracle数据库的逻辑结构和物理结构,并介绍了表空间的概念和作用。 ... [详细]
  • CSS3选择器的使用方法详解,提高Web开发效率和精准度
    本文详细介绍了CSS3新增的选择器方法,包括属性选择器的使用。通过CSS3选择器,可以提高Web开发的效率和精准度,使得查找元素更加方便和快捷。同时,本文还对属性选择器的各种用法进行了详细解释,并给出了相应的代码示例。通过学习本文,读者可以更好地掌握CSS3选择器的使用方法,提升自己的Web开发能力。 ... [详细]
  • 本文主要解析了Open judge C16H问题中涉及到的Magical Balls的快速幂和逆元算法,并给出了问题的解析和解决方法。详细介绍了问题的背景和规则,并给出了相应的算法解析和实现步骤。通过本文的解析,读者可以更好地理解和解决Open judge C16H问题中的Magical Balls部分。 ... [详细]
  • 知识图谱——机器大脑中的知识库
    本文介绍了知识图谱在机器大脑中的应用,以及搜索引擎在知识图谱方面的发展。以谷歌知识图谱为例,说明了知识图谱的智能化特点。通过搜索引擎用户可以获取更加智能化的答案,如搜索关键词"Marie Curie",会得到居里夫人的详细信息以及与之相关的历史人物。知识图谱的出现引起了搜索引擎行业的变革,不仅美国的微软必应,中国的百度、搜狗等搜索引擎公司也纷纷推出了自己的知识图谱。 ... [详细]
  • yum安装_Redis —yum安装全过程
    篇首语:本文由编程笔记#小编为大家整理,主要介绍了Redis—yum安装全过程相关的知识,希望对你有一定的参考价值。访问https://redi ... [详细]
  • Redis底层数据结构之压缩列表的介绍及实现原理
    本文介绍了Redis底层数据结构之压缩列表的概念、实现原理以及使用场景。压缩列表是Redis为了节约内存而开发的一种顺序数据结构,由特殊编码的连续内存块组成。文章详细解释了压缩列表的构成和各个属性的含义,以及如何通过指针来计算表尾节点的地址。压缩列表适用于列表键和哈希键中只包含少量小整数值和短字符串的情况。通过使用压缩列表,可以有效减少内存占用,提升Redis的性能。 ... [详细]
  • 一次上线事故,30岁+的程序员踩坑经验之谈
    本文主要介绍了一位30岁+的程序员在一次上线事故中踩坑的经验之谈。文章提到了在双十一活动期间,作为一个在线医疗项目,他们进行了优惠折扣活动的升级改造。然而,在上线前的最后一天,由于大量数据请求,导致部分接口出现问题。作者通过部署两台opentsdb来解决问题,但读数据的opentsdb仍然经常假死。作者只能查询最近24小时的数据。这次事故给他带来了很多教训和经验。 ... [详细]
  • Redis API
    安装启动最简启动命令行输入验证动态参数启动配置文件启动常用配置通用命令keysbdsize计算key的总数exists判断是否存在delkeyvalue删除指定的keyvalue成 ... [详细]
  • 本文整理了Java中java.lang.NoSuchMethodError.getMessage()方法的一些代码示例,展示了NoSuchMethodErr ... [详细]
  • 本文介绍了如何使用call_user_func_array函数向Redis中添加有序列表或集合。该函数可以接受一个数组作为参数,第一项是要操作的有序列表或集合的键,后续的项目是排序权重和值的交替。通过该函数,可以方便地向Redis中添加多个元素,并指定它们的排序权重。 ... [详细]
  • 本文介绍了在无法联网的情况下,通过下载rpm包离线安装zip和unzip的方法。详细介绍了如何搜索并下载合适的rpm包,以及如何使用rpm命令进行安装。 ... [详细]
  • 本文介绍了Python版Protobuf的安装和使用方法,包括版本选择、编译配置、示例代码等内容。通过学习本教程,您将了解如何在Python中使用Protobuf进行数据序列化和反序列化操作,以及相关的注意事项和技巧。 ... [详细]
  • MACElasticsearch安装步骤及验证方法
    本文介绍了MACElasticsearch的安装步骤,包括下载ZIP文件、解压到安装目录、启动服务,并提供了验证启动是否成功的方法。同时,还介绍了安装elasticsearch-head插件的方法,以便于进行查询操作。 ... [详细]
  • 本文介绍了Hyperledger Fabric外部链码构建与运行的相关知识,包括在Hyperledger Fabric 2.0版本之前链码构建和运行的困难性,外部构建模式的实现原理以及外部构建和运行API的使用方法。通过本文的介绍,读者可以了解到如何利用外部构建和运行的方式来实现链码的构建和运行,并且不再受限于特定的语言和部署环境。 ... [详细]
author-avatar
violet
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有