一、基本环境信息
select version();
+---------------------+
| version() |
+---------------------+
| 10.0.20-MariaDB-log |
+---------------------+
1 row in set (0.00 sec)
uname -a
Linux MCBL051 2.6.32-504.el6.x86_64 #1 SMP Tue Sep 16 01:56:35 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
select @@tx_isolation;
+----------------+
| @@tx_isolation |
+----------------+
| READ-COMMITTED |
+----------------+
1 row in set (0.00 sec)
二、问题描述
应用人员反应某系统生产环境,有一支SQL跑了很久,被kill了,但一直没有kill掉,后面多支类似SQL执行时被前面这支sql堵住,请dba支援。
三、问题分析
从processlist来查看,TIME为52113秒,SQL执行大约14小时没有完成,而手工kill,无法释放线程资源,而堵住后面执行的多个类似SQL,应用人员将所有内外应用停了后,也没结束被kill的线程。下面显示SQL的command一直处于killed,没有真正完成。
show processlist\G
*************************** 3. row ***************************
Id: 38023534
User: lotsprd
Host: 172.16.12.155:44805
db: lots
Command: Killed
Time: 52113
State: Sending data
Info: SELECT DISTINCT
o.CN,
o.WNE,
o.CON,
o.ON
Progress: 0.000
查看事务信息,trx_mysql_thread_id为38023534的事务正处于RUNNING,对应同一支SQL,没有其它事务了,说明这个线程没有被其它线程锁住,事务中的表也没有上锁。
select * from innodb_trx\G
*************************** 1. row ***************************
trx_id: 636733127
trx_state: RUNNING
trx_started: 2016-05-13 20:57:54
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 0
trx_mysql_thread_id: 38023534
trx_query: SELECT DISTINCT
o.CN,
o.WNE,
o.CON,
o.ON,
o.A_NAME,
o.A_NO,
o.CREATE_DATE receipt_time,
o.delivery_date delivery_time
FROM t o
WHERE o.CREATE_DATE BETWEEN '2016-05-02 20:39:23' AND '2016-05-12 17:00:23'
AND o.business_type = 'B2C'
AND o.is_delivery = 1
and o.order_type = 'PO'
AND o.WNe = 'xxx小电仓'
AND o.A_CODE NOT IN (38549, 44926)
AND o.A_NO != ''
AND o.is_got = 0
AND o.IS_SIGN = 0
AND NOT EXISTS (
SELECT id FROM t_carrier
WHERE ON = o.ON
AND remark NOT IN (
'开始处理',
'待配货',
'开始处理',
'信息审核通过',
'已通知配货',
'待配货',
'您的包裹已打包'
)
)
ORDER BY o.WNE, o.CREATE_DATE limit 20
trx_operation_state: NULL
trx_tables_in_use: 2
trx_tables_locked: 0
trx_lock_structs: 0
trx_lock_memory_bytes: 360
trx_rows_locked: 0
trx_rows_modified: 0
trx_concurrency_tickets: 2256
trx_isolation_level: READ COMMITTED
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 10000
trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
查看show engine innodb status的事务部分信息,发现跑该SQL的事务也没有被锁住,这样看来,这个应该就只是一个查询SQL,但select这种为什么会记录innodb_trx表中,有待进一步验证
。
------------
TRANSACTIONS
------------
Trx id counter 637134193
Purge done for trx's n:o <636733130 undo n:o <0 state: running but idle
History list length 84354
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 38193498, OS thread handle 0x2b73c5442700, query id 2807386390 localhost dba init
show engine innodb status
---TRANSACTION 0, not started
MySQL thread id 38168362, OS thread handle 0x2b739680a700, query id 2807382298 10.63.32.108 lotsprd cleaning up
---TRANSACTION 3843, not started
MySQL thread id 1, OS thread handle 0x2b7396253700, query id 0 Waiting for background binlog tasks
---TRANSACTION 636733127, ACTIVE 63973 sec, thread declared inside InnoDB 2256
mysql tables in use 2, locked 0
MySQL thread id 38023534, OS thread handle 0x2b7396e99700, query id 2805177701 172.16.12.155 lotsprd Sending data
SELECT DISTINCT
o.CN,
o.WNE,
o.CON,
o.ON,
o.A_NAME,
o.A_NO,
o.CREATE_DATE receipt_time,
o.delivery_date delivery_time
FROM t o
WHERE o.CREATE_DATE BETWEEN '2016-05-02 20:39:23' AND '2016-05-12 17:00:23'
AND o.business_type = 'B2C'
AND o.is_delivery = 1
and o.order_type = 'PO'
AND o.WNe = 'xxx小电仓'
AND o.A_CODE NOT IN (38549, 44926)
AND o.A_NO != ''
AND o.is_got = 0
AND o.IS_SIGN = 0
AND NOT EXISTS (
SELECT id FROM t_carrier
Trx read view will not see trx with id >= 636733128, sees <636709213
Trx #rec lock waits 0 #table lock waits 0
Trx total rec lock wait time 0 SEC
Trx total table lock wait time 0 SEC
又查看主库processlist情况,显示已将所有binlog发到了salve上:
-----+
| Id | User | Host | db | Command | Time | State | Info
| Progress |
+----------+-----------------+---------------------+------+-------------+----------+-----------------------------------------------------------------------
+------------------------------------------------------------------------------------------------------+----------+
| 2 | event_scheduler | localhost | NULL | Daemon | 18320596 | Waiting on empty queue | NULL
| 0.000 |
| 12575445 | rep | 10.16.19.88:30630 | NULL | Binlog Dump | 12434220 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL
| 0.000 |
| 38023534 | lotsprd | 172.16.12.155:44805 | lots | Killed | 65293 | Sending data | SELECT
DISTINCT
o.CN,
o.WNE,
o.CON,
o.ON | 0.000 |
| 38168361 | lotsprd | 10.63.32.108:10310 | NULL | Sleep | 15127 | | NULL
| 0.000 |
| 38168362 | lotsprd | 10.63.32.108:10311 | lots | Sleep | 2185 | | NULL
| 0.000 |
| 38171800 | lotsprd | 10.63.32.108:11787 | lots | Sleep | 665 | | NULL
| 0.000 |
| 38189047 | lotsprd | 10.63.32.108:5627 | lots | Sleep | 1412 | | NULL
| 0.000 |
| 38200552 | dba | localhost | NULL | Query | 0 | init | show
processlist | 0.000 |
+----------+-----------------+---------------------+------+-------------+----------+-----------------------------------------------------------------------
+------------------------------------------------------------------------------------------------------+-----
从库状态显示主从数据是同步的:
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.16.19.87
Master_User: rep
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000267
Read_Master_Log_Pos: 476541394
Relay_Log_File: relay-bin.000819
Relay_Log_Pos: 476541681
Relay_Master_Log_File: mysql-bin.000267
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 476541394
Relay_Log_Space: 476542015
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 19873306
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: No
Gtid_IO_Pos:
1 row in set (0.00 sec)
四、问题处理
从网上查找资料,显示该问题只能重启mysql或是进行主从切换,由于上面查看相关信息有确认主从库的数据是同步的,而在应用人员已将所有应用都停了情况下,应用人员同意重启mysql
,用msyqladmin shutdown执行关闭时,一直hang住很久,查看错误日志显示如下:
160514 15:10:16 [Note] /apps/svr/mariadb10/bin/mysqld: Normal shutdown
160514 15:10:16 [Note] Event Scheduler: Killing the scheduler thread, thread id 2
160514 15:10:16 [Note] Event Scheduler: Waiting for the scheduler thread to reply
160514 15:10:16 [Note] Event Scheduler: Stopped
160514 15:10:16 [Note] Event Scheduler: Purging the queue. 0 events
160514 15:10:36 [Warning] /apps/svr/mariadb10/bin/mysqld: Forcing close of thread 38023534 user: 'lotsprd'
160514 15:10:36 [ERROR] mysqld: Got an error writing communication packets
shutdow 已hang住,此时主库也无法再提供对外服务,前面主从数据已确认同步,将从库停止slave。现在只能对mysql进程进行kill操作了,如下:
[apps@MCBL051 bin]$ ps -ef|grep mysql
apps 2584 1 0 2015 ? 00:00:00 /bin/sh /apps/svr/mariadb10/bin/mysqld_safe --defaults-file=/apps/conf/mysql/mariadb10_3306.cnf
apps 3121 2584 17 2015 ? 36-11:01:53 /apps/svr/mariadb10/bin/mysqld --defaults-file=/apps/conf/mysql/mariadb10_3306.cnf --basedir=/apps/svr/mariadb10 --
datadir=/apps/dbdat/mariadb10_data3306 --plugin-dir=/apps/svr/mariadb10/lib/plugin --log-error=/apps/logs/mysql/error3306.log --open-files-limit=8192 --pid-
file=/apps/dbdat/mariadb10_data3306/mysql.pid --socket=/tmp/mysql3306.sock --port=3306
apps 16651 22150 0 15:26 pts/0 00:00:00 grep mysql
[apps@MCBL051 bin]$ kill -9 2584
[apps@MCBL051 bin]$ ps -ef|grep mysql
apps 3121 1 17 2015 ? 36-11:01:53 /apps/svr/mariadb10/bin/mysqld --defaults-file=/apps/conf/mysql/mariadb10_3306.cnf --basedir=/apps/svr/mariadb10 --
datadir=/apps/dbdat/mariadb10_data3306 --plugin-dir=/apps/svr/mariadb10/lib/plugin --log-error=/apps/logs/mysql/error3306.log --open-files-limit=8192 --pid-
file=/apps/dbdat/mariadb10_data3306/mysql.pid --socket=/tmp/mysql3306.sock --port=3306
apps 17837 22150 0 15:27 pts/0 00:00:00 grep mysql
[apps@MCBL051 bin]$ kill -9 3121
[apps@MCBL051 bin]$ ps -ef|grep mysql
apps 17947 22150 0 15:27 pts/0 00:00:00 grep mysql
再启动mysql主库正常,启动从库slave正常,检查主从数据同步正常。