我的系统是在Ubuntu上运行的TomCat 7服务器,与在CentOS中运行的MongoDB集群通信.我们在AWS上有这个,它运行得很好.
我最近在Azure上提出了完全相同的事情,当tomcat应用程序尝试查询MongoDB时,我们有持续的,看似随机的超时.典型的错误是:
Jan 31 08:13:54 catalina.out: Jan 31, 2014 4:14:09 PM com.mongodb.DBPortPool gotError Jan 31 08:13:54 catalina.out: WARNING: emptying DBPortPool to xxx.cloudapp.net/xxx.xxx.xxx.xxx:21191 b/c of error Jan 31 08:13:54 catalina.out: java.net.SocketException: Connection timed out Jan 31 08:13:54 catalina.out: at java.net.SocketInputStream.socketRead0(Native Method) Jan 31 08:13:54 catalina.out: at java.net.SocketInputStream.read(SocketInputStream.java:146) Jan 31 08:13:54 catalina.out: at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) Jan 31 08:13:54 catalina.out: at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) Jan 31 08:13:54 catalina.out: at java.io.BufferedInputStream.read(BufferedInputStream.java:334) Jan 31 08:13:54 catalina.out: at org.bson.io.Bits.readFully(Bits.java:46) Jan 31 08:13:54 catalina.out: at org.bson.io.Bits.readFully(Bits.java:33) Jan 31 08:13:54 catalina.out: at org.bson.io.Bits.readFully(Bits.java:28) Jan 31 08:13:54 catalina.out: at com.mongodb.Response.(Response.java:40) Jan 31 08:13:54 catalina.out: at com.mongodb.DBPort.go(DBPort.java:142) Jan 31 08:13:54 catalina.out: at com.mongodb.DBPort.call(DBPort.java:92) Jan 31 08:13:54 catalina.out: at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:244) Jan 31 08:13:54 catalina.out: at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:216) Jan 31 08:13:54 catalina.out: at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:288) Jan 31 08:13:54 catalina.out: at com.mongodb.DB.command(DB.java:262) Jan 31 08:13:54 catalina.out: at com.mongodb.DB.command(DB.java:244) Jan 31 08:13:54 catalina.out: at com.mongodb.DBCollection.getCount(DBCollection.java:985) Jan 31 08:13:54 catalina.out: at com.mongodb.DBCollection.getCount(DBCollection.java:956) Jan 31 08:13:54 catalina.out: at com.mongodb.DBCollection.getCount(DBCollection.java:931) Jan 31 08:13:54 catalina.out: at com.mongodb.DBCollection.count(DBCollection.java:878) Jan 31 08:13:54 catalina.out: at com.eweware.service.base.store.impl.mongo.dao.BaseDAOImpl._exists(BaseDAOImpl.java:788) Jan 31 08:13:54 catalina.out: at com.eweware.service.base.store.impl.mongo.dao.GroupDAOImpl._exists(GroupDAOImpl.java:18)
我正在使用Java驱动程序2.11.4并按如下方式初始化它:
builder.autoConnectRetry(true) .connectionsPerHost(10) .writeConcern(WriteConcern.FSYNCED) .connectTimeout(30000) .socketKeepAlive(true);
在阅读互联网时,我看到一些材料表明存在Azure问题和一些C#建议,但我还没有看到任何关于如何从Java纠正它的内容.
更多细节:
当MongoDB是单个节点或副本集时会发生这种情况
无论服务器是否负载,都会发生这种情况.事实上,服务器在某些负载下似乎比冷启动时表现更好.但即使在恒定负载下,它也会超时
超时似乎是随机的,因为没有可识别的模式,呼叫将失败或何时失败.有时它会花一个小时没有问题,有时候每次通话都会失败
如果我在超时错误时重试调用,最终它将起作用.有时需要重试> 100次,有时只需要一次重试.
这是我正在尝试的重试代码:
private DBObject findOneRetry(DBObject criteria, DBObject fields, DBCollection collection) throws SystemErrorException { DBObject obj = null; for (int attempt = 1; attempt < MAX_RETRIES; attempt++) { try { obj = collection.findOne(criteria, fields); // getting SocketException inside here return obj; } catch (Exception e) { if (attempt > MAX_RETRIES) { throw new SystemErrorException(makeErrorMessage("findOneRetry", "find", attempt, e, null), e, ErrorCodes.SERVER_DB_ERROR); } else { logger.warning(getClass().getName() + ": findOneRetry failed and will retry in attempt #" + attempt + " in collection " + _getCollection()); } } } return obj; }
有关如何纠正的任何建议?
提前致谢!
这是由于CentOS服务器中的tcp_keepalive_time很高.
在您的MongoS服务器中:sudo nano/proc/sys/net/ipv4/tcp_keepalive_time将7200更改为60
重新启动Azure实例.
更新:为了确保您的VM始终具有tcp_keepalive_time的值:
添加此行:
bash -c 'echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time'
至:
/etc/rc.d/rc.local
更新到更新:
对于Linux的大多数版本,都有一个/etc/sysctl.d/
目录.创建一个文件,例如mongo.conf
包含:
net.ipv4.tcp_keepalive_time = 60
将该文件放在该目录中并运行:
sysctl -p /etc/sysctl.d/mongo.conf
验证更改:
sysctl net.ipv4.tcp_keepalive_time
这将在我遇到的基于RedHat和Debian的系统中重新启动.您需要确定并检查/etc/sysctl.conf
其他任何文件,/etc/sysctl.d
以查看该变量是否已设置为其他内容并进行适当更改.