作者:乔志造型店长阿杰 | 来源:互联网 | 2022-12-09 10:54
我们目前正在运行Hortonworks 2.6.5.0:
$ hadoop version
Hadoop 2.7.3.2.6.5.0-292
Subversion git@github.com:hortonworks/hadoop.git -r 3091053c59a62c82d82c9f778c48bde5ef0a89a1
Compiled by jenkins on 2018-05-11T07:53Z
Compiled with protoc 2.5.0
From source with checksum abed71da5bc89062f6f6711179f2058
This command was run using /usr/hdp/2.6.5.0-292/hadoop/hadoop-common-2.7.3.2.6.5.0-292.jar
操作系统是CentOS 7:
$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
我们最近开始在ambari-agent
日志文件中注意到这些问题:
$ grep -i "error|warn" /var/log/ambari-agent/*
/var/log/ambari-agent/ambari-agent.log:WARNING 2018-07-30 14:03:50,982 NetUtil.py:124 - Server at https://hbase26-2.mydom.com:8440 is not reachable, sleeping for 10 seconds...
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:00,986 NetUtil.py:96 - EOF occurred in violation of protocol (_ssl.c:579)
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:00,990 NetUtil.py:97 - SSLError: Failed to connect. Please check openssl library versions.
/var/log/ambari-agent/ambari-agent.log:WARNING 2018-07-30 14:04:00,990 NetUtil.py:124 - Server at https://hbase26-2.aa.mydom.com:8440 is not reachable, sleeping for 10 seconds...
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:10,993 NetUtil.py:96 - EOF occurred in violation of protocol (_ssl.c:579)
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:10,994 NetUtil.py:97 - SSLError: Failed to connect. Please check openssl library versions.
/var/log/ambari-agent/ambari-agent.log:WARNING 2018-07-30 14:04:10,994 NetUtil.py:124 - Server at https://hbase26-2.aa.mydom.com:8440 is not reachable, sleeping for 10 seconds...
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:20,996 NetUtil.py:96 - EOF occurred in violation of protocol (_ssl.c:579)
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:20,997 NetUtil.py:97 - SSLError: Failed to connect. Please check openssl library versions.
当这些开始发生时,我们再也无法通过Ambari管理Hadoop集群的任何方面.所有服务都显示黄色问号,并表示"心跳丢失".
多次重启不允许我们恢复Ambari,并最终重新获得对集群的控制权.
1> slm..:
此问题原因是服务器在尝试连接到端口8440上的CA服务时无法处理TLSv1.1证书.
我们注意到该服务实际上正在运行:
$ netstat -tapn|grep 8440
tcp 0 0 0.0.0.0:8440 0.0.0.0:* LISTEN 1203/java
但curl
除非我们通过--insecure
交换机禁用TLS检查,否则这将失败.这是我们的第一个线索,它似乎与TLS有关.
进一步的调查使我们看到NetUtil.py(Ambari的一部分)似乎没问题.其他线索包括:
$ cat /etc/ambari-agent/conf/ambari-agent.ini
...
[security]
ssl_verify_cert = 0
...
还有这个:
$ grep -E '\[https|verify' /etc/python/cert-verification.cfg
[https]
#verify=platform_default
verify=disable
这些都没有奏效.最终的工作是什么,强制ambari-agent
使用TLSv1.2与TLS1.1:
$ grep -E "\[security|force" /etc/ambari-agent/conf/ambari-agent.ini
[security]
force_https_protocol=PROTOCOL_TLSv1_2
然后重新启动ambari-agent restart
.
我能够使用分散在互联网上的各种提示将这些全部拼凑在一起.我把它放在这里希望它能帮助任何其他可怜的灵魂发生在他们的Hadoop/Hortonworks集群中.
参考
Ambari代理 - [SSL:CERTIFICATE_VERIFY_FAILED]证书验证失败
Java/Python更新和Ambari代理TLS设置
主机注册时出现Openssl错误
清理Ambari指标系统数据
为什么会这样?
进一步调试/挖掘我发现这个标题为:禁用TLSv1和TLS1.1 - 启用TLSv1.2.现在,您必须将Ambari Agent配置为使用TLSv1.2.