cpu:英特尔至强E5-2640V3处理器 2.6GHz 8核 2颗
mem:8G,DDR4-2133 RDIMM,32条,共256G
硬盘1:1.2T,万转sas做数据盘,24块
硬盘2:600G,万转sas做系统盘,2块
RAID卡:2G缓存
网卡:2*10GE(SFP+),原厂的
操作系统:suse11sp4
Linux hebda_data_33 3.0.101-77-default #1 SMP Tue Jun 14 20:33:58 UTC 2016 (a082ea6) x86_64 x86_64 x86_64 GNU/Linux
上联交换机:华为12812
网卡信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | ethtool -i p4p2 driver: bnx2x version: 1.710.51-0 firmware-version: FFV08.07.25 bc 7.13.54 bus-info: 0000:83:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes hebda_data_33:~ # ethtool -i em1 driver: bnx2x version: 1.710.51-0 firmware-version: FFV08.07.25 bc 7.13.54 bus-info: 0000:01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes hebda_data_33:~ # lspci -s 0000:83:00.1 -vvv 83:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10) Subsystem: Broadcom Corporation Device 1006 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- Latency: 0 Interrupt: pin B routed to IRQ 60 Region 0: Memory at c8000000 (64-bit, prefetchable) [size=8M] Region 2: Memory at c8800000 (64-bit, prefetchable) [size=8M] Region 4: Memory at ca000000 (64-bit, prefetchable) [size=64K] Expansion ROM at ca500000 [disabled] [size=512K] Capabilities: [48] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Not readable Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [a0] MSI-X: Enable+ Count=32 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00001000 Capabilities: [ac] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Latency L0 <1us, L1 <2us ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+ Capabilities: [13c v1] Device Serial Number f4-e9-d4-ff-fe-9d-ba-10 Capabilities: [150 v1] Power Budgeting > Capabilities: [160 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [1b8 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [220 v1] #15 Kernel driver in use: bnx2x Kernel modules: bnx2x hebda_data_33:~ # lspci -s 0000:01:00.0 -vvv 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet (rev 10) Subsystem: Dell BCM57800 10-Gigabit Ethernet Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- Latency: 0 Interrupt: pin A routed to IRQ 40 Region 0: Memory at 95000000 (64-bit, prefetchable) [size=8M] Region 2: Memory at 95800000 (64-bit, prefetchable) [size=8M] Region 4: Memory at 96030000 (64-bit, prefetchable) [size=64K] Expansion ROM at 96080000 [disabled] [size=512K] Capabilities: [48] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=8 DScale=1 PME- Capabilities: [50] Vital Product Data Not readable Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [a0] MSI-X: Enable+ Count=32 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00001000 Capabilities: [ac] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Latency L0 <1us, L1 <2us ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+ Capabilities: [13c v1] Device Serial Number 18-66-da-ff-fe-65-77-0b Capabilities: [150 v1] Power Budgeting > Capabilities: [160 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [1b8 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [220 v1] #15 Capabilities: [300 v1] #19 Kernel driver in use: bnx2x Kernel modules: bnx2x hebda_data_33:~ # ethtool -S p4p2|grep dis [0]: rx_discards: 79516 [0]: rx_phy_ip_err_discards: 0 [0]: rx_skb_alloc_discard: 28517 [1]: rx_discards: 88484 [1]: rx_phy_ip_err_discards: 0 [1]: rx_skb_alloc_discard: 27102 [2]: rx_discards: 13667973 [2]: rx_phy_ip_err_discards: 0 [2]: rx_skb_alloc_discard: 35207 [3]: rx_discards: 33056205 [3]: rx_phy_ip_err_discards: 0 [3]: rx_skb_alloc_discard: 33533 [4]: rx_discards: 13263091 [4]: rx_phy_ip_err_discards: 0 [4]: rx_skb_alloc_discard: 34748 [5]: rx_discards: 7583294 [5]: rx_phy_ip_err_discards: 0 [5]: rx_skb_alloc_discard: 32756 [6]: rx_discards: 3703892 [6]: rx_phy_ip_err_discards: 0 [6]: rx_skb_alloc_discard: 28380 [7]: rx_discards: 31746726 [7]: rx_phy_ip_err_discards: 0 [7]: rx_skb_alloc_discard: 32609 rx_discards: 103189181 rx_mf_tag_discard: 0 rx_brb_discard: 90068 rx_phy_ip_err_discards: 0 rx_skb_alloc_discard: 252852 没有其它错误 hebda_data_23:~ # for i in `seq 1 10`; do ifconfig p4p2 | grep RX | grep overruns; sleep 1; done RX packets:253639505018 errors:305619311 dropped:0 overruns:305375168 frame:244143 RX packets:253639552428 errors:305619311 dropped:0 overruns:305375168 frame:244143 RX packets:253639566818 errors:305619311 dropped:0 overruns:305375168 frame:244143 RX packets:253639585722 errors:305619311 dropped:0 overruns:305375168 frame:244143 RX packets:253639597202 errors:305619311 dropped:0 overruns:305375168 frame:244143 RX packets:253639610209 errors:305619311 dropped:0 overruns:305375168 frame:244143 RX packets:253639622800 errors:305619311 dropped:0 overruns:305375168 frame:244143 RX packets:253639642350 errors:305620450 dropped:0 overruns:305376307 frame:244143 RX packets:253639675509 errors:305620450 dropped:0 overruns:305376307 frame:244143 RX packets:253639723772 errors:305620471 dropped:0 overruns:305376328 frame:244143 hebda_data_23:~ # for i in `seq 1 10`; do ifconfig p4p2 | grep RX | grep overruns; sleep 1; done RX packets:253639788669 errors:305620773 dropped:0 overruns:305376630 frame:244143 RX packets:253639812355 errors:305621201 dropped:0 overruns:305377058 frame:244143 RX packets:253639834600 errors:305621201 dropped:0 overruns:305377058 frame:244143 RX packets:253639892990 errors:305621455 dropped:0 overruns:305377312 frame:244143 RX packets:253639913026 errors:305621455 dropped:0 overruns:305377312 frame:244143 RX packets:253639919136 errors:305621455 dropped:0 overruns:305377312 frame:244143 RX packets:253639935095 errors:305622380 dropped:0 overruns:305378237 frame:244143 RX packets:253639954560 errors:305623012 dropped:0 overruns:305378869 frame:244143 RX packets:253639961150 errors:305623012 dropped:0 overruns:305378869 frame:244143 RX packets:253639971680 errors:305623012 dropped:0 overruns:305378869 frame:244143 |
Gp DB 4.3
问题描述安装应用后网卡的使用情况如下图:
但是在高峰时通过nagios会发现整个集群每个节点都报下面的错误,裸跑的时候也有类似的报错,但是没有来得及抓网卡的包:
1 2 3 | Interface 11 Active checks of the service have been disabled - only passive checks are being accepted Perform Extra Service Actions CRITICAL 09-20-2016 10:47:51 0d 0h 11m 46s 1/1 CRIT - [p4p2] (up) MAC: f4:e9:d4:9d:cb:92, 10.00 Gbit/s, in: 262.67 MB/s, in-errors: 0.16%(!!) >= 0.1, out: 237.76 MB/s |
实际使用的命令是:
1 2 | echo '<< sed 1,2d /proc/net/dev |
整体上来看,errors在0.1%-0.6%之间,极少的能达到1%,当时的流量也从20M-200MB左右不等。
第一个问题是:这是不是问题?我个人感觉应该是,所以个人花了精力来处理,各位大神意见?
第一个问题是:如何解决?我有一点思路,请大神拍一下。
看了网上大家写的,怀疑问题是在rx errors,而且我看overrun比较多,是否不是ring_buffer的问题,而是中断的问题?