i just deployed a Rocket M5 XW with an Omni antenna. I have several of these deployed at this time with no difficulties to date but what makes this one different is it is dtd with a Powerbeam M5 400 and a Nanobridge M2. All have been running the latest release candidates and the longest cable run to any single node is about 60’.
Since the XW went live on Friday I have been experiencing dropouts of this node and an attached node. By dropout, I mean the ability to navigate to the individual node GUI. There is no evidence of the node resetting or losing power nor is there evidence of dropouts with connectivity to the ToughSwitch. Originally it appeared it might be an IP conflict however this was worked through and and appears to be resolved. I experienced the same “drop” while in an SSH session to the node. I am actually not convinced that the problem is with the XW but might be in the PowerBeam as that is still the newer technology and is using the gigabit connection to the ToughSwitch. Andre, Joe, and Conrad, this feels like it may be similar to the experience we had while on the Elsinore tower working with the PowerBeam to Sleeping Indian. I think that was a cable length and solved by locking the node to 100 Mbps. This cable run is significantly shorter than what we dealt with there. Perhaps I am on the wrong path.
I did load 171 to the XW this morning with no change and was thinking of trying it on the PowerBeam.
Ideas?
Keith - AI6BX
So ssh sessions drop when your directly connected to the node is that correct ?
Do the Toughswitches have any kind of port-level error reporting to rule out a layer 1 problem? ie: framing errors, or simply tx or rx errors?
I’ve been pinging the Rocket Omni and Nanobridge M2 and when the ping fails, the opposite node is sending an icmp unreachable packet from its DtD address. It happens in both directions. Sounds to me like OLSR is flapping for whatever reason. Also as expected the dtd LQ% begins to drop.
Seeing no loss on the Powerbeam link or nodes back to me in this direction. Ping is solid on the LAN and DtD IP’s of the Powerbeam which should follow the physical Ethernet port.
Can you remotely control the PoE to one node at a time?
yes, SSH sessions drop as well though I have not physically been connected to the node. This has been remote via the mesh.
at the switch and can also power down all but the PowerBeam. If I shut it down, I lose access to the site.
i see the same decrease in the LQ/NLQ of the dtd node as well when this happens. Hmmm.
Sounds like it's the symptoms that I have seen a time or two that is already being looked into by the dev team.
Best bet for now might be to turn off the new device creating a new libk until it get sorted as it sounds like that's your trigger point but other than that it's probably not the hardware.
Thanks, Conrad. Are you developing that now or what might the timeline be? Just estimating how long I might keep this node down as people are already linking into it.
Keith
Does 'ifconfig' show errors on any interfaces?
eth0 Link encap:Ethernet HWaddr 44:D9:E7:D1:8A:67
inet addr:10.132.83.57 Bcast:10.132.83.63 Mask:255.255.255.248
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:126011 errors:0 dropped:0 overruns:0 frame:0
TX packets:35988 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:107887293 (102.8 MiB) TX bytes:15449412 (14.7 MiB)
Interrupt:4
eth0.1 Link encap:Ethernet HWaddr 44:D9:E7:D1:8A:67
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:14967 errors:0 dropped:0 overruns:0 frame:0
TX packets:7487 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:4909176 (4.6 MiB) TX bytes:2560554 (2.4 MiB)
eth0.2 Link encap:Ethernet HWaddr 44:D9:E7:D1:8A:67
inet addr:10.209.138.103 Bcast:10.255.255.255 Mask:255.0.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:99806 errors:0 dropped:0 overruns:0 frame:0
TX packets:28500 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:100237923 (95.5 MiB) TX bytes:12744568 (12.1 MiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:56560 errors:0 dropped:0 overruns:0 frame:0
TX packets:56560 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:5668893 (5.4 MiB) TX bytes:5668893 (5.4 MiB)
wlan0 Link encap:Ethernet HWaddr 44:D9:E7:D0:8A:67
inet addr:10.208.138.103 Bcast:10.255.255.255 Mask:255.0.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:73370 errors:0 dropped:0 overruns:0 frame:0
TX packets:73873 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:94195022 (89.8 MiB) TX bytes:98572599 (94.0 MiB)
wlan0-1 Link encap:UNSPEC HWaddr 44-D9-E7-D0-8A-67-00-44-00-00-00-00-00-00-00-00
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:111250 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:104605454 (99.7 MiB) TX bytes:0 (0.0 B)
Keith, In this 3 device dtdlink cluster the node W6LAR-8-NB-M2-P2P-RCH does not show any RF links in current state. This node is under the conditions to experience what we call "slugbug" in the RC and Nightly build images. This shouldn't be the root cause of drops to the rocket XW Omni (I don't see how your route path could be going though the M2). This is something additional to keep an eye on until this 2Ghz link comes online when you try to access and need to reboot (power cycle from TS).
http://bloodhound.aredn.org/products/AREDN/ticket/234
If you want to give me access to W6LAR's Rocket XW Omni I can log in to remotely monitor and help get to the bottom of it. Check out the TS port statistics, anything different about the ports the PBE M5 and the rocket m5 statistics? I put a PBE-M5-620 on a tower with a ~100' cat5 into the TS. I did not have to do any special configuration for the 1Gbit port on the mesh node. It has always worked fine. But this cable may be shorter than the issues we saw at Elisinore Pk.
For other's benefit. These newer PBE devices come with 1Gbit ports instead of a 100Mbit ethernet port. The Toughswitch was having an incompatibility establishing a stable link with the mesh node over the cat5 and we ended up doing a custom mesh node setting to lock it at 100Mbits. I didn't have to do that at another site, guessing due to shorter cable.
Joe AE6XE
Joe,
I sent you the login info for the Rocket XW as well as the Nanobridge M2 via email. I can also shut down the M2 from the switch to see if that has any positive impact.
No TX or RX errors in the TS at this time. I did also just start system error logging in the switch for drops etc.
Thanks,
Keith
Trouble-shooting update.
After monitoring and poking around a bit. Here's what Keith and I have discovered.
There's a 'dtdlink' cluster of 3 nodes, call them A, B, and C. We are accessing node C over a DtDlink cat5 cables (ToughSwitch in the middle) from node A. I discovered a lot of unrelated traffic also going between Node A and C. This was "UDP" protocol traffic. As I monitored this link, the OLSR Hello UDP packets were also on the link. These packets were not all arriving on the other side and OLSR was dropping down to 89% LQ and 100% NLQ and OLSR at that point was showing an ETX of 1.23 for this DtDlink.
I'll have to dig though some code, as there's a threshold to drop a cat5 link as I recall at 95% LQ. But the logic seems to revert to the behavior of an RF path olsr settings with best ETX of 1 and link drop at like 20% LQ. Normally this link has a DTDLink of ETX pegged at 0.1.
At this point, 'A' still thinks the ETX or cost to hop though B on to C is 0.1 + 0.1 = 0.2 ETX. So it flips routing through node B thinking it's now lower cost. Then the A -> C link goes back up to the 100% LQ and about ~30 seconds later flips back to normal. I watched this flip-flop ~6 times tracking the olsr changes.
This route flip-flop coincides with the symptoms we see -- connection dropping and delay in responding. We'll have to investigate further. Is there a hardware factor causing packets to be lost? All 3 nodes see the traffic from everyone else, so why is only one cat5 directional path showing lower LQ and not the others?
Here's what I captured in the middle of the event on node 'A' (when the access to node 'C' drops):
Table: Links
Local IP Remote IP Hyst. LQ NLQ Cost
10.63.24.253 10.17.252.60 0.00 1.000 1.000 0.100 (A -> B)
10.63.24.253 10.209.138.103 0.00 0.890 1.000 1.123 (A -> C)
10.62.24.253 10.176.139.122 0.00 0.862 0.925 1.252 (A -> RF remote node)
The dtdlink directional path from 10.209.138.103 -> 10.63.24.253 is showing the 89% LQ.
Keith, I'd inspect cabling and ports between the toughswitch and this node 10.63.24.253 Dtdlink.AI6BX-8-PBE-M5-P2P-HC.local.mesh as the next step. Shielded? Corrosion in the ports/connectors? crimping good?
10.209.138.103 is dtdlink.AI6SW-8-RM5-XW-Omni.local.mesh
Joe AE6XE
While scanning the horizon with my PowerBeam M5 400 I started to notice what looked like a bad network cable or connection. Pings were being lost (over 5%) between the laptop and node’s LAN IP address. So much so that at times the entire Ethernet connection would disconnect per Windows. I could even see the link light on the 400 going on and off. Upon further testing with a big pile of patch cables, various power injectors, laptops, switches, and a lot of patience I narrowed down a few things.
2 laptops, a Dell and a Toughbook, containing the Intel i219 and i218 NICs, and running Windows 7 both exhibited issues. However, running a bootable Linux image on both laptops seemed to resolve the problem.
1 laptop, an older i7 Dell, Win 7, contains an older 85xxx chip, no problems.
HP 1810 gigabit switch, no issues. Several Cisco 3750 (10/100 only models) switches, no problem.
Connecting any of the laptops, thru any switch, no problem.
I replicated this on two other live M5 400’s (KD6EPQ-9-PBM5-RdlsEOC-1.local.mesh and -2) but only while using one of the triggering laptops. These nodes are normally plugged into a 3750 10/100 switch, with no problem. No errors on the switch ports.
Updating laptop drivers, setting to 100/Full (or anything else), changing power-save settings, flow control, offloading, etc. had no positive effect.
Paths thru the RF side of the node never exhibited any issues. Also the node’s LAN IP remained ping-able at all times thru RF.
Keith forced the ToughSwitch 5 and the M5 400 node referenced above into 100/Full and the route-flapping problem at the site went away. (AI6BX-8-PBE-M5-P2P-HC.local.mesh). On a side note, all on-site cabling was verified last time I was up there. I plugged in thru the switch and saw the problem; we thought maybe it was a bad switch. I had plugged in directly to the node but apparently didn’t let it sit long enough to fail. It can go a minute or so without fail at times.
I forced my M5-400 node into 100/Full and the problem went away on the direct laptop tests. Laptop NIC settings didn’t matter at this point whether set to 100/Full or Auto. (Set both the same of course in the real world.)
So it would seem there may be a driver issue on the M5 400 affecting auto-negotiation only with some specific chipset/driver/switch/OS combinations.
3.17.1.0RC1 and Dev 171 both exhibit this behavior. These are all non-ISO models. I have not seen a problem with the PB M5-300 model. Let me know if u want any support files or access. I'll leave the "fixed" test node on the air (KD6EPQ-1-TEST.local.mesh)
Ian
Bingo. Sorry, I should have caught that earlier given the model involved. We saw this previously at Elsinore Pk with a PBE-M5-400 too.
These newer devices with the Gigabit ports are at risk of bouncing around on the rates and causing havoc. The PBE-M5-620 may also exhibit this symptom, although I have one at Peasants Pk on an 8-port ToughSwitch. This combination is working fine with no special settings.
Andre, Conrad, did ether of you create a ticket earlier on this issue from Elsinore experience? Or do we need to submit this now?
Joe AE6XE
Joe,
Glad to know we are all human and occasionally miss something in our reading. Good to know too that my initial thought was not that far off. :)
To my knowledge no ticket has been created for this issue.
No ticket was created. It was my node and I assumed it to be a result of a marginal cable termination. It was the first time any of us had seen the problem, so we were content to wait until we saw it again before documenting it in a ticket.
Andre
I entered ticket #265 for this issue.