You are here

Access Dropping

19 posts / 0 new
Last post
Ai6bx
Access Dropping
i just deployed a Rocket M5 XW with an Omni antenna. I have several of these deployed at this time with no difficulties to date but what makes this one different is it is dtd with a Powerbeam M5 400 and a Nanobridge M2. All have been running the latest release candidates and the longest cable run to any single node is about 60’.

Since the XW went live on Friday I have been experiencing dropouts of this node and an attached node. By dropout, I mean the ability to navigate to the individual node GUI. There is no evidence of the node resetting or losing power nor is there evidence of dropouts with connectivity to the ToughSwitch. Originally it appeared it might be an IP conflict however this was worked through and and appears to be resolved. I experienced the same “drop” while in an SSH session to the node. I am actually not convinced that the problem is with the XW but might be in the PowerBeam as that is still the newer technology and is using the gigabit connection to the ToughSwitch. Andre, Joe, and Conrad, this feels like it may be similar to the experience we had while on the Elsinore tower working with the PowerBeam to Sleeping Indian. I think that was a cable length and solved by locking the node to 100 Mbps. This cable run is significantly shorter than what we dealt with there. Perhaps I am on the wrong path.

I did load 171 to the XW this morning with no change and was thinking of trying it on the PowerBeam.

Ideas?

Keith - AI6BX
KG6JEI
So ssh sessions drop when
So ssh sessions drop when your directly connected to the node is that correct ?
AJ6GZ
Do the Toughswitches have any
Do the Toughswitches have any kind of port-level error reporting to rule out a layer 1 problem? ie: framing errors, or simply tx or rx errors? I’ve been pinging the Rocket Omni and Nanobridge M2 and when the ping fails, the opposite node is sending an icmp unreachable packet from its DtD address. It happens in both directions. Sounds to me like OLSR is flapping for whatever reason. Also as expected the dtd LQ% begins to drop. Seeing no loss on the Powerbeam link or nodes back to me in this direction. Ping is solid on the LAN and DtD IP’s of the Powerbeam which should follow the physical Ethernet port. Can you remotely control the PoE to one node at a time?
Ai6bx
Yes, SSH drops
yes, SSH sessions drop as well though I have not physically been connected to the node. This has been remote via the mesh.
Ai6bx
Yes, Ian, I can do some port level reporting
at the switch and can also power down all but the PowerBeam. If I shut it down, I lose access to the site.
Ai6bx
I see the same
i see the same decrease in the LQ/NLQ of the dtd node as well when this happens. Hmmm.
KG6JEI
Sounds like it's the symptoms

Sounds like it's the symptoms that I have seen a time or two that is already being looked into by the dev team.

Best bet for now might be to turn off the new device creating a new libk until it get sorted as it sounds like that's your trigger point but other than that it's probably not the hardware.

Ai6bx
Thanks, Conrad
Thanks, Conrad. Are you developing that now or what might the timeline be? Just estimating how long I might keep this node down as people are already linking into it.

Keith
AJ6GZ
.
Does 'ifconfig' show errors on any interfaces?
Ai6bx
None but it did take three attempts to get the connection
eth0      Link encap:Ethernet  HWaddr 44:D9:E7:D1:8A:67
          inet addr:10.132.83.57  Bcast:10.132.83.63  Mask:255.255.255.248
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:126011 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35988 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:107887293 (102.8 MiB)  TX bytes:15449412 (14.7 MiB)
          Interrupt:4

eth0.1    Link encap:Ethernet  HWaddr 44:D9:E7:D1:8A:67
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:14967 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7487 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:4909176 (4.6 MiB)  TX bytes:2560554 (2.4 MiB)

eth0.2    Link encap:Ethernet  HWaddr 44:D9:E7:D1:8A:67
          inet addr:10.209.138.103  Bcast:10.255.255.255  Mask:255.0.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:99806 errors:0 dropped:0 overruns:0 frame:0
          TX packets:28500 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:100237923 (95.5 MiB)  TX bytes:12744568 (12.1 MiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:56560 errors:0 dropped:0 overruns:0 frame:0
          TX packets:56560 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:5668893 (5.4 MiB)  TX bytes:5668893 (5.4 MiB)

wlan0     Link encap:Ethernet  HWaddr 44:D9:E7:D0:8A:67
          inet addr:10.208.138.103  Bcast:10.255.255.255  Mask:255.0.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:73370 errors:0 dropped:0 overruns:0 frame:0
          TX packets:73873 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:94195022 (89.8 MiB)  TX bytes:98572599 (94.0 MiB)

wlan0-1   Link encap:UNSPEC  HWaddr 44-D9-E7-D0-8A-67-00-44-00-00-00-00-00-00-00-00
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:111250 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:104605454 (99.7 MiB)  TX bytes:0 (0.0 B)
AE6XE
AE6XE's picture
Keith,  In this 3 device
Keith,  In this 3 device dtdlink cluster the node W6LAR-8-NB-M2-P2P-RCH does not show any RF links in current state.  This node is under the conditions to experience what we call "slugbug" in the RC and Nightly build images.    This shouldn't be the root cause of drops to the rocket XW Omni (I don't see how your route path could be going though the M2).  This is something additional to keep an eye on until this 2Ghz link comes online when you try to access and need to reboot (power cycle from TS).
   
http://bloodhound.aredn.org/products/AREDN/ticket/234

If you want to give me access to W6LAR's Rocket XW Omni I can log in to remotely monitor and help get to the bottom of it.   Check out the TS port statistics, anything different about the ports the PBE M5 and the rocket m5 statistics?    I put a PBE-M5-620 on a tower with a ~100' cat5 into the TS.  I did not have to do any special configuration for the 1Gbit port on the mesh node. It has always worked fine. But this cable may be shorter than the issues we saw at Elisinore Pk.

For other's benefit.   These newer PBE devices come with 1Gbit ports instead of a 100Mbit ethernet port.   The Toughswitch was having an incompatibility establishing a stable link with the mesh node over the cat5 and we ended up doing a custom mesh node setting to lock it at 100Mbits.   I didn't have to do that at another site, guessing due to shorter cable.

Joe AE6XE
Ai6bx
Credentials

Joe,

I sent you the login info for the Rocket XW as well as the Nanobridge M2 via email. I can also shut down the M2 from the switch to see if that has any positive impact. 

No TX or RX errors in the TS at this time. I did also just start system error logging in the switch for drops etc.

Thanks,

Keith

AE6XE
AE6XE's picture
Trouble-shooting update.  
Trouble-shooting update.  

After monitoring and poking around a bit.   Here's what Keith and I have discovered.    

There's a 'dtdlink' cluster of 3 nodes, call them A, B, and C.     We are accessing node C over a DtDlink cat5 cables (ToughSwitch in the middle) from node A.   I discovered a lot of unrelated traffic also going between Node A and C.  This was  "UDP" protocol traffic.  As I monitored this link, the OLSR Hello UDP packets were also  on the link.  These packets were not all arriving on the other side and OLSR was dropping down to 89% LQ and 100% NLQ and OLSR at that point was showing an ETX of 1.23 for this DtDlink.  

I'll have to dig though some code, as there's a threshold to drop a cat5 link as I recall at 95% LQ.   But the logic seems to revert to the behavior of an RF path  olsr settings with  best ETX of 1 and link drop at like 20% LQ.     Normally this link has a DTDLink of ETX pegged at 0.1.    

At this point,  'A' still thinks the ETX or cost to hop though B on to C is 0.1 + 0.1 = 0.2 ETX.   So it flips routing through node B thinking it's now lower cost.   Then the A -> C link goes back up to the 100% LQ and about ~30 seconds later flips back to normal.  I watched this flip-flop ~6 times tracking the olsr changes.

This route flip-flop coincides with the symptoms we see -- connection dropping and delay in responding.    We'll have to investigate further.   Is there a hardware factor causing packets to be lost?  All 3 nodes see the traffic from everyone else, so why is only one cat5 directional path showing lower LQ and not the others?   

Here's what I captured in the middle of the event on node 'A' (when the access to node 'C' drops):
Table: Links
Local IP              Remote IP       Hyst.    LQ      NLQ    Cost
10.63.24.253    10.17.252.60    0.00    1.000    1.000    0.100       (A -> B)
10.63.24.253    10.209.138.103    0.00    0.890    1.000    1.123    (A -> C)
10.62.24.253    10.176.139.122    0.00    0.862    0.925    1.252    (A -> RF remote node)

The dtdlink directional path from 10.209.138.103 -> 10.63.24.253 is showing the 89% LQ.  

Keith, I'd inspect cabling and ports between the toughswitch and this node 10.63.24.253 Dtdlink.AI6BX-8-PBE-M5-P2P-HC.local.mesh as the next step.   Shielded? Corrosion in the ports/connectors?   crimping good?

10.209.138.103 is dtdlink.AI6SW-8-RM5-XW-Omni.local.mesh

Joe AE6XE
AJ6GZ
Another update… sorry if this is long winded.
While scanning the horizon with my PowerBeam M5 400 I started to notice what looked like a bad network cable or connection. Pings were being lost (over 5%) between the laptop and node’s LAN IP address. So much so that at times the entire Ethernet connection would disconnect per Windows. I could even see the link light on the 400 going on and off. Upon further testing with a big pile of patch cables, various power injectors, laptops, switches, and a lot of patience I narrowed down a few things. 2 laptops, a Dell and a Toughbook, containing the Intel i219 and i218 NICs, and running Windows 7 both exhibited issues. However, running a bootable Linux image on both laptops seemed to resolve the problem. 1 laptop, an older i7 Dell, Win 7, contains an older 85xxx chip, no problems. HP 1810 gigabit switch, no issues. Several Cisco 3750 (10/100 only models) switches, no problem. Connecting any of the laptops, thru any switch, no problem. I replicated this on two other live M5 400’s (KD6EPQ-9-PBM5-RdlsEOC-1.local.mesh and -2) but only while using one of the triggering laptops. These nodes are normally plugged into a 3750 10/100 switch, with no problem. No errors on the switch ports. Updating laptop drivers, setting to 100/Full (or anything else), changing power-save settings, flow control, offloading, etc. had no positive effect. Paths thru the RF side of the node never exhibited any issues. Also the node’s LAN IP remained ping-able at all times thru RF. Keith forced the ToughSwitch 5 and the M5 400 node referenced above into 100/Full and the route-flapping problem at the site went away. (AI6BX-8-PBE-M5-P2P-HC.local.mesh). On a side note, all on-site cabling was verified last time I was up there. I plugged in thru the switch and saw the problem; we thought maybe it was a bad switch. I had plugged in directly to the node but apparently didn’t let it sit long enough to fail. It can go a minute or so without fail at times. I forced my M5-400 node into 100/Full and the problem went away on the direct laptop tests. Laptop NIC settings didn’t matter at this point whether set to 100/Full or Auto. (Set both the same of course in the real world.) So it would seem there may be a driver issue on the M5 400 affecting auto-negotiation only with some specific chipset/driver/switch/OS combinations. 3.17.1.0RC1 and Dev 171 both exhibit this behavior. These are all non-ISO models. I have not seen a problem with the PB M5-300 model. Let me know if u want any support files or access. I'll leave the "fixed" test node on the air (KD6EPQ-1-TEST.local.mesh) Ian
AE6XE
AE6XE's picture
Bingo.  Sorry, I should have
Bingo.  Sorry, I should have caught that earlier given the model involved.   We saw this previously at Elsinore Pk with a PBE-M5-400 too.   

These newer devices with the Gigabit ports are at risk of bouncing around on the rates and causing havoc.  The PBE-M5-620 may also exhibit this symptom, although I have one at Peasants Pk on an 8-port ToughSwitch.  This combination is working fine with no special settings. 

Andre, Conrad, did ether of you create a ticket earlier on this issue from Elsinore experience?   Or do we need to submit this now?

Joe AE6XE
Ai6bx
We are all human!
Joe, 

Glad to know we are all human and occasionally miss something in our reading. Good to know too that my initial thought was not that far off. :)
KG6JEI
To my knowledge no ticket has
To my knowledge no ticket has been created for this issue.
K6AH
K6AH's picture
No ticket was created.  It
No ticket was created.  It was my node and I assumed it to be a result of a marginal cable termination.  It was the first time any of us had seen the problem, so we were content to wait until we saw it again before documenting it in a ticket.

Andre
AJ6GZ
#265
I entered ticket #265 for this issue.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer