You are here

Anyone experiencing link degradation?

28 posts / 0 new
Last post
AE6XE
AE6XE's picture
Anyone experiencing link degradation?

Everyone in SoCal is seeing a significant degradation in link performance, many links that have been great quality for months and years are suddenly unusable.   Symptoms seem to be across all the channels in all the bands.   Major tower site where 6+ P2P links coming in are usually near 100%/100%, and last couple of days are showing ~30%/60%  LQ/NLQ.    This is going on from San Diego up through Ventura maybe beyond.

Anyone elsewhere seeing similar?  

Joe AE6XE

w6bi
w6bi's picture
Tropo

We're attributing it to this: http://www.dxinfocentre.com/tropo_wam.html

Some local stations have reported multi-state propagation on 450 MHz.

kg6wxc
kg6wxc's picture
Might be Tropo

I have been watching those maps that Orv linked to for a couple of days now.
Keep in mind those maps are only predictions, but there does seem to be a correlation.

If you look now, you'll notice an orange "blob" just off the So. Cal. coast, that sucker has been hanging around for a couple of days and is finally moving away.
This was that same map a couple of days ago: tropo_map-1800_9_5.png Those dashed lines indicate an "Unstable" area due mostly to local Thunderstorms.
http://www.dxinfocentre.com/propagation/hti.htm
It got worse after that, the pink area in the center got larger but was still off the coast, and then today the red area was right on top of us up here in the Ventura area (it may not have stretched all the way to the OC).
Before today (and for the last few days) it has mostly been an evening time phenomena and would go away a couple of hours after sunset or until about midnite-1am, it was consistent.
Today, the link degradation lasted all day and it did not let up. It has only just started to get better as of now.

I am certainly no expert on this kinda stuff and this should be taken with a grain of salt (or 3), but I think there may be a pattern here.

Ai6bx
Absolutely seeing

I am absolutely seeing what you are. I just got back from a real quick trip to NY and MA and have been looking more closely as I had seen this while tunneling in Thursday night. Some of my most solid links that have been 100x100 since they went online and have been been providing near perfect throughput. These links are now down in the 20’s-high 50’s. Not good. I jumped in and started rebooting some of these today before seeing your post hoping it might be an olsr issue or something solveable. No joy.

K6AH
K6AH's picture
San Marcos Pk

Keith, I thought SMP was also experiencing this on the link to Elsinore.  But it's not.  It's definitely a radio issues.  Unfortunately, I also can't talk to the switch, so I can't even power fail the radio.  I'll need to get back up there.

The San Diego network as a whole has been rock-solid throughout this.  Interesting we weren't affected down here.
 

AJ6GZ
Yup

I'm seeing OLSR entries run up and down from a normal 1200-1300, down to as low as Total = 30, Nodes = 10. A few minutes later it's back to normal. It has been doing this continually since Friday. And yes LQ/NLQ values are all low.

It's also dragging down the LQ on my tunnel connection (as seen from the Server side node which is closest to the mesh at large). NLQ is still 100%. Only 3 nodes down that tunnel with almost zero traffic just keepalives and ping monitoring.

Anyone seeing any different or new symptoms?

Ian

AE6XE
AE6XE's picture
One of the possibilities is

One of the possibilities is links being saturated with traffic and impacting OLSR.   I looked at some packet captures on a few nodes, and others sending to me.   While I do see heavy OLSR traffic, it's still not enough traffic by itself to saturate and be a root cause.  It looks more to me like a symptom -- lots of link updates to propagate around.    

I am seeing a lot of meshchat traffic and other traffic.   It would be a very good idea to turn off meshchat instances, particularly if there are still other instances that can be used.   Map crawlers, can be turned off.  This is just a process of elimination.    

Check the link rates, if these are still good or high, then this isn't an RF related issue.   

Joe AE6XE

Ai6bx
Andre

yes, I agree that the issue at SMP is the radio and have reached out to see I’d Chris can power cycle through his web enabled power strip as I think he also has in Internet based access to that allowing him to power cycle that which will reset the TS and all nodes. SMP being down may be what is making SD appear unaffected.

Ai6bx
Link Rates

Joe,

the lonk rates i am I am seeing are still decent, slightly worse than usual but still reasonable given the LQ/NLQ.

AE6XE
AE6XE's picture
Best guess...   there's some

Best guess...   there's some traffic occurring on the greater mesh network, that in and of itself should be normal and adequate RF links to support.  But this traffic is somehow triggering a defect in OLSR or preventing hello packets, then are going missed, then OLSR takes down the link because it goes below the threshold.  There hasn't been any big changes to the node counts, generally around 400+?   Any backbone link that goes down, then causes a cycle of lots of OLSR updates for all the nodes on the other side as it comes back up. 

What we're not seeing (please post if you are seeing otherwise):

A) enough OLSR traffic that it is a root cause or significant contributor to the symptoms.   I'm only seeing from ~8 to ~18 OLSR packets being sent out from a given node in a second.  18 x 1500 bytes/packet is tiny in comparison to our Mbps links.   the traffic, while it is floating up, isn't enough volume to explain this issue, and is expected traffic when links are going up and down.
B) flooding the network with traffic everywhere.   Just not seeing data flooding the network
C) link rates are still relative good/high.  If we were having atmospheric ducting issues, bringing in more noise, etc. then the link rates would be coorespondingly dropping with the additional interference and/or noise. 

Joe AE6XE
 

K6AH
K6AH's picture
...and San Diego, which, due

...and San Diego, which, due to a bad backbone radio, is currently disconnected from the greater SoCal network and is not seeing any of this.

K5DLQ
K5DLQ's picture
Wondering if there are non

Wondering if there are non-standard or modified nodes running that could be contributing?   ie.  Pi's running olsr, hamwan linking experiments, PC's running olsr, etc...

(not pointing fingers, but, just thought this may be a good data point)

 

w6bi
w6bi's picture
Back to normal

From our point of view in Ventura County, the network began recovering about 24 hours ago, and has since returned to normal.   This coincides (coincidentally?) with the fading of the tropo propagation noted earlier.
I skimmed the network mapper database, and found no 'non-standard' nodes that might be running funny OLSR daemons.    We had seen one or two in the past, but that's not the case at the moment.
 

K6CCC
K6CCC's picture
A data point

Here's the data point I don't really understand.  At my house I have a hAP (K6CCC-hAP-at-home) that is tunnel connected to Redlands, Glendale, Oxnard, and sometimes Orange County.  Additionally there is a DtD link to a Mikrotik LHG 5HPnD (just the driven element) that connects to a LHG 5nD about 50 feet away in my garage.  On 2 GHz, there is another hAP and a LHG 2nD (just the driven element) that are all within 4 feet of connected hAP.  All the 2 GHz links have S/N rations in the mid 50 dB range, and the 5 GHz link is in the low 30s.  Except for the hAP that has the tunnels, none of the other nodes have ANY traffic except OLSR.
When the "problem" was happening, all of my links including the tunnels went to crap.  Someone who knows more will have to figure out why the DtD and 4 foot RF links that have no traffic were affected.
And as I type this on Monday morning, all is healthy...
 

w6bi
w6bi's picture
More tropo?

I again have very poor link quality to my normally very strong 5 GHz access point 5 miles LOS away.
I see we have another tropo hot spot that's developed right off the coast.  http://www.dxinfocentre.com/tropo_wam.html
Anyone else seeing sudden link degradations this afternoon?

 

w6bi
w6bi's picture
..and now back to normal

We got reports from about 30 miles around with similar issues.   Everything has returned to normal now.

AE6XE
AE6XE's picture
Older unpatched firmware?

Orv,   How many nodes on the SoCal network have firmware older than 3.18.x.x?   These devices are unpatched and subject to OLSR problems with map crawlers.  It is a possible explanation, particularly if any key links do not have current firmware.  

w6bi
w6bi's picture
Oldest firmware

Joe, there are 102 nodes reachable by the mapper that are running firmware older than 3.18.x.x.  Would you like the list?

AE6XE
AE6XE's picture
Yes, we can make a push with

Yes, we can make a push with local stakeholders to get upgraded.  This devices are at risk of locking up until patched.

w6bi
w6bi's picture
Email?

I'll email the list out to the SoCal Hamnet mailing list.

w6bi
w6bi's picture
More on tropo

One of our local mesh networking guys happened to meet up with an FAA guy (also a ham I believe) recently and asked him if their 11 GHz links had had any recent disruption.  The guy said yes, they'd pretty much gone to hell a couple of times.  When shown the tropo map he said the tropo opening occurrences matched up closely with their observed outages.

AE6XE
AE6XE's picture
The next time we see these

The next time we see these bad conditions, let's check the link rates and/or iperf throughput tests.   During the bad conditions, when I was checking on Pleasants Pk, the backbone links were only degraded maybe ~30% .  A backbone link usually at 65 Mbps was down around 40Mbps (from memory).  

While this is a degradation we may be able to attribute/confirm to environmental conditions, it's not fully explaining the LQ% we were seeing.    This is UDP broadcast packets that are going missing.    What we need next time is the link rate table to see why a given MCS rate is selected and the packet success rate over the link.    If the link itself is showing  only 5% loss for the chosen MCS rate, but OSLR udp packets are showing 80% loss, then there's still more to explain.

The magic of 802.11n is that it is supposed to keep rolling with the punches of the environment:  inversion layers, tropo ducting, fading, etc.   This means the modulation and code schemes are changed to maximize the data throughput possible given the current conditions.  Thus, the link rates go down as conditions worsen.  But the quality of the link (LQ%) should be maintained (or minimally affected) until going down to the  lowest setting of MCS0, then can't function. 

Joe AE6XE

AJ6GZ
Data collection

I have done packet captures and evidence points to an OLSR issue, not RF. We see LQ's dropping on reliable tunnels and DtD links during the outages, too.

Are there any particular diagnostic commands that we can all run during the outage that might be helpful?

Ian

AE6XE
AE6XE's picture
This data might be helpful:

This data might be helpful:

Compare the actual received OLSR packets with what OLSR says is received.    This can be done on the node, after installing the tcpdump package:

tcpdump  -w /tmp/<hostname>.pcap -c 500  port 698 
tcpdump -w /tmp/<hostname>.pcap -c 500 -i eth0.2 port 698  <- coming in over dtdlink

ifconfig will show interfaces other than eth0.2 (dtdlink) to substitute for tunnel, etc. -c 500 says capture 500 packets.   Copy the pcap data file in /tmp down to your computer and open in wireshark.   Each packing from a given neighbor will have a sequence # in the OLSR protocol.  Out of 10, what % was missing.  Does this compare to the LQ OLSR shows for that neighbor?   (Did we receive on the interface, and did OLSR also receive or was blocked/busy and lost it.)

Joe AE6XE
 

K6CCC
K6CCC's picture
Where is tcpdump?

Where do we find tcpdump so we can install it?  I found the tcpdump website, but there is no indication how to install it.  And I would assume that we need an "AREDN" version, so I doubt that is where to get it anyway.
Yes, once I have the file, I know how to upload a package to an AREDN node (I do it every time I do firmware updates)...
 

nc8q
nc8q's picture
Where do we find tcpdump so we can install it?

https://arednmesh.readthedocs.io/en/latest/arednGettingStarted/advanced_...
Scroll to "Package Management".

On a node with internet access; select:
Setup -> Administration -> Download Package -> Select Package (from drop-down menu)
tcpdump-mini 4.9.2-1

Hope this helps, Chuck

K6CCC
K6CCC's picture
Learn something new every day!

I never had figured out what the "Download packages" was for.  I had assumed it was something else and that did not make sense.  Now I know, but could not install it on the Rocket M5 that I wanted to because of insufficient memory.  I will get it onto my hAP at home tonight.

 

kg6wxc
kg6wxc's picture
Not sure what it is, but it's gone for now.

I would not think it's my map, the issue with the older FW and my mapping scripts was fixed a long time ago.
I just stopped polling the nodes the way I was, now it's all http when going over the mesh.
The amount of data coming back from a remote node to one of my mappers is small, how big is the sysinfo.json output? That's all that is polled for, and even that is only once per hour.

Anyways... Over the weekend I was talking to a friend of mine up in Santa Barbara and he does a lot of work with some pretty long range 11GHz links, 40-60 miles.
I asked him if over the last 2 weeks or so he was seeing this same thing and his answer was "Yes, Like you would not believe".

So I still don't know, I agree with Joes post #7 in this thread, that if it happens again we should start turnning off some of the "distributed" services and see if it makes a difference... I kind of think it won't, but it's worth a shot. :)
Also if we're going to think about the "odd devices" running olsr, what about something/someone trying to come "in" from a mesh gateway causing it too? who knows at this point, there are several "ways in" too.

If we see it again around here, I'll try to capture some packets and see whats in them.
In Ventura here, we (and I) have changed nothing and it's all gone back to normal.

*edit* sorry to repeat what Orv already said, and your right Joe, I completely forgot to look at the that link mod rate table to see what it was doing during all this, very good point!

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer