You are here

Potosi M2Rocket still off-line

18 posts / 0 new
Last post
AA7AU
AA7AU's picture
Potosi M2Rocket still off-line

Starting a new thread to deal with on-going problem. For details, please see last few posts in this thread:
https://www.arednmesh.org/content/bullet-m2hp-xw-firmware

As of Sunday 20-Jan-2019, the Rocket M2 (with 120* sector) on Mt Potosi in the SW corner of the LV area is still unavailable. This is a very important node here (in the short- to interim- timeframe) and this problem has crippled many new mode users who were relying upon linking thru Potosi.

Basically we can "see" the node in WiFi Scan at good signal strength and with the proper SSID etc. It just seems like it won't handshake with other nodes which used to hit it fairly easily. This removes any IP-based options for trouble-shooting.

It almost looks like it flipped into AP mode instead of Mesh mode. Physical access to this node is currently problematic, and I don't have full details yet.

I have two questions now so as to hopefully move forward on this:

1- is there any way to use that MAC address to connect over RF to the node if it's in Mesh operation?

2- if we wanted to try to connect to the node in AP mode, how would we configure one of our operational nodes to contact it (on -2/10), and then could we somehow remotely reboot that node back into proper operation?

TIA,
- Don - AA7AU

AA7AU
AA7AU's picture
Still shows up in Real-Time SNR charts

The SNR charts must be MAC based as Potosi still shows up with a consistent reading for real-time SNR - so that can't be IP-based.

Just adding another data point - still need HELP!

- Don - AA7AU

AE6XE
AE6XE's picture
If you see this node on 10MHz
If you see this node on 10MHz channel width, then it couldn't be in AP mode -- no settings ever defined to be in this state.   There has been occurrences of moisture shorting out the cat5 wiring, which can put a node in firstboot state (same as pressing the remote reset button on a UBNT power brick from ~15 seconds).   The node would be in firstboot, but the AP is on a standard 20MHz channel. 

On a node receiving the Potosi signal, please grab a support download file.   In this data is the output of a command to see that it is connected with an 802.11 adhoc network, "iw dev wlan0 station dump".   this will confirm Potosi is still in mesh mode, if listed in this output.   If it has an 802.11n adhoc connection, then we'd be looking at the next level for OLSR activity to exchange IP addresses and hostnames.  This gets a bit more technical, but on your local node, install the tcpdump package and  from the command line, "tcpdump -i wlan0 port 698"  and look to see if any data is coming from the Potosi node.  If not, then OLSR is not functioning at Potosi.

Joe AE6XE
AA7AU
AA7AU's picture
Potosi is the first entry in that list

Potosi's MAC is DC:9F:DB:36:81:99 - still shows up in WiFi scans. Here's your data:

root@W7HEN-HARC-M2R90-TDY:~# iw dev wlan0 station dump
Station dc:9f:db:36:81:99 (on wlan0)
        inactive time:  350 ms
        rx bytes:       2632078556
        rx packets:     12549391
        tx bytes:       821955257
        tx packets:     6158694
        tx retries:     4135115
        tx failed:      1683
        rx drop misc:   1136849
        signal:         -82 [-85, -85] dBm
        signal avg:     -82 [-84, -87] dBm
        tx bitrate:     19.5 MBit/s MCS 2
        rx bitrate:     39.0 MBit/s MCS 10
        expected throughput:    13.366Mbps
        authorized:     yes
        authenticated:  yes
        associated:     yes
        preamble:       long
        WMM/WME:        yes
        MFP:            no
        TDLS peer:      no
        DTIM period:    0
        beacon interval:100
        connected time: 1500991 seconds

What's next?

Potosi remains unresponsive on the mesh IP-layer,
- Don - AA7AU

edited to add: this data is from a node which has NOT rebooted since before Potosi went missing.
 

AE6XE
AE6XE's picture
This says that there is an
This says that there is an 802.11 adhoc connection between Potosi node and this node.   looks like about a 17db SNR received signal.   The Potosi node is live and making a wireless link.   Next step is to run the tcpdump command to see if OLSR is up and sending out hello packets.  I'd suspect there are none and thus, no traffic can be exchanged as there is no routing information to communicate with IP traffic. 

You'll need to locally sync with owners of the node to gain access to further investigate.  Don KE6BXT is at Quartzsite, not sure about Frank to gain access.

Joe AE6XE
AA7AU
AA7AU's picture
TCPDUMP doesn't find Potosi

OK, did the tcpdump now:       tcpdump -i wlan0 port 698
just cycles thru a few known nodes except not sure about this one;
22:06:46.172362 IP 10.71.178.144.698 > 10.255.255.255.698: OLSRv4, seq 0xfde8, length 60
but no entries for Potosi and its old IP#

Looks like its not talking .... but it still shows up in WiFi scan this eve.

Frank responded to my earlier email this evening: "... no remote resets available. Physical hill top access is not likely any time soon."

Is there anything else we can try remotely?

Thanks,
- Don - AA7AU

AE6XE
AE6XE's picture
Not much that can be done at
Not much that can be done at this point.   The node has a watchdog reset feature if olsr stops responding on the node, so how it got into this state is unexplained.  I'd want a support data download which can be obtained from a laptop on the LAN of the node at the site before rebooting it. 

You might try this command to see if any traffic is coming out, "tcpdump -i wlan0 ether host dc:9f:db:36:81:99".

Joe AE6XE 
AA7AU
AA7AU's picture
Nada

Thanks, Joe!

Ran  the "tcpdump -i wlan0 ether host dc:9f:db:36:81:99" and got *nothing* for three minutes of waiting:
root@W7HEN-HARC-M2R90-TDY:~# tcpdump -i wlan0 ether host dc:9f:db:36:81:99
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on wlan0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

Understand you need a support data file from the node using LAN to try to figure this one out. I hope we can get that for you. However, I have no control over who will ultimately end up at the site and perform the power cycle. I will inform Frank (no reply yet to my last email) that this data capture needs to be done to help AREDN going forward.

I think that this Potosi failure certainly makes it real clear to all that a hub-and-spoke network design is a very poor choice - when the central focus stumbles and fails ... and even more so when that central point in only accessible at certain times of the year and then with difficulty. I have five new mesh users here in the HOA-dominated west side of Henderson, all pointed at Potosi (with no other current alternative) who are now deaf!  --sigh--

- Don - AA7AU

K6AH
K6AH's picture
Hub and Spoke Not Always a Bad Choice

Don, central site nodes are maintainable if you have alternate ways into them.  Most all such sites in the SoCal network have access through a separate channel and usually on a different band.  In addition, I would never place a node at a hard-to-get-to site without having a managed PoE switch that you can turn power off/on to each node.  It's all in designing to a set of requirements which must include maintainability. 

Andre, K6AH
 

AA7AU
AA7AU's picture
Apologies for the overly broad comment.

Thanks Andre. Sorry for the overly broad comment. You are absolutely right and AREDN has good guidance on how to properly design networking.

The fellow in charge of this site now writes that they intend to implement a power-cycle type control over another RF access. But for now that node is unavailable; hopefully we'll get that data capture before power-cycle. But, I'm somewhat out of the loop on that.

Up in Idaho, our shoe-string budget sometimes precludes common-sense things like remote control over POE. But we're working on it. Luckily we have mostly a true interconnected mesh up there and the mountain top mesh node is not central to continued operations.

Thanks for all you do,
- Don - AA7AU

K7FYI
Different Node; Same Symptoms
Joe:  
Interestingly, a different node here in Vegas seems to be exhibiting this same behavior as of around 2:45 AM this morning.  Like Potosi, I see signal strength indicated and its MAC address shows up in a WiFi scan - "????" is shown in the host name.  This node is DTD'd on the mountaintop; coming in through that link doesn't work either.  Thankfully, this one is easier to physically access.

This is Rocket M5 XW that has been running a beta build (it's what was available when we installed and didn't want to risk an OTA upgrade).  It has been running for ~2 months; probably 30 days + since a reboot.

Any predictions on what it will take to get it running?

root@K7FYI-MKTK5HP-SE:~# iw dev wlan0 station dump
Station fc:ec:da:66:bc:37 (on wlan0)
        inactive time:  60 ms
        rx bytes:       769707
        rx packets:     12411
        tx bytes:       0
        tx packets:     0
        tx retries:     0
        tx failed:      0
        rx drop misc:   0
        signal:         -69 [-69, -83] dBm
        signal avg:     -68 [-68, -82] dBm
        tx bitrate:     6.0 MBit/s
        authorized:     yes
        authenticated:  yes
        associated:     yes
        preamble:       long
        WMM/WME:        yes
        MFP:            no
        TDLS peer:      no
        DTIM period:    0
        beacon interval:100
        short slot time:yes
        connected time: 2734 seconds

Rick
K7FYI
AE6XE
AE6XE's picture
Rick,  at the end of January,
Rick,  at the end of January, a watchdog fix went into the Nightly Build, to reset OLSR if it froze up, and process still running.  This must be a slightly older firmware version.   ...or a different root cause.   We really need to capture a support data download (locally from the LAN of the node) before they are rebooted to confirm state and what the issue is.

Has the snow melted yet?  I hear people will be skiing in Mammoth until July this year.
 
K7FYI
Thanks, Joe - will do.  We're
Thanks, Joe - will do.  We're trying to arrange a trip up today or tomorrow.  I do think the firmware is older than the OLSR fix.  In retrospect, I should have risked the OTA upgrade.

Thankfully, this one is only ~3,400' so no snow there.  The higher peaks around town...  still plenty of white!
K7FYI
Joe:  Good news, bad news and
Joe:  Good news, bad news and good news:

Yesterday, on a trip to the mountain, we got the troublesome node back up after a reboot.  Unfortunately, I couldn't get into it on the LAN side to get the support data file (...turns out, I had the wrong IP address).  While at the site, I upgraded the firmware to 3.19.3.0.  Everything was running fine when we left the mountain; when I checked on it from home, I found that it had stopped responding ~30 min earlier.  The symptoms seemed to be the same.

Today, we traveled back to the mountain and this time I was able to get the support data file (attached).  We now have the PoEs for both mesh nodes at this site on a remotely accessible power switch on another network so we can cycle them without traveling to the site.

Maybe this is a red herring, but here is our only thought on what's related to the sudden crashes of this node:  A local user on a 31 mile path has been connecting to this node with a Ubiquity AirGrid.  The AirGrid is running 3.19.3.0 firmware and I previously used it at my house (25 mile path) to connect to the same node without problems.  We've noticed that when that device connects now, the node seems less responsive to other users.  I don't have any firm metrics on the connection, but know it is marginal at best.  Out of an abundance of caution, he powered it off this morning and it remains off.  

Hopefully the support data is helpful!

Rick
K7FYI
AE6XE
AE6XE's picture
Rick,  this is good data and
Rick,  this is good data and does show an issue.   I don't believe the marginal AirGrid link is a root cause, although it could be exacerbating the failure.  Any marginal link, particularly if traffic is trying to get across it, will tie up the channel for all the other traffic.  With extra handshaking to try and get the traffic across, latency gets worse and VOIP traffic can start losing bits, if bad enough.   

What I found in the data download was the suspected situation where OLSR is miss-behaving -- it's running, but not producing hostname or routing information fundamental for the node to operate.  There is a 'watchdog' script that restarts OLSR, but the watchdog script is not running to perform this step.  Nothing is obvious why this olsr-watchdog script is not running, so I need to dig around a bit to figure out why.

Meanwhile, what is unique about the site(s)?   This is the only reported instance, so we're seeing the planets only aligning for your group for some reason.   Is there any extra heavy use of:

1) repeatedly refreshing the mesh status on these nodes?
2) continually polling the node over the mesh with sysinfo or other map generating scripts?
3) Is the marginal link continually going up/down, or is it always established and just marginal?
4) other?

Let's connect up separately.  KE6BXT previously setup a tunnel to connect in, but haven't setup my client yet.   I'll work to access tomorrow night.  

Joe AE6XE

 
K7FYI
Excellent.  I'm glad the data
Excellent.  I'm glad the data is helpful.

I don't "think" there is anything unique about the site or its use, and it's behaved flawlessly for about 45 days; then these two crashes.  I do have a habit of setting mesh status to "Auto", but I'm normally viewing my QTH node in my browser (and I leave the browser window up 24x7). 

Two nodes with strong connections are always present; one node running an omni comes and goes (very low LQ; high NLQ, as viewed from the mountaintop site) and I'm not sure how well the AirGrid site was connected in most recently (I was out of town until Friday night).

Iperfspeed is the only service installed; my sense is that it's used very, very infrequently.

What I find interesting is that the Potosi node appears to have the same problem...  different hardware (M2 XM, IIRC).  It seems odd that something so rare affects two separate mountaintop nodes here in Vegas.

Let me know what you think when you've tunneled in and feel free to contact me offline.  Appreciate the help!

Rick
K7FYI
ke6bxt
ke6bxt's picture
QST QST QST KE6BXT-N7ZEV-Potosi-M2R-54-129-153 is back on the
QST QST QST KE6BXT-N7ZEV-Potosi-M2R-54-129-153 is back on the air.
AA7AU
AA7AU's picture
May 8th

Frank was right - it did take until May. Thanks Frank!

- Don - AA7AU

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer