You are here

Occasional DNS resolution/broadcast failures (?) in new release.

14 posts / 0 new
Last post
w6bi
w6bi's picture
Occasional DNS resolution/broadcast failures (?) in new release.
We're seeing occasional DNS broadcast failures (IP address showing instead of hostname) on DtD links of perhaps a half dozen nodes.  Sometimes they'll go away after several hours or several days, sometimes they persist.  (See attached screenshot).
All of the affected nodes are running the latest production release code.  They're all high-level nodes, all Rockets (which may not be relevant).
Any ideas?

Orv W6BI
w6bi
w6bi's picture
Screenshot
Why can't I add screenshots to the initial post? Anyway, here it is.
Image Attachments: 
K6AH
K6AH's picture
That appears to be a bug.

That appears to be a bug.  I have the same issue here appearing in Current Neighbors.

Image Attachments: 
K6AH
K6AH's picture
I think I figured out the problem.

I think I figured out the problem.  The node it was happening on was a PBE-M5-400 with a gigabit Ethernet interface.  I ran the commands Joe, AE6XE, has given out and the problem is gone.  It must be some issue on boot-up when the eth0 interface is being negotiated that's hosing up the DNS.

Anyway, run the following:
opkg install http://downloads.openwrt.org/releases/packages-18.06/mips_24kc/base/ethtool_4.15-1_mips_24kc.ipk  
ethtool eth0
ethtool -s eth0 speed 100 duplex full autoneg off

To persist this change, put the last command also in:  /etc/rc.local
K6AH
K6AH's picture
I have another idea...
Looking at your nodes exhibiting the problem...
  • WD6EBY-LA-MtWilson-SW-Sector-5G
  • WD6EBY-LA-Verdugo-S-Sector-5G
Both have really long host names. Perhaps move the salient info to the comment field and reduce the character count of the node name.
 

 

w6bi
w6bi's picture
Debugging

Andre, Rockets don't have Gigabit interfaces, so I don't think that was the issue.   But the nodes in question seem to have node names exactly 32 characters. That's exactly half of the maximum length for the node name  - 64 characters.     How long are your problematic node names?

W2TTT
W2TTT's picture
DNS Issue

Orv and Andre,
While navigating through our network over RF only paths, through nodes running several older and new code versions, we ran into this DNS Issue where we couldn't get to an immediate neighbor who node name was listed on a mesh status screen.   To solve the node navigation issue, we used the older code nodes to look up the IP addresses in theIr OLSR module screens.  We then were able to get to the desired node. 
All our node names are short, and this issue is being seen on nodes with older code, so I wouldn't jump on the the current 3.18.9 code for the source of this defect.

While the dropping of the OLSR GUI on port 1978 was helpful in what it freed in the way of resources, maybe we should consider a second type of mesh status screen with just IP addresses or one that combines them both? 

We also might want to make the periodicity of OLSR broadcast adjustable along with the number of hops.  A poor man's BGP  alternative could be set up between cooperative sets of nodes through administration and reduce this traffic among stable partners.

Thoughts?

73,
Gordon Beattie, W2TTT
201.314.6964
 

AE6XE
AE6XE's picture
I had seen this "IP address
I had seen this "IP address showing for Neighbor Nodes in mesh status" previously, and early on, but couldn't reproduce.   In the example I saw, I captured OLSR packets from the node of the IP address showing on the neighbor.    The OSLR packets were not including the hostname of this node to communicate to the neighbor.   I haven't been able to reproduce the problem, and that was early on, so chalked it up to the twilight zone since no one was seeing it for the last couple of months.  

I suspect this is not a problem on the node displaying the IP address, rather on that neighbor node, not sending a hostname to know about.   This may be complicated if olsr continues to cache the host/IP of nodes, e.g. when renaming a node, the old name still hangs around for a while on nodes across the mesh.

You should be able to reach the node by using the IP address directly.   If you have this scenario, then install the "tcpdump-mini" package and capture data with this command (assumes over RF, but if path is over a cat5, change wlan0-1 to eth0.2)

tcpdump -i wlan0-1 -c 1000 -w /tmp/my-node-name.pcap  port 698

This may take several minutes to collect a 1000 packets.  Then send me this data file along with the support download.   

Now, reboot this node and or "/etc/init.d/olsrd restart" and "/etc/init.d/dnsmasq restart".   Did that make the symptoms go away?

Joe AE6XE
w6bi
w6bi's picture
PCAP file.
PCAP file was emailed, as it was too large to attach.
Orv W6BI
w6bi
w6bi's picture
Different issue?
When uploading the tcpdump package, I got this message on two different nodes?  What's the significance of the last line?
Installing tcpdump-mini (4.9.2-1) to root...
Downloading
http://downloads.arednmesh.org/releases/3/18/3.18.9.0/packages/mips_24kc/base/tc
pdump-mini_4.9.2-1_mips_24kc.ipk
Installing libpcap (1.8.1-1) to root...
Downloading
http://downloads.arednmesh.org/releases/3/18/3.18.9.0/packages/mips_24kc/base/li
bpcap_1.8.1-1_mips_24kc.ipk
Configuring libpcap.
Configuring tcpdump-mini.
Failed to restart all services, please reboot this node.
AE6XE
AE6XE's picture
Orv,   As you may have
Orv,   As you may have observed, tcpdump immediately works without a reboot.  This is a 3rd party package, haven't dug into the details behind this message, because there haven't been any repercussions, so ignoring.   
w6bi
w6bi's picture
Service restarts
When trying to restart services on a suspect node per your suggestion Joe, I got this series of responses:

root@WD6EBY-LA-MtWilson-SE:~# cd /etc/init.d
root@WD6EBY-LA-MtWilson-SE:/etc/init.d# ./olsrd restart
packet_write_wait: Connection to 10.176.140.239 port 2222: Broken pipe
[obeach@jethro_house temp]$ ssh -p 2222 root@10.176.140.239
ssh: connect to host 10.176.140.239 port 2222: Connection refused

<reconnected>

root@WD6EBY-LA-MtWilson-SE:~# cd /etc/init.d
root@WD6EBY-LA-MtWilson-SE:/etc/init.d# ./dnsmasq restart
/etc/rc.common: line 1: can't create /tmp/hosts/dhcp: nonexistent directory
/etc/rc.common: line 1: can't create /tmp/hosts/dhcp: nonexistent directory
udhcpc: started, v1.28.3
udhcpc: sending discover
udhcpc: no lease, failing
root@WD6EBY-LA-MtWilson-SE:/etc/init.d#

Orv W6BI
AE6XE
AE6XE's picture
Orv, that looks right.    The
Orv, that looks right.    The routing was broken when olsr went down to communicate with the node, so it takes a minute or two for olsr to come back up and you are able to communicate with it -- normal behavior.  dnsmasq has some start up warnings normally not visible, can ignore.  Another item on the long list of things, to find yet a few more minutes in the day to do.  
AE6XE
AE6XE's picture
How to recover -- to get hostnames to display
Orv and I have confirmed, OLSR is not communicating the node's primary IP's Hostname at times (1%?) .   The Neighbor nodes will display the IP address only.    restarting olsrd resolves the issue.  a node reboot is also expected to resolve the issue.   (Nothing to do on the nodes where you see the IP address, take this action on the node with this IP address.)   

We'll have to dig into OLSR to find out why it is not performing its function, and why this is infrequent or not easily repeatable.     If anyone finds repeatable steps and/or particular configuration to reproduce, please post.   Half the battle sometimes is reproducing an issue.

Joe AE6XE 

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer