While investigating the transient routing loops (and accompanying packet losses) that I've seen in our (San Diego) network, I came across this essay from the German group that wrote much of the existing code in olsrd, the Optimized Link State Routing daemon that is currently running in both AREDN and Broadband Hamnet nodes.
https://www.open-mesh.org/projects/open-mesh/wiki/The-olsr-story
It's easy to criticize someone else's work. And I don't have an opinion about their newer BATMAN protocol because I haven't studied it yet. But when somebody invests a lot of effort in something only to say that their own work is fundamentally flawed and should be junked in favor of something else, that tends to carry a lot of weight with me.
Note, the document you reference doesn't represent the last 10 years of OLSR development and improvements--it's quite old and looks like the split in 2006 that created BATMAN. It's nice to see that these groups are healthily competing to evolve the state-of-the-art: http://battlemesh.org
Conrad and I are following along on the OLSRv2 developer list. They still have a way to go for consideration. What would you recommend is a better path than to fall into OLSRv2 when it is released?
Joe AE6XE
The "something else" is the B.A.T.M.A.N. (Better Approach To Mobile Adhoc Networking) protocol I mentioned. I noticed it in the Linux kernel quite some time ago so I know it's not brand new.
I don't have an opinion on whether it is better or worse than olsrd because, as I said, I haven't dug into it yet. I just thought it interesting that we're encountering some of the same problems (specifically routing loops) that induced the writers of that essay to abandon olsrd entirely.
It did occur to me a while ago that synchronized clocks might help minimize routing loops by allowing nodes to update their routing tables simultaneously. It probably wouldn't eliminate loops entirely because, as long as routing information can be delayed or lost in transit, nodes aren't guaranteed a consistent view of the network from which to build their routing tables. But the only way to know is to try things out; routing algorithm behavior can be amazingly counter-intuitive.
Moderation comment:
https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance
https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-II
Im curious what is currently in the AREDN firmware that is dealing with loop issues?
Thanks
AJ6BL
A) "Bridge" loop avoidance -- when nodes are configured as bridges such that a broadcast packet is propagated from node to node
B) "Transient Route Change" avoidance -- when there is a change of routing with time delay for all nodes in the mesh to receive updated information
These BATMAN documents are 'A'. The AREDN mesh nodes, using OLSR, aren't setup in this kind of a bridge, they don't pass through broadcast packets, rather each node is a layer 3 or IP router. There must be an IP address of the final destination and the mesh node has routing tables to identify the neighbor to send the traffic to. Broadcast packets would flood the RF networks everywhere if the nodes propagated this traffic and with more and more nodes would too quickly make the network unusable.
AREDN nodes are susceptible to option 'B', which is inherent in designs that are based on "Link State". The physics are such that there is a delayed time for the 2 nodes on each end of the a direct RF link to communicate information about this link to everyone else. If, e.g. the conditions suddenly change and the link degrades, the different delay times for other nodes to receive this information has potential to route IP traffic in a loop for a few seconds until receiving the update. There would have to be an alternative multi-path options for this condition to occur, with similar ETX value.
BATMAN does appear to be a better option to scale. However, OLSR has been re-writing their approach called OLSRv2. Potential for these competing groups to leap frog one another. AREDN uses OLSRv1 today. If we start to see scaling issues with OLSRv1, we'd be highly motivated to jump to BATMAN or OLSRv2 and make a decision. No one is eager to make this jump because the protocols are not compatible old vs new. It would be challenging for groups to have to go though this migration to upgrade their entire network all at once.
Joe AE6XE
That helps clear things up a bit for me. I am doing some research this week into some looping issues we are experiencing here in the LA/Ventura county area and wasn't clear about the protocols we are currently using.
Also, according to the BATMAN docs there is a way to integrate with non BATMAN devices and still take advantage of the benefits, but you have to setup an interface for them and bridge it to the BATMAN interface. I don't know if this would be a possible work around for integrating non-batman nodes prior to upgrading everyone (if we go this route), but I thought I should mention it. Maybe creating a 2nd SSID that would be for the BATMAN nodes and keep the existing AREDN-v3 ssid for "legacy" nodes? Just a thought. Im not an expert in mesh but I am trying to learn as much as I can so please bare with me here. Check out the link below for more info. It does say "a couple computers" so this idea may not work after-all.
https://www.open-mesh.org/projects/batman-adv/wiki/Quick-start-guide Lastly, what is the optimal setup for a switch that has >1 node connected to it, in proximity to each other? Is channel separation important (since they are tied together with the switch)? Should we be enabling RSTP or anything like that? I know you said broadcast traffic isn't able to flood the network so STP shouldn't benefit anything. Any other suggestions to avoid looping issues? This more of a question related to the bigger sector antennas and backbone nodes, we have up on the peaks not really for the smaller nodes, but any advice would be appreciated.
Im going to be reading up on OLSR now so I may have some more questions later.
Thanks again Joe,
AJ6BL
Another scenario, consider two tower sites that can hear one another on the same channel, could be 20+ miles away. Each site has lots of clients. Only one site could transmit at a time blocking out all other traffic, the other tower site couldn't hear a local client at the same time, etc.
Look for parallel multi-hop paths between two nodes with similar ETX. This is where you will find transient route loop flapping. One path can get loaded down with a video stream, the ETX drops, then the other path gets selected by OLSR. The other path gets loaded down with the traffic, the ETX drops, repeat. The symptoms are such that everything seems to be working fine, but then the connection times out, drops out, or can't keep up with lost data. too many links or coverage nodes on the same RF channel will just have poor performance most of the time, a different symptom.
Joe AE6XE