You are here

Route flapping and looping in the SD network

21 posts / 0 new
Last post
KA9Q
Route flapping and looping in the SD network

I'm exercising part of the San Diego network by updating the software on a remote Raspberry Pi.

The data path looks like this, starting with marge.ka9q.net:

via Ethernet to ka9q-north (in University City);

via radio to w6qar-south (in Sorrento Mesa);

via ethernet to w6qar-e or w6qar-n (see below);

via radio to WL7AG-PQ (in Rancho Penasquitos);

via ethernet to WL7AQ-PQ-NE;

via radio to KG6EQU-MCHS (also in Rancho Penasquitos);

via ethernet to KG6EQU-MCHS-PI.

Throughput reaches a steady state between 300-450 kb/s or so, but every few minutes (and sometimes as often as 30 seconds), throughput drops way down and maggie receives a flurry of ICMP Time Exceeded messages from either w6qar-e or w6qar-n, sometimes from both. Eventually, the path recovers and TCP slow start ramps back up.

Traceroutes will change as well. Here's a healthy one:

1  192.168.42.1 (192.168.42.1)  1.856 ms  0.940 ms  0.839 ms
 2  dtdlink.ka9q-north.local.mesh (10.19.73.183)  1.260 ms  1.205 ms  1.195 ms
 3  w6qar-wts.local.mesh (10.242.28.171)  3.204 ms  10.888 ms  8.751 ms
 4  mid1.w6qar-wte.local.mesh (10.3.17.29)  5.297 ms  7.507 ms  4.871 ms
 5  wl7ag-pq.local.mesh (10.248.87.64)  7.212 ms  20.966 ms  27.897 ms
 6  mid1.wl7ag-pq-ne.local.mesh (10.253.178.62)  24.672 ms  21.919 ms  17.763 ms
 7  kg6equ-mchs.local.mesh (10.18.74.69)  51.919 ms  33.347 ms  22.061 ms
 8  kg6equ-mchs-pi.local.mesh (10.146.82.42)  91.296 ms  52.530 ms  188.385 ms

And here's an unhealthy one:

1  192.168.42.1 (192.168.42.1)  1.270 ms  0.969 ms  0.899 ms
2  dtdlink.ka9q-north.local.mesh (10.19.73.183)  1.260 ms  1.281 ms  1.227 ms
3  w6qar-wts.local.mesh (10.242.28.171)  369.206 ms  137.997 ms  6.314 ms
4  dtdlink.w6qar-wtn.local.mesh (10.151.6.207)  31.033 ms  21.327 ms  93.373 ms
5  wl7ag-pq.local.mesh (10.248.87.64)  932.122 ms  544.169 ms  1019.233 ms
6  dtdlink.w6qar-wtn.local.mesh (10.151.6.207)  1256.483 ms  492.448 ms  69.649 ms

(finishes before reaching destination).

Note the difference in route beyond hop 3; w6qar-s is switching between w6qar-e and w6qar-n in reaching wl7ag-pq. It may be that wl7aq-pq is returning the packet to w6qar-n instead of forwarding to its destination, or it could be just that the route changed during the traceroute yielding an inconsistent result. This plus the flurry of TTL ICMP messages, shows frequent routing flaps with transient routing loops forming during each flap. Even if the links are unreliable, the routing protocol is supposed to avoid loops because they can be very wasteful as they bounce around until the TTL (which typically starts at 64) reaches zero.

Hopefully this can be fixed with adjustments to the OLSR parameters and is not a fundamental flaw in the routing algorithm itself. I don't know the inner details of OLSR yet but other routing algorithms I've seen usually have hysteresis settings designed specifically to avoid this sort of thing.

AE6XE
AE6XE's picture
KA9Q,

KA9Q,

I've seen similar issues.   I speculate that there are 2 paths on the mesh with very close ETX cost.     Loading one path down with traffic seems to cause the olsr ETX to increase just enough to trigger a route change to the other path.     We have a dual  RF link (2.4Ghz and 5Ghz) between sites (nodes DtDlinked on both ends)  with both bands at 100% LQ and NLQ.   The HD video stream goes in and out--not usable.    We reduced the power way down on one band to effectively take it out of the equation and video works fine now.

Joe AE6XE

KA9Q
I'm continuing to investigate

I'm continuing to investigate. With the path otherwise idle, I'm still seeing period route flaps so I don't think it's related to traffic.

If two paths are nearly equal in cost, there should be a hysteresis mechanism to keep them from flapping. But flapping isn't the real problem -- it's the looping that forms every time they do. That's much more damaging. Avoidance of loops is supposed to be one of the advantages of a link state routing algorithm.

It's been a while since I read up on the theory but I vaguely remember this may only be true when everything is properly timestamped. This could be a very good reason to put NTP into the nodes, huh?

KA9Q
OK, I think you're on to

OK, I think you're on to something. Looking at the mesh topology entries (as seen on my own node), the linkcost from W6QAR-WTE to WL7AG-PQ is (0.866/0.796) 1.449. The entry for W6QAR-WTN to WL7AG-PQ at the same time is (0.815/0.854) 1.434 -- almost the same.

The remaining links in the two paths don't change, so this must be the culprit.

It should not be necessary to sabotage one link to keep the algorithm stable. It ought to apply some hysteresis to the selection, i.e., it won't switch routes unless the new route is better than the old one by some minimum amount, and/or for some minimum time that you specify.

AE6XE
AE6XE's picture
The config settings of OLSR

The config settings of OLSR are from the early days of BBHN and as I recall, all defaults.  It is overdue to tune these settings.   Need to dig into the code, but I think the routing recalculation based on ETX is triggered if there is a change in topology, link drop/new, etc..   There may not be any hysteresis in the trigger to recalculate the route table on a node, probably otherwise triggered every 30 seconds if that is what you are seeing.  

We're using olsrd 0.6.7 package with some unrelated enhancements and 'etx_ffeth' LQ algorithm if you have interest and time to dig into better tuning for everyone's use.  This ubnt change in flash partitioning that can brick nodes as well as the 'deaf receive chain' issue blocking the 3.15.1.0 release is occupying the time for the moment.

Joe AE6XE

KA9Q
The problem is not that the

The problem is not that the route keeps changing; it's the transient routing loop that forms when it does, and the data packets that are dropped (and link capacity possibly wasted) as a result. So the fix does not have to involve avoiding equal-cost paths, or preventing route switching between them. Doing that just avoids the problem without really fixing it. This is supposed to be an ad-hoc network, which means it should do the best it can no matter what network topology you give it.

Although we don't see the actual path that each data packet takes, we can infer when a routing loop has formed by the blast of ICMP Destination Unreachable - TTL Exceeded packets that come back. These messages are generated when the number of hops a packet takes exceeds the initial value in the packet's Time To Live field. This does not normally happen except in a routing loop; the only exception is during traceroute when the initial TTL is deliberately set less than the hop count to identify intermediate routers. (There are no traceroutes here; all the data packets here were sent with initial TTL = 64).

I've attached a packet trace that shows these ICMP messages. It's a pcap file run through tar cjf to produce a tgz as required by this forum.

It was received at maggie.ka9q.net (10.44.77.5), one end of the TCP path with kg6equ-mchs-pi (10.146.82.42) at the other end. Only the ICMP messages are shown; I've filtered out the actual data packets that triggered them, but you can see their headers in the returned ICMP messages.

In case you can't see the other San Diego IP/DNS pairings:

10.3.17.29 is dtdlink.w6qar-wte.local.mesh (Qualcomm east)

10.151.6.207 is dtdlink.w6qar-wtn.local.mesh (Qualcomm north)

10.242.28.171 is w6qar-wts.local.mesh (Qualcomm south)

10.19.73.183 is dtdlink.ka9q-north.local.mesh (my node, pointed north at Qualcomm South)

A router generates an ICMP message with the source address set to the address of the interface over which the router received the offending packet. So when an ICMP message comes from dtdlink.(somebody), that means the offending packet arrived on that node's Ethernet interface. Note that maggie.ka9q.net natively speaks olsr over Ethernet VLAN 2 to ka9q-north, the Ubiquiti node running on my roof. Hence the two ICMP messages from dtdlink.ka9q-north.local.mesh.

As you can see, the first event at 686.574 seconds consists of a flurry of ICMP Time Exceeded messages from dtdlink.w6qar-wtn and dtdlink.w6qar-wte, two co-located Qualcomm nodes connected by local Ethernet. Each thinks that the other is the best route to the next hop in the path, wl7ag-pq. The loop appears between them over their Ethernet link, so at least the wasted bandwidth is on Ethernet, not the air. Whichever happens to have the packet when its TTL goes to 0 is the one that generates the ICMP message.

More curious is the event at 694.5 seconds when an ICMP message is returned from w6qar-wts. This is the Qualcomm radio interface that faces me, so the loop is apparently now over the air between Qualcomm South and somebody else (maybe me? Normally I am its only neighbor.) This is more distressing since it means that air channel capacity is being wasted. Note these are interspersed with ICMP messages from dtdlink.w6qar-wte, but not -wtn.

The two ICMP messages at 714.3 sec are destination unreachable - network unreachable messages, not TTL exceeded, so my own Ubiquiti node is reporting that the destination has completely disappeared from my routing table. But it returns thereafter, and things are fine for several minutes until we see several more episodes of routing loops over the Ethernet between Qualcomm North and Qualcomm East.

I will have to dig into the protocol, the implementation and the parameters to better understand what's going on. Link state routing algorithms are known to be less prone to loops than the simpler distance vector (Bellman-Ford) routing algorithms but they are not immune (as we see). I seem to recall that the key to avoiding loops in a link state algorithm is to ensure that all routing tables are based on a consistent picture of the network, which means making those changes simultaneously on up-to-date topology tables. It may be that part of the fix here is to use NTP to synchronize all the node clocks so that they can do this more easily.

As I discussed with Conrad last night, I am going to try to use SNMP to rapidly sample the routing tables of the nodes involved to catch them in the act of forming a routing loop. This should shed more light on the subject.

 

 

 

Support File Attachments: 
AE6XE
AE6XE's picture
Yes, a hysteresis on route

Yes, a hysteresis on route changes would minimize the occurrence of  route loops, but not resolve this root cause.   There is only a known hysteresis on establishing the route/link to begin with.  

There's papers out there investigating the known routing loops that occur in OLSR during the transition to propagate route changes and update all route tables.  One of our prior forum discussions touched on this.  It remains a weakness of OLSRv1.   

In previously looking through the code, the internal application clocks are based on relative time with neighbors.  Syncing clocks with NTP wouldn't make an impact.  There is a trade-off of trying not to flood the network with Link State update packets and how long it takes the 'network' to change to a more preferred route.  Consequently, some (reluctant) time delay (order of seconds) is tolerated with these inherent routing loops during a route change transition.  

The OLSR approach is for a node to be autonomous.   Short of writing our own and/or changing the game (we'd have to pay in time-effort), we have the options to minimize occurrence and parameter tuning to reduce exposure time.   From a practical perspective, the immediate option available is to avoid parallel paths with comparable ETX values.and drive the occurrences to zero.      

Joe AE6XE

KA9Q
You might be right that the

You might be right that the most expedient thing to do in the short term is to avoid frequent route changes with ad-hoc (so to speak) hacks. But there can be an opportunity cost to that. If you have two independent equal-cost routes to your destination, you'd really like to take advantage of them and divide your traffic among them.

Of course the two routes here are not actually independent in that the two Qualcomm nodes (North and East) share the same RF channel to the next hop; because of their pointing, they both happen to see it equally well. But if they were on separate channels, then you would want to divide the traffic.

Would be nice to find some way to incorporate co-channel RF interference into the link metrics.
 

KG6JEI
" have two independent equal

" have two independent equal-cost routes to your destination, you'd really like to take advantage of them and divide your traffic among them."

Can't say that I have seen a routing daemon available on Linux that does this automatically, nor can I think of a reasonable way off to do this without significantly coupling multiple devices together (active active cluster type deployment) which may possibly not be as well a fit for the self autonomous design of the network as it actually would insert one more area for a failure.

"Would be nice to find some way to incorporate co-channel RF interference into the link metrics."

Since the routing daemon we run is Open Source if this feature was programmed it could be submitted upstream to OLSR, once they test it and prove it out in experimental networks and prove it works we certainly could consider it based on the results of field testing. I could see in theory how it may prove somewhat useful. Would be curious how well theory turns into practice.
 

KG6JEI
This thread is really started

This thread is really started to get off topic

As noted in person, the OLSR data is really what is wanted.

This issue can not progress forward without that to help us see exactly what is going on inside of OLSR at the time it occurs and all these random topic splits under this thread really are not going to help solve the root question on this thread either.

 

AE6XE
AE6XE's picture
Multi-Path OLSR

While we hack around OLSRv1 weaknesses, the good news is there are folks out there working on the problem.  For longer term consideration, check out this research for Multi-Path MP-OLSR based on OLSRv2.   Basically,  "every node no longer maintains the routing table. Instead, the source computes the whole path when it needs to send packets, and then puts the paths in the packets (different packets may get different paths, but one packet just gets one path)".    Might be issues for longer route paths, but there wouldn't be any routing loops anymore.   ...probably at least a year out for OLSRv2 to mature or stabalize and we can begin to consider these upgrades.

For everyone reading this, there's a good olsr-for-dummies overview from Section 2.   Very good to know for any AREDN mesh admin and quick reading (AREDN uses OLSRv1 and ETX today).

Jing Wen. Multiple Path Optimized Link State Routing (MP-OLSR). 2013. 

http://hal.univ-nantes.fr/file/index/docid/861074/filename/MP-OLSR_Repor...

Joe AE6XE

 

KA9Q
Given that several ad-hoc

Given that several ad-hoc routing protocols have been under development for quite a few years, I think it's pretty clear already that there is still no one, universal, bug-proof, ideal routing algorithm for everybody and I personally think it would be a mistake to cast any one of them (e.g. olsr) into stone in an operational network.

Besides Multi-Path OLSR, which I recognize as a form of "source routing" (something like AX.25 digipeater strings) there's BATMAN, which goes the other way - the nodes don't try to figure out an entire path, only the next hop.

And all these belong to only one of the classes of ad-hoc routing protocols, the "pro active" class. They maintain routes in advance of their actually being needed. The alternative is on-demand routing, where a route is determined only when a packet makes it necessary. There are some potentially important advantages here, such as a substantial reduction in routing overhead when the network is large, the topology changes frequently, and traffic is light. (Right now on the San Diego network, I'd say OLSR broadcasts are orders of magnitude more than actual data traffic.)

OLSR routing overhead may be tolerable on WiFi-like channels, but it would seem intolerable on lower-speed links that may be our only alternative in an emergency situation. There are still good uses for mobile VHF/UHF radios with speeds of only a few tens or hundreds of kilobits/sec. They can carry voice and still images, and the routing algorithm shouldn't prevent this.

We could avoid the problem by bounding (say) the OLSR protocol at the edges of a reasonably fast WiFi-like backbone and running different routing protocols on the other networks, then joining these networks with IP routers using traditional inter-network protocols like BGP, but it's too soon to say whether this would be acceptable or come at a significant cost in flexibility and robustness. There's still something very seductive in the idea of a pure ad-hoc network where you can just toss a bunch of nodes into random places with random data rates and random antennas and have them self-organize into the best possible network that can be constructed with the available links, but this has proven very difficult in the general case.

KG6JEI
"OLSR routing overhead may be

"OLSR routing overhead may be tolerable on WiFi-like channels, but it would seem intolerable on lower-speed links that may be our only alternative in an emergency situation. There are still good uses for mobile VHF/UHF radios with speeds of only a few tens or hundreds of kilobits/sec. They can carry voice and still images, and the routing algorithm shouldn't prevent this."

I'm aware of this already being done,  Winlink gateways for example, users come in on packet to the gateway, which can than send the rest over the mesh network. BBS servers would also be equaly true.

Or if one wanted to you could run IP over AX.25 and assign radio an IP address based on the Mesh Node "lan" segment.

I saw Gene WB9COY gave a great demonstration of this using the DSTAR highspeed 1.2GHz radios pluged into the LAN port of a mesh node and relaying the access over to a laptop on the other side of the DSTAR stream.

The mesh is a network, nothing we do on the mesh side will generally limit one from plugging pieces into the network on the LAN segment and interfacing them.

One can even treat the VHF/UHF radio with TNC as a network serial port and use it on remote ends of the network. Ive seen demonstrations where a central TNC is shared to multiple computers on a mesh so that any single one can grab it and send information or receive information from the VHF channel.

So the mesh routing protocol is at this time in no means eliminates the usefulness of these radios to still serve a purpose in many ways its actually helping them as in the case of Winlink one could quickly gather data on slow speed channels and relay it a distance over a high speed link, and than drop it back on a "slow channel" when it gets to its destination.

KA9Q
So the mesh routing protocol

So the mesh routing protocol is at this time in no means eliminates the usefulness of these radios to still serve a purpose in many ways its actually helping them as in the case of Winlink one could quickly gather data on slow speed channels and relay it a distance over a high speed link, and than drop it back on a "slow channel" when it gets to its destination.

It's a problem if you'd like the routing to work seamlessly over all these links, including the slow ones, because of the high overhead of the frequent OLSR "hello" packets on the slow links. As I said you can avoid this if you limit OLSR to just WiFi (and Ethernet)-like networks where you've got lots of capacity and run some less intensive routing protocol on the slow links, but then you need something else to handle inter-net routing between those two intra-networks. And you have to designate the nodes on both networks that will be the gateways between them.

Sure, no matter what routing you use you're always better off planning as much as you can in advance instead of relying on chance. But there's still something very seductive about being to throw together a collection of links and nodes of all kinds and have them automatically figure out the best network that can be constructed from what's available, without having some person (i.e., skilled network wizard) do this manually. A network wizard who's probably going to be out of town if/when the time comes.

 

KG6JEI
In the samples I gave the

In the samples I gave the links could have fully routable IP's without any need to run routing over them including samples where it actually was done.

I'm not sure trying to use a packet link for a backhaul (where it needs routing data to flow over it) is likely to be wise and I see a lot of problem with it being able to handle the demands of an AREDN network as the entire intent of AREDN is to build MODERN networks to help augment the networks that can no longer keep up with the demands of served agencies.

KA9Q
I'm not sure I understand

I'm not sure I understand what you mean by "packet link" and "backhaul". If you mean 1200 baud AX.25, yeah, that's pretty useless in this day and age. I mean, the Bell 202 modem was already obsolete when we standardized AX.25 in 1982!

But I know of some forthcoming VHF/UHF digital radios operating at medium data rates (tens to hundreds of kb/s), fast enough to support the kinds of applications you're envisioning for AREDN (http text and some stills, VoIP). Although much slower than backbone WiFi gear on 13 cm and above, they are more adaptable to mobile and portable operation with omnidirectional antennas and not-quite line-of-sight paths. In other words, I'm thinking of using them as edge or feeder networks to the regional backbone, not necessarily as backhauls between several regional backbones (unless there's no alternative).

Because these nodes will be mobile and may need to relay each other to the nearest fixed backbone access station, some form of automatic routing within that network is obviously very important. You also want them to be able to move seamlessly between backbone access points, so either they will have to speak the same routing protocol as the backbone or the backbone should just provide simple, transparent link-level connectivity.

I'm beginning to think a lot about that second option. Because the backbone nodes are stationary, with engineered point-to-point links, it isn't really necessary for them to run, internally, a routing protocol like OLSR designed for networks that change much more rapidly.

So why not just use link-level bridging (switching)? One big advantage of this approach is the simplicity and versatility of a transparent link-level network. Among other things, you immediately get multicast and IPv6 support for free. I consider both very important. A network-layer routing protocol like OLSR can still run over this network, but it would be carried like any other data traffic. And because the backbone looks like a single, fully connected Ethernet to the network layer, the OLSR Hello updates between IP routers on the edge of this network will be that much smaller.

Redundant links are still OK; just use the STP (Spanning Tree Protocol) to break loops.  It's been in the Ethernet standard for a long time, is very widely used and implemented in switches and in the Linux kernel. There are several revisions of STP, each backward compatible with earlier implementations: IEEE 802.1d, 802.1w and now the 2014 version of 802.1q.

Every two seconds each node transmits a BDPU (Bridge Data Protocol Unit), and unlike OLSR it doesn't grow in size as the network grows. BPDUs are at the link layer so they don't leave the edge of the bridge network (e.g., they don't have to go over lower speed VHF/UHF links).

As in any switched network, packets to MAC addresses whose locations are unknown are flooded to all ports while packets to known addresses are sent only on those links that take them closer to their destination. Because multicast addresses never appear in source fields, multicasting works automatically.

It would probably be possible to run stock firmware on many backbone nodes, alleviating the growing problem of finding specific models and types that can support third-party code. (I notice that AirOS already supports STP). In connection with this, we should consider managed (base station) mode instead of ad-hoc mode for some or all links. Despite the larger physical scale, many backbone links still resemble conventional WiFi networks: a node on a mountaintop talking to several other stations who can't hear each other directly. Just make the mountaintop mode the access point and the other nodes its clients. This immediately eliminates the hidden-terminal problem because, in a managed network, client stations transmit only when polled by the access point.

Another potentially major advantage of managed mode is that many now implement multicast optimization; they snoop on IGMP (multicast membership) messages so they send multicasts only where they're wanted. Many also turn a multicast that would otherwise be transmitted at low speed with no acknowledgement into several acknowledged unicasts, one to each client, that can be sent at the fastest speed the client can support.

Yes, there are some disadvantages too. Traceroute no longer shows the complete physical path through the backbone, so some alternate method will have to be provided. The users will have to somehow learn the IP addresses of each bridge node and send them SNMP queries to read the bridge tables. And I'm sure there are other disadvantages that will come out. But I think it's worth a serious look.

KG6JEI
I suggest attending the San

I suggest attending the San Diego Mesh Workging Grouo meetings in San Diego a lot of this had already been discussed and decided for San Diego.

*note* This post is submitted as a member of the San Diego Mesh Working group and not as an AREDN member. The comments in this post are my own and do not necessarily reflect the opinion of the AREDN community.

 

 

KA9Q
And I do indeed attend those

And I do indeed attend those meetings, but this has not come up for discussion at any of those I have attended.

Are there any minutes or other records of the rationales for the various design decisions taken?

 

KA9Q
Or, instead of formal

Or, instead of formal rationale document(s), were any experiments conducted and empirical data collected on the various alternatives before the current architecture was adopted as optimal for its stated purpose?

 

K6AH
K6AH's picture
SD Network Design

Phil,

Your backbone design recommendation reiterates the design diagram I shared with you a couple month ago... and is indeed the model the Southwestern AREDN network (the California-wide network I described at Microwave Update this weekend) intends to follow. The design has long since been posted on this forum.

Andre, K6AH

 

KA9Q
Andre, you're actually the

Andre, you're actually the one who planted the seed that got me thinking. In one of your presentations IIRC you said that you would run your 3.3 GHz nodes, at least initially, with AirOS in bridged/switched mode. I wondered why you did this and thought it would be better if they ran the mesh code like everything else.

But then I got to thinking about it and began to wonder if you weren't on the right track. In a stable, engineered network with point-to-point links you don't really need the complexity of something like OLSR. Bridging may be enough; STP will break any forwarding loops. The nodes nearer the edge can still run OLSR over this, and its routing tables can be smaller.

The main disadvantage I see is that STP breaks loops by simply turning off links until the loops disappear; the paths are not always optimal. (Turned-off links will be turned back on automatically if another link happens to fail). This means traffic might not take the most optimal route. E.g., if you have nodes A, B and C in a closed ring and the STP breaks the link between A and C, then while traffic from A to B and from B to C will go direct, traffic from A to C will go via B instead of going direct. This is usually OK in fast Ethernet networks but possibly not in a slower radio network. But you can assign priorities to the links to influence which ones the STP will disable first when needed to break loops. E.g., in my ABC example, if the nodes are roughly in a straight line so the path AC is much longer (and probably slower) than AB and BC, then you can set things so that the STP will break AC unless one of the other links fails.

I'd be happy to explain this further if you're interested. I did a lot of work on bridging and spanning trees back at Bellcore in the 1980s. My boss, the late Dave Sincoskie, invented the concept of multiple spanning trees that eventually became the IEEE VLAN standard.

 

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer