A presentation by Dino Farinacci and Liming Wei (Cisco). First the requirements for a Multicast Routing (MR) monitoring protocol were identified. The issues are:
The motivation and properties of the desired mechanism:
Partitioning the tasks
Fault Detection
Fault isolation, using mtrace, rsh, traceroute, etc.
Protocol components
Messages
Reliability
Protection against Attacks
A presentation given by Van Jacobson on behalf of Dan Massey (UCLA) and Bill Fenner (XEROX PARC).
A multicast router has 5500 routes, 67 change over 60times/hour, there are also 7950 distinct sources present.
The data collection methodology was then presented. The site from where the observations were made was UCLA which is a leaf with respect to the global Mbone topology.
Route instability is not caused by just link instability, causes involve:
The most known Mbone bug is the buffer overflow problem on neighbors of Cisco routers.
DVMRP routers refresh route information every minute. The current routing table consists of 5500+ routes and can require as many as 80 packets. Cisco routers send the entire table, all 80 packets, at once. This often results in a buffer overflow (and thus lost packets) at neighboring routers. Mrouted's are particularly vulnerable to a socket buffer overflow. When the same update is lost three times in a row, due to deterministically overflowing the buffer, the downstream neighbor will declare the route down.
While this can cause a flap for any route advertised via the Cisco, the most common case is for local routes to flap. Cisco sends transit DVMRP routes first followed by routes injected from the unicast table. The last packet(s), which correspond to the routes local to the Cisco, are most likely to be dropped. The behavior is typically a packet with the Cisco's local route is received at an upstream mrouted. The mrouted accepts and advertises the route to its neighbors. Refresh messages from the Cisco are dropped due to buffer overflow and the mrouted times out the route. Eventually a packet with the route is not dropped and the mrouted re-advertises the route.
If counting to infinity occurs and enhances the problem, the result can be well over 60 flaps per hour. The Suspect Routes link illustrates some sites which may suffer from this problem. The path up to the mrouted/Cisco pair remains completely stable, non-Cisco neighbors of the mrouted remain stable, but routes advertised by the Cisco flap consistently.
There is no workaround at this time, but work is underway. In mrouted 3.9beta bigger socket buffers would be helpful, but some operating systems do not allow this. An eventual fix is for the Cisco to spread out rather than blast updates. Additional measures may reduce the information sent by only sending a new netmask when it changes. Attempt to increase buffering at neighboring routers is also ongoing, but many systems simply can not keep up. This problem has been observed to occur between two cisco routers as well; it is not limited to cisco--non-cisco interactions.
See aldo http://ganef.cs.ucla.ac.uk/ masseyd/Route.