A wide-ranging panel discussion at the TERENA Networking Conference considered the stability of the Internet routing system at all levels from technology to regulation. The conclusion seemed to be that at the moment the Internet is stable because two systems, technical and human, compensate effectively for each others’ failings. While improvements to increase stability may be possible, they must beware of disrupting the current balance or introducing new ways that it can fail.
One of the concerns at the technical level is that as the Internet grows, the number of routing updates will also grow, perhaps beyond the capacity of routing equipment to handle it. Amund Kvalbein from Simula has been studying records of route announcements between 2003 and 2010, a period when both the number of IP address prefixes and the number of autonomous network operators responsible for them more than doubled. An initial plot of the number of routing updates per day is very noisy and suggests a greater rate of increase, however most of this turns out to be due to three types of event:
- Duplicate route announcements, which add no information to what has already been seen. These do not present a significant threat to stability;
- Spikes caused by ‘large events’, where a significant number of prefixes change route at the same time, either as a result of a technical incident (e.g. a cable cut) or a change in routing policy. These are the events that the routing protocols are designed to handle, and they appear to be doing it well; in particular these spikes are local in effect and do not spread across the Internet;
- Changes in level, caused by a small group of prefixes becoming unstable so that their routing ‘flaps’, causing a rapid sequence of continuous updates that may persist for months if the originating networks are not well managed. Fortunately such networks rarely carry much traffic, so their overall effect on the Internet is low.
Once these three types of event are removed from the trace, the rate of growth of router updates is slower than the growth of either prefixes or networks. Growth in router adverts does not seem to be a threat to Internet stability.
David Freedman from Claranet explained how human actions deal with these problems. Major networks routinely stress test new equipment before using it in their production networks, subjecting it to higher levels of traffic than are expected to occur. As a result, many more technical problems are identified before they can affect the Internet. Networks’ technical staff increasingly know and trust one another, through Network Operator Groups, peering forums and mailing lists. This allows many problems to be resolved by a quick phone call. Major disruptions – whether caused by earthquakes, wars or terrorism – will often see competitors working together to protect the Internet rather than their short-term commercial interests. David noted that human networking may resolve some problems more effectively than technology. Technical measures to deal with flapping routes can easily replace a degraded service to users by no service at all, whereas using human contacts may well be able to fix the problem at source. Economic theories about commons suggest that such cooperative approaches will only be stable if there is the possibility of punishment for those who do not cooperate. David confirmed that networks that are a regular source of problems may be threatened with a loss of peering (which would mean they had to pay more for transit connectivity).
Thomas Haeberlen (ENISA) looked at how regulators can help to improve Internet stability. First he noted that new technical approaches often change sources of instability rather than removing them: for example rPKI makes it harder to deliberately hijack the routing system but also increases the risk that mistakes in handling certificates will result in valid routes being rejected. Also, any new technology must work with what already exists on the Internet and be capable of being deployed gradually, in both technical and economic terms. Aiming to improve the whole internet by regulation is almost certainly too ambitious, particularly as there are wide variations between different countries’ approaches and attitudes to regulation. Instead national regulators should concentrate on measuring and improving the resilience of those parts of the Internet that are most important for their citizens. Preserving reliable access to national banks and national broadcasters during an incident may be both more valuable and more tractable. A rough initial estimate of resilience, and whether there are obvious single points of failure, can be obtained from public BGP information. This needs to be supplemented by organisations’ private information about non-public peerings, routing policies, bandwidth, etc. which may reveal that the network is either more reslient or less than the public information suggests. Bottlenecks and dependencies also need to be checked both above and below the network connectivity layer: do two apparently resilient routes actually share the same cable or duct, or do Internet services rely on connectivity to particular applications or datacentres? Such dependencies and potential failures may not be apparent to individual providers, who cannot combine private information from multiple sources. ENISA plans to work with national regulators to test this approach.