The Internet has become a lifeline grade utility.
Our health, safety, and financial security depend on reliable and consistent availability of Internet services.
Yet over the years we have given relatively little consideration to actually having a reliable and consistently available Internet.
We are to a large extent flying the Internet on good luck and the efforts of unheralded people often working with tools from the 1980s.
As we wrap the Internet with security walls and protective thorns, maintenance and repair work is becoming increasingly difficult to accomplish in a reasonable period of time, or even at all.
With the increasing inter-dependency between the Internet and our other lifeline grade utilities — such as power, water, telephone, and transportation — outages or degradations of any one of these systems can easily propagate and cause problems in other systems. Recovery can be difficult and of long duration; significant human and economic harm may ensue.
Although we can hope that things will improve as the Internet matures, outages, degradations, and attacks can, and will occur. And no matter how much we prepare and no matter how many redundant backup systems we have, equipment failures, configuration errors, software flaws, and security penetrations will still happen.
The oft quoted line, “the Internet will route around failure”, is largely a fantasy.
When we designed the ARPAnet and similar nets in the 1970s we did have in mind that parts of the net would be vaporized and that packet routing protocols would attempt — notice that word “attempt” — to build a pathway around the freshly absent pieces.
Today’s Internet is less dynamic than the old ARPAnet; today’s Internet is more “traffic engineered”, and subject to peering and transit agreements than the old ARPAnet. Although the possibility of dynamically routing around path problems remains, that possibility is constrained.
Today’s Internet is far more intricate than the ARPAnet. Today’s Internet services are often complicated aggregations of inter-dependent pieces. For example, web browsing depends upon more than mere packet routing; it depends upon a well operating domain name service, upon well operating servers for the many side-loads that form a modern web page, and upon compatible levels of cryptographic algorithms. Streaming video or music, and even more so interactive gaming or conversational voice, requires not only packet connectivity but also fast packet delivery with minimal latency, variation of latency (jitter), and packet loss.
As any one today can attest, today’s Internet service quality varies from day to day.
When the Internet was less ingrained into our lives, network service wobbles were tolerable. Today they are not.
Problems must be detected and contained; the causes ascertained and isolated; and proper working order restored.
Individually and as a society we need strong assurance that we have means to monitor the Internet to detect problems, to isolate those problems, and to deploy repairs. Someone is going to need adequate privileges to watch the net; to run diagnostic tests; and to make configuration, software, and hardware changes.
However, we do not have that strong assurance.
And the few assurances we do have are becoming weaker due to the deployment of ever thicker, stronger, and higher security barriers.
Simply put: Our ability to keep the net running is being compromised, impeded, and blocked by the deployment of ever stronger security measures.
This is a big problem. It is a problem that is going to get worse. And solutions are difficult because we can not simply relax security protections.
This paper describes this problem in greater detail, speculates what we might be able to do about it, and offers a few suggestions.
We have all seen the movie 2001. The computer, HAL, has gone awry. HAL and the ship could be repaired and recovered, but HAL is refusing to let Dave back aboard to make those repairs.
A similar situation happens when something in the Internet goes awry. (I am excluding from this discussion the too frequent situation in which the user has mis-installed or mis-configured a home router, the user is doing something incorrectly, or the user does not know the correct vocabulary to describe what they perceive as a problem.)
When there is a network problem that problem is in the inside, somewhere. The diagnostic and repair people are on the outside. Lying between the problem and the repair team are security barriers. Detecting and repairing Internet problems is often somewhat like trying to cook a meal without being able to see into the kitchen; much less enter it; and even less able to slice or mix the ingredients, add spices, or touch the stove.
There are plenty of things that can cause Internet problems. Software may be obsolete or incorrectly implemented, hardware or power may have failed, components may be misconfigured, a security penetration may have affected operations, etc. The Internet really does abide by Murphy’s Law that if something can go wrong, it will, and at the worst possible time.
We Should Be, But Are Not, Designing The Internet To Be Managed, Diagnosed, or Repaired
As a community our approach to Internet Engineering, from the beginning, has been more concerned with “making it work” than “how do we detect and repair failures”.
That imbalance remains with us to this day.
If we want a reliable Internet then we must give more emphasis to means to detect, constrain, expose, isolate, and repair problems.
Railroad and aircraft engineers understand through long experience that even the most unlikely of failures will occur and that backup systems themselves may fail.
Why are traffic lights red and green? Once upon a time, when railroads were new, red lamps were used to tell drain drivers to stop and white lamps were used to indicate when it was safe to proceed. Every now and then the red lens would fall off and the red lamp would emit white light causing a train to proceed when it should be stopped. The answer was to use red and green lenses so that a white light became an indication of a fault, and the train driver would stop. (There is a subsequent story of why the “green” color used today is rather blue.)
We built the Internet on almost the opposite belief — that the store and forward packet machinery of the Internet will simply “route around failures”. We quickly learned how wrong we were:
In the early days of the ARPAnet the memory of one of the Interface Message Processors (IMPs) developed a bad spot, causing that IMP to announce that it was the best path to everywhere — so eventually all of the traffic of the ARPAnet was sent to the ill IMP, where that traffic silently vanished. Not only did the network not route around a failure, it routed everything to the failure.
We have learned that not only does the Internet sometimes not “route around” failures but that there are entire classes of problems that have nothing to do with packet forwarding. These are things like intermittent communication link noise, queueing delay and congestion, out of date or malicious routing information, old or inadequately built software, etc.
Often the first (and too often the only) person to notice that something has gone wrong on the Internet is an end user who only perceives that what he wants to do is not happening. Such a person almost always has no idea of the underlying cause and often has no good way to describe and report the problem.
Some of us are beginning to use the net beyond its zone of safety. For instance, I’ve seen a video of a surgeon doing remote control surgery from 50km away and trusting that the 5G link in the middle wouldn’t have any hiccups.
The Internet’s routing system is somewhat robust. It can often recover from clear outages, such as failure of a packet forwarding path. But that recovery can take time. It takes time to detect an outage and for the disparate elements of that level of the Internet to “converge” and begin to use a new path. Sometimes this recovery process itself can go south, for example there is “route flapping” where the net endlessly flip-flops between multiple solutions. (This can, in turn, cause a backup mechanism, route change damping, to come into play. This can lock a failed path into place, preventing use of working alternatives, until the damping period expires.)
We have a lot of work to do to improve our means to detect and repair Internet problems. This note will discuss some of those later.
However, nearly every measure we might consider will face resistance because nearly every measure will require some sort of carefully tailored door through our ever growing walls of security.
It may well come to pass that we have to reconsider some long and deeply held opinions about how we use the net, about privileges, and about centralized authorities to regulate who can exercise these privileges.
Network management, maintenance, and repair are often held in low esteem. Yet these are areas that deserve a great deal of respect. These tasks can often require a deep knowledge of how the network works at all levels, from the physical all the way up through applications; an understanding of what causes could create observed symptoms (effects); and the expertise to apply tools and change network configurations without making situations worse.
Internet Repair 101
Internet repair is invasive. Internet repair comes with a significant risk of error and damage.
Internet repair requires elevated privilege to reach into the net to observe, to test, and to manipulate.
Network diagnosis and repair are like surgery.
Often the tools we need to diagnose and repair the net are sharp and intrusive, like surgical scalpels. We can’t denature these tools - that would be like forcing a surgeon to operate with a dull butter knife.
Because our tools are potentially dangerous they must be used with expertise and care.
Network repair generally involves several steps, often applied iteratively:
- Observe the symptoms.
- Develop a theory (or several theories) of what could cause the observed symptoms.
- Run tests to evaluate the theories.
- Deploy corrective measures.
- Determine whether the corrective measures actually work.
Every one of these steps may require access deep into parts of the net that are generally off-limits, where sensitive data may be flowing, and where changes might have significant external ramifications.
It is not uncommon for these steps to disrupt normal operations, exacerbate problems, or create new problems.
Internet pathologies are often found in the interaction between distantly related pieces that may lie in different, perhaps mutually suspicious, administrative realms. Consequently those who are engaged in diagnosis and repair may be distant from one another and may not have credentials to freely exchange data or reach into one-another’s networks.
Network repair is often a race against the clock, there is not a lot of time to follow complex administrative procedures.
Security barriers often get in the way of legitimate repair efforts. Security barriers seem to become even taller and thicker when repair people make an error or otherwise need to back out quickly.
Trust, Privilege, and Balancing Risks Against Needs
The Internet is neither owned by nor operated by any single entity. The glue that holds the Internet together is made of enlightened self interest in the mutual benefits of connectivity, of generally accepted technical standards and methods of operation, a lot of trust and constructive interaction between operators, and the “robustness principle”: be conservative in what you send, be liberal in what you accept".
This glue is still strong, but it is losing strength.
Operators trust one another less as the increasingly become competitors. Laws, particularly laws regarding anti-competitive practices, privacy, and security, may further constrain cooperation and trust.
Rather than the grand shared experiment of the past, the once open, smooth, almost seamless net has become more closed, granular, and suspicious.
That means that keeping end-to-end flows running across the network is becoming more difficult at exactly the same time that we are becoming more dependent on those flows.
As should be apparent from the preceding sections, keeping the net running smoothly demands that certain people and their tools have privileges and authority to reach quickly into the net, often across security barriers, to take measurements, run tests, and change configurations.
That’s a lot of privilege.
The net and its users can be badly damaged if that privilege is exercised without care or without knowledge.
We should fear abuse or mistake. We should fear those whose goal is to use maintenance and repair as a cover for malicious activities.
Yet we need these tasks to be done, we need the net to keep operating.
How do we balance the needs against the risks?
Has Anyone Done This Before?
When it comes to keeping a network running we have a model of “doing it right”: the old AT&T, “Ma Bell”.
Ma Bell’s voice telephone network was the great communications technology of the era. Indeed, a significant portion of the traffic of today’s Internet is still carried over wires and facilities constructed by AT&T or other telephone companies.
Ma Bell anticipated failures; the engineers designed rock solid equipment with multitudes of test points. With the ARPAnet we explicitly anticipated major failures, but had few test points. With the Internet and in the name of efficiency we have backed away from the self-healing dynamics of the ARPAnet in favor of path routing based more on negotiated contractual agreements and traffic engineering. And up at the application level we have come to depend on continuous connectivity - too often the failure of a TCP connection means the abrupt end of an application session.
In the era of Ma Bell it was common practice for nearly every device to have a local and a remote loopback mode. This is lacking on today’s Internet. Yes, we do have a limited form of remote loopback in the form of ICMP Echo request/response (“ping”), but ICMP Echo is often blocked or rate limited by intervening devices.
And “ping” is a very coarse tool. It is a tool to perform the most basic of connectivity tests. It tells us almost nothing about the health of the path over which those ping packets passed. We do not learn wether the path is solid or is on the verge of overload, we learn nothing about the queue lengths in switching devices.
When it came to operations and repair Ma Bell had an advantage over those of us who try to keep the Internet running - Ma Bell had an almost closed system. Ma Bell used their own equipment; ran their network entirely under their own control; and allowed access only to their own employees who used Ma Bell’s own, approved and trusted tools.
In other words, telephone repair teams of the Telco era were prepared with identifiers, credentials, and approved tools and methods. There was an overall administrative framework that created a foundation for one Telco repair person to trust others, even if they had never previously worked with (or even knew) those others. So when doing maintenance or repairs there was a well established process to deal with security barriers.
We do not have anything comparable for the Internet. There is no universal identifier or credential that believably says to all: “This person is capable and trustworthy, give him/her access.” The flow of diagnostic information across administrative boundaries (such as between network providers) is subject to competitive constraints. At best we have informal relationships, such as between people who have learned to trust one another through previously working together or participation in organizations such as the North American Network Operators Group (NANOG).
So, What Do We Need To Make The Internet More Repairable Without Harmfully Reducing Security?
Making the Internet more repairable will require that we relax some of our security so that trusted and trustworthy people in disparate organizations and often distant from one another can look for problems, run isolation tests, and deploy corrective measures.
It will be difficult, however, to find the right balance between repairability and security.
(It is not expected that person A in organization A will be allowed to directly reach into organization B’s network. However, cooperation between A and B needs to be facilitated and, when appropriate, barriers to access and operational data can be temporarily reduced without simultaneously inviting attacks from the outside.)
Network operators will almost certainly need to be incentivized, and perhaps compelled, to pay adequate attention to matters of maintenance and repair, particularly in a world requiring inter-vendor cooperation to resolve end-to-end difficulties.
Here’s my list of things we need to do:
Database of Network Pathology
There’s a lot of informal and anecdotal knowledge about network failures and how to deal with them.
We ought to assemble that knowledge into a database of network pathology.
This would be more than a mere list, it would map symptoms to potential causes. It would describe diagnostic steps to be taken to discriminate among the possible causes.
Such a database ought to be ever growing.
Software tools could be developed to lead repair people or eventually support fully or partially automated detection, isolation, or repair. (It should be recognized that an automated repair system that goes awry could quickly make an utter mess of the Internet.)
Another possibility is to let one of the new Large Language Model AI systems graze on such a database and see what it comes up with.
Repairability Impact Assessments
We live in a world here in the US where nearly every construction project must pass an Environmental Impact review.
Perhaps we ought to suggest, if not require, that security policies and barriers be accompanied by a review of the impact upon maintenance and repair.
Every Internet protocol standard ought to contain a section that addresses failure modes; histories of misinterpretation and mis-implementation; and tools and procedures to evaluate proper operation, diagnose problems, limit the spread of side effects of mis-operation, and make repairs.
Failure Reporting and Analysis
We ought to establish a formal, empowered body to investigate and report upon significant network failures.
This could be something like the US National Transportation and Safety Board (NTSB).
Of course discretion would be necessary to embargo reports until we could be sure that they are not roadmaps guiding those who wish to cause harm.
“In Case of Emergency Break Glass” Boxes
We’ve all seen fire axes and other emergency gear under an “In Case of Emergency Break Glass” sign.
Network operations staff need something like this when the encounter security barriers. (This does not mean that operations staff can bypass those barriers with impunity - opening such a box ought to trigger appropriate security alarms.)
The Internet has few convenient walls on which to hang “In Case of Emergency” boxes. And if such things were in public places on the net we could expect that they would be put to ill use by network vandals.
Operators could publish lists of barriers with appropriate contact information. But operators would likely consider such lists as menus of opportunities, an invitation to attackers. So dissemination of any such list would necessarily have to be limited. I don’t know how to do this except that I suspect DRM techniques might be useful to create something akin to the “This tape will self destruct in 60 seconds” seen in Mission Impossible shows.
However, my own personal experience indicates that at times emergency repair measures must take precedence over security.
As I write I am reminded of a time we when had only minutes to fix a network outage with thousands of people waiting. To get things running we had to jury rig a cable from hither to yon. But there was a steel wall in the way - a security measure. We didn’t ask permission, we just picked up a convenient pair of large ball peen hammers and battered our way through.
Ready-To-Go Disaster Recovery Teams
Disaster recovery is the extreme form of Internet management, diagnosis, and repair. And even more than typical diagnosis and repair scenarios, disasters require cross-organizational coordination.
Disaster recovery is always chaotic. Pre-written plans can provide a degree of order.
In its most basic form, disaster recovery plans could be written in advance; actors and organizational roles designated; resources pre-allocated (or at least identified); and a legal framework created.
Perhaps even more intensive structures could be created; these could be Internet analogs to things like the US Federal Emergency Management Agency (FEMA).
Relaxed Legal Liability During “Network Emergencies”
A limited set of governmental authorities could have the power to declare “network emergencies”.
During such an emergency certain legal constraints and liabilities could be relaxed so that recovery efforts could work with increased focus and flexibility.
Standardized Legal Agreements, NDAs, and Insurance
It would be difficult to establish the structures we desire if we could not define them
Such is the purpose of legal agreements: to establish a common, agreed-upon understanding of who the parties are, what are their duties, and what are their obligations.
Non Disclosure Agreements (NDAs) define the shape and thickness of the cloud of secrecy under which a system operates. The concept of an NDA has recently become somewhat tarnished, even considered evil. However a proper NDA can work least as much as a prophylactic to provide guidance to let people know expectations as it is as a punitive device to deter transgressions.
Insurance acts as a financial backstop. Too many people think of insurance as a form of absolution and permission to act negligently (or worse.) However, good insurance that clearly states what is covered and, perhaps more importantly, what is not covered. can serve to nudge network maintenance and repair people to take greater care.
In the homeowners insurance industry, some states of the US have mandated standard form agreements with standardized attachments for additional coverage. Thus, for instance, a homeowner can buy a fairly well standardized HO-3 base policy with standardized riders.
Standardized legal provisions make it harder for maintenance-shy providers to hide behind pages of tortured, impenetrable legalese.
Clear and Hard Lines of Demarcation Between Users and Providers
At present the lines of control and responsibility between Internet users and providers is often very unclear.
Telephone providers have long had explicit demarcations - often physically contained in a well constructed box with a provider side and a customer side. But as we move towards Internet services things get more fuzzy. Cable providers reach into our homes at least as far as a DOCSIS certified modem, but perhaps over cables that are not controlled by the cable provider. And when we move into the realms of wireless connectivity and into higher level protocols (IP, TCP, and applications) the borders are even more vague.
Many problems and security vulnerabilities can be in these administrative no-man’s lands.
Additional clarification of the boundaries of ownership, responsibility, and administration could benefit us all. This, however, may require some standardization efforts to define physical, electrical, and network protocol demarcation points.
TLS - Transport Level Security - is an example of a technology with many players and fuzzy boundaries. Too frequently users see applications emit messages that refer to TLS (or its predecessor, SSL) problems. TLS (with its many versions), the system of certificates, and the hierarchy of certificate authorities is complex and ever changing. Standards organizations sometimes “deprecate” deployed digest and cryptographic algorithms, thus orphaning and stranding network services and user applications that are not running the utterly “latest and best” code. As a consequence the problem that is afflicting users is not a failure or problem in any particular aspect of the net. Rather the user’s problem is an emergent property arising out of dissonance between clients and servers or other parts of the net infrastructure. Often no one is “wrong” (unless running older code is “wrong”). This makes it difficult to pin the tail onto the correct donkey and resolve the user’s problem.
Professional Structure and Status For Network Maintenance and Repair Personnel
Few topics will elicit groans faster than a suggestion that network operations and repair staff be professionalized.
I, myself, hold a professional certificate (in law) and am subject to strong rules of professional conduct. It is my experience that professional certifications can work well when applied with rigor and overseen by a body with enforcement powers.
I am aware that professional status is not a perfect system. And, like all aspects of security, there ought to be several layers of different kinds of protection, with professional certification being but one of those layers. Professional certificates ought not be a blank-check master key that opens all networks.
I personally have not found some of the corporate-issued certifications to be of more than limited value.
Moreover, there are many people who have deep expertise with network operations and repair who would not be willing to acquire a professional certification. We don’t want to lose access to their expertise. How this may be achieved is an open question.
Explicit Call-Outs For Management, Diagnostics, and Repair in Service Level Agreements
It is not uncommon for large network customers to enter into a Service Level Agreement (SLA) with network providers.
SLA’s usually are somewhat quantitative and specify fairly gross metrics such as bandwidth or availability. That last item, availability, is useful but typically only represents availability of that provider’s service, not end-to-end service.
Network contracts and SLAs ought to be expanded to include provisions that obligate the provider to meet present and future industry best practices of inter-provider cooperation. (There could be a mirror provision that obligates the customer to provide adequate access to diagnostic and repair points.)
Third Party Beneficiary Status of Users
In the United States, and much of the rest of the world, civil (non-criminal) responsibility and liability is often established through contracts.
Frequently some duties and obligations in those contracts are treated as merely decorative and are not practiced by the parties, often to save money, and almost always to the detriment of end users.
End users of network services, the people and entities most hurt by network outages, often have no power to enforce terms in those contracts.
Third party beneficiary status can be accorded to end users in the various contracts that bind network and service providers to one another. Such status generally allows an end user to bring a complaint to a court and request that the parties to the contract abide by the terms of that contract.
Of course, most corporations and their lawyers resist putting third party beneficiary designations into contracts. It may, therefore, be necessary to extend our laws so that such terms are imputed into new and existing contracts between network service providers.
Contracts between providers are often treated a proprietary and not disclosed beyond the directly involved parties. Users often have no access to these agreements and may not even be aware of their existence. If there were third party beneficiary provisions in these agreements it is unclear how the end users would know what rights they might have.
Regulatory Body Oversight
The Internet is a product of late 1960’s. That era was one in which government oversight, and even government in general, were held in low esteem. That attitude has carried forth to this day. Many would-be government and non-government regulators appear to consider “hands off the Internet” to be a virtue.
Yet, oversight and regulation can be useful and constructive. It is better, for instance, to have a well established procedure, performed with deep expertise, to evaluate the effectiveness and safety of drugs; we would not want such policy to be performed by a lone, uninformed judge with no expertise in the field.
Here in the United States we have a plethora of administrative agencies that could have a role in assuring that our networks adhere to rational balances between repairability and security. Apart from the fact that regulatory agencies often are captured by those they purport to regulate, it could be better to have a single agency take on the role.
The three most obvious candidates in the US would be NTIA, the FCC, or the FTC.
There is also a role to be played by the Public Utility Commissions of the various states.
Pre-Establish Access Credential (Especially Credentials That Cross Administrative Boundaries)
One of the major messages of this paper is that improving the reliability of the Internet will involve a lot more access to parts of the net that may be hidden behind security and institutional walls.
It would be unreasonable to expect that any network operator will throw open its doors to anyone who walks up and says “I am here to run some tests”, much less to someone who says “I am here to configure or repair your equipment.”
At a very minimum, cross-operator interactions, in the absence of known personal relationships, will require some form of verifiable identification issued by a trusted authority.
Legal Liability For Inadequate Care For Network Management, Diagnosis, and Repair
Security is all the rage, the public and government are demanding “more, more security”.
On the other hand, concerns that the network actually is maintainable and repairable barely catch the eye, and rarely catch any budget dollars, of corporate management.
Legal liability is a useful means to change that imbalance - it is the business equivalent of the apocryphal 2x4 to the forehead.
There are few things that draw the attention of a corporate C-Suite than the risk of paying damages to customers and users.
When it comes to liability to users for negligent flaws or errors, the software and networking industries have long done the Dance of Avoidance.
The choreography of this dance calls for disclaimers of liability, disclaimers of warranties, dollar limits, coerced arbitration, anti-consumer choice of laws and jurisdiction, and encouragement of a belief that “software and networks are different”.
That dance may finally be coming to an end.
Whatever eventual balance between security and maintainability ie established it will mean we build more fences around the Internet, draw more lines, add more administrative headaches, and perhaps create a priesthood of people who have elevated authority and power to reach into the depths of the Internet.
Costs will increase. But hopefully this will will be offset by improved availability and reliability of the net.
 This note does not address some ancillary, but important, issues arising out of increased security such as customer lock-in and discouragement of innovation.
 During the 1970s I worked for the US military establishment and the idea that parts of the net would be destroyed by nuclear blasts was very much part of the conversation, and survival of communications very much part of our goal.
 An earlier collection of these ideas may be found in my 2003 presentation From Barnstorming to Boeing - Transforming the Internet Into a Lifeline Utility:
 Open the Pod bay doors, please, HAL. (YouTube video clip)
 China completes world’s first 5G remote surgery in test on animal (YouTube video clip)
 The Robustness Principle is coming under stress because being “liberal in what you accept” may open the door to security attacks.
 There are some tools that are better than ping when trying to expose the health of the underlying path. There is, of course, the “traceroute” family of tools, whether those use UDP or TCP connections (or erroneously use ICMP echo probes.) And there was “pathchar” and its subsequent re-implementation by Bruce Mah as “pchar”. I looked at the problem from an insiders perspective and came up with a very rough draft of a “Fast Path Characterization Protocol”
 Troubleshooting tip: If you are trying to elicit signs of life from a device on your local LAN then rather than using “ping”, use “arping”. Ping uses ICMP, which requires a working IP stack. ARP works at a lower level and may be active on the target even if its IP stack has somehow gone astray.
 AT&Ts dominance, hubris, and resistance to outside interference was so pronounced that it became part of Hollywood movies such as the 1967 film “The President’s Analyst” where AT&T became the evil nemesis, TPC - “The Phone Company”.