Protecting Critical Systems in Unbounded Networks

Protecting Critical Systems in Unbounded Networks
Robert J. Ellison, David A. Fisher, 
Richard C. Linger, Howard F. Lipson, 
Thomas A. Longstaff, Nancy R. Mead
 

Society is growing increasingly dependent on large-scale, highly
distributed systems that operate in unbounded network environments.
Unbounded networks, such as the Internet, have no central administrative
control and no unified security policy. Furthermore, the number and
nature of the nodes connected to such networks cannot be fully known.
Despite the best efforts of security practitioners, no amount of
hardening can assure that a system that is connected to an unbounded
network will be invulnerable to attack. The discipline of survivability
can help ensure that such systems can deliver essential services and
maintain essential properties such as integrity, confidentiality, and
performance, despite the presence of intrusions.

 

The New Network Paradigm:
Organizational Integration

From their modest beginnings some 20 years ago, computer networks have
become a critical element of modern society. These networks not only
have global reach; they also affect virtually every aspect of human
endeavor. Networked systems are principal enabling agents in business,
industry, government, and defense. Major economic sectors, including
defense, energy, transportation, telecommunications, manufacturing,
financial services, health care, and education, all depend on a vast
array of networks operating on local, national, and global scales. This
pervasive societal dependence on networks magnifies the consequences of
intrusions, accidents, and failures, and amplifies the critical
importance of ensuring network survivability. 

A new network paradigm is emerging. Networks are being used to achieve
radical new levels of organizational integration. This integration
obliterates traditional organizational boundaries and integrates local
operations into components of comprehensive, network-based business
processes. For example, commercial organizations are integrating
operations with business units, suppliers, and customers through
large-scale networks that enhance communication and services. These
networks combine previously fragmented operations into coherent
processes open to many organizational participants. This new paradigm
represents a shift from bounded networks with central control to
unbounded networks.

Unbounded networks are characterized by distributed administrative
control without central authority, limited visibility beyond the
boundaries of local administration, and a lack of complete information
about the entire network. At the same time, organizations' dependence on
networks is increasing, and the risks and consequences of intrusions and
compromises are amplified.

The Internet is an example of an unbounded environment with many
client-server network applications. A public Web server and its clients
may exist within many different administrative domains on the Internet.
Many business-to-business Web-based e-commerce applications depend on
conventions within a specific industry segment for interoperability.
Within the Internet, there is little distinction between insiders and
outsiders. Everyone who chooses to connect to the Internet is an
insider, whether or not they are known to a particular subsystem. This
characteristic is the result of the desire, and modern necessity, for
connectivity. A company cannot survive in a highly competitive industry
without easy and rapid access to its customers, suppliers, and partners.

More and more, a company's partners on one project are its competitors
on the next, so trust definition and maintenance becomes an extremely
complex concept. Trust relationships are continually changing, and in
traditional terms may be highly ambiguous. Trust is especially difficult
to establish in the presence of unknown users from unknown sources
outside one's own administrative control. Legitimate users and attackers
are peers in the environment and there is no method to isolate
legitimate users from the attackers. In other words, there is no way to
bound the environment to legitimate users using only a common
administrative policy.

[back to top]

 

Expanding the Traditional View of Security

The natural escalation of offensive threats versus defensive
countermeasures has demonstrated time and again that no practical
systems can be built that are invulnerable to attack. Despite the
industry's best efforts, there can be no assurance that systems will not
be breached. Thus, the traditional view of information-systems security
must be expanded to encompass the specification and design of
survivability behavior that helps systems survive in spite of attacks.
Only then can systems be created that are robust in the presence of
attack and able to survive attacks that cannot be completely repelled.

In short, the nature of contemporary system development dictates that
even hardened systems can and will be broken. Survivability solutions
should be incorporated into both new and existing systems to help them
avoid the potentially devastating effects of compromise and failure as a
result of attack.

 

The Definition of Survivability

We define survivability as the capability of a system to fulfill its
mission, in a timely manner, in the presence of attacks, failures, or
accidents. The term system is used in the broadest possible sense, to
include networks and large-scale systems of systems. In particular, the
focus of survivability is on unbounded networked systems where
traditional security precautions are inadequate.

The term mission refers to a set of very high-level requirements or
goals. Missions are not limited to military settings; any successful
organization or project must have a vision of its objectives, whether
they are expressed implicitly or as a formal mission statement.
Judgments as to whether or not a mission has been fulfilled are
typically made in the context of external conditions that may affect the
achievement of that mission's goals. For example, assume that a
financial system shuts down for 12 hours during a period of widespread
power outages caused by a hurricane. If the system preserves the
integrity and confidentiality of its data and resumes its essential
services after the period of environmental stress is over, the system
can reasonably be judged to have fulfilled its mission. However, if the
same system shuts down unexpectedly for 12 hours under normal conditions
(or under relatively minor environmental stress) and deprives its users
of essential financial services, the system can reasonably be judged to
have failed its mission, even if data integrity and confidentiality are
preserved.

It is important to recognize that it is the mission fulfillment that
must survive, not any particular subsystem or system component. Central
to the notion of survivability is the capability of a system to fulfill
its mission, even if significant portions of the system are damaged or
destroyed. Survivable system is often used as a shorthand term for a
system with the capability to fulfill a specified mission in the face of
attacks, failures, or accidents. Again, it is the mission, not a
particular portion of a system, that must survive. 

 

Characteristics of Survivable Systems

As noted, essential services are defined as the functions of a system
that must be maintained when the environment is hostile, or when
failures or accidents occur that threaten the system.

Central to the delivery of essential services is the capability of a
system to maintain essential properties (i.e., specified levels of
integrity, confidentiality, performance, and other quality attributes).
Thus, it is important to define minimum levels of quality attributes
that must be associated with essential services. For example, a launch
of a missile by a defensive system cannot be effective if the system's
performance is slowed to the point that the target is out of range
before the system can launch.

The capability to deliver essential services (and maintain the
associated essential properties) must be sustained even if a significant
portion of the system is incapacitated. Furthermore, this capability
should not be dependent on the survival of a specific information
resource, computation, or communication link. In a military setting,
essential services might be those required to maintain an overwhelming
technical superiority, and essential properties may include integrity,
confidentiality, and a level of performance sufficient to deliver
results in less than one decision cycle of the enemy. In the public
sector, a survivable financial system is one that maintains the
integrity, confidentiality, and availability of essential information
and financial services, even if particular nodes or communication links
are incapacitated because of an intrusion or accident, and that recovers
compromised information and services in a timely manner. The financial
system's survivability might be judged by using a composite measure of
the disruption of stock trades or bank transactions (i.e., a measure of
the disruption of essential services).

Key to the concept of survivability, then, is the identification of
essential services, and the essential properties that support them,
within an operational system. There are typically many services that can
be temporarily suspended while a system deals with an attack or other
extraordinary environmental condition. Such a suspension can help
isolate areas that have been affected by an intrusion and can free
system resources to deal with the intrusion's effects. The overall
function of a system should adapt to preserve essential services.

[back to top]

The capability of a survivable system to fulfill its mission in a timely
manner is thus linked to its ability to deliver essential services in
the presence of an attack, accident, or failure. Ultimately, mission
fulfillment must survive, not any portion or component of the system. If
an essential service is lost, it could in some cases be replaced by
another service that supports mission fulfillment in a different but
equivalent way. However, we still believe that the identification and
protection of essential services is an important part of a practical
approach to building and analyzing survivable systems. As a result, we
define essential services to include alternate sets of essential
services (perhaps mutually exclusive) that need not be simultaneously
available. For example, a set of essential services to support power
delivery may include both the distribution of electricity and the
operation of a natural gas pipeline. 

 

Developing Survivability Solutions 

Survivability solutions are best understood as risk-management
strategies that first depend on an intimate knowledge of the mission
being protected. The mission focus expands survivability solutions
beyond purely independent ("one size fits all") technical solutions,
even if those technical solutions are broad-based and extend beyond
traditional computer security to include fault tolerance, reliability,
usability, and so forth. Risk-mitigation strategies first and foremost
must be created in the context of a mission's requirements (prioritized
sets of normal and stress requirements), and must be based on "what-if"
analyses of survival scenarios. Only then can we look toward generic
software engineering solutions based on computer security, software
quality attribute analyses, or other strictly technical approaches to
support the risk-mitigation strategies.

Hence, survivability depends not only on the selective use of
traditional computer-security solutions, but also on the development of
effective risk-mitigation strategies that are based on scenario-driven
"what-if" analyses and contingency planning. "Survival scenarios"
positing a wide range of cyber-attacks, accidents, and failures aid in
the analyses and contingency planning. However, to reduce the
combinatorics inherent in creating representative sets of survival
scenarios, these scenarios focus on adverse effects rather than causes.
Effects are also of more immediate situational importance than causes,
because an organization will likely have to deal with (and survive) an
adverse effect long before a determination is made as to whether the
cause was an attack, an accident, or a failure. Awaiting the outcome of
a detailed post-mortem to determine the cause, before acting to mitigate
the effect, is out of the question when an organization is dealing with
the survival of most modern, mission-critical applications.

 

Developments at the
CERT� Coordination Center

The CERT� Coordination Center (CERT/CC) is developing a survivable
network analysis (SNA) method to evaluate the survivability of systems
in the context of attack scenarios. Also under development is a
survivable systems simulator that will support analysis, testing, and
evaluation of survivability solutions in unbounded networks.

The SNA method permits assessment of survivability strategies at the
architecture level. Steps in the SNA method include 

system mission and architecture definition 
identification of essential services and corresponding essential
architecture components 
generation of intrusion scenarios and corresponding compromisable
architecture components 
survivability analysis of architectural softspots that are both
essential and compromisable 
Intrusion scenarios play a key role in the method. SNA results are
summarized in a survivability map that links recommended survivability
strategies for resistance, recognition, and recovery to the system
architecture and requirements. Results of applying the SNA method to a
subsystem of a large-scale, distributed healthcare system have been
summarized. Future studies will involve the application of the SNA
method to proposed and existing distributed systems for government,
defense, and commercial organizations.

The survivable systems simulator is based on a new methodology called
"emergent algorithms." Emergent algorithms produce global effects
through cooperative local actions distributed throughout a system. These
global effects (which "emerge" from local actions) can support system
survivability by allowing a system to fulfill its mission, even though
the individual nodes of the system are not survivable. Emergent
algorithms can provide solutions to survivability problems that cannot
be achieved by conventional means. The survivable systems simulator will
allow stakeholders to visualize the effects of specific cyber-attacks,
accidents, and failures on a given system or infrastructure. The goal is
to enable "what-if" analyses and contingency planning based on simulated
walkthroughs of survivability scenarios.

 

For Additional Information

This column is based on the following publications, which contain
additional information about this topic.

R. J. Ellison, D. A. Fisher, R.C. Linger, H. F. Lipson, T. A. Longstaff,
N. R. Mead, "Survivability: Protecting Your Critical Systems," IEEE
Internet Computing, November/December 1999.

H. F. Lipson and D. A. Fisher, "Survivability--A New Technical and
Business Perspective on Security," Proceedings of the 1999 New Security
Paradigms Workshop, September 21-24, Association for Computing
Machinery, 1999.

 

About the Authors

Robert J. Ellison is a member of the technical staff in the Networked
Systems Survivability Program at the SEI. He is currently involved in
the study of survivable systems architectures. He has previously led SEI
efforts in software development environments and CASE tools. He has a
PhD in mathematics from Purdue University.

David A. Fisher is currently leading a research effort in new approaches
for survivability and security in information-based infrastructures at
the SEI's CERT Coordination Center. From 1973-75, Fisher served as
program manager in the Advanced Technology Program (ATP) at the National
Institute of Science and Technology (NIST). He earned a PhD in computer
science at Carnegie Mellon University, an MSE from Moore School of
Electrical Engineering at the University of Pennsylvania, and a BS in
mathematics from Carnegie Institute of Technology (now Carnegie Mellon).

Richard C. Linger is a senior member of the technical staff in the
Networked Systems Survivability Program at the SEI, where he is
developing methods for analysis and design of survivability for
large-scale infrastructure systems. Before joining the SEI, he was a
senior technical staff member in IBM, where he co-developed cleanroom
software engineering technology for development of ultra-reliable
software systems. He is an adjunct professor at the Carnegie Mellon
Heinz School of Public Policy and Management and the Carnegie Mellon
School of Computer Science.

Howard F. Lipson has been a computer security researcher at the SEI's
CERT Coordination Center for more than seven years. He has played a
major role in extending security research at the SEI into the new realm
of survivability. Earlier, Lipson was a computer scientist at AT&T Bell
Labs, where he did exploratory development work on programming
environments, executive information systems, and integrated network
management tools. He holds a PhD in computer science from Columbia
University.

Thomas A. Longstaff is currently leading research in network security at
the SEI. As a member of the CERT Coordination Center, he has been
investigating topics related to information survivability and critical
national infrastructure protection. Before coming to the SEI, he was the
technical director at the Computer Incident Advisory Capability (CIAC)
at Lawrence Livermore National Laboratory in Livermore, California. He
completed a PhD at the University of California, Davis in software
environments. He received a BA in physics and mathematics from Boston
University, and an MS in computer science from the University of
California, Davis.

Nancy R. Mead is currently leading a research effort in Survivable
Network Architectures at the SEI, and is an adjunct professor in the
Master of Software Engineering program, Carnegie Mellon University. She
is involved in the study of survivable systems requirements and
architectures and the development of professional infrastructure for
software engineers. Before joining the SEI, Mead was a senior technical
staff member at IBM Federal Systems, where she spent most of her career
in development and management of large real-time systems. She has a BA
and MS in mathematics from New York University and a PhD in mathematics
from Polytechnic Institute of New York.


Back to the Index