Chapter 20 Network Reliability and Availability

20.1 Introduction

Chapter 17 described the network requirements of certain classifications of services that can be offered over broadband FDM distribution networks. Among those are transmission quality, effective bandwidth, service reliability (the probability that a system will survive without interruption for a defined period), outage rate (the average rate at which service interruptions occur), and availability (the percentage of time that service is available). Transmission quality was treated in detail in chapters 1016, and effective bandwidth was covered in Chapter 18. This chapter will deal with the calculation of network reliability and service availability. It will also discuss how those parameters vary as a function of the topology of the distribution system, among other factors. Finally, it will explore the difference between true availability and that experienced by users of a particular service.

Service interruptions can result from a variety of causes, including equipment failure, commercial power problems, interfering signals, or blocking due to inadequate circuit capacity, as discussed in Chapter 17. This chapter will deal with outages caused by either equipment or powering. Chapter 4 dealt with data traffic engineering, while Chapter 16 dealt with upstream interference issues.

20.2 History and Benchmarking

Logically, service availability should be just that — the net availability to the end user, irrespective of the cause of any interruptions. Historically, however, different industries have used different measures of their own service integrity.

20.2.1 Telephony

As we discussed in Chapter 17, the most widely quoted service availability requirement is 99.99% for local, wired, voice telephony service. It is important to review, however, how the exchange carriers define this parameter:

The service availability objective for thesubscriber loop is 99.99%, which corresponds to 0.01% unavailability or 53 min/yr maximum downtime. This objective, incorporating all network equipment between the local switch and the network interface (excluding the local switch and customer premises equipment), refers to the long-term average for service to a typical customer.1

This is not the same as end user service availability. It excludes, for instance, availability of dial tone from the switch, for whatever reason, or availability of interswitch or long-distance circuits. It also excludes any wiring within the home or terminal equipment problems. Though this may be understandable from a local network operator’s point of view, the history of the cable television industry has been that in-home problems are the largest single cause of service outages.2

Second, this is a goal for performance that is averaged over both the entire customer base and over time. Thus, a significant minority of customers may have consistently unreliable service, yet the carrier meets the service goal if the majority of customers have few problems.

Third, carriers do not count “unavailability due to the loss of both primary and back-up powering”3 in cases where there is powered equipment between the central office and home, such as fiber-in-the-loop (FITL) configuration. The importance of this exception depends, obviously, on the integrity of the field powering systems. It has been the experience of cable television network operators, however, that utility power interruptions cause more outages than all equipment failure classifications combined.

Finally, availability includes not only failure rates but also the time required to restore service. The problem is that, though a cut telephone drop wire (as an example) causes service to be unavailable, the network operator will typically not be aware of the outage until the customer tries to use the telephone and discovers the line is dead, then reports the outage to the phone company. The phone company calculates the length of the outage, for purposes of measuring compliance with the availability goal, from when it is reported, not from when it actually begins.4 Arguably, therefore, the actual outage times may far exceed the reported times. As the traditional telephone network evolves by deploying fiber and performance-monitoring equipment closer to homes, this disparity will presumably decrease.

20.2.2 Cable Television

Although the cable television industry has not historically had formal goals covering availability or outage rates, CableLabs undertook a study of field reliability and subscriber tolerance of outages under the auspices of its Outage Reduction Task Force. Its report, published in 1992,5 is discussed in detail in Chapter 17. In summary, the report set as a failure rate target that no customer should experience more than 2 outages in any 3-month period, with a secondary standard of no more than 0.6 outage per month per customer.

As with the telephony goal, it is important to review what is included in the definition of an outage:

Only multicustomer outages are included, thereby excluding all drop, NID, set top terminal, or in-house wiring problems.

Single-channel outages are included, thereby including headend processing equipment.

Only signal outages, not degraded reception, are included.

All elements of signal distribution between headend and tap are included, including the effects of commercial power availability.

Although the language refers only to outages that are “experienced by customers,” the software distributed with the report calculates all outages, and therefore, it can be assumed that customer awareness of outages is not important to the definition, even though the goal was based on customer experience.

The “no customer shall experience” clause in the goal clearly indicates that a worst-case, not average, analysis is called for.

Based on an estimated mean time to restore service of 4 hours, CableLabs translated 0.6 outage per month into a minimum acceptable availability of 99.7%.

The equivalence between customer-perceived individual outages and distribution system outages is tenuous at best. On the one hand, some outages may be detected and repaired without some subscribers being aware of the outage; on the other hand, individual subscribers may have in-home problems that will not be included in the preceding definition. Nevertheless, CableLabs chose 0.6 multi-customer outage per month as its benchmark for distribution system outage modeling.6

In the field measurements portion of its study, the task force gathered actual field failure analysis reports from some of its members and used those to predict failure rates of certain classifications of components (discussed further later) and to develop models that could be used to predict the failure rates of arbitrary network architectures. The accuracy of the raw data used in this type of analysis is limited by the analytic skills of the field technician and by its completeness. For instance, it is not obvious that a technician would think to report a brief outage caused by a routine maintenance procedure or by a brief power outage that did not cause a formal trouble call to be entered into the dispatch system. Second, the gathered data is necessarily representative of typical cable systems of that time and did not include modern HFC systems. Finally, cable systems (at least in 1992) routinely reported only distribution system outages, not individual customer outages, in their reports.

20.2.3 Cable Television Versus Telephony Reliability and Availability Definitions

Clearly, the CableLabs and Bellcore definitions of reliability and availability are not equivalent, and the numbers should be not compared directly. Table 20.1 illustrates the approximate inclusion of each industry’s outage definition compared with those elements that actually affect customer service.

Table 20.1 Comparison of Bellcore and CableLabs outage definitions

image

As a final note, availability is traditionally the most important parameter applied to telephone services though CableLabs found that the rate at which outages occurred was most important to television viewers, with the length of the outage (and thus availability) of only secondary importance.7

20.3 Definitions and Basic Calculations

It is important to have a clear understanding of the terminology used in calculating reliability-related parameters. Thus, the following definitions will be used throughout this chapter.

Component failure rate. The failure rate of a component or system is defined as the statistical probability of failure of any one of a similar group of components in a given time interval. The measured period should be contained within the normal service life of the device — that is, failure rate is not intended to measure time to wear-out, but rather “midlife” random failures. For the purposes of this chapter, we will generally express failure rates on a yearly basis.

Failure rates of individual electronic components (capacitors, resistors, semiconductors, and so on) are generally determined by their manufacturers using carefully designed tests that are conducted on a sample of products selected at random from the production line.

Failure rates of system-level components such as amplifiers, taps, connectors, and set top terminals can be determined theoretically based on the failure rates of their internal electronic parts and the way in which they are interconnected, or they can be determined based on actual failure data from field-installed units. We have a strong preference for the latter since the installed failure rate is influenced not only by design but also by craft issues, environmental conditions, and how the unit is powered.

When field observations are used to determine failure rate, the appropriate formula is


image (20.1)


where

λ = the annual failure rate

k = the number of observed failures

n = the number of items in the test sample

t = the observation time in years

The failure rate can be converted to other units as required. For instance, the failure rate per hour is the annual failure rate divided by 8,760.

System failure rate — simple nonredundant connection. Where n components are connected in such a way that the failure of any one of them will lead to a failure of some defined system or function (such as a subscriber service outage or a single-channel failure), and the failures do not have a common cause, the resultant failure rate of the system is given by


image (20.2)


where

λsystem = the net system failure rate

λi = the failure rate of component i

n = the number of components, any of whose failure will cause a system failure

Note that this formula is not exact since it assumes that none of the failures overlap in time. Thus, it is accurate only when the failure rates and repair times are such that multiple, independent, simultaneously occurring outages are statistically insignificant. Generally, this is a good assumption for well-designed and managed cable systems.

System reliability — simple nonredundant connection. System reliability is the probability that the system will not fail during some specific period. In general, reliability can be calculated once the failure rate is known using


image (20.3)


where λand t are in consistent units (that is, failure rate per year and years or failure rate per hour and hours).

The result of this calculation may be less than intuitive. For instance, if the annual failure rate of a given system were 1 (meaning that if we measured a large number of similar systems, we would detect a number of failures per year equal to the number of systems), the probability of any single system surviving one year without a failure is e−1 = 0.37.

Where components or subsystems of known reliability are connected in a configuration where the failure of any component leads to a system failure, the net system reliability can be calculated using


image (20.4)


where

Rs(t) = the net system reliability over period t
R1, R2,…, Rn = the reliabilities of the individual components, measured over the same period

Mean time between failure (MTBF). The predicted time between failures is simply the inverse of the failure rate, or


image (20.5)


where MTBF and λare in consistent units.

Generally, this chapter will express MTBF values in hours. When manufacturers provide predicted MTBF data, it is also generally given in those units. As with failure rate, MTBF is not a measure of useful service life. In fact, many components have an MTBF that exceeds their wear-out time.

Mean time to restore (MTTR). When a failure does occur, the mean time between failure and restoration of the defined function is the MTTR. For the purposes of this chapter, MTTR values will usually be expressed in hours.

Note that MTTR is sometimes used to express mean time to repair rather than mean time to restore. We use the latter definition here to distinguish between when service is restored and when the failed component is actually repaired. Since it is more common in cable television to restore service after a failure by swapping a good for a failed network component, this is more accurate when our goal is to determine service reliability.

Availability (A). The ratio of time that a service is available for use to total time is known as availability. It can be calculated from MTBF and MTTR (expressed in consistent units) using


image (20.6)


Unavailability (U). Unavailability is the ratio of time that a service is unavailable to total time and is, by definition, equal to 1 — A.

Outage time (TU). Outage time is the amount of time that the network is unavailable during a defined period. In general, the outage time during some period t is equal to the unavailability multiplied by t. It is common to express outage time in minutes per year, which can be calculated as follows:


image (20.7)


20.4 Effects of Redundant Network Connections

The preceding simple calculations apply to situations in which the failure of any single component leads to the failure of the defined system or function. In a redundant connection, either of two (or more) components can provide a function should the other(s) fail. Thus, the system user experiences a failure only if the redundantly connected components simultaneously fail. As discussed in the previous chapter, a common example in fiber architectures is a ring configuration where signals can travel by either of two cable routes between headend and hubs.

Unavailability. When two network sections are connected in a redundant configuration, the equivalent unavailability is the product of the unavailabilities of the two individual sections. Since, for each section, the unavailability is simply the number of outage minutes per year divided by the total minutes in a year, the equivalent unavailability is


image (20.8)


where

UR = the net unavailability
TU1 and TU2 = the yearly outage times (in minutes) of the redundantly connected sections

Converting this back into equivalent yearly outage minutes for the combined section:


image (20.9)


Failure rate. The equivalent failure rate is the sum of the failure rate of the first feeding segment multiplied by the probability of the second section failing before the first can be restored, plus the failure rate of the second section multiplied by the probability of the first section failing during an outage in the second section.

The probability of section one failing during an outage in the redundantly connected section two is


image (20.10)


where

PF1 = the probability of failure of section one
λ1 = the failure rate of section one per year
MTTR2 = the period (in hours) over which the probability of failure is to be measured, namely, the MTTR for a failure in network section two

Thus, the net failure rate when two sections join in a redundant feed is


image (20.11)


20.5 Absolute Versus User-Perceived Parameters

The preceding equations yield reliability-related performance parameters for the network, independent of usage. For applications that require continuous use of the network, the calculated failure rates and availability of the network are the same as those experienced by a user. For most services, however, usage of the network by individuals is only partial. For instance, most families watch television about 4 hours per day, whereas telephone users may access the public switched telephone network for only a half hour daily.

The importance of partial usage is that some outages will occur and service be restored before all potential users are affected. Thus, the average network reliability, as experienced by occasional users, will be higher than for full-time users. However, the relationship between the network outage rate and avail ability and that experienced by occasional users is complex and affected by not only average aggregate usage time but also by the pattern of usage.

As an example, if a television subscriber watches 4 straight hours of television every evening, but at no other time, and the network MTTR is 1 hour, the probability is that he or she will experience slightly more than one-sixth of all outages. On the other hand, a person who uses the same network to make a 1-minute phone call every 6 minutes (say, a 24-hour business using a telephone modem to do credit card verifications) will experience every outage (and, in fact, will perceive a greater outage rate, since the restoration time for each outage will affect 10 attempted phone calls) even though his or her total daily usage time is the same.

A further complication is in the user’s response to a failure and how we quantify the modified usage pattern. For example, assume that a user normally averages one phone call every 4 hours. If the user attempts to make a call and finds the phone not working, he or she may try again every few minutes until successful. If the outage lasts an hour, the user will have experienced many more failures than during his or her normal usage pattern. At the other extreme, a television viewer who would normally watch an evening’s programming may, upon discovering an outage, abandon viewing for the entire evening and thus not be aware that the outage lasted only a few minutes. Again, the viewer’s perception of the network’s availability is affected by his or her changed usage pattern. A legitimate question for reliability modeling is whether these failure-induced retries are counted as independent failures.

For a hypothetical fiber-to-the-curb, fully status-monitored network, one researcher has calculated that a telephone user making 10-minute calls spaced 100 minutes apart (roughly a 9% utilization factor of the network) would experience 30.2 of the network’s 53 minutes of unavailability every year.8 If any portion of the network is unmonitored, the outage time experienced by the customer will increase and could, in fact, exceed the Bellcore-defined outage time for the network!

For cases where usage patterns are not modified as a direct result of outages, we can estimate perceived outage rates and service availability as follows.9

First, define usage pattern for a service consisting of an average call length as TC, and an average call cycle time (from the start of one call to the start of the next call) as TS. To keep things simple, all times will be expressed in hours.

The probability of a service failure during a call period TS can be calculated from Equation (20.3) (the probability of failure is 1 minus the reliability):


image (20.12)


where TS is in hours and

λ= the service failure rate per hour

If we multiply the probability of failure per call cycle by a scaling factor that takes into account both call length and service restoration time, we can calculate the probability of failure per attempted usage of the network:


image (20.13)


where TC and MTTR are both in hours.

The total number of attempts to use the service per year can be calculated using


image (20.14)


from which we can calculate the user-experienced yearly failure rate:


image (20.15)


The average length of experienced outages is


image (20.16)


so that the total experienced outage time per year is


image (20.17)


Finally, the experienced availability (that is, 1 minus the ratio of outage time to attempted usage time) can be calculated using


image (20.18)


As an example, assume that a network fails once every 3 months (MTBF = 2,190hours, λ= 0.000457 per hour) and that the MTTR is 2 hours. Based on this data, an application that used the network continuously would experience an outage rate of 4 per year, a yearly outage time of 8 hours, and an availability of 0.999088.

A television viewer using the network 4 hours per day (TC = 4, TS = 24) would experience a yearly outage rate of 0.9953, a yearly outage time of 1.327 hours, and an effective availability of 0.999091. Thus, although using the net work only one-sixth of the time, the viewer experiences roughly one-fourth the outage rate but nearly the same effective availability as a 24-hour user.

A telephony user who makes a 10-minute call every 4 hours (TC = 0.16667, TS = 4) would experience a yearly outage rate of 2.1665, a yearly outage time of 0.333 hour, and an effective availability of 0.999087. Thus, even though the telephone user makes use of the network for a lower percentage of the time, he or she experiences about the same effective availability but a higher failure rate than the television viewer because of his or her more frequent attempts at access.

As can be seen, network effective availability — the probability that service is available — is quite insensitive to usage patterns, whereas experienced outage rate is closely tied to both call length and call rate.

20.6 Network Analysis

Once the basics of reliability are understood, they can be applied to the analysis of actual and proposed networks. The steps to analysis include breaking the network into manageable sections, properly assigning the components to the sections, determining the failure rates of individual network components, estimating the service restoration times for failures at various network levels, and calculating the various reliability parameters. Though the principles of the process will be described here, in practice, the calculations are much more easily performed using a computer, which also allows the engineer to assess the effects of changes in topology, operating practices, and component reliability.

20.6.1 Logical Subdivision of the Network for Analysis

The first step is subdividing the network into segments. The degree to which this is required depends on the level of analysis required. For instance, if the only parameter of interest is the failure rate and availability as experienced by the most distant subscriber, then the many branches in the coaxial distribution system can be ignored and only the longest cascade analyzed in a single calculation.10 As a minimum, however, any network branches that form redundant structures must be separated, analyzed separately, and then used as variables in the equations in Section 20.3 to determine the effective performance of the redundant structure.

For those desiring a more thorough analysis, a separate analysis of each physically separate branch and calculation of reliability data at every customer port will allow additional data such as total failures and the performance as experienced by the average as well as by the most disadvantaged customers. It also allows analysis of the contributions of various network sections to overall failure rate and unavailability. The analysis given at the end of this chapter was done using a large spreadsheet that allowed a very detailed analysis.

In order to limit the amount of data to be analyzed, it is common to analyze only a typical subsection of a large network. In the case of an all-coaxial network, this might be a single trunk, whereas in the case of an HFC network, it is likely to consist of a single fiber-fed node, along with the fiber trunking structure feeding that node.

Figure 20.1 shows a typical example. In this case, a sheath ring fiber structure is used between the headend and the node receiver, while coax cascades of up to five amplifiers are used beyond the node. Should only the performance to the last user be required, then analysis of all but the longest cascade can be omitted. Only the downstream optical structure is shown in the Figure. See Section 20.6.6 for a discussion of the calculation of two-way structures.

image

Figure 20.1 Topology of typical HFC node and sheath ring trunk.

Figure 20.2 shows the logical division of this network into segments for analysis. Note that the number assigned to each section identifies a unique structure, whereas the letter denotes the instance of that structure. Thus, all four of the segments numbered five are similar, and only a single analysis of that structure is required. Since a single headend optical transmitter feeds both downstream fibers between the headend and node, it is part of segment 1A. Separate optical receivers are used in the node, however, so each of the segments numbered 2 includes both the fiber and connected node receiver.

image

Figure 20.2 Division of network into segments for analysis.

20.6.2 Assigning Network Components to Subdivisions

Sometimes, network components’ effect on transmission integrity may be determined by other than their physical placement in the network. Power supply 1 in Figure 20.2 is physically located in section 5A, but powers the node and the first amplifier in both sections 5A and 5B. Therefore, for purposes of calculating reliability parameters, it should be included in section 3A. Similarly, power supply 2 is physically located in section 6A, but must be included in section 4A since it powers the amplifier that feeds sections 7A and 6A, and its failure would affect the signals in all the downstream branches.

With the components properly assigned to network segments, based on their effect on operation, the components in each section should then be listed in a column as a first step to segment analysis.

20.6.3 Determining Component Failure Rates

In order to estimate the performance of the defined sections, the field failure rate of each network component type must be determined. Manufacturers’ design analysis, based on the MTBF of the individual components used and their temperature/voltage stress levels, often have limited relationship to failures of field-installed equipment where such factors as voltage transients and the skill levels of field personnel may dominate inherent failure rates. Also, a failure of a series-connected component may or may not cause a loss of service to all downstream subscribers. For instance, a tap port may fail, causing a loss of service to one customer but not disrupting the through-signal to subsequent taps.

With those caveats, it is instructive to examine failure rates of a few key components.

Commercial power. CableLabs’ task force recorded average commercial power failure rates of 30% (all failure rates are on a yearly basis unless otherwise noted), meaning that the probability of a commercial power interruption at any given power supply is 0.3 each year. Since this data is based on outages caused by power failures, it is to be presumed that this Figure relates to unprotected commercial power outages rather than all power outages. Also, given that their data was derived from reported system outages, it can be presumed that many short interruptions were not included in the data.

By contrast, the Network Reliability report quotes from a study showing average commercial power unavailability of 370.2 minutes per year.11 They did not report on the rate at which outages occurred, however. The net unavailability with varying durations of standby power capacity is shown in Table 20.2.

Table 20.2 Average commercial power unavailability in the United States

image

From this data, it is readily apparent that even standby battery capacity of several times the 2 hours normally designed into cable television standby power supplies will result in unprotected outages well in excess of Bellcore’s 53-minute availability guideline. Thus, normal standby powering will need to be supplemented by other means (such as built-in generators or status monitoring combined with deployable mobile generators) for high-availability services.

It is the experience of one of the authors, however, that the reliability of commercial power varies widely from location to location so that Table 20.2 can be taken only as a measure of average conditions. Accurate network reliability predictions must depend on locally viable commercial power data unless networks are so hardened against commercial power outages as to make the issue irrelevant.

Power supplies. Independent of the availability of commercial power is the reliability of the power supplies themselves. CableLabs’ data has suggested a failure rate of only 3 %. Though such a rate may be achievable by a nonstandby power supply, it is the industry’s common experience that standby power supplies are often much less reliable (though the overall reliability of commercial power plus standby power supply will exceed that of a nonstandby supply by itself). Standby supply reliability will be dramatically affected by the age and condition of the batteries, as well as by the level of routine maintenance in the system.

Amplifiers. CableLabs found field failures occurring in trunk amplifiers at the rate of 10%, whereas failures in the simpler line extenders occurred at a 2% annualized rate. “Distribution” amplifiers used in modern HFC networks have a complexity somewhere between these extremes, and 5% is often used in modeling HFC networks though upgraded designs may justify lower values. Werner of TCI has quoted “empirically derived” failure rates for distribution amplifiers as low as 1.75%.12 Merk has suggested that operators use failure rates of 0.5% for trunk amplifiers and 0.15% for line extenders, based solely on warranty repair data.13 This range of values illustrates how difficult it is to be confident of failure rates and how competent engineers can arrive at rates that vary by more than 10:1.

Optical transmitters. Werner and Merk are consistent in reporting 2.3% downstream transmitter failure rates. If CableLabs’ failure rates for other headend equipment of 20–30% are to be believed, then these values are suspiciously low. On the other hand, the base of equipment that went into the latter study varied in age, whereas optical transmitters are likely to be much newer given the relatively recent widespread rollout of HFC systems.

Werner rates upstream transmitters identically to downstream, whereas Merk suggests that upstream transmitters experience a failure rate of only 0.9%. Given that most upstream transmitters operate at much lower powers, do not require a thermoelectric cooler, and do not require a highpower driver, there is some logic for believing a lower upstream value.

Optical receivers. When evaluating optical receivers, it is important to distinguish between the receiver itself and the entire optical node. Since nodes often include full trunk amplifier stations, it is advisable to rate the receiver independently of its host station unless it is a stand-alone unit. Werner’s value here is 1.7%, whereas Merk uses 0.7%, with both authors rating upstream and downstream receivers essentially identical.

Passive devices (taps, power inserters, splitters, directional couplers). MTBF data supplied to CableLabs by manufacturers suggested that passive device failure rates should be on the order of 1%. CableLabs’ field data suggested that actual numbers were closer to 0.1% — a rare case where actual experience was superior to predicted. Werner’s data is even more optimistic at 0.04%. Merk suggests the use of values between 0.07% and 0.26%, depending on the device.

Coaxial connectors. Neither CableLabs nor Merk suggests failure rates for coaxial connectors though CableLabs’ “cable span” failure rate may be a composite of the actual cable and the attached connectors. Werner suggests a failure rate of 0.01%. Although this may seem insignificant, it must be remembered that the signal must pass through at least two connectors for every series-connected device in the system so that a cascade of 5 amplifiers, 1 power inserter, 2 splitters, and 25 taps will include 66 connectors.

The preceding values are for the connectors used for solid-sheathed aluminum cable. From experience, the failure rate (as measured by outages and associated trouble calls) for drop connectors is much higher. Based on actual experience operating numerous systems, it is suggested that 0.25% be used pending better industry statistics. As with distribution connectors, a typical drop system will include several series-connected F connectors. Among all the components considered, the greatest actual field variation may be in the performance of drop connectors owing to both craft sensitivity and the wide range of available connector quality choices.

Fiber-optic cable. The modeling program provided with CableLabs’ report suggests the use of 3% failure rates for fiber cable spans (regardless of length). Merk suggests the use of Bellcore data of 0.44% per mile, whereas Werner suggests the use of 0.44% (presumably per span). Hamilton-Piercy suggests that Bellcore’s data on fiber failures is equivalent to 0.3% per span per year but uses his company’s experiential value of 1.5% per span.14 Clearly, these are widely varying numbers that reflect several variables.

For one thing, the Bellcore data was based on analysis of “FCC-reportable” outages only, meaning those that affect at least 30,000 customers.15 Thus, most instances of fiber damage on the subscriber side of switches went unreported, whereas the failures occurred in telephone interoffice cables that are placed in well-protected trenches because of the high volumes of traffic being carried. Fiber serving small pockets of customers is more economically installed and more likely to be damaged as a result. For example, most of the damaged cables reported by telephone companies were located 4–5 feet underground, well below the depths used in most cable television construction.

Second, the lengths of the spans considered vary from 5 miles or less for typical suburban HFC architectures to 50 miles or more for long-distance telephone links. Finally, available failure data may have been difficult to coordinate to the total length of the failed cable.

Given that most fiber cable failures are due to dig-ups and should increase proportionally with distance, a value of 0.1% per mile is suggested pending better data. This is roughly equivalent to Hamilton-Piercy’s value if 15-mile average link distances are assumed, or Werner’s value if 4.4-mile spans are assumed.

Coaxial cable. Coaxial distribution cable might be thought to suffer similar failure rates. CableLabs’ field data, however, suggests 3% annual failures. Even if it is assumed that spans are a half-mile long (about the longest expected between trunk amplifiers), this would be the equivalent of 6% per year per mile, or 40 times the optical cable failure rate. Werner’s 1996 data suggests the use of 0.23% — less than one-tenth the failure rate. Based on the fact that fiber cable is generally armored and therefore both stronger and not as subject to rodent damage, it seems reasonable to use a value of 0.2% per mile for distribution coaxial cable or double the risk factor for fiber cable.

Drop cable is not as well protected, nor is it made to the same quality standards as distribution cable. Analysis of trouble call rates and causes suggests that annual failure rates of drop cables may be in the 1–2% region.

Set top terminals and other operator-supplied terminal equipment. The failure rates of terminal devices will obviously vary considerably with complexity and age. New, nonaddressable terminals will have very low failure rates, whereas first-generation addressable products often failed at several percent per month. Few authors who have modeled cable television failure rates have included terminal data since CableLabs’ definition excludes individual subscriber outages. Based on field failure data from several systems over a number of years, we suggest that a default value of 7% be used, lacking more device-specific data. For purposes of analysis, this failure rate is taken to include nonhardware failures such as the failure of an addressable terminal to receive an enabling command, data entry errors that result in unintended de-authorizations, and power failures that cause the terminal to lose its authorization status for some time after power restoration. From the subscribers’ viewpoint, these are still service outages.

Active network interface devices (NIDs). Where services terminate at the outside wall of residences, the equipment is subjected to greater weather exposure but may be rigidly mounted and is not subject to handling damage. Bellcore suggests that NIDs containing active termination equipment, whether for video or telephony service, should have a net unavailability of about 26 minutes per year.16 If it is assumed that individual customer outages are restored in a mean time of 8 hours, this is equivalent to a 5.4% yearly failure rate.

Headend/rack-mounted equipment. CableLabs’ initial data indicated very high (20–30%) failure rates among headend equipment. In general, however, a failure rate of 5%, averaged over all classifications of headend equipment, is more in line with operators’ experience and will be used in our example calculations.

In the case of telephony, headend units are required to interface between the ports on the switch and the HFC distribution system. Although the term often has a more precise meaning in different contexts, host digital terminal (HDT) will be used to refer to all the headend equipment between the switch and broadband distribution system, considered as a subsystem. Although Bellcore has not predicted the failure rate of HDTs, they have assigned 10 minutes of annual unavailability to this subsystem. On the basis of an assumed 0.5-hour MTTR, the HDT is assigned an annual failure rate of 33%.

In summary, there is reasonable industry agreement on some average failure rates but widely diverging data on others. As the cable industry gains more experience with reliability as an essential element of network design, better data will likely become available for predicting performance. We have a bias toward carefully collected field data derived from systems operating in conditions similar to the system to be analyzed (regional differences in such factors as lightning strikes and the availability of laws requiring utility locates before commencing underground construction can have a major effect on outages). For the purposes of the example discussed later in this chapter, the annual failure rates shown in Table 20.3 will be used.

Table 20.3 Typical individual network component annual failure rates

image

20.6.4 Estimating MTTR

Although the elapsed time between failures is a function of network architecture and the quality of the components used, the time required to restore service after an outage occurs is almost entirely within the operator’s control and is a function of staffing levels and training in addition to performance-monitoring systems.

Total time to restore service consists of four time segments:17 (1) the time between the start of the outage and when the operator is aware of it, (2) the time for crews to respond to the outage location, (3) the time for the technician to identify the cause of the outage, and (4) the time to repair or replace the failed item and restore service.

Werner quotes average MTTR values varying from 66 to 243 minutes, depending on the diligence of the system operator. Though CableLabs did not put forth an industry goal, they did quote from a Warner Cable guideline calling for 2-hour maximum MTTR for damaged fiber-optic cables.18 Data provided by the telephone industry to the FCC on fiber repairs showed that, in most cases, service was restored within a 2–4- or 4–6-hour window.19

Historically, many cable systems used manual trouble call routing, staffed during business hours only (except for major outages, storms, and so on), and operated many headends without any regularly present technicians. As larger, regional systems offering a wide variety of services have evolved, 24-hour, 7-day staffing of headends is no longer uncommon, and field service staffs often work 2 shifts, with on-call technicians for late-night situations.

As an economic choice, the response to outages is proportional to the number of customers affected. Large outages inevitably clog the customer service telephone lines and may lead to bad publicity since customers already affected by loss of service are doubly inconvenienced by being unable to report the situation.

Based on experience from a number of systems, the MTTR values in Table 20.4 will be used in the example discussed later in the chapter. To the extent that operators do better or worse than these values (which may be overly pessimistic for a well-organized system), the availability will vary from that calculated.20 Factors such as traffic conditions, dispersal of personnel, training, availability of repair materials, and communications systems and utilization will have a major impact on MTTR.

Table 20.4 Typical cable system MTTR values

Failure Level MTTR
Headend (24-hour, 7-day staffing) 0.5 hour
Fiber trunk cables 2 hours
Optical node components 1 hour
Distribution system coaxial components 4 hours
Individual customer drop or components 8 hours with remote monitoring 12 hours without remote monitoring (customer call-in delay)

20.6.5 Special Considerations Within Headends and Hubs

When Bellcore set up the availability standards for telephony, they specifically excluded the switch and everything on the network (nonsubscriber) side of the switch. CableLabs, on the other hand, specifically included headends to the extent that a single-channel outage is counted the same as a total system outage.

Those wishing to model cable system reliability factors may wish to consider some interim approach that accounts for all critical common equipment (such as combining amplifiers and headend power) but only some percentage of individual channel-processing equipment. Without such an approach, it can readily be shown that the failure rate of a modern 750-MHz fully loaded headend will be theoretically very high, yet the average viewer will be unaware of those failures if they happen to occur in channels he or she is not watching.

In previously published studies, one of the authors has proposed that viewers, in a typical viewing session, are exposed significantly to 10 channels (that is, they access and view programming on 10 channels in an evening of typical viewing), and he therefore modeled headend reliability based on the sum of the critical common equipment, plus all the equipment required to create 10 channels (satellite antennas, receivers, modulators, and so forth).21 In the example discussed later, it has been assumed that the viewer is exposed to the potential failure of the approximately 30 channel-specific pieces of equipment needed to process the 10 channels viewed, plus 3 common combining amplifiers. For other services, different calculations of headend equipment reliability might be appropriate. As with equipment failure rates, there is little statistically valid field information on the number of channels viewed by typical subscribers.

A complete analysis of video headend-caused subscriber outages should include not only signal-processing equipment but control equipment, for instance, the telemetry generators that send enabling signals to terminals, addressability control computers, and (for two-way terminals) upstream data receivers.

20.6.6 Special Considerations for Two-Way Systems

Most of the distribution system is shared between upstream and downstream signals, including amplifier stations (though a detailed analysis might assign different failure rates to upstream and downstream modules within the station), power supplies, passive devices, and the cable itself. The optical portion of the plant, however, almost always uses separate fibers for upstream and downstream communications, and always uses separate upstream and downstream optical transmitters and receivers.

When modeling the optical trunking segments, therefore, both upstream and downstream transmitters and receivers are critical to two-way applications and must be included. Assuming the fibers share a common cable, however, the reliability of a pair of fibers will not differ significantly from a single fiber, and so the number would be the same as for one-way transmission. When calculating the failure rate of the headend for a two-way application, it is also necessary to include equipment that processes both incoming and outgoing communications.

20.6.7 Calculation of Failure Rates and Availability

Once the structure of the network is defined and the field failure rates and repair times of the equipment used are known, the next step is to calculate the total failure rate of each segment. This is simply the sum of the failure rates of each of the elements in the segment (Equation (20.2)). Next, using Equations (20.6) and (20.7), it is possible to calculate the availability of each section and the total yearly minutes of outage.

If the only result desired is total system failure rate and availability to the most affected subscriber, the failure rates and outage minutes of each of the critical series-connected sections can be added. The net availability can then be calculated using Equation (20.7). Additional data can be derived from the analysis, if desired, by, for instance, calculating the number of subscribers affected by each outage, calculating the outage rate and availability to each subscriber, and/or calculating the cost of repair for each class of failure. Although this level of analysis is much more complex, it yields a significant amount of additional information about the performance of the network, including

Reliability and availability to the average, as well as most affected, customer

The total number of outages and the total number of customer outages (the summation of each outage multiplied by the number of customers affected)

The total cost of outage repair

The distribution of outages by size

This additional information is very helpful in making design trade-offs that include not only the cost of construction but also the required customer service staffing levels and predicted repair costs.22

20.7 Analysis of a Typical HFC Network

There have been many published analyses of the reliability and availability of specific network configurations. One of the most comprehensive compared 13 different architectures and found availability values ranging from 99.7% to 99.99% based on 0.3% fiber link failure rates and 10% amplifier failure rates (failure rates for other components were not given). Using 1.5% fiber link failures and 5% amplifier failures, two architectures were found to exceed 99.98% availability: a ring/ring/star and a double star.23

As an illustration of attainable reliability factors, as well as the degree of information available from a comprehensive analysis of failure rate data, we give a complete analysis of a typical HFC cable system.

20.7.1 Architecture

The analyzed system is typical of an early 1990s upgrade. The downstream bandwidth extends to 550 MHz, whereas the return bandwidth is limited to 30 MHz. Although it is logically a simple single star network, several nodes are served from each large fiber-optic cable leaving the headend, and the analysis accounts for this shared risk. Figure 20.3 is a simplified diagram of a typical node. As with Figure 20.1, the tap configuration is not shown though each of the dashed lines contains taps (and sometimes splitters and/or directional couplers). Even though not shown in the Figure, the taps and branching are included in the analysis. The distribution system extending from each node passes approximately 2,000 homes, with a basic penetration rate of 70%. The total number of homes served from the headend is 150,000, split among 75 similar nodes.

image

Figure 20.3 Simplified schematic diagram of analyzed system.

The coaxial amplifier cascade beyond the node is limited to 4, and the entire node distribution system contains 53 amplifiers. Three power supplies are required to power all the active devices, with a maximum power supply cascade of two. The total number of series-connected taps in any one distribution leg is about 20.

The initial analysis is based on the use of a generator, but no uninterruptible power supply (UPS), at the headend and field standby power supplies with 2-hour battery capacity. It is assumed that this results in a 30-second headend outage every time the commercial power fails (three times per year) until the generator “kicks in,” and that the field standby power supplies have the effect of reducing the field failure rates to 50% per year in each location, with 1 hour of unprotected outage when the batteries do run down (based on dispatching a crew with portable generators as a result of customer-reported outages). It is assumed that there is no status monitoring of power supplies that would have allowed crews with portable generators to back-power supplies before the batteries expire.

20.7.2 Video Services Performance

The network was analyzed for delivery of conventional cable television video services. It was assumed that the viewer was “exposed” to the reliability of 30 pieces of headend channel-processing equipment plus 3 headend combining amplifiers during a typical viewing session, as discussed in Section 20.6.5.

A typical home drop system was assumed, including a set top terminal. No monitoring of the home equipment was assumed so that subscribers were exposed to all failures occurring beyond the tap (since they wouldn’t be known to the network operator until reported by the user), but only a portion of the failures earlier in the network, as discussed earlier under user-perceived reliability factors.

The outage rate per year for the most affected customer is 6.11. Though there are no specific standards for telephony service, this is close to CableLabs’ suggested maximum of 0.6 per month for video service. (Even though the CableLabs’ guideline did not include drop and premises elements, this analysis does since subscriber satisfaction was measured on overall outage rate independent of cause.) Figure 20.4 shows the relative contribution of various network sections to this rate. As can be seen, powering is by far the largest contributor though headend outages also have a major effect than for unavailability. The large power contribution reflects the use of a generator, but no UPS at the headend, so that customers experience a short outage each time the power fails at the headend.

image

Figure 20.4 Relative network section contributions to outage rate.

The network availability to the most affected customer was calculated to be 0.99951, equivalent to yearly service outage time of about 4.3 hours. This unavailability is six times better than CableLabs’ initial goal for cable television service.

As Figure 20.5 shows, the largest contributor to unavailability is also powering, followed by the drop, headend, coaxial distribution network, and fiber trunking. The large contribution from powering reflects both the limitations of battery capacity and the lack of status monitoring. (In Section 20.7.4, we will see the effect of adding power supply status monitoring.) Although the drop hardware is reasonably reliable, its contribution to total outage time is disproportionately large since the MTTR is much higher than for outages affecting multiple customers. The size of the headend contribution is affected primarily by the large number of moderately reliable units involved in producing a multichannel video lineup. Nevertheless, the headend contribution is proportionately less than for outage rate since the MTTR is much less.

image

Figure 20.5 Relative network section contributions to outage hours.

A third useful result of the failure analysis is the distribution of multisubscriber outages on the basis of the number of customers affected. This analysis is important because it allows the network operator to size his or her customer service operation knowing the expected volume of outage-related customer calls. Figure 20.6 shows the outage size distribution for the network analyzed.

image

Figure 20.6 Outage distribution by number of affected customers.

20.7.3 Wired Telephony Services Performance

The network was also analyzed as a delivery means for wired telephony. In that case, there was no exposure to headend video equipment, and in accordance with Bellcore methodology, the switch reliability was not included though the interface between switch and distribution system (the host digital terminal) was. In the case of telephony, the drop system included only the drop itself, plus a network interface device (NID), on the assumption that the in-house wiring and analog instruments should not be included in accordance with Bellcore practice. The failure rates assigned to the HDT and NID resulted in 10 and 26 minutes per year of downtime in accordance with Bellcore’s allocations in TR-NWT-000418.24

When analyzed in this telephony configuration, the outage time is 26% less than for video services, resulting in an availability of 0.999636, whereas the number of outages is 27% less (4.82 per year). Analysis of the causes of the remaining outages shows that power-related issues dominate all others. Although the unavailability is about 3.6 times worse than Bellcore standards for wired telephone service, they are not comparable because of the factors not included in the Bellcore outage definition (this analysis, for instance, includes powering and measures outages from actual occurrence, not from when they are reported).

20.7.4 Effect of Improved Powering on Telephony Performance

If a UPS is added to the headend and status monitoring to the field supplies (so that crews can deliver temporary generators before the batteries lose their charge), then all power-related outages can be eliminated except those due to the reliability of the powering equipment itself. In that case, the telephone service availability improves to 0.99976, and the outage rate declines dramatically to 0.82 per year.25 Having said that, however, it must be pointed out that the cable industry’s experience with early status-monitoring equipment was often that it was less reliable than the equipment being monitored.

As Figure 20.7 shows, powering is still a significant contributor to network unavailability but is less important relative to the drop (dominated by the NID failure rate and slow repair time) and the distribution network. Figure 20.8 shows that the number of large outages has dropped dramatically as a result of improved powering.

image

Figure 20.7 Reduced effect on unavailability of powering with headend UPS and status monitoring.

image

Figure 20.8 Reduction in number of large outages with improved powering.

It is clear from this analysis that achieving a true 0.9999 availability to the consumer, should that be required, requires several fundamental changes in this network:

Improved HDT and NID reliability — their Bellcore-assigned outage time allocations consume 68% of the 53-minute total allowed.

Hardened and more reliable powering. The field power supplies installed by most telcos for their HFC networks were extremely reliable (and expensive) by traditional cable television standards.

Shorter cascades of both coaxial equipment and power supplies. In general, telco-rated HFC networks are powered by a single power supply located at the node and have active equipment cascades beyond the node of only two to three.

Reliable status monitoring throughout the network. Since the time required to be aware of a failure, then to analyze its cause, is a material part of total MTTR, extensive status monitoring plays an essential part of reducing repair times.

Note that the fiber trunking portion of the network was a minor contributor to unavailability in all cases, even in this simple star network. Thus, use of various ring topologies will reduce the instance of large outages (sometimes an important consideration) but will not significantly reduce network unavailability in this network. As other factors are brought under control, the fiber contribution will become more important.

20.8 Summary

The estimation of network availability, failure rate, and other reliability-related factors is very straightforward, if somewhat tedious, given a topology, known component failure rates, and repair times. Unfortunately, a great deal of the effort needed to produce accurate reliability predictions is involved in developing believable component and commercial power failure rates. Once a logical model of a proposed network is constructed and entered into a spreadsheet for analysis, it is simple to test the effect of different component choices and operating practices.

The preliminary analysis done in this chapter illustrates that modern HFC networks are easily capable of achieving low customer-experienced outage rates and unavailability for video services.

HFC networks that are designed properly for wired telephone services are capable of achieving the required availability, whereas lower-cost networks may be entirely adequate for various levels of video and data services. It is the function of the network engineer to design the most cost-effective network to carry the desired services.

Endnotes

1. Bellcore TA-NWT-00909, Issue 2, Generic Requirements for FITL Systems Availability and Reliability Requirements, December 1993, p. 13-1. Bell Communications Research, Piscataway, NJ.

2. Brian Bauer, In-Home Cabling for Digital Services: Future-Proofing Signal Quality and Minimizing Signal Outages, 1995 SCTE Conference on Emerging Technologies, Orlando, FL, pp. 95–100. SCTE, Exton, PA.

3. Bellcore TR-NWT-000418, Issue 2, Reliability Assurance for Fiber Optic Systems System Reliability and Service Availability, December 1992, p. 31. Bell Communications Research, Piscataway, NJ.

4. Network Reliability Council, Final Report to the Federal Communications Commission, Reliability IssuesChanging Technologies Working Group, New Wireline Access Technologies Subteam Final Report, February 1996, p. 13.

5. CableLabs, Outage Reduction, September 1992. CableLabs, Louisville, CO.

6. Ibid., p. II-5.

7. Ibid., p. V-10.

8. Network Reliability Council, op. cit., p. 15.

9. Private communication from Andrew Large. The formulas are valid for cases where Tc + MTTR < Ts.

10. This statement is not true when considering signal as opposed to hardware reliability in the upstream direction since ingress and noise from all branches will affect the reliability of communications on any given branch. Also, as will be seen, we are talking about logical, not physical, cascade here, and the powering configuration may require the modifications to what we would normally consider the cascade order. Finally, electrical short circuits in network branches that are not part of the signal cascade can nevertheless affect communications.

11. Network Reliability Council, op. cit., p. 16. The Figures are quoted from the paper authored by Allen L. Black and James L. Spencer entitled, An Assessment of Commercial AC Power Quality: A Fiber-in-the-Loop Perspective, and published in the Intelec ‘93 Proceedings.

12. Tony Werner and Pete Gatseos, Ph.D., Network Availability Consumer Expectations, Plant Requirements and Capabilities of HFC Networks, 1996 NCTA Technical Papers. NCTA, Washington, DC, pp. 313–324.

13. Chuck Merk and Walt Srode, Reliability of CATV Broadband Distribution Networks for Telephony Applications — Is It Good Enough?, 1995 NCTA Technical Papers. NCTA, Washington, DC, pp. 93–107.

14. Nick Hamilton-Piercy and Robb Balsdon, Network Availability and Reliability. Communications Technology, July 1994, pp. 42–47.

15. Network Reliability Council, Network Reliability: A Report to the Nation, Compendium of Technical Papers. National Engineering Consortium, Chicago, June 1993, pp. 1–32.

16. Bellcore TA-NWT-00909, Issue 2, Generic Requirements for FITL Systems Availability and Reliability Requirements, December 1993, p. 13-5-6. Bell Communications Research, Piscataway, NJ.

17. Tony Werner and Pete Gatseos, op. cit., p. 318.

18. CableLabs, op. cit., p. V-16.

19. Network Reliability Council, Network Reliability: A Report to the Nation, Compendium of Technical Papers, National Engineering Consortium, Chicago, June 1993, p. 22.

20. One large MSO reports systemwide MTTRs ranging from 1.64 to 2.39 hours for multiple customer outages. The measured systems are all HFC with various sizes of nodes and wide deployment of both fiber node and coaxial amplifier status monitoring. All headends have UPS plus generator backup. Private correspondence from Nick Hamilton-Piercy.

21. David Large, User-Perceived Availability of Hybrid Fiber-Coax Networks, 1995 NCTA Technical Papers. NCTA, Washington, DC, pp. 61–71.

22. David Large, Reliability Model for Hybrid Fiber-Coax Systems, Symposium Record, 19th International Television Symposium and Exhibition. Montreux, Switzerland, June 1995, pp. 860–875.

23. Nick Hamilton-Piercy and Robb Balodon, op. cit.

24. Bellcore TR-NWT-000418, Table 4, Loop Downtime Allocations — FITL Systems, December 1992, p. 37. Bell Communications Research, Piscataway, NJ.

25. The average monthly video availability among cable systems comprising one large MSO (that has widely deployed status monitoring) varied from 0.9992 to 0.9996 over a 10-month period. Private correspondence from Nick Hamilton-Piercy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset