Analysis, San Diego, CA, January 1996.
An Approach to Verification and Validation
of a Reliable Multicasting Protocol
John R. Callahan and Todd L. Montgomery
{callahan,tmont}@cerc.wvu.eduNASA Software IV&V Facility
100 University DriveFairmont, WV 26554
304-367-8235, 304-367-8211 (fax)
Abstract: This paper describes the process of implementing a complex communications protocolthat provides reliable delivery of data in multicast-capable, packet-switching telecommunicationnetworks. The protocol, called the Reliable Multicasting Protocol (RMP), was developedincrementally using a combination of formal and informal techniques in an attempt to ensure thecorrectness of its implementation. Our development process involved three concurrent activities:(1) the initial construction and incremental enhancement of a formal state model of the protocolmachine; (2) the initial coding and incremental enhancement of the implementation; and (3)model-based testing of iterative implementations of the protocol. These activities were carried outby two separate teams: a design team and a V&V team. The design team built the first version ofRMP with limited functionality to handle only nominal requirements of data delivery. This initialversion did not handle off-nominal cases such as network partitions or site failures. Meanwhile,the V&V team concurrently developed a formal model of the requirements using a variant ofSCR-based state tables. Based on these requirements tables, the V&V team developed test casesto exercise the implementation. In a series of iterative steps, the design team added newfunctionality to the implementation while the V&V team kept the state model in fidelity with theimplementation. This was done by generating test cases based on suspected errant or off-nominalbehaviors predicted by the current model. If the execution of a test in the model andimplementation agreed, then the test either found a potential problem or verified a requiredbehavior. However, if the execution of a test was different in the model and implementation, thenthe differences helped identify inconsistencies between the model and implementation. In eithercase, the dialogue between both teams drove the co-evolution of the model and implementation.We have found that this interactive, iterative approach to development allows software designersto focus on delivery of nominal functionality while the V&V team can focus on analysis of off-nominal cases. Testing serves as the vehicle for keeping the model and implementation in fidelitywith each other. This paper describes (1) our experiences in developing our process model; and(2) three example problems found during the development of RMP. Although RMP has providedour research effort with a rich set of test cases, it also has practical applications within NASA. Forexample, RMP is being considered for use in the NASA EOSDIS project due to its significantperformance benefits in applications that need to replicate large amounts of data to many networksites. The RMP source code and documentation are currently available on the WWW viahttp://research.ivv.nasa.gov/.
This work is supported by NASA Cooperative Agreement NCCW-0040 under supervision of theNASA Independent Software Verification and Validation (IV&V) Facility, Fairmont, WV.
1.0 Introduction
Much work has been done in the area of verifying that implementations of communicationprotocols conform to their specifications [1,2]. Conformance is usually verified through extensivetesting of an implementation in which tests are derived directly from the protocol specification. Ifan implementation behaves in a manner predicted by the protocol specification, then theimplementation is said to conform to the specification. If not, then an error exists in theimplementation of the protocol. Although this method does not formally verify that a protocolspecification and an implementation are consistent, it represents the state-of-the-practice in thisdomain of software development.
This paper describes our experiences while trying to formally specify and implement a complexcommunications protocol that provides reliable delivery of data in multicast-capable, packet-switching telecommunications networks. The protocol specification, called the ReliableMulticasting Protocol (RMP), was developed concurrently with its implementation. Theimplementation was developed incrementally using a combination of formal and informaltechniques in an attempt to ensure the correctness of its implementation with respect to theevolving protocol specification. We found that many formal methods did not help us in thedevelopment of the protocol specification nor its implementation. We concluded that the best usesfor formal methods in our situation was in the specification of the protocol requirements and thegeneration of tests derived from the specifications applied to prototype versions of the softwareduring development.
One of the primary goals of our effort was to achieve high-fidelity between the specification andimplementation during development. High-fidelity means that the specification model andimplementation agree regarding the behavior of the protocol. We felt that if fidelity was not aprimary concern, then there existed the strong possibility that the specification and theimplementation would diverge in behavior. This would render analysis of any formal specificationmodel irrelevant in the development and maintenance of the software since such analysis wouldoffer little assurance that the actual code behaved in an identical manner.
Our development process involved two teams: a design team and a verification and validation(V&V) team. These two teams worked in an iterative, interactive fashion that allowed the designteam to focus on nominal behaviors of the software while the V&V team examined off-nominalbehaviors. The task of the design team was (1) to specify the protocol in terms of mode tables and(2) implement the protocol in C++ as specified by the mode tables. The task of the V&V teamwas to (1) analyze the consistency and completeness of the mode tables by analyzing \"paths\"through the mode tables and (2) generate tests from the mode tables for suspect conditions.Suspect conditions include those paths identified in the mode table model as being deadlock,livelock, or potential sources of unexpected behaviors. The V&V team used the requirementsmode model to identify cases that were considered by the design team to be unusual or virtuallyimpossible. In retrospect, these cases were the source of several errors in the specification andimplementation of RMP.
We use the terms \"verification and validation\" in a different context from their typical usagebecause of our bipartite, prototyping development process. In our case, the term \"verification\"
1
refers to activities that help in the identification of off-nominal behaviors of the software based onanalysis of the specification model. We use the term \"validation\" to refer to activities that involvetesting the implementation for properties based on potential problems revealed throughverification analysis.
The protocol specification as expressed in the mode tables helped us organize and structure testswhile developing implementation prototypes. Testing formed the dialogue by which the two teamscommunicated about the intended behavior of the protocol and its implementation. This paperrelates our experiences in developing our approach and describes details of our model-basedtesting methods. We do not claim to have \"formally verified and validated\" the RMP specificationand its implementation, but rather we have developed a strategy and process by which theevolution of RMP is enhanced by testing and verification. Our approach has been to study theproblems that have occurred during development, testing, and operation of RMP. Through a post-mortem analysis of problems, we are trying to find methods that may have discovered problemsearlier in the development lifecycle.
2.0 The Reliable Multicasting Protocol (RMP)
The Reliable Multicasting Protocol (RMP) [3] is based on an algorithm originally developed forreliable delivery of data in broadcast-capable, packet-switching networks [4]. The originalalgorithm, which we call the Token Ring Protocol (TRP), allows sites in a packet-switchingnetwork to establish a token ring for distributing responsibility for acknowledgments. A singletoken is passed from site to site around the ring and only the holder of the token (called thecurrent token site) can acknowledge certain data packets. RMP has high-performancecharacteristics because acknowledgments themselves are multicast to all other token ring sites.When a site gets the token (i.e., it becomes the current token site), it multicasts anacknowledgment if and only if it has seen all data packets since the last acknowledgment itreceived. The token is passed in the multicast acknowledgment packet. The acknowledgmentpacket includes the source and sequence numbers of data packets it is acknowledging. This allowseach site to detect if any packets are missing. A site will use negative acknowledgments to requestretransmission of any missing packets. When all packets since the last acknowledgment receivedhave been received by the current token site, then that site can multicast its acknowledgment andthus pass the token to the next site on the ring. When a token site sends an acknowledgment, it isassumed that all data packets since it last held the token have been received by all sites.
The sender of a packet assumes that all messages since it last had the token have been received bythe other sites within a requested quality of service (QoS) level. A packet is marked delivered ifand only if it satisfies its QoS level of delivery. The QoS level allows for resilience of the protocolin the presence of site failures and network partitions. In the case of failures, the token ringreforms itself around the failed site. In the presence of persistent failures, the application programusing RMP must decide to degrade the QoS level or try again.
RMP differs from previous reliable broadcast protocols like TRP in that an acknowledgmentpacket may acknowledge an arbitrary number of data packets. Previous protocols specified thateach data and acknowledgment packets have a one-to-one relationship. This dramatically
2
Event OrderingData(A,1)Data(B,1)
Ack((A,1),(B,1),A,1)Nack(C,1,3)Ack(NULL,C,4)
Data(B,1) [retransmit]Data(A,2)
Ack((A,2),B,5)
CNack(C,1,3)Ack((A,2),B,5)MulticastMediaAData(A,1)Ack(NULL,C,4)Data(A,2)BData(B,1)
Ack((A,1),(B,1),A,1)Data(B,1)
Figure 1: An example of an RMP token ring and events
improves throughput in networks with sporadic losses and allows an application to tradeoffperformance and quality of service requirements.
Each site in a token ring maintains a data structure called an Ordering Queue (OrderingQ) inwhich acknowledgments and data packets are organized based on timestamps. An OrderingQueue is consistent if and only if there are no missing data packets for pending acknowledgments.A missing packet will appear as an empty slot in the OrderingQ that must be filled. When a sitebecomes the token site, all empty slots in the OrderingQ since the last acknowledgment receivedmust be filled. For example, in Figure 1 we show 3 sites of a token ring and a global sequence ofevents. No site has complete knowledge of this sequence. It is only shown to illustrate a possiblescenario. Next to each site is a list of the messages sent by that site. First, site A sends a datapacket signified as Data(A,1) where the first parameter is the sending site and the second is thesequence number of the message. Sequence numbers are unique to individual sites. Second, site Bsends a data packet (Data(B,1)). The initial token site is site B who then acknowledges both datapackets and passes the token to site A. The Ack((A,1),(B,1),A,1) message contains a list ofsource identifiers and sequence numbers for two packets, followed by the next token site and thetimestamp of the acknowledgment.
We assume that site C missed the data packet Data(B,1). Table 1 shows a snapshot of theOrderingQ data structure at site C after it receives the Ack((A,1),(B,1),A,1) message. Uponreceiving this acknowledgment, site C realizes it has missed the Data(B,1) message that should fillthe third slot of the OrderingQ. It knows this because the Data(B,1) packet is listed in the Ackmessage from B. Each slot in an OrderingQ corresponds to a timestamp whether explicit in thecase of Ack messages or implicit in the case of Data packets. Site C will multicast a Nackmessage to request the data packet to fill the one slot in its OrderingQ at timestamp 3.
After a period during which no data packets are transmitted, Site A will time-out andsubsequently send a multicast NULL Ack packet with timestamp 4. This passed the token to siteC. Site B responds to the Nack by retransmitting the Data(B,1) message. The sequence numberidentifies this message uniquely to distinguish it from new messages. After the retransmission ofData(B,1), site A multicasts another data packet with sequence number 2 as Data(A,2). Since site
3
PacketAckDataDataTimestamp123Token Pass or DataB -> A(A,1)missing
Number of Packets2
Table 1: Ordering Queue for Site C with empty slot
PacketAckDataDataAckAckData
Timestamp123456
Token Pass or DataB -> A(A,1)(B,1)A -> CC -> B(A,2)
Number of Packets2
01
Table 2: Final Ordering Queue for Site C
C’s OrderingQ is consistent, it multicasts an acknowledgment of the Data(A,2) packet and passesthe token to site B. Table 2 shows the final configuration of site C’s OrderingQ.
3.0 Approaches to Verification and Validation of RMP
A formal proof of correctness for the original protocol specification exists [5], but we also wantedto ensure a high degree of fidelity between the specification and implementation of the protocol.To achieve this fidelity, we adopted a mode-based, tabular approach based on a variant of SCR-based tables [6] to express the protocol specification instead of the axiomatic approach in theoriginal proof. Table 3 shows a small portion of the protocol specification tables for RMP. Thefirst column shows the current mode. A mode is a superstate that encapsulates a larger set ofspecific states of an implementation [7]. While an implementation may change specific variablesand thus move from state to state, the mode may remain unchanged until a major event andcondition occur. Modes allow the specification to view states of the protocol machine at anappropriate level of abstraction for our analysis. Mode names in Table 3 include TokenSite (thesite holds the token), NotTokenSite (the site does not hold the token), and Getting (the site holdsthe token, but must retrieve missing packets). The second column specifies the event whichincludes the arrival of a packet (data or acknowledgment (ACK)) or a time-out alarm. The thirdcolumn specifies the condition under which a mode transition will occur given the event. In Table3, we show conditions including checks for consistency of the Ordering Queue and checks to seeif an incoming acknowledgment packet names this site as the new token site. We consideredusing condition tables [8] but our approach is currently sufficient for our protocol specification.The fourth column specifies the new mode if the event and condition are true. Finally, the fifthcolumn specifies the action that takes place upon the mode transition. An action includes variablesettings, conditions, and output events.
We used model checking to explore potential problems in the requirements mode model and usedtesting to explore suspect cases in the implementation. These tests helped verify that theimplementation had the same behavior as the specification in specific cases. We tried severaldifferent specification methods for RMP including PVS [9], Murphi [10], SMV [11], and SPIN
4
Current ModeNotTokenSiteNotTokenSiteNotTokenSite
Event/AlarmAckAckAck
Condition
OrderingQ consistentand named token siteNot named token siteOrderingQ inconsistentand named token siteOrderingQ consistentOrderingQ consistent
New ModeTokenSiteAction-
GettingTokenSiteData
Pass Alarm
NotTokenSite-GettingSend Nack
for missingpackets
TokenSite-PassingSend Null
Ack
Table 3: Fragment of RMP specification mode tables
[12]. We settled on a modified version of Murphi since (1) it was amenable to our tabularspecifications and (2) includes temporal logic operators for verification of liveness, deadlock, andinvariant properties of the specification. Tests were generated by hand from suspect cases andadded to the test suite based on analysis of the Murphi models of the RMP specifications.This type of approach to analysis played a major role in our effort even though we hoped thatformal methods would reduce the need for testing. We discovered, however, that testing did nothelp us verify the protocol after its completion but rather it helped us to discover problems duringthe concurrent specification and implementation. We built a test scaffold for RMP by creating alow-level network stub and used testing as the vehicle for keeping our evolving implementationand specification in fidelity with each other. The code was annotated with debugging statementsthat produced a trace of events and conditions. Such traces were compared against thespecification tables to validate the behavior of the implementation relative to the formal model.This approach proved to be very useful since the formal model helped us organize our test suiteand provided an abstract model we could analyze.
We built the protocol specification and its implementation concurrently because pragmaticconstraints of implementing the protocol had a feedback effect on the protocol specification.Performance requirements, programming language peculiarities, and other pragmatic aspects ofthe implementation forced us to consider changes to the requirements during implementation. Weadopted an iterative approach to development because we expected these types of problems tooccur. The design team built the first version of RMP with limited functionality to handle onlynominal requirements of data delivery. This initial version did not handle off-nominal cases suchas network partitions or site failures. Meanwhile, the V&V team concurrently developed theMurphi model of the requirements using the existing mode tables. Based on these requirementstables, the V&V team developed test cases to exercise the implementation. In a series of iterativesteps, the design team added new functionality to the implementation while the V&V team keptthe Murphi state model in fidelity with the implementation. This was done by generating test casesbased on suspected errant or off-nominal behaviors predicted by the current model. If theexecution of a test in the model and implementation agreed, then the test either found a potentialproblem or verified a required behavior. However, if the execution of a test was different in themodel and implementation, then the differences helped identify inconsistencies between the modeland implementation. In either case, the dialogue between both teams drove the co-evolution of themodel and implementation.
5
4.0 Lessons Learned
Most of the problems found in the RMP specifications and implementation were caused byincomplete requirements where it was assumed that certain conditions could not occur butactually did occur in practice. Sometimes, the implementation was coded before the specificationwas updated if a pragmatic consideration made such a change expedient in the code. Other times,we explored solutions in the tables before coding it. Again, the testing between the specificationand implementation during incremental development helped reveal these problems much earlierthan if the process had been more linear.4.1 The Perpetual Getting Problem
As shown in Table 3, a site will transition from NotTokenSite mode to TokenSite mode if theOrderingQ is consistent. If the OrderingQ is not consistent, then the site will enter the Gettingmode while retrieving missing packets. Once the OrderingQ is consistent, the site will transitionfrom Getting mode to the TokenSite mode. This fact was correctly specified in our mode tables,but the implementation was incorrect because a portion of code for the Getting mode did notcheck for consistency of the OrderingQ. The implementation livelocked in the Getting mode in thecase of missing packets.
We were able to discover the problem during analysis for livelock modes using temporalassertions. A pessimistic analysis yielded potential off-nominal paths in the specification. Underideal operating conditions of the protocol, no site should have to enter the Getting mode since noloss occurs under ideal conditions. Indeed, the problem was not discovered in testing on a LocalArea Network where there was no loss of packets unless the network was congested (a rarecondition). Subsequently, no sites ever entered the Getting mode to retrieve missing packets. Themode specifications do not explicitly model the loss of a packet, rather the condition of aninconsistent OrderingQ is an off-nominal behavior when a site becomes the token site. Weconstructed a test case for this scenario and found the problem in the implementation.4.2 The Time-To-Live Problem
RMP relies on an unreliable IP Multicasting layer [13] in which packets have a time-to-live (TTL)field that controls their propagation in Wide Area Networks. At each router, the TTL field of apacket is decremented by 1 and checked to see if it is above or below the router threshold. If theTTL is above the threshold, the router forwards the multicast packet. If not, the packet is notforwarded. This allows control of the propagation of multicast packets to local, national, andworld-wide distribution.
RMP extends the original TRP work by allowing for the initial formation and subsequentmodifications to the token ring membership list during execution. RMP allows sites to join andleave the token ring dynamically. Our implementation, however, overlooked the fact that tokenrings sites can be local to one another (i.e., at low TTL values), but new sites can be very faraway (i.e., at high TTL values). When the far site tries to join a ring, the far site will not see anymessages due to their low TTL values. Subsequently, when the ring fails to pass the token to the
6
far site. This failure will trigger a reformation of the ring to exclude the far site. This situation canrepeat itself ad infinitum as long as the far site keeps trying to join the ring.
Time-to-live information was not included in the mode specifications. Therefore, no analysis ofthe formal model could have revealed this problem and we could not construct a test for thiscondition from the model. We feel, however, that this problem could have been detected duringimplementation when the design team needed to fill in the TTL field of the packets. The designersshould have noted that the requirements are silent on how to fill-in the TTL field of any packetconstructed. This silence invites a designer to make inconsistent assumptions about the behaviorof the protocol machine.
4.3 The Leaving Ring Timestamp Problem
When a token site tries to leave a ring in a controlled fashion (i.e., rather than an abrupt sitefailure), it must wait until the token completes a cycle of the remaining ring sites before actuallyleaving the ring. The reason for this restriction is due to the fact that the departing site may holdpackets that are missing at other sites. If the departing site leaves too soon, then some empty slotsin the Ordering Queues of other sites cannot be filled.
The specifications incorrectly stated that a site may leave the ring when it has seen N timestampedpackets where N is the number of site remaining on the token ring. The problem with thisapproach is that any intervening data packet can fill a timestamp slot causing the departing site toexit the ring before all remaining sites have acknowledged. We incorrectly assumed a one-to-onerelationship between timestamps and acknowledgment packets. As a result, the ring is wedged ina livelock state because sites cannot fill some empty slots in their Ordering Queues.
The problem was found through direct analysis of the formal model and testing revealed theproblem in the implementation. It took unusual conditions, however, to reproduce this problem inpractice because the network had to be congested before the behavior appeared. The formalmodel produced a suspect path and the corresponding test produced a livelock condition. We feelthat this problem was easily revealed by analysis of the formal model. In addition, the formalmodel helped structure exploration of test conditions during the resolution of the problem after itsinitial discovery.
5.0 Conclusions
We do not claim that RMP has been \"verified and validated\" to the extent that it is totally correct,rather that we have developed a technique that strengthens analysis and testing in the long-termdevelopment of our software. Short term problems did occur, but they helped us to evolve aspecification model in high-fidelity with an implementation. Co-evolution of the formalspecification model and the implementation was the most useful result of our study. Our techniqueallowed our two teams to structure their tests and other analysis activities. Their activitiessupported each other in the development of the implementation and refinement of thespecifications.
7
In the future, we will continue to use RMP as a testbed problem and explore new specificationand analysis techniques that complement incremental software development activities. We arecontinuing to evolve the specifications even though the software has been released in a Beta testversion. This type of release scheme limits the use of RMP to non-critical projects and helps usexplore operational problems. When a problem in operation does occur, we are using the modetables to trace where the problem occurred. This has been useful in understanding problems,finding why problems were or were not detected earlier, and refining the specificationincrementally.
References
[1] Luo, G., G. v. Bochmann, and A. Petrenko, Test Selection Based on Communicating Nondeterministic Finite-State Machines Using a Generalized Wp-Method, IEEE Transactions on Software Engineering, Volume 20,Number 2, February 1994, pp. 149-162.
[2] Sidhu, D. P. and T.K. Leung, Formal Methods for Protocol Tetsing: A Detailed Study, IEEE Transactions onSoftware Engineering, Volume 15, Number 4, April 1989, pp. 413-426.
[3] Montgomery, T., Design, Implementation, and Verification of the Reliable Multicasting Protocol, M.S. Thesis,West Virginia University, December 1994.
[4] Chang, J.M. and N.F. Maxemchuk, Reliable broadcast protocols, ACM Transactions on Computer Systems,Volume 2, Number 3, August 1984, pp. 251-273.
[5] Yodaiken, V. and K. Ramamritham, Verification of a Reliable Net Protocol, Formal Techniques in Real-Timeand Fault-Tolerant Systems, January 1992, pp. 193-215.
[6] Heninger, K.L., Specifying software for complex systems: New techniques and their application, IEEETransactions on Software Engineering, Volume 6, Number 1, January 1980.
[7] Jahanian, F. and A. K. Mok, Modechart: A Specification Language for Real-Time Systems, IEEE Transactionson Software Engineering, Volume 20, Number 12, December 1994, pp. 933-947.
[8] Leveson, N. G., M. P. E. Heimdahl, H. Hildreth, and J. D. Reese, Requirements Specification for Process-Control Systems, IEEE Transactions on Software Engineering, Volume 20, Number 9, September 1994, pp. 684-707.
[9] S. Owre, N. Shankar, and J. M. Rushby, User Guide for the PVS Specification and Verification (Beta Release), Computer Science Laboratory, SRI International, 1991.
System
[10] D. Dill, A. Drexler, A. Hu, and C. Yang, Protocol Verification as a Hardware Design Aid, IEEE Conferenceon Computer Design: VLSI in Computers and Processors, IEEE Computer Society Press, October 1992.
[11] J. Burch, E. Clarke, D. Long, K. McMillan, and D. Dill, Symbolic Model Checking for Sequential CircuitVerification, IEEE Transactions on Computer-Aided Design, Volume 13, Number 4, April 1994.
[12] G. J. Holzmann and D. Peled, An improvement in formal verification, Proceedings of the 7th InternationalConference on Formal Description Techniques, FORTE 94, Berne, Switzerland, October 1994.
[13] Deering S., E., Multicasting Routing in Internetworks and Extended LANs, ACM SIGCOMM ’88Symposium, August 1988.
8
因篇幅问题不能全部显示,请点此查看更多更全内容