Techniques for fault-tolerant computing which do not require fault-tolerant hardware or a fault-tolerant operating system. The techniques employ a monitor daemon which is implemented as one or more user processes and a fault-tolerant library which can be bound into application programs. A user process which is executing on ordinary hardware under an ordinary operating system is made fault tolerant by registering it with the monitor daemon. The degree of fault tolerance can be controlled by means of the fault-tolerant library. Included in the fault-tolerant library is a function which defines portions of a user process's memory as critical memory, a function which copies the critical memory to persistent storage, and a function which restores the critical memory from persistent storage. The monitor daemon monitors fault-tolerant processes, and when such a process hangs or crashes, the daemon restarts it. When the techniques are employed in a multi-node system, the monitor daemon on each node monitors one other node in addition to the processes in its own node. In addition, the monitor daemon may maintain copies of the state of fault-tolerant processes running at least on the monitored node. When the monitored node fails, the monitor daemon starts the processes from the monitored node for which the monitor daemon has state on its own node. When a node leaves or rejoins the multi-node system, what other node a given monitor daemon monitors is automatically redetermined for the new configuration of the multi-node system.

Apparatus and method for fault-tolerant computing
Application Number
Publication Number
Application Date
May 8, 1996
Publication Date
May 5, 1998
Yennun Huang
Gordon E Nelson
Jeffrey M Weinick
Donald P Dinella
Lucent Technologies
G06F 11/08
G06F 11/00
