程序代写案例-SCC 311

Dr. Barry Porter
SCC.311: Fault Tolerance
SCC 311 | Dr. Barry Porter
Overview for today
• Introduction to fault tolerance: concepts and termino
logy
• Failure detection for fail-stop and Byzantine failures
• Server replication schemes for fault tolerance (active vs. passive)
SCC 311 | Dr. Barry Porter
Terminology
• We need to be able to differentiate different kinds of problem in fault
tolerance, so we define specific terms:
• Failure: inability of a system (or subsystem) to perform its required function
• Error: transition of internal state into an invalid state
• Fault: the cause of an error
Not all faults and errors cause a failure, but all failures and errors are caused by a fault
SCC 311 | Dr. Barry Porter
Terminology
• Types of fault:
• Omission: a specific response / expected event does not arrive
• Crash: system stops – a kind of omission where all responses fail to arrive
• Timing: a response arrives outside of its expected window (early or late)
• an omission fault (above) can be seen as a timing fault with infinite time
• Byzantine (or arbitrary): a response occurs but with unexpected / invalid /
malicious contents
Easy
to
detect
Hard
to
detect
SCC 311 | Dr. Barry Porter
Propagation of faults → errors →
failures
Error Error
Error Failure
Internal
fault
External
fault
Not all faults and errors cause a failure, but all failures and errors are caused by a fault
SCC 311 | Dr. Barry Porter
How is a service performing?
• At a high level we can model a service in terms of its availability and
its reliability
Availability
Readiness to offer a service (service responds when requested)
Reliability
Continuity of correct service (service operates without failures)
Examples:
system crashes for 1s every hour: 99.9% available but unreliable
system never crashes but undergoes maintenance once per week: 100% reliable but 98% available)
SCC 311 | Dr. Barry Porter
How is a service performing?
• We can also model quality of service on a scale
• Here the notion of a failure varies
• A web server takes 5 minutes to respond with the correct reply
• This is functionally correct, but is it an acceptable level of service?
• A service exhibits graceful degradation if it avoids total failure with
potentially reduced service
• This is a key principle which gives dev ops / administrator teams time to react
to a failure without loss of all service to users
SCC 311 | Dr. Barry Porter
How common are failures?
• Modern data centres use cheap, commodity server hardware; as a
result, machine failures are common
Google experiences over 1,000 total server failures per year,
with thousands of hard drive failures and multiple power
distribution unit failures per data centre
Microsoft sees an average of 5 network device (switch/router)
total hardware failures per day in its datacentres, with a large
number of transient failures
SCC 311 | Dr. Barry Porter
Common fault tolerance approaches
Replication
Run multiple copies of a service on different hardware units,
either for availability of reliability
N-version design
Design a system in multiple different ways, making it less likely
that all versions will experience the same error
Checkpointing and operation logs
Save the state of a system periodically so that we can recover
from a failure by re-loading the most recent checkpoint
Fault tolerance can generally be enhanced with heterogeneity: e.g. of hardware, software, physical location;
distributed systems are very well placed to take advantage of this effect
SCC 311 | Dr. Barry Porter
Coming up...
• Failure detection (crash-stop and Byzantine)
• Replication (passive and active)
SCC 311 | Dr. Barry Porter
Failure detection in distributed
systems
• The impossibility of detecting crash failures in a distributed system is a
key defining proof which affects the design of fault-tolerant systems[1]
• This proof says that the only way to detect a crash failure is to ask a
computer if it is still alive, and wait for a reply
• However, it is impossible to decide at what point we have waited long enough
to declare a computer has crashed, vs. being slow / busy / network delays
We typically therefore wait for a set amount of time to declare a computer failed, but
be prepared to be wrong if we later get a valid response from that computer
helloooo... ?
[1] Impossibility of distributed consensus with one faulty process, Fischer, Lynch, Paterson,
1985
SCC 311 | Dr. Barry Porter
Failure detection in distributed
systems
• In some cases temporary network errors or partitions can cause
disagreement about the crash status of remote computers
• Server A may thing that server B has crashed, but server C thinks B is alive
• If A and C talk to each other about B, they're going to disagree...
helloooo... ?
helloooo... ?
hi!
A
C
B
SCC 311 | Dr. Barry Porter
Byzantine failures
• If a server crashes we can detect its failure with a timeout and design
a protocol to continue as normal afterwards
• If a server starts sending us garbage data or malicious data, this can
disrupt the orderly flow of a communication protocol and cause an
entire distributed system to enter erroneous states
• These conditions are much harder to detect than a server crash
SCC 311 | Dr. Barry Porter
Byzantine failures
• Handling this kind of failure is often called the Byzantine Generals problem
• based on a fictional story developed by Leslie Lamport to help explain the
distributed systems consensus problem
SCC 311 | Dr. Barry Porter
Byzantine failures
• Handling this kind of failure is often called the Byzantine Generals problem
• based on a fictional story developed by Leslie Lamport to help explain the
distributed systems consensus problem
We need to coordinate!
SCC 311 | Dr. Barry Porter
Byzantine failures
• Handling this kind of failure is often called the Byzantine Generals problem
• based on a fictional story developed by Leslie Lamport to help explain the
distributed systems consensus problem
We need to coordinate!
All communication is
by message-passing,
where messages can
take any amount of time
to arrive, and may be
corrupted in transit
SCC 311 | Dr. Barry Porter
Byzantine failures
• Handling this kind of failure is often called the Byzantine Generals problem
• based on a fictional story developed by Leslie Lamport to help explain the
distributed systems consensus problem
We need to coordinate!
"Get cheese!"
"Don't go!"
"Don't go!"
A malicious actor can
exploit message-passing
to send different things
to different nodes; this
can cause catastrophic
failures
Malicious
actor
SCC 311 | Dr. Barry Porter
Byzantine failures
• Handling this kind of failure is often called the Byzantine Generals problem
• it is now known that solving this problem for n malicious or misbehaving computers
requires 3n+1 computers (this is proven to be necessary and sufficient[1])
• with this many computers, and only n malicious ones, we can solve the problem if
everyone tells everyone else everything that they know
[1] Reaching Agreement in the Presence of Faults, Marshall, Shostak, Lamport, 1980
go
go
stop
SCC 311 | Dr. Barry Porter
Byzantine failures
[1] Reaching Agreement in the Presence of Faults, Marshall, Shostak, Lamport, 1980
A
B C D
go
go
stop
B: Ago
C: Ago
D: Astop
Each node notes what it
has heard from every
other node
A is malicious; there are 3n + 1 = 4 nodes in total
SCC 311 | Dr. Barry Porter
Byzantine failures
[1] Reaching Agreement in the Presence of Faults, Marshall, Shostak, Lamport, 1980
A
B C D
go
go
stop
C D
go
go
B D B D
go go
stop
stop
B: Ago, Cgo, Dstop
C: Ago, Bgo, Dstop
D: Astop, Bgo, Cgo
When each node has
heard from every other
node in the group, we
make a decision based
on the majority view
(if any)
A is malicious; there are 3n + 1 = 4 nodes in total
SCC 311 | Dr. Barry Porter
Byzantine failures
[1] Reaching Agreement in the Presence of Faults, Marshall, Shostak, Lamport, 1980
A
B C D
go
go
B: Ago
C: Ago
D: Astop
What if we hear nothing
for a long time?
- we time-out & use a
“non” value from that
node (we assume good
node are responsive)
A is malicious; there are 3n + 1 = 4 nodes in total
SCC 311 | Dr. Barry Porter
Byzantine failures
[1] Reaching Agreement in the Presence of Faults, Marshall, Shostak, Lamport, 1980
A
B C D
go
go
C D
go
go
B D B D
go go


B: Ago, Cgo, D
C: Ago, Bgo, D
D: A, Bgo, Cgo
When each node has
heard from every other
node in the group, we
make a decision based
on the majority view
(if any)
A is malicious; there are 3n + 1 = 4 nodes in total
SCC 311 | Dr. Barry Porter
Byzantine failures
[1] Reaching Agreement in the Presence of Faults, Marshall, Shostak, Lamport, 1980
A
B C D
g
s
s
B: Ag
C: As
D: Ag
A and D are malicious; there are 3n + 1 = 7 nodes in total
E F G
s g g
E: As
F: Ag
G: As
SCC 311 | Dr. Barry Porter
Byzantine failures
[1] Reaching Agreement in the Presence of Faults, Marshall, Shostak, Lamport, 1980
A
B C D
g
s
s
A and D are malicious; there are 3n + 1 = 7 nodes in total
E F G
s g g
Imagine node D sends
alternate g/s to every
other node; now we
have a problem!
(our majority at this
point is conflicting)
We use n+1 rounds of
communication to
share everything that
everyone knows,
allowing us to isolate
bad actors
R1
R2
R3
B: Ag, Cs, Dg, Es, Fg, Gs
C: As, Bg, Ds, Es, Fg, Gs
D: Ag, Bg, Cs, Es, Fg, Gs
E: As, Bg, Cs, Dg, Fg, Gs
F: Ag, Bg, Cs, Ds, Es, Gs
G: As, Bg, Cs, Dg, Es, Fg
SCC 311 | Dr. Barry Porter
Byzantine failures
[1] Reaching Agreement in the Presence of Faults, Marshall, Shostak, Lamport, 1980
A
B C D
g
s
s
B: Ag, Cs, Dg, Es, Fg, Gs
C: As, Bg, Ds, Es, Fg, Gs
D: Ag, Bg, Cs, Es, Fg, Gs
A and D are malicious; there are 3n + 1 = 7 nodes in total
E F G
s g g
E: As, Bg, Cs, Dg, Fg, Gs
F: Ag, Bg, Cs, Ds, Es, Gs
G: As, Bg, Cs, Dg, Es, Fg
Imagine node D sends
alternate g/s to every
other node; now we
have a problem!
(our majority at this
point is conflicting)
We use n+1 rounds of
communication to
share everything that
everyone knows,
allowing us to isolate
bad actors
R1
R2
R3
In this new round, B will
send its entire state
vector to everyone else,
and everyone else will
do the same
SCC 311 | Dr. Barry Porter
Replication
• We can use replication for either fault tolerance or performance
• For fault tolerance, passive and active replication are the most common styles
• Passive replication uses a primary replica to process all requests, and one or
more backup replicas kept up to date by the primary
• Active replication uses a group of replicas while all process every request –
combined with a strategy to keep all members of the group up to date
SCC 311 | Dr. Barry Porter
Replication // General Client Client
Front
End
Replica Replica
Replica
Client
Issue Request: client request sent to
one or more replicas
Coordination: replicas agree on
request ordering and yes/no to execute
Execution: replicas perform request
(assuming agreement to do so)
Agreement: replicas reach consensus
on request outcome
Response: replica(s) reply to client
request
1
2
3
4
5
SCC 311 | Dr. Barry Porter
Passive Replication
• We use a "front end" node as the
contact point for clients
• If the primary replica fails, the front
end will direct requests to one of
the other replicas as a new primary
• The "coordination", "execution", and
"agreement" phases are all decided
by the primary replica alone
Client Client
Front
End
Replica Replica
Primary
replica
Client
Supports crash failures
SCC 311 | Dr. Barry Porter
Passive Replication
• In passive replication we never show the client a response until
we are sure that the state of our replica set is consistent
Client
Backup
Primary
replica 1
2
C
In this example, the client sees a state which is inconsistent with what it was last told
SCC 311 | Dr. Barry Porter
Passive Replication
• In passive replication we never show the client a response until
we are sure that the state of our replica set is consistent
Client
Backup
Primary
replica 1
2
C
Here we make sure the backup replicas are up to date before sending a response; slower, but gives consistent view
SCC 311 | Dr. Barry Porter
Passive Replication
• What if the primary crashes before responding to the client?
Client
Backup
Primary
replica 1
2
C
Client may re-send request, believing the first one failed; again this causes an inconsistent client/server view
resend request
x3
SCC 311 | Dr. Barry Porter
Passive Replication
• What if the primary crashes before responding to the client?
• Using monotonically increasing request IDs lets us check...
Client
Backup
Primary
replica 1
2
C
Now the client and server views remain consistent...but what about multiple clients?
SCC 311 | Dr. Barry Porter
Passive Replication
• What if the primary crashes before responding to the client?
• Using monotonically increasing request IDs lets us check...
• ...and using unique client IDs allows us to differentiate clients
Client
Backup
Primary
replica 1
2
C
SCC 311 | Dr. Barry Porter
Active Replication
• The front end service sends each
client request to all replicas
• This requires a reliable group
communication service to ensure
that each request actually reaches
every replica and in correct order
• Front end checks for agreement between
all replicas on response value
• We can use this approach to vote
on replica replies to check for errors
Client Client
Front
End
Replica
Client
ReplicaReplica
Supports crash failures Supports Byzantine failures
SCC 311 | Dr. Barry Porter
Active Replication
• With multiple clients, we need to
ensure that requests arrive at each
replica in the same order
• We use a total ordering group
communication scheme to achieve this
• All replicas should produce an identical
response to reach request; if not,
something has gone wrong and
can take a majority vote
Client Client
Front
End
Replica
Client
ReplicaReplica
Supports crash failures Supports Byzantine failures
SCC 311 | Dr. Barry Porter
Replication comparison
Passive Active
Communication overhead Low High
Processing overhead Low High
Recovery overhead High Low
Fault model Crash fault only Byzantine faults
Less expensive
Less complex
More expensive
More complex
SCC 311 | Dr. Barry Porter
Summary
• We've covered an introduction to fault tolerance concepts, from fault
propagation to failure detection
• Discussed Byzantine fault detection in detail, and two different
replication schemes used for fault tolerance
SCC 311 | Dr. Barry Porter
Further reading
• Section 7.5 (Distributed Commit) and 7.6 (Recovery) of Tanenbaum &
van Steen; Sections 16 & 17 of Coulouris & al
• Chapter 7. Fault Tolerance of Tanenbaum & van Steen; Chapter 18
Coulouris & et. al

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468