What is a fault tolerant (fault tolerance) system?

Question

What is a fault tolerant (fault tolerance) system?

Asked 7 years, 1 month ago

Viewed 402 times

5

When talking about high-scale systems, much is heard of having a fault-tolerant system (fault-Tolerant). See the description of the Elixir programming language:

Elixir is built on top of the Erlang VM, known for running low latency, distributed and fault-tolerant, while also having succeeded in developing web and embedded systems. [adapted and translated from elixir-lang.org]

But I don’t understand what it means for a system to have this feature. Do you mean that the system can self-heal from failures? What kind of failures should be covered for a fault-tolerant system? And what strategies are used for a fault-tolerant system?

1 answer

Browser other questions tagged terminology software-architecture

You are not signed in. Login or sign up in order to post.

by Maniero • **444,682** points · Answer 1 · 2018-10-13T20:53:01+00:00

It is a characteristic of systems that are able to continue operating in a more or less normal way regardless of failures in any of the parts necessary for its operation.

The way to get this feature varies, and may involve hardware or software solutions. Tolerance is almost always obtained with some backup system, replication, mirroring, redundancy, or something like that, and in general, it also has some kind of monitoring and escalation of actions when there is something wrong. But one way to help is to write code that is robust, that is always prepared for a failure to occur and can do something useful with it. But remember that often tolerance will be given by the infrastructure adopted.

In software solutions it is customary to anticipate the problems and not to let them happen or after having occurred have some way to re-experience or go to another form that delivers the desired result. A simple system that detects errors in the software and gives a solution can already be considered tolerant to failures at some level. We usually only use the term when everything is solved without direct human intervention.

In general this tolerance is somewhat limited and in each situation it is explained in which cases the operation can continue normal. It is obvious that there are always flawed levels and the more tolerant to every type of failure the system needs to be, the more complex it will be, in some cases it can only be tolerant with a lot of replication in parts of the world. In others only having a way to solve if one of the solution software fails another solves the work or gives some useful result anyway.

So the term is often used as marketing when it doesn’t specify the tolerance level.

There are no guarantees that tolerance allows normal operation always, just that it does not stop completely. In some cases delivering the result is not even the intention, just not stop working is already a good goal.

It is important that anything that occurs in the middle of the failure process that can be reversed or that can be contained without contaminating other parts.

Some mechanisms are very sophisticated, complex and expensive.

There are no tools that can do this magically as some might wish. Sure, you can hire some service that gives you something ready, but you’ll never get it without someone’s great effort.

For all this it is complicated to talk about types of failures and specific strategies, each solution takes a form according to each type of system.