# Testability of VLSI

# Lecture 14 Fault Tolerant VLSI Design

By Dr. Sanjay Vidhyadharan

ELECTRICAL

ELECTRONICS

COMMUNICATION

### **Error Detection**

- 1. Error needs to be detected first to rectify (tolerate) the error.
- 2. Even if the detected error is not rectified warning is to be generated, as a measure of *safety*.
  - 3. The key to error detection is redundancy. The three classes of redundancy
    - □ Physical (sometimes referred to as "spatial")
    - □ Temporal, and
    - □ Information

# **Physical Redundancy**

Dual Modular redundancy (DMR) with a comparator



Excellent error detection

All errors except for errors due to design bugs, errors in the comparator, and unlikely combinations of simultaneous

#### ELECTRICAL ELECTRONICS COMMUNICATION

### **Physical Redundancy**



COMMUNICATION

**ELECTRONICS** 

**ELECTRICAL** 

# **Physical Redundancy**



COMMUNICATION

[1] Fault Tolerant Computer Architecture by Daniel J Sorin

**ELECTRONICS** 

**ELECTRICAL** 

### **Temporal Redundancy**

Temporal redundancy requires a unit to perform an operation twice one after the other, and then compare the results.

Total time is doubled unless pipe-lining structure

ELECTRONICS

ELECTRICAL

Unlike with physical redundancy, there is no extra hardware or power cost (once again ignoring the comparator).

aniani

Add redundant bits to a datum to detect when it has been affected by an error. Odd or Even Parity Check

**ELECTRONICS** 

**ELECTRICAL** 



COMMUNICATION

Classification of Code:

**ELECTRICAL** 

Example "Single-error correcting (SEC) and double-error detecting (DED)" SECDED : HD4

HD3: Can either correct single errors *or* detect single or double errors, but it cannot do both.

COMMUNICATION

ian

**ELECTRONICS** 

 $2^{r} \ge m+r+1$ m is number of bits in original data r is number of redundant bits to be added n=m+r final number of bits in the coded data string

**ELECTRONICS** 

Example m = 4 r=1: 2 >= 4 + 1 + 1 r=2: 4 >= 4 + 2 + 1 r=3: 8 >= 4 + 3 + 1

**ELECTRICAL** 



[1] Fault Tolerant Computer Architecture by Daniel J Sorin

COMMUNICATION

|   | Example: 0011 (Even Parity) |                       |                       |                  |                  |                                  |                                  | Even Parity<br>P <sub>1</sub> : $0$ ,1,1,0                          |  |
|---|-----------------------------|-----------------------|-----------------------|------------------|------------------|----------------------------------|----------------------------------|---------------------------------------------------------------------|--|
|   | 7                           | 6                     | 5                     | 4                | 3                | 2                                | 1                                | $\begin{array}{c} P_1 : 0, 1, 1, 0 \\ P_2 : 1, 1, 0, 0 \end{array}$ |  |
|   | 0                           | 0                     | 1                     | P <sub>3</sub>   | 1                | P <sub>2</sub>                   | P <sub>1</sub>                   | $P_3$ : <b>1</b> ,1,0,0                                             |  |
|   | 0                           | 0                     | 1                     | P <sub>3</sub>   | 1                | P <sub>2</sub>                   | P <sub>1</sub>                   | Assuming an erroneous data                                          |  |
|   | 0                           | 0                     | 1                     | 1                | 1                | 1                                | 00                               |                                                                     |  |
| _ | <b>2</b> <sup>3</sup>       | <b>2</b> <sup>1</sup> | <b>2</b> <sup>0</sup> | P <sub>1</sub> ( | 2 <sup>0</sup> ) | P <sub>2</sub> (2 <sup>1</sup> ) | P <sub>3</sub> (2 <sup>3</sup> ) | Error Code $(P_3, P_2, P_1)$<br>(1,1,1) -> 7                        |  |
|   | 0                           | 0                     | 1                     | -                | 1 🖕              | $\mathcal{O}$                    |                                  | Assuming an erroneous data                                          |  |
|   | 0                           | 1                     | 0                     |                  | 2                | 2                                |                                  |                                                                     |  |
|   | 0                           | 1                     | 1                     | . 6              | 3                | 3                                |                                  |                                                                     |  |
|   | 1                           | 0                     | 0                     | 0)               |                  |                                  | 4                                | Error Code $(P_3, P_2, P_1)$                                        |  |
|   | 1                           | 0                     | 1                     | <mark>)</mark> : | 5                |                                  | 5                                | (0,0,1) -> 1                                                        |  |
|   | 1                           | 1                     | 0                     | <u></u>          |                  | 6                                | 6                                | Example                                                             |  |
|   | 1                           | 1                     | 1                     | -                | 7                | 7                                | 7                                | Write and read from memory                                          |  |

#### **Functional Units**

*General Techniques*. To detect errors in a functional unit, we could simply treat the unit as a black box and use physical or temporal redundancy

Another general approach to functional unit error detection is to use *arithmetic codes*.

Example

#### A + B = C

10A+10B=10C, If we get 10C it is error free Error causes the adder to produce a result that is not a multiple of 10

#### **Functional Units**

ELECTRICAL

**ELECTRONICS** 



#### **Functional Units**

#### **Multipliers**

ELECTRICAL

 $A \times B = C \rightarrow [(A \mod M) \times (B \mod M)] \mod M = C \mod M.$ 

ELECTRONICS

With an appropriate choice of M, the modulus operation can be performed with little hardware

```
6 X 12 = 72
6 Mod5 X 12 Mod5 = 72 Mod5
1 X 2 = 2
```

#### **Tightly Lockstepped Redundant Cores**

**ELECTRONICS** 

ELECTRICAL

Physical redundancy to replicate a core (DMR or TMR). Results compared after every instruction or perhaps less frequently. The frequency of comparison determines the maximum error detection latency





IBM System/390 is a discontinued mainframe

COMMUNICATION

#### **Redundant Multithreading Without Lockstepping**

Simultaneously multithreaded (SMT) cores such as the Intel Pentium 4 provided an opportunity for low-cost redundancy. An SMT core with T thread contexts can execute T software threads at the same time. If an SMT core has fewer than T useful threads to run, then using otherwise idle thread contexts to run redundant threads provides cheap error detection.

COMMUNICATION

ELECTRONICS

ELECTRICAL

#### **Dynamic Verifcation of Invariants**

Rather than replicate a piece of hardware or a piece of software, another approach to error detection is dynamic verification. At runtime, added hardware checks whether certain invariants are being satisfied. These invariants are true for all error-free executions and thus dynamically verifying them detects errors. The key to dynamic verification is identifying the invariants to check.

*Control Logic Checking:* For a given instruction, some of the control signals are always the same. To detect errors in these control signals, the authors add logic to compute a fixed-length signature of these control signals, and the core compares this signature to a prestored signature for that instruction.

Data Flow Checking:

*Watchdog Processors*. Most of the invariant checkers we have discussed so far have been tightly integrated into the core. A watchdog processor is a simple coprocessor that watches the behavior of the main processor and detects violations of invariants. A typical watchdog shares the memory bus with the main processor. The invariants checked by the watchdog.



COMMUNICATION

ELECTRICAL

ELECTRONICS

*Using Software to Detect Hardware Errors*. SWAT used mostly simple hardware checks with a little additional software

Original Code

add r1, r2, r3

xor r4, r1, r5

store r4, 0(\$r6)

// r4 = r1 XOR r5

// r1 = r2 + r3

6r6) // Mem[\$r6] = r4

Code with EDDI-like Redundancy

add r1, r2, r3 // r1 = r2 + r3 add r11, r12, r13 // r11 = r12 + r13 xor r4, r1, r5 // r4 = r1 XOR r5 xor r14, r11, r15 // r14 = r11 XOR r15 bne r4, r14, error // if r4 !=r14, goto error store r4, 0(\$r6) // Mem[\$r6] = r4

#### **Error Detection in Caches and Memory**

In most computers, the levels of the memory hierarchy below the L1 caches, including the L2 cache and memory, are protected with ECC. The L1 cache is either protected with EDC as in the Pentium 4, UltraSPARC IV, and Power4 or with ECC (as in the AMD K8 and Alpha 21264.

The choice of error codes represents an engineering tradeoff. Using EDC on an L1 cache, instead of ECC, leads to a smaller and faster L1 cache. However, with only EDC on the L1, the L1 must be write-through so that the L2 has a valid copy of the data if the L1 detects an error. The writethrough L1 consumes more L2 bandwidth and power compared to a write-back L1

# **Error Detection in Caches and Memory**

#### **Detecting Errors in Addressing**

ELECTRICAL

**ELECTRONICS** 

Consider the case where a core accesses a memory with address B, and the memory erroneously provides it with the correct data value *at address C*. Even with EDC, this error will go undetected because the data value at address C is error-free.



COMMUNICATION

20

### **Self-Repair**



**ELECTRICAL** 

**ELECTRONICS** 

COMMUNICATION

#### Self-Repair



**ELECTRICAL** 

**ELECTRONICS** 

COMMUNICATION

22



Design of Fault Tolerant Adders: A Review Ghashmi H. Bin Talib1 · Aiman H. El-Maleh1 · Sadiq M. Sait

INSTRUMENTATION

ELECTRICAL ELECTRONICS

COMMUNICATION



The two-pair two-rail checker receives pairs of inputs (x0, y0) and (x1, y1) where x0 = y0' and x1 = y1'. The outputs of the checker are also in two-rail form z0 = z1'. It

ELECTRICAL ELECTRONICS COMMUNICATION



COMMUNICATION

**ELECTRONICS** 

**ELECTRICAL** 

25



P. Kumar, R.K. Sharma / Engineering Science and Technology, an International Journal 19 (2016) 1465–1472

COMMUNICATION

**ELECTRONICS** 

**ELECTRICAL** 

#### **Fault Tolerant Adders with Reversible Gates**

#### **Reversible Gates**

**ELECTRICAL** 



**ELECTRONICS** 

Figure 1: Feynman Gate

When B is zero, the gate acts as a copying gate or a buffer where both the output lines contain the input A. When B is one, the complement of A is obtained at the output Q.



COMMUNICATION

27

#### **Fault Tolerant Adders with Reversible Gates**



COMMUNICATION

**Design of Fault Tolerant Adders: A Review Ghashmi H. Bin Talib1 · Aiman H. El-Maleh1 · Sadiq M. Sait** Arabian Journal for Science and Engineering https://doi.org/10.1007/s13369-018-3556-9

**ELECTRONICS** 

**ELECTRICAL** 

28

#### References

1. Fault Tolerant Computer Architecture by Daniel J Sorin

**ELECTRICAL** 



**INSTRUMENTATION** 

**ELECTRONICS** 

**ELECTRICAL** 

COMMUNICATION