Scientific Activities

 

The scientific topics that have been addressed cover the various aspects of test and reliability of System-on-Chip. The main list of addressed topics is reported be

Test and Reliability of Systems Based on Approximate Computing

 

In last decades, the demand of computing efficiency has been growing constantly. On one side, the relevance of new-generation power-consuming applications increases. On the other side, low-power portable devices are more and more deployed in the consumer market. Therefore, new computing paradigms are necessary to cope with the competing requirements introduced by modern technologies. In recent years, several studies on Recognition, Mining and Synthesis (RMS) applications have been conducted. A very interesting peculiarity has been identified, i.e. the intrinsic resiliency of those applications. Such a property allows RMS applications to be highly tolerant to errors. This is due to different factors, such as noisy data processed by these applications, non-deterministic algorithms used, and possible non-unique answers. These properties have been exploited by a new and increasingly established computing paradigm, namely the Approximate Computing (AxC).

AxC cleverly profits from the RMS intrinsic resiliency to achieve gains in terms of power consumption, run time, and/or chip area. Indeed, by introducing selective relaxations of non- critical specifications, some parts of the target computing system can be simplified, at the cost of a slight accuracy reduction. Additionally, AxC is able to target different layers of computing systems, from hardware to software. In this work, we focus on Approximate Integrated Circuits (AxICs), which are the outcome of AxC application at hardware level, specifically on Integrated Circuits (ICs). In particular, we focus on IC functional approximation. Functional approximation has been employed in the last few years to design efficient AxICs (in terms of area, timing, and power consumption) by systematically modifying IC functional behavior, thus introducing controlled errors. To measure the error produced by an AxIC, several error metrics have been proposed in the literature. As a consequence of the AxICs increasing relevance, it becomes important to address the new challenges to test such circuits. In this respect, some previous works drew attention to the challenges that functional approximation entails for testing procedures. At the same time, opportunities come with IC functional approximation. More in details, the concept of acceptable circuit changes: while conventionally a circuit is good if its responses are never different from the expected ones, in the AxIC context some unexpected responses might be still acceptable, according to the error metric maximum threshold. Therefore, some acceptable defects may be left undetected, ultimately leading to a production yield gain (i.e., the percentage of acceptable circuits, among all fabricated circuits, increases).

In recent years, several works have been presented to classify AxIC faults into non-redundant and ax-redundant (i.e., catastrophic and acceptable, respectively) according to an error threshold (i.e., maximum tolerable amount of error). As a result of this classification, two lists of faults are obtained (i.e., non-redundant and ax-redundant). Consequently, the Automatic Test Pattern Generation (ATPG) targets only the non-redundant faults. Obtained tests prevent catastrophic failures from occurring, by detecting non-redundant defects. However, to actually achieve the expected yield gain, test patterns must avoid detecting the ax-redundant faults. Otherwise, defective yet acceptable AxICs are rejected, resulting in some yield loss. In this work, we provided the following contributions. First, we showed conventional testing problems that prevent the achievement of the expected yield gain. Then, we proposed a new test application technique to actually achieve the expected yield gain. The technique is applicable regardless of the specific error metric and of the specific test pattern generation technique used.

This work has been carried out by Marcello Traiola (Ph.D. student – LIRMM). It has been presented at the DDECS 2017 International Conference [C15] and DATE 2020 International Conference [C24]. Further developments and contributions related to this work have been presented at the EWDTS International Symposium [C17] and the WAPCO international workshop [In4], Further developments and contributions related to this work have been published in IEEE Transactions on Nanotechnology journal [J7], and will appear in Proceedings of IEEE [J10].

Design of Fault Tolerant Architectures

 

The selection of the ideal trade-off between reliability improvement and cost of a fault tolerant architecture employed for hardening is based on the safety required by the critical level of the application, the environmental radiation level and the technology used. It is important to create a balance that best suits the design cost-budget and the acceptable error rate constraint. This balance can be achieved by applying a selective hardening technique that allows the designer to create an architecture optimized by selecting only the most sensitive circuit parts to be hardened. This provides improvement in error rate at an acceptable area and power overheads.

To achieve selective hardening, two steps are necessary. Firstly, we must identify the most sensitive circuit parts according to their contribution to the overall error rate. Secondly, we must apply an error detection scheme to the selected sensitive parts. In the literature, most selective hardening approaches focus on improving the vulnerability analysis methodology. Moreover, they exploit existing error detection architectures. Accurately estimating circuit-level vulnerability generally requires the consideration of three masking effects (logical, electrical and latching-window). These masking effects prevent transient pulses from getting latched in flip-flops. However, simulating three masking effects requires massive computational effort. Therefore, alternative techniques have been explored in the literature. Some works consider all three masking effects and rely on approximate abstract models. Others resort to only one or two masking effects to identify circuit elements with the highest impact on the soft error rate. For the specific case of arithmetic circuits, some authors developed fault tolerant adders using error-detecting scheme. These solutions achieve area gains of about 25% compared to full duplication schemes, but are restricted to adder circuits.

Recently, we have proposed a very fast and low computation effort method that helps selecting the most sensitive parts of a logic design and identify the degree of hardening necessary to fulfill the design cost (in terms of area and power) and soft error reliability constraints. Based on this very fast reliability analysis called structural susceptibility analysis, we also proposed a selective hardening technique using the HyTFT (Hybrid Transient Fault Tolerant) architecture presented in. By reducing the number of output nodes of the CL (Combinational Logic) and comparing it with a full version of the circuit, this selective hardening approach not only reduces the size of the comparator but also significantly reduces the size of the duplicated CL copy in a vulnerability-aware manner. Even if the use of this structural susceptibility analysis leading to the HyTFT architecture has proven to be more efficient in terms of area and power consumption compared to a full duplication scheme, it does not consider any error metrics. Our work during this year focuses on proving that AxC (Approximate Computing) can lead to a better trade-off when used in a duplication scheme. To prove it, we use arithmetic circuits as they have quantifiable error metrics such as the EP (Error Probability) and the WCE (Worst-Case Error) used to measure the magnitude of the errors.

In this study, we analyze the impact of the selective hardening technique proposed earlier by comparing different duplication techniques implemented in an error detection architecture suitable for arithmetic circuits. We explore four different scenarios of duplication i) a full duplication scheme, ii) a reduced duplication scheme based on the structural susceptibility analysis, iii) a reduced duplication scheme based on the logical weights of the arithmetic circuit outputs and iv) a reduced duplication scheme based on an approximated structure from a public benchmark suite which is composed of arithmetic circuits. Note that, all the considered scenarios are built independently of the workload. Experimental results achieved on adders and multipliers demonstrate the interest of using approximate structures as duplication scheme since both area overhead and power consumption are reduced compared to a full duplication scheme, while maintaining good levels on error metrics. Note that the arithmetic circuits used as case studies (8-bits adders and 8 to 16-bits multipliers) in our experiments are relatively small compared to the required comparator needed to build the duplication scheme. Consequently, considering area and power overhead of comparators would negatively affect the reliability comparisons between the four considered scenarios. For this reason, all experiments have been done without considering the area and power overhead due to the comparators. This may slightly biased the results from a quantitative point of view, but it does not jeopardize the main conclusion about the interest of using approximate structures as duplication scheme. Moreover, to corroborate the experimental results, we run a set of simulation-based gate-level transient fault injections. They show that using approximate structures as duplication scheme offers a better reliability level compared to the other considered duplication scenarios.

This work has been performed by Bastien Deveautour (Ph.D. student – LIRMM). It has been published in JETTA – Journal of Electronic Testing, Springer, in 2017 [J4] and in December 2019 [J9].

Computing-in-Memory Architecture

 

Today’s computing devices are based on the CMOS technology, that is subject to the well-known Moore’s Law predicting that the number of transistors in an integrated circuit will be doubled almost every two years. Despite the advantages of the technology shrinking, we are facing the physical limits of CMOS. Among the multiple challenges arising from technology nodes smaller than 20 nm, we can highlight the high leakage current (i.e., high static power consumption), reduced reliability, complex manufacturing process leading to low production yield, complex testing process, and extremely costly masks.

Additionally, the expected never-ending increasing of performance is indeed no longer true. Looking in more detail, the classical computer architectures, either von Neumann or Harvard, divide the computational element (i.e., CPU) from the storage element (i.e., memory). Therefore, data have to be transferred from the memory to the computational element in order to be processed and then transferred back to be stored. The main problem of this paradigm is the bottleneck due to the data transfer time limited by the bandwidth.

Many new technologies are under investigation, among them the memristor is a promising one. The memristor is a non-volatile device able to act as both storage and information processing unit. It introduces many advantages: CMOS process compatibility, lower manufacturing cost, zero standby power, nanosecond switching speed, great scalability, high density and non-volatile capability. Thanks to its inner nature (i.e., computational as well as storage element), the memristor is exploited in di erent kinds of applications, such as neuromorphic systems, non-volatile memories and computing architectures for data-intensive applications.

In this work, we have developed an automatic tool for mapping a given boolean function into a memristor crossbar. Owing to this tool, we are now able to analyze two kinds of logic synthesis (i.e., 2-level and multi-level). Experimental results can be useful as a reference (benchmarking) for comparing future works.

This work has been performed by Marcello Traiola (Ph.D. student – LIRMM), with the contribution of Umberto Ferrandino (Master Student – University of Naples) and Dario Manone (Master Student – Politecnico di Torino). Related publications are [C12, C14, C16]

Intra-Cell Defect Test and Diagnosis

 

The ever-increasing growth of the semiconductor market results in an increasing complexity of digital circuits. Smaller, faster, cheaper and low-power consumption are the main challenges in semiconductor industry. The reduction of transistor size and the latest packaging technology (i.e., System-On-a-Chip, System-In-Package, Through Silicon Via 3D Integrated Circuits) allow the semiconductor industry to satisfy the latest challenges. Although producing such advanced circuits can benefit users, the manufacturing process is becoming finer and denser, making chips more prone to defects. In modern deep submicron technologies, systematic defects are becoming more frequent than random defects.

Today, systematic defects appear not only in the cell interconnection, but also inside the cell itself (intra-cell defect). In the literature, existing works prove that these defects can escape classical test solutions. Despite the fact that previous work already proved that classical test sets lead to a low coverage of intra-cell defects, none of them characterize the applied test set from the diagnostic point of view. Basically the question is how good is the applied test set to diagnose such defects.

In this work, we have developed a defect grading tool able to characterize a given test set with respect to the intra-cell defects coverage and diagnosability. This tool is composed of two main parts: (1) the library cell characterization and (2) the deductive fault simulator engine.

Stefano Bernabovi from the Politecnico di Torino has carried out this work. Stefano Bernabovi was a master student who spent 6 months at LIRMM in the context of the LAFISI. The work carried out by Stefano has been presented at the DDECS International Symposium [C1].

Scan-Chain Intra-Cell Defects Grading

 

For reasons similar to those mentioned above, despite the fact that previous work already proved that classical test sets lead to a low coverage of intra-cell defects, none of them has investigated the issues related to scan-chain testing in the presence of intra-cell defects.

Usually, scan chain test is performed by applying a so-called shift test. A toggle sequence “00110011…” is shifted into the scan chain and shifted out. The applied sequence produces all the possible transition at the input of each scan flip-flops. In this way, the correctness of the shift operations is verified and the presence of stuck-at and transition fault models in the scan flip-flop interconnections is also guaranteed. Moreover, the above sequence can cover the detectable intra-cell defects for each scan flip-flops.

Despite the fact that the shift test is widely used in practice, it has been proven that some intra-cell defects can escape because the scan chain test is applied only when the flip-flops are in test mode. It is thus mandatory to analyze and quantify the intra-cell defect escapes to further develop meaningful test solutions.

In this work, we have proposed an evaluation of the effectiveness of the ATPG test patterns in terms of intra-cell defect coverage affecting scan flip-flops. Experimental results have shown that a meaningful test solution has to be developed to improve the overall defect coverage for scan chain testing.

Aymen Touati from LIRMM has carried out this work. Aymen Touati is a PhD student who spent 1 month at the Politecnico di Torino in the first part of 2015. This work has been presented at the DTIS International Conference [C4].

Power Aware Test of Integrated Circuits

 

Nowadays, semiconductor product design and manufacturing are being affected by the continuous CMOS technology scaling. On the other side, high operation speed and high frequency are mandatory requests, while power consumption is one of the most significant constraints, not only due to the large diffusion of portable devices. This influences not only the design of devices, but also the choice of appropriate test schemes that have to deal with production yield, test quality and test cost.

Testing for performance, required to catch timing or delay faults, is mandatory, and it is often implemented through at-speed testing. Usually, performance testing involves high frequencies and switching activity (SA), thus triggers significant power consumption. As a consequence, performance test may produce yield loss (due to over-stress), e.g., when a good chip is damaged while testing. Yield loss may also occur when a good chip is declared as faulty during test (again due to over-stress). Hence, reduction of test power is mandatory to minimize the risk of yield loss; however, some experimental results have also proved that too much test power reduction might lead to test escape and create reliability problems because of the under-stress of the circuit under test (CUT).

Hence, it is crucial to know exactly what is the maximum peak power consumption achievable by the CUT under normal operational conditions. In this way the designer can check whether it can be tolerated by the technology and the test engineer can develop proper stimuli to force the CUT to work in these conditions.

In this work, we have proposed to speed-up an automatic functional pattern generation approach proposed earlier by exploiting an intelligent and fast power estimator based on mimetic learning for increasing functional peak power of CPU cores. To speed-up the framework we proposed a fast power estimator to be used in conjunction with the evolutionary algorithm (EA), thus reducing the evaluation time from weeks to days. In particular, we proposed a method for fast power estimation based on neural networks whose inputs are the new switching activities per gate type and the output is the estimated power.

In the proposed approach we are not trying to develop a new method for computing the exact power consumption, because power evaluators already exist by several vendors. Instead, we are proposing a method to quickly estimate peak power consumption; therefore, the method may sometimes miscompute power, but is generally able to correctly drive the EA to generate peak power effective patterns. The approach can reduce time from weeks to days providing a functional peak power consumption measure effective for mapping test power. The main novelty of the proposed strategy lies in its ability to improve the automatic functional pattern generation framework by inserting a feed-forward neural network (FFNN)-based external power evaluator that is faster than commercial ones.

We validated the proposed methodology using the Intel 8051, synthesized using a 65nm industrial technology. The training process of the FFNN required two days. Also, the final generation using the trained FFNN reduced a single evaluation time by more than 60% with respect to commercial tools, while always individuating the test program points where the peak power was maximized.

Mauricio De Carvalho from the Politecnico di Torino has carried out this work. Mauricio De Carvalho was a PhD student who spent 6 months at LIRMM in the context of the LAFISI. The work carried out by Mauricio has been published in the ASP Journal of Low Power Electronics [J1].

Fault Tolerance and Reliability of Microprocessor Cores

 

CMOS technology scaling allows the realization of more and more complex systems, reduces production costs and optimizes performances and power consumption. Today, each CMOS technology node is facing reliability problems whilst there is currently no alternative technology as effective as CMOS in terms of cost and efficiency. Therefore, it becomes essential to develop methods that can guarantee a high robustness for future CMOS technology nodes against transient and permanent faults.

A high integration density affects the robustness of a circuit during its functioning as well as during its manufacturing. Smaller size of transistors makes the circuit more vulnerable to soft errors where devices are not permanently damaged. Moreover, a high integration density causes a high defect density, which results in hard errors and a lower manufacturing yield.

To increase the robustness of future CMOS circuits and systems, use of fault tolerant architectures is one possible solution. In fact, these architectures are commonly used to tolerate on-line faults, i.e. faults that appear during the normal functioning of the system, irrespective of their transient or permanent nature. They use redundancy, i.e. the property of having spare resources that perform a given function and tolerate faults in combinational and/or sequential part of the circuit. These techniques are generally classified according to the type of redundancy used. Basically, three types of redundancy are considered: information, temporal and hardware.

In this work, we have first proposed a low-level error-detection and correction scheme for pipelined microprocessor cores used in high-dependability real-time applications. The work focuses on dealing with transient (SET and timing) and permanent faults occurring in the combinational logic part of pipelined microprocessors. The fast reconfiguration and rollback scheme mitigates the effects of faults with negligible performance overhead (i.e. in the worst case four additional clock cycles). The proposed fault tolerant architecture uses three types of redundancy: information redundancy for error detection, temporal redundancy for transient error tolerance and hardware redundancy for permanent error correction. Similar to TMR, this architecture consists of implementing three times the combinational logic part of the circuit. However, only two of them are running in parallel in functional mode. As case study, we have hardened combinational parts of a MIPS microprocessor based on the proposed fault tolerant approach. Experiments were performed using gate level simulations showing that in terms of area our approach is only 5 % more costly w.r.t. the TMR while offering a significant improvement in power, which is about 40 %. In addition, our approach offers a full protection against transient and permanent faults in Combinational Logic (CL) blocks of the MIPS microprocessor.

Imran Wali from LIRMM has carried out this work. Imran Wali was a PhD student who spent 1 month at the Politecnico di Torino in the first part of 2014. The goal of his visit was to validate his work using the fault injection facilities developed by the Politecnico di Torino.

After that, we have proposed a Hybrid Fault-Tolerant (HyFT) architecture that targets robustness, manufacturing yield and power consumption of logic circuits. Combining information, timing and hardware redundancy, this solution targets both hard errors and SETs tolerance in the combinational part of logic circuits. Similar to a TMR solution, this architecture consists of implementing three times the combinational part of the original logic circuit. However, only two of them run in parallel. A Finite State Machine (FSM) is used to select two CL copies to be active. It changes the architecture configuration with respect to the error detected by a comparator.

Although effective in tolerating transient and permanent faults, this fault tolerant architecture suffered from timing constraints related to the comparison-window, i.e. the time where the computed results from two running CL copies are compared for error detection. This constraint, which imposes that no paths in the combinational logic can be shorter than the comparison-window, implies unnecessary area overhead and thus power consumption.

So, we have proposed some design and timing optimization issues and analyzed their impact on the resulting area overhead and power consumption. We also verified that the proposed modifications do not affect the fault tolerance capability. Experiments were performed by synthesizing the different designs and using gate-level simulations showing that in terms of area the proposed modifications offer around 65% reduction in terms of area, about 55% power saving and 87% less performance overhead as compared to the initial design without any penalty of the fault tolerance capability.

Imran Wali from LIRMM has carried out this work. It has been presented at the IOLTS International Symposium [C6].

Delay Test of Integrated Circuits

 

Nowadays, electronics products present various issues that become increasingly important with CMOS technology scaling. High operation speed (and thus high frequency) is a mandatory request. These needs influence not only the design of devices, but also the choice of appropriate test schemes that have to deal between production yield, test quality and test cost. Due to the advances in manufacturing technologies and more aggressive clocking strategies used in modern design, more and more defects lead to failures that can no longer be modeled by classical stuck-at faults. Numerous actual failures exhibit timing or parametric behaviors, which are not represented by stuck-at faults. Such failures have to be taken into account during the test process in order to reach acceptable DPM (Defect per Million).

Testing for performance, required to catch timing or delay faults, is therefore mandatory and is often done through at-speed scan testing for logic circuits. At-speed scan testing consists of using a rated system (nominal) clock period between launch and capture for each delay test pattern, while a longer clock period is normally used for scan shifting (load and unload cycles). The most widely used fault models targeting timing related failures are the Transition Fault Model and the Gate Delay Fault Model. The Transition Fault Model is a qualitative delay fault model. It assumes that the delay at the fault site is large enough to cause logic failure. The main advantage of the Transition Fault Model is that it does not require to explicitly considering delay size during fault simulation. Conversely, the Gate Delay Fault Model is a quantitative delay fault model since a delay size has to be defined (or assumed) in advance. This fault model is more accurate with respect to the transition Fault Model, but the need to take into account the delay size makes harder fault simulation and test generation.

In the literature, several works have addressed the problem of determining the size of delay faults detected by a given test set. The basic idea behind such works is to consider an extended set of timing information to be exploited during the fault simulation (e.g., initial and final logic values, earliest arrival and latest stabilization times). The common drawback of such approaches is the need to increase the fault simulation complexity by including timing information.

In this work, we have developed a methodology aimed at representing a Gate Delay Fault as a set of Transition Delay Faults in the propagation paths of the affected net. The proposed equivalence introduces a significant advantage with respect to other methods considering the Gate Delay Fault effects through timing simulation; in fact, by considering Transition Delay Faults only, we shift the analysis to a quantitative level of abstraction instead of explicitly considering the delay size and the delay effect over the circuit. The proposed ATPG flow is consequently less complex and requires less CPU resources compared to existing flows. The set of Transition Faults identified as equivalent to a Gate Delay Fault is also depending on the sensitization path and changes even for the same delay according to the incoming path from a primary input to the considered gate. Therefore, the proposed technique consists in the analysis of the circuit, by considering both sensitization and propagation paths. An ATPG flow suitable for Gate Delay Fault detection has been developed.

As a difference with other approaches existing in the literature, we introduced a preliminary step w.r.t. conventional fault simulation. By using the proposed methodology, we produced the fault list which is used for performing the ATPG process, which accounting only for Transition Delay Faults (i.e., without adding timing information). Moreover, a classification of the delay size ranges detected by the generated patterns has been obtained as a by-product of the performed analysis.

Natale Pipitone from the Politecnico di Torino has carried out this work. Natale Pipitone was a Master student who spent 6 months at LIRMM in the context of the LAFISI. The work has been presented at the DTIS International Conference [C5].

Functional Test of Microprocessor Cores

 

Nowadays, electronic products present various issues that become more important with the CMOS technology scaling and the requisite request of both high operation speed and high frequency. Testing for performance, required to catch timing or delay faults, is therefore mandatory and often implemented through at-speed structural scan testing for digital circuits. Considering at-speed scan testing, two different schemes are used in practice: Launch-off-Shift (LOS) and Launch-off-Capture (LOC). They consist of using a rated (nominal) system clock period between launch and capture for each delay test pattern, while a longer clock period is normally used for scan shifting (load/unload cycles).

At-speed structural scan testing may lead to excessive power consumption that can either damage the Circuit Under Test (CUT) or lead to yield loss. Reducing the power during testing is a well-known technique but reducing too much the power consumption may lead to test escape phenomena.

Therefore, to cope with the above issues, we have to tune the test power depending on the functional power of the device itself. Stated thus, the knowledge of the actual functional power is mandatory. In this work, we focused on the power-aware test of microprocessor cores and we proposed a functional test programs generator for such microprocessor cores. The generator aims at maximizing the power consumption of the target microprocessor, and hence the generated programs are good candidates to accurately estimate the functional power limits (i.e., to avoid both over- and under-test).

Since generated functional test programs maximize the power consumption, they are definitively characterized by a high switching activity. In other words, they could be also good candidates for delay fault testing. In this work, we also investigated the impact of re-using available functional test programs for exploring a global test solution that maximizes the delay fault coverage while satisfying the power consumption constraints. These programs maximize the power consumption of the targeted microprocessor core in functional mode. Then, we investigated the impact of re-using such functional test programs for delay fault testing. In particular, we showed how these functional tests can be applied to improve the transition delay fault coverage. We propose to re-use the DfT circuitry already present in the CUT, thus avoiding any further modifications of the device. Basically, we intend to map functional test programs into a structural scan testing scheme (i.e., LOC or LOS). The result is a test scheme applied to the circuit through existing scan chains. Finally, we combine functional test with structural one to prove that it is possible to maximize the delay fault coverage while respecting the power consumption constraints for the microprocessor cores testing.

Aymen Touati from LIRMM has carried out this work, with the contribution from Alessandro Guerriero (Master student – Politecnico di Torino) who spent 6 months at LIRMM from April to October 2015. The work has been presented at the IEEE NATW International Workshop [C2], at the EDAA DATE International Conference [C3] and in the JCSC international journal [J5].

Reliability Analysis of Deep Neural Networks

 

Deep Learning, and in particular Convolutional Neural Networks (CNNs), are currently one of the most intensively and widely used predictive models in the field of machine learning. In most of the cases, CNNs provide very good results for many complex tasks such as object recognition in images/videos, drug discovery, natural language processing up to playing complex games.

One of the peculiar characteristics of the CNNs is the inherent resilience to errors due to their iterative nature and learning process. Thus, these techniques are now deeply used for safety-critical applications like autonomous driving. In order to use an electronic device in a safety-critical application, the reliability must be evaluated. More in detail, the probability that a fault may cause a failure is computed. The reliability analysis and its evaluation are regulated by standards depending on the application domain (e.g., IEC61508 for industrial systems, DO-254 for avionics, ISO 26262for automotive).

Usually, in-field test solutions have to be embedded and activated in mission-mode to detect possible permanent faults before these may produce any failure. Examples of such test solutions are Design for Testability techniques (e.g., BIST), self-test functional approaches (e.g, Software-based Self-test), or a combination of both. Independently on the adopted test solution, the key point is the achieved fault coverage with respect to the adopted fault model(s). A higher fault coverage leads to ensure a higher level of safety. It is computed as the ratio between the number of faults detected by the test solution over the number of possible faults. The fault list includes all the possible faults except those that cannot produce any failure in the operational mode. In the ISO26262 terminology, these faults are called “Safe Faults Application Dependent” (SFAD). Identifying SFAD is crucial, because it allows to remove them from the fault list and to focus the test efforts towards faults leading to application failures, only.

The goal of this work is to characterize the impact of permanent faults affecting a CNN by means of a fault injection campaign on the darknetopen source DNN framework. In the literature, few works exist that target the reliability of DNN, and only for soft errors (i.e., bit flip). In our study, we considered the injection of permanent faults with the final goal of evaluating the SFAD distribution and to classify the criticality of a CNN prediction deviation according to the corrupted layer. The main contributions of this work performed with respect to the state-of-the-art are listed in the following:

  • Fault Injection is carried at software layer, in order to be independent from any potential HW architecture finally running the CNN;
  • Fault effects are classified in SFAD (masked), Safe and Unsafe, according to several criteria;
  • Fault Injection allows to identifying the most sensible layers for which a safety mechanism may be purposely devised.
Two different CNN topologies have been characterized using the general darknet framework, the LeNet and Yolo CNNs. Experimental results show limited computational costs to achieve a good accuracy in the faulty behavior classification.

This work has been carried out by Annachiara Ruospo (Ph.D. student – Politecnico di Torino). It has been presented at the IEEE LATS symposium 2019 [C22].

Dependable Systems for Space Applications

 

The target of this work was the creation of the scheduling to manage the test of several kinds of memories embedded in the satellite MT-CUBE developed at the university of Montpellier. The scheduler is a crucial device in the experimental card of the satellite, because it allows all experiments to be made, following timing, data communication and power constraints. On board of the satellite, the system uses only one bus for all communications and this represents a reason to refine the setting of the scheduler.

The work was divided in five contributions:

  1. Management of the tests via scheduler
  2. Interaction between the scheduler and test interfaces
  3. Hardening of registers
  4. Implementation of the others functionalities, such as hardened buffers
  5. Testing of FPGA (communication via serial)

The tests were made on memories of various typologies on which a series of tests were run. The scheduler mainly consists of two entities that rule a static behavior and a dynamic behavior of the memories during the tests. An independent entity has the aim to manage the timing of the scheduler: start and stop the test. The scheduler takes in account the latest actions made on the memories and decides the new tests to be carried out later. After a stop, the scheduler takes into account the test time elapsed to determine the percentage of completion of the tests, in order to manage the wake up action.

The system has been installed on the satellite MT-CUBE, and for this reason, cosmic particles that are present in the spatial environment could be causes of some errors, especially in the internal registers of the FPGA that we used. In particular, the particles hitting the registers may change the value of some bits. The unexpected change of bit registers could cause serious problems to the system. Thus, one of our targets was to identify and correct the events through error detection/correction codes.

The implementation on board has been done using the Xilinx software and in particular the used tools are PlanAhead and ISE Impact.

Cosimo Lupo from the Politecnico di Torino has carried out this work. Cosimo Lupo was a Master student who spent 8 months at LIRMM in the context of the LAFISI. The work carried out by Cosimo has been implemented in a satellite for which the launch is scheduled at the end of 2016.