STMicroelectronics and the Verimag laboratory in Grenoble are starting a collaboration on hardened circuits and critical real-time applications. STMicroelectronics provides its expertise as a hardware manufacturer, in particular for hardened circuits usable in, e.g., spacecraft equipments. The Verimag laboratory in Grenoble was founded in 1993 by J. Sifakis, who got the Turing award 2007 for his pioneering work on model-checking. Verimag is also the birth place of Lustre, a programming language for real-time applications, which is the core of the SCADE tool. SCADE has become a de facto standard in the avionics industry. Verimag provides its expertise in building and validating critical software and systems able to meet the requirements of certification authorities.
The PhD will be funded by STMicroelectronics. The student will work partly at STMicroelectronics, and partly at Verimag, under the supervision of F. Maraninchi.
How To Apply: send an email to florence.maraninchi@imag.fr, including a motivation letter, a detailed CV, transcripts of your master diploma, and recommendation letters from previous advizors. All the documents should be in pdf.
General context of the PhD:
The influence of particles on hardware circuits has to be taken into account for space applications, but also for aircraft, and even cars. Under the influence of a particle, a bit (or several) can be changed in memory, or the behavior of some combinational logic can be temporarily different from its specification. This obviously has huge consequences on the software that runs on this hardware. Several techniques are developed by hardware manufacturers (at the physical level, or applying redundancy principles at the various levels of hardware logical design). Hardening methods can also be applied in the various layers of the software, where the redundancy can be spatial (two computations of the same thing on two cores) or temporal (two successive computations of the same thing). Both hardware and software methods range from the mere detection of faults, to the correction of one, or several faults. The final users of hardware/software systems in the space, avionics or automotive industries, require some confidence on the way faults are tolerated or recovered in the system, and the probability of severe crashes as a result of these faults. This is required by internal certification processes such as the ISO26262 in the automotive safety domain, or independent certification authorities.
Fault injection techniques:
One technique that allows assessing the tolerance of a hardware/software system to faults induced by particles, is fault injection. The idea is to induce a fault in the circuit, and to trace its effect through the various layers, up to the observable behavior of the application. The faults can be masked on their way up for a variety of reasons, including by logic masking or covered by the redundancy mechanisms implemented for that purpose; but they can also be masked for reasons inherent to the way the system works such as applicative masking (for instance, corrupting a bit in a memory that won’t serve again, or is about to be rewritten anyway before the next use). Fault injection can be done physically, or by simulation. The observation of the system submitted to faults, and the understanding of the way faults propagate, are key questions in this domain. One important question is the exact perimeter of the experiment: do we take a circuit, together with the full software stack, and submit it to physical particles? Or do we have to "cut" the stack at one or more levels, and to provide stubs for the missing parts? At the top of the stack, the problem comes from the fact that the customer does not usually provide its own application, but instead requires a qualification of the hardware for any application. The question then is how to provide a stub that behaves as a typical application. At the bottom of the stack, if we want to perform quick simulations, we need a model of the way the hits of the physical particles on the circuit transform into logical phenomena (bit flips, or errors in the logic).
Fault injection, real-time operating systems and critical applications
One point that has not attracted much research work for the moment is the fact that the applications are very often the implementation of some control laws, in the form of a set of threads running on a real-time operating system (RTOS). This is the topic of the proposed PhD. It includes several interesting questions:
1) The propagation of the fault in the hardware has to be observed up to the behavior of the control problem, not only up to the values computed by the implementation as the fault may affect more deeply the algorithm such as its control flow or timing. Typically, control laws are designed in such a way that sampling errors are tolerated, which means that an error that would be considered as very important at the software level, could be less important if we keep in mind that what matters is the effect of the control law. This could lead to a new classification of faults.
2) If we need to cut the stack and replace applications by stubs, the nature of a "typical" application has to be studied at the level of the control engineering problems.
The requirements of the ISO26262 norm adopted by the automotive industry lead to a generalization of points 1 and 2 above: given a characterization of faults (in the form of probabilities, fault rates and other metrics, fault trees) at one level of the stack, how to compute a characterization of the resulting faults at an upper level of the stack?
All these problems can be studied first through simulations. In order to study the behavior of the upper levels (typically the RTOS and the applications) in a simple setting, one solution is to run it on a simulated hardware platform. A probabilistic fault model is needed in order to simulate the faults. It can be the result of a prior study of the lower levels (e.g., physical fault injection in hardware blocks). Another topic of interest would be to analyze the code of the RTOS and the structure of a "typical" application, in order to identify the points in the software for which an error would be the most serious. This could serve as a test requirement for hardware, to determine whether this serious situation can indeed occur.