doc: Convert internal links to RST format
[project/bcm63xx/atf.git] / docs / components / ras.rst
1 Reliability, Availability, and Serviceability (RAS) Extensions
2 ==============================================================
3
4 This document describes |TF-A| support for Arm Reliability, Availability, and
5 Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
6 later CPUs, and also an optional extension to the base Armv8.0 architecture.
7
8 In conjunction with the |EHF|, support for RAS extension enables firmware-first
9 paradigm for handling platform errors: exceptions resulting from errors are
10 routed to and handled in EL3. Said errors are Synchronous External Abort (SEA),
11 Asynchronous External Abort (signalled as SErrors), Fault Handling and Error
12 Recovery interrupts. The |EHF| document mentions various `error handling
13 use-cases`__.
14
15 .. __: exception-handling.rst#delegation-use-cases
16
17 For the description of Arm RAS extensions, Standard Error Records, and the
18 precise definition of RAS terminology, please refer to the Arm Architecture
19 Reference Manual. The rest of this document assumes familiarity with
20 architecture and terminology.
21
22 Overview
23 --------
24
25 As mentioned above, the RAS support in |TF-A| enables routing to and handling of
26 exceptions resulting from platform errors in EL3. It allows the platform to
27 define an External Abort handler, and to register RAS nodes and interrupts. RAS
28 framework also provides `helpers`__ for accessing Standard Error Records as
29 introduced by the RAS extensions.
30
31 .. __: `Standard Error Record helpers`_
32
33 The build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run
34 time firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also
35 be set ``1``.
36
37 .. _ras-figure:
38
39 .. image:: ../resources/diagrams/draw.io/ras.svg
40
41 See more on `Engaging the RAS framework`_.
42
43 Platform APIs
44 -------------
45
46 The RAS framework allows the platform to define handlers for External Abort,
47 Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
48 refer to the porting guide for the `RAS platform API descriptions`__.
49
50 .. __: ../getting_started/porting-guide.rst#external-abort-handling-and-ras-support
51
52 Registering RAS error records
53 -----------------------------
54
55 RAS nodes are components in the system capable of signalling errors to PEs
56 through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
57 nodes contain one or more error records, which are registers through which the
58 nodes advertise various properties of the signalled error. Arm recommends that
59 error records are implemented in the Standard Error Record format. The RAS
60 architecture allows for error records to be accessible via system or
61 memory-mapped registers.
62
63 The platform should enumerate the error records providing for each of them:
64
65 - A handler to probe error records for errors;
66 - When the probing identifies an error, a handler to handle it;
67 - For memory-mapped error record, its base address and size in KB; for a system
68 register-accessed record, the start index of the record and number of
69 continuous records from that index;
70 - Any node-specific auxiliary data.
71
72 With this information supplied, when the run time firmware receives one of the
73 notification mechanisms, the RAS framework can iterate through and probe error
74 records for error, and invoke the appropriate handler to handle it.
75
76 The RAS framework provides the macros to populate error record information. The
77 macros are versioned, and the latest version as of this writing is 1. These
78 macros create a structure of type ``struct err_record_info`` from its arguments,
79 which are later passed to probe and error handlers.
80
81 For memory-mapped error records:
82
83 .. code:: c
84
85 ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
86
87 And, for system register ones:
88
89 .. code:: c
90
91 ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
92
93 The probe handler must have the following prototype:
94
95 .. code:: c
96
97 typedef int (*err_record_probe_t)(const struct err_record_info *info,
98 int *probe_data);
99
100 The probe handler must return a non-zero value if an error was detected, or 0
101 otherwise. The ``probe_data`` output parameter can be used to pass any useful
102 information resulting from probe to the error handler (see `below`__). For
103 example, it could return the index of the record.
104
105 .. __: `Standard Error Record helpers`_
106
107 The error handler must have the following prototype:
108
109 .. code:: c
110
111 typedef int (*err_record_handler_t)(const struct err_record_info *info,
112 int probe_data, const struct err_handler_data *const data);
113
114 The ``data`` constant parameter describes the various properties of the error,
115 including the reason for the error, exception syndrome, and also ``flags``,
116 ``cookie``, and ``handle`` parameters from the `top-level exception handler`__.
117
118 .. __: interrupt-framework-design.rst#el3-interrupts
119
120 The platform is expected populate an array using the macros above, and register
121 the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
122 passing it the name of the array describing the records. Note that the macro
123 must be used in the same file where the array is defined.
124
125 Standard Error Record helpers
126 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
127
128 The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
129 both memory-mapped and System Register accesses:
130
131 .. code:: c
132
133 int ras_err_ser_probe_memmap(const struct err_record_info *info,
134 int *probe_data);
135
136 int ras_err_ser_probe_sysreg(const struct err_record_info *info,
137 int *probe_data);
138
139 When the platform enumerates error records, for those records in the Standard
140 Error Record format, these helpers maybe used instead of rolling out their own.
141 Both helpers above:
142
143 - Return non-zero value when an error is detected in a Standard Error Record;
144 - Set ``probe_data`` to the index of the error record upon detecting an error.
145
146 Registering RAS interrupts
147 --------------------------
148
149 RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
150 Recovery interrupts. For the firmware-first handling paradigm for interrupts to
151 work, the platform must setup and register with |EHF|. See `Interaction with
152 Exception Handling Framework`_.
153
154 For each RAS interrupt, the platform has to provide structure of type ``struct
155 ras_interrupt``:
156
157 - Interrupt number;
158 - The associated error record information (pointer to the corresponding
159 ``struct err_record_info``);
160 - Optionally, a cookie.
161
162 The platform is expected to define an array of ``struct ras_interrupt``, and
163 register it with the RAS framework using the macro
164 ``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
165 macro must be used in the same file where the array is defined.
166
167 The array of ``struct ras_interrupt`` must be sorted in the increasing order of
168 interrupt number. This allows for fast look of handlers in order to service RAS
169 interrupts.
170
171 Double-fault handling
172 ---------------------
173
174 A Double Fault condition arises when an error is signalled to the PE while
175 handling of a previously signalled error is still underway. When a Double Fault
176 condition arises, the Arm RAS extensions only require for handler to perform
177 orderly shutdown of the system, as recovery may be impossible.
178
179 The RAS extensions part of Armv8.4 introduced new architectural features to deal
180 with Double Fault conditions, specifically, the introduction of ``NMEA`` and
181 ``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
182 software which runs part of its entry/exit routines with exceptions momentarily
183 masked—meaning, in such systems, External Aborts/SErrors are not immediately
184 handled when they occur, but only after the exceptions are unmasked again.
185
186 |TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
187 This means that all exceptions routed to EL3 are handled immediately. |TF-A|
188 thus is able to detect a Double Fault conditions in software, without needing
189 the intended advantages of Armv8.4 Double Fault architecture extensions.
190
191 Double faults are fatal, and terminate at the platform double fault handler, and
192 doesn't return.
193
194 Engaging the RAS framework
195 --------------------------
196
197 Enabling RAS support is a platform choice constructed from three distinct, but
198 related, build options:
199
200 - ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware;
201
202 - ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See
203 `Interaction with Exception Handling Framework`_;
204
205 - ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to
206 EL3.
207
208 The RAS support in |TF-A| introduces a default implementation of
209 ``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION``
210 is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
211 top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
212 to through platform-supplied error records, probe them, and when an error is
213 identified, look up and invoke the corresponding error handler.
214
215 Note that, if the platform chooses to override the ``plat_ea_handler`` function
216 and intend to use the RAS framework, it must explicitly call
217 ``ras_ea_handler()`` from within.
218
219 Similarly, for RAS interrupts, the framework defines
220 ``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
221 when a RAS interrupt taken at EL3. The function bisects the platform-supplied
222 sorted array of interrupts to look up the error record information associated
223 with the interrupt number. That error handler for that record is then invoked to
224 handle the error.
225
226 Interaction with Exception Handling Framework
227 ---------------------------------------------
228
229 As mentioned in earlier sections, RAS framework interacts with the |EHF| to
230 arbitrate handling of RAS exceptions with others that are routed to EL3. This
231 means that the platform must partition a `priority level`__ for handling RAS
232 exceptions. The platform must then define the macro ``PLAT_RAS_PRI`` to the
233 priority level used for RAS exceptions. Platforms would typically want to
234 allocate the highest secure priority for RAS handling.
235
236 .. __: exception-handling.rst#partitioning-priority-levels
237
238 Handling of both `interrupt`__ and `non-interrupt`__ exceptions follow the
239 sequences outlined in the |EHF| documentation. I.e., for interrupts, the
240 priority management is implicit; but for non-interrupt exceptions, they're
241 explicit using `EHF APIs`__.
242
243 .. __: exception-handling.rst#interrupt-flow
244 .. __: exception-handling.rst#non-interrupt-flow
245 .. __: exception-handling.rst#activating-and-deactivating-priorities
246
247 --------------
248
249 *Copyright (c) 2018-2019, Arm Limited and Contributors. All rights reserved.*