[ag-automation] [OSADL help] Automatic shutdown after patching PREEMPT-RT to Kernel 3.0.32

Sun Jun 3 21:24:48 CEST 2012

Bojie,

> I'm new in OSADL and have used the PREEMPT-RT patch to realize some
> real-time features for my own tests. The kernel I use is 3.0.32 and
> the patch rt52.
> 
> I've configured the kernel carefully according to the instructions
> and notes on the website of OSADL and got really good results in
> testing its performance like latency.
> 
> However, the biggest problem for me is that the system will directly
> shutdown after running for a while (maybe less than 30 min). I was
> wondering whether it was caused by the hardware overheating
> protection. Because I've disabled almost all the power management
> services (like APM and ACPI) in kernel and for ACPI I only let
> itsself enabled when all the sub options disabled(like fan, cpu, etc)
> according to avoid extra latencies.
Before starting to comment on your request, I would like to make sure
that my answer is not misunderstood. Despite all the following
explanations, suggestions and references, please note that a notebook is
not designed to be used as an industrial control system, a banking
transaction server, an audio recording device or whatever application is
used that requires real-time capabilities. Since some time ago, we use
the participants' notebooks in training classes and workshops on
real-time embedded systems and we install - as one of the exercises - a
real-time kernel on these notebooks. This has given us an overview about
the real-time behavior of at least a hundreds different notebooks.
Although we worked pretty hard on them to obtain good real-time data,
only about 20% to 30% of the notebooks are able to deliver them. Thus,
as a general rule, forget notebooks in the context of real-time. They
are not made for it.

This said, however, a notebook may serve as a useful example to study
the various mechanisms that prevent real-time behavior. In addition,
some industrial PCs are based on chipsets for mobile systems, since low
power consumption and low heat generation are also good for embedded
systems. They allow to design fan-less systems in a hermetically closed
housing that can be used, for example, in a dust exposed environment. To
make such systems real-time compliant may require knowledge that can be
obtained with notebooks.

In order to better study and document the technicalities of real-time in
notebooks, we have added another rack to our QA farm and populated it
with a number of different notebooks. For the time being, there are six
notebooks is rack #6 (3 x Acer, 1 x Lenovo, 1 x IBM, 1 x Dell) ->
https://www.osadl.org/?id=865#c6802. Without additional specific
measures, only one of them (Acer Aspire 5734Z with Intel Pentium M dual
core @2300 MHz) has excellent real-time capabilites. The other two Acer
notebooks also produce acceptable latency plots after playing around
with kernel parameters (see below), but they have a couple of outliers
per week which we did not yet trace down any further. If you buy
yourself a more prestigious notebook such as a Lenovo or a Dell and you
pay more for it in the hope to get more, then you get more indeed. This
added value is called System Management Interrupt (SMI) and, apparently,
the more you pay, the longer and the more frequent the SMIs are which
makes the notebook completely useless for any real-time application.
Despite a lot of effort, we were unable to get a reasonable real-time
behavior on our Lenovo (IBM) and Dell notebooks. To illustrate the
output of the hwlatdetect utility on a system that suffers from SMIs, we
are running hwlatdetect twice daily for 30 minutes on these notebooks.
The text output is made available along with the other farm profile data
- as you can see, the Dell notebook is the winner.

How to configure the PREEMPT_RT kernel to get lowest latencies in a
notebook? In contrast to your procedure, i.e. to first disable any power
management and then measure the latency, we would propose to first run
the PREEMPT_RT kernel as you would do in any ordinary PC. If you are
lucky, you are already done as we were in the case of the Acer Aspire.
To investigate the origin of the unexpected shutdowns you are
experiencing, you may wish to install the sensors package and run
sensors-detect. You may then log the temperature data using something like
  while true
  do
    sensors | logger
    sleep 60
  done
and inspect the system logfile after recovering from a crash. If the
system in fact becomes overheated, try to re-enable power management and
only disable those parts of it that affect latency.

If the system has unacceptable latencies with the entire power
management in place, we would recommend to
- inspect /sys/devices/system/cpu/cpu*/cpuidle/state*/time,
- inspect /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor,
- enable kernel-internal latency histograms,
- run hwlatdetect and
- run cyclictest with the -fb<threshold> option.

C States
If /sys/devices/system/cpu/cpu*/cpuidle/state*/time has data in any
other state then the first one, check out the latencies in
/sys/devices/system/cpu/cpu*/cpuidle/state*/latency and limit the C
state to the one with an acceptable latency, e.g. by using the kernel
parameter processor.max_cstate=[1-6]. In unsure, try
processor.max_cstate=1 and check the latency.

Throttling
If your scaling governor is anything else than "performance", try
  cd /sys/devices/system/cpu
  for i in cpu[0-9]*
  do
    echo performance >$i/cpufreq/scaling_governor
  done
This normally decreases the latency in the same amount as the frequency
increases.

Idle processing
Several mechanisms in addition to C states are available to reduce power
consumption during idle processing, but they also delay the
responsiveness of the system. You may use the kernel parameter idle= to
study their effect on the latency. The extreme setting is idle=poll - if
your system is susceptible to overheating, this option will most
probably cook it in no time. But if it still has the same latencies as
before, they probably are not related to idle processing. Other options
are idle=nomwait, idle=mwait etc.

Other methods to investigate latencies
The kernel-internal latency histograms, hwlatdetect and cyclictest with
the -fb<threshold> are used to search for specific origins of latencies.
To explain how all of this works probably goes beyond the scope of this
posting. In case your system no longer overheats but you see execution
gaps in your tracings that would explain the measured latencies, then
you may send the tracings along with system and boot data to our email
address latency-fighters at osadl.org. We may have a look and see what we
can do.

Hope this helps,
	-Carsten.