[ag-automation] neuer Mitleser

Fri Apr 21 22:48:57 CEST 2006

Hi Jan,

On Tue, 2006-04-11 at 11:41 +0200, Jan Kiszka wrote: 
> [sorry for the late reply - busy...]

me too :)

> From a higher point of view, the basic idea of RTDM is threefold:
> 
> 1. Keeping the interface a hard-RT driver has to provide limited and
>    targeted. This aims at reducing the amount of code one has to
>    maintain with respect to both logical correctness and determinism.
> 
>    More in details, RTDM focuses on basic POSIX I/O services (read /
>    write / ioctl) and the socket interface (recvmsg / sendmsg &
>    friends). Invocations are passed to driver handlers separated
>    according to the caller's scheduling policy (RT or non-RT). An RTDM
>    driver can register dedicated handlers for each context, providing,
>    e.g., resource allocation policies depending on the criticality of
>    the caller or rejecting wrong usages.

I dont see why this is a general necessary functional split. When the
driver needs different handling depending on the users scheduling
policy, then let the driver do that part. 

> 2. Providing a driver development interface for the time-critical part
>    that is independent of the underlying RTOS (and its potential
>    modifications over the time). This concept was (re-)born in the days
>    when Preempt-rt was far from being in sight. So the current
>    realisation may require a careful rebalancing when including
>    Preempt-rt support.

I'm not yet convinced, that a "fit every OS driver" approach is the
right way to go. Looking at the various existing implementations of "fit
for everything drivers" it's not unlikely that this will become a
nightmare.

> Item 1 should be straightforward to realise over Preempt-rt, as there
> are already deterministic system entries for the RTDM service set, only
> adoption of the handler signatures are required. Also item 2 will mostly
> just demand name mapping (for mutexes e.g.) or minor wrapping.
> 
> But regarding mainline integration of this layer: well, if this is going
> to be the ultimate criteria, RTDM may fail. Without preempt-rt in the
> kernel, it doesn't make sense to think about this at all.

Why not. When the design is good and there is actually a benefit from
having such a subsystem, then it makes completely sense to give it a
try. If it's single purpose and unusable when -rt is switched off, it's
in fact debatable.

> But even with
> preempt-rt, I'm a bit sceptical that the current kernel community will
> like the idea of including a layer which aims at keeping code compilable
> also against something outside their yard.

That's no argument. There is a lot of code in mainline which is held
compatible to non Linux implementations. The criteria is usefullness.

> While I'm perfectly fine with letting code flow together with the kernel
> development to achieve best average efficiency, I'm not convinced to do
> the same with stuff used for critical applications. Therefore RTDM.
> That's my view, but maybe I (and other sharing my view) will be proven
> wrong in the next years. :)

Well, you know my POV. 

Granted, it's easier to add single purpose solutions in order to
circumvent limitations in the generic code, but I doubt that this is a
good general approach. I still think that it's better to spend time on
solving such problems in the existing code or proving that the seperate
implementation is necessary and usefull for others too. Nobody will have
any objections if there is a solution to improve the performance of e.g.
the networking stack. This might as well include some extra short cuts
for critical applications. Automation is not so different from other
demanding application fields. Restricting the view on the requirements
of automation is fundamentally wrong.

> But first keep the integration out of the focus. I'm sure we can more
> easily agree on the specification part, i.e. the device driver profiles.
> CAN is such an example where we are on a good way to achieve a
> compatible programming model, independent of the underlying
> implementation details.

As long as the programming model is generic enough and does not furtivly
depend on non obvious assumptions e.g. syscall splits. 

> As you said, this depends on your scenario. I can very well imagine
> large, complex applications (telco...) where your definitely smart
> optimisations help a lot. But I can also imagine scenarios (tight
> control loops e.g.) where you better spend a few percentage efficiency
> for guarantees.

There is a wide range between both types (telco and the tight control
loop). Also tight control loops where this might matter are usually
lockless. OTOH we have a critical control application (migrated from
OS9) which uses locks heavily. Giving up the non contended / no waiters
optimnization would hurt badly. 

I had a short look into the xenomai implementation and I really wonder
how you will achieve pthread_mutex handling across processes (pshared)
with priority inheritance support - including shared mutexes across rt
and non-rt processes. This is a basic requirement for many POSIX based
applications and frameworks. It's not a question whether such a scenario
is desirable from an engineering and design POV, it's simply a fact that
it is necessary if you claim POSIX compliance.

Bluntly, without thinking too much: It does not work, you can't do
priority inheritance across domains without a major hack all over the
place.

That's one of my main criticism on the pseudo domain concept. It
pretends to be safe by seperation and flexible by resource sharing at
the same time. There is no way to get both things together without major
restrictions. You claim POSIX compliance, but put self defined
restrictions on its usage.

How do you justify that a POSIX compliant RT application has to be
modified in order to work on such a system and the user has to redesign
and rewrite the code?

I'm not saying that domains are bad, but you have to state the
restrictions clearly upfront and stay away from general compliance
claims. For Joe User it has to be made entirely clear that the claimed
conformance has severe restrictions:
- works only inside a single domain
- has incomplete functionality and needs modifications to the code

This makes it simply unusable for already existing large code bases -
multi process user applications and code based on large frameworks like
ACE/TAO.

IMNSHO the whole concept of pseudo domains is broken by design and will
never lead to a satisfying solution. Either you have an all in one OS or
real physically sepearated domains, which enforce a completely different
design of the application/system.

> > If you trap into the -ENOMEM situation in the locking code, then your
> > system has reached its limits anyway.
> 
> That's easy to claim, but I don't see why a well configured Linux box
> shouldn't keep their time critical jobs alive and working even under
> memory pressure.

Oh well. There is a huge difference between memory pressure and the
point where you get ENOMEM in the locking code. Once this hits, there is
serious trouble on the way. When ENOMEN happens in such a scenario then
your complete system - xenomai or preempt-rt - is rendered unusable.

> Bugs are everywhere, e.g. memory leaks in
> not-that-well-reviewed non-RT applications.

We have this discussed before and it is still a strawman argument. Bugs
are bugs and the fact that in theory the RT-application stays alive does
not change this and does not make any bug less dangerous. 

Worse, you might even fool unexperienced users to believe, that they
don't have to audit the complete system, as the OS will take care that
the RT application survives.

In the whole extent such a statement suggest that even buggy non
reviewed device drivers for non-rt related equipment are harmless. They
are not. Hitting a kernel BUG from one of those might kill the system
completely. The experience we've seen in course of the preempt-rt
development was exactly this. Apparently working drivers broke due to
well hidden race conditions and other problems covered by the vanilla
linux behaviour. I said behaviour, not semantics. The bugs were present
in vanilla too, but so subtle that they were almost impossible to
trigger.

I don't see any reason, why xenomai should not be affected in the same
way.

As long as a domain concept does not provide complete physical
seperation of the domains such statements are moot and delusive.

Pseudo domains as provided by xenomai are as vulnerable as single domain
implementations by buggy and malicious code in the non-rt part despite
of all sacrosanctness claims made in course of this and previous
discussions.

Sorry. I really do not understand what you want to achieve with such
arguments. I actually want to watch the fun you have with an experienced
audit engineer, when you make such claims.

I would seriously appreciate it if we could restrict this to pure
technical discussions again and keep FUD out.

> Ok, I recently noticed that
> the out-of-memory manager of 2.6 doesn't make this goal easier, but it
> still remains feasible and worth to achieve.

I'd be grateful, if you find a sane solution to fix the oom-killer
itself and not just add a "it works for xenomai" hack. :)

	tglx