Define and document expected behavior of API calls in various contexts #18970

pabigot · 2019-09-06T12:05:55Z

This issue is intended to cover the concerns raised in #1960, #3694, #6184, #17014, and related lack of clarity and consistency.

Proposal: We should have:

Definitions of terms; see Draft terminology related to API behavior definitions #20959
A policy for when (kernel and driver) API functions are thread-safe/reentrant. Some are (to varying degrees), some are not.
At the individual function level we should clearly document:
- Can a call block?
- Any situation where the call doesn't block but also will not return for a very long time (e.g. chip-level flash erase)
- Can the function be invoked from an ISR? What behavior is guaranteed in that situation?
- Are the operations of the function atomic?
- The context in which any callback is invoked.
- Any conditions that must be satisfied before the function is invoked, e.g. holding a lock.

Also are k_is_in_isr() and k_is_preempt_thread() sufficient to determine whether it's safe to proceed, or do we also need k_is_atomic(), #17301, or other technologies?

Example policies might be:

All driver functions must be invokable from ISRs, but-EWOULDBLOCK shall be returned in situations where the requested operation cannot be successfully completed in that context.
All GPIO manipulation functions must be atomic.

If we can define the normal behavior at either a system or driver-specific level, only the exceptions need be documented at the individual function level.

The text was updated successfully, but these errors were encountered:

mbolivar · 2019-11-19T21:44:38Z

thread-safe: (TBD) a function is thread-safe if its behavior is correct when invoked from multiple threads at the same time

reentrant: (TBD) a function is reentrant if its behavior is correct when it is invoked by (indirect) recursion from the same thread

+1 to these.

interrupt-safe: (TBD) a function is interrupt-safe if its behavior is not affected by concurrent access from interrupts

No objections, but can you provide an example API function where you might want to use this term? I would expect all the driver APIs I can think of offhand to be interrupt-safe, at least when called from thread context.

* block: (TBD) a call blocks if it can suspend the invoking thread
  while it waits for something to happen

I would suggest "might sleep" (or "may sleep") instead.

This is more familiar from Linux kernel source code (648 matches for "might sleep" in v5.3), and has additional advantages beyond that.

In particular, as discussed for clock_control_on() at today's API meeting, some people (including me) thought "block" might simply mean "don't return until the operation is complete". This would include functions which busy-wait until the job is done (like k_spin_lock() might want to do on SMP).

By contrast, "might sleep" makes it clearer that the caller's thread might get scheduled out, and thus further hints the caller must be running in thread context.

At the individual function level we should clearly document:

Can a call block?

So this would become e.g.:

/**
 * @brief Enable a clock controlled by the device
 *
 * On success, the clock is enabled and ready when this function
 * returns.
 *
 * This function :ref:`might-sleep`.
 *
 * @param dev Device structure whose driver controls the clock.
 * @param sys Opaque data representing the clock.
 * @return 0 on success, negative errno on failure.
 */
static inline int clock_control_on(struct device *dev,
				   clock_control_subsys_t sys)

Where the Sphinx might-sleep ref would link directly to a common definition for the term.

Also are k_is_in_isr() and k_is_preempt_thread() sufficient to determine
whether it's safe to proceed, or do we also need k_is_atomic(), #17301, or other technologies?

+1 to k_is_atomic() instead of k_is_in_isr().

mbolivar · 2019-11-19T22:15:54Z

+1 to k_is_atomic() instead of k_is_in_isr().

Never mind, I take it back. https://lwn.net/Articles/274695/

Once these mistakes are cleared up, there is still the question of just how kernel code should decide whether it is running in an atomic context or not. The real answer is that it just can't do that. Quoting Andrew Morton again:

The consistent pattern we use in the kernel is that callers keep track of whether they are running in a schedulable context and, if necessary, they will inform callees about that. Callees don't work it out for themselves.

pabigot · 2019-11-20T03:25:28Z

interrupt-safe: (TBD) a function is interrupt-safe if its behavior is not affected by concurrent access from interrupts

No objections, but can you provide an example API function where you might want to use this term? I would expect all the driver APIs I can think of offhand to be interrupt-safe, at least when called from thread context.

We need to be able to say succinctly "Unless otherwise specified all API functions are interrupt-safe" and expect people to know what that means. A specific example would be the GPIO API, where it is not clear to implementers that, because GPIO write functions may be invoked from ISRs, read-modify-write code like:

u32_t out = gpio->OUT;
gpio->OUT ^= (out & ~mask) | (value & mask);

must be wrapped in a spin-lock to be interrupt-safe. Most implementations don't do that, even in the ongoing GPIO rewrite, in part because we couldn't get agreement in #19252 on what "interrupt-safe" meant and whether it was important to document that expectation before people started implementing drivers.

* block: (TBD) a call blocks if it can suspend the invoking thread
  while it waits for something to happen
I would suggest "might sleep" (or "may sleep") instead.
This is more familiar from Linux kernel source code (648 matches for "might sleep" in v5.3), and has additional advantages beyond that.

In particular, as discussed for clock_control_on() at today's API meeting, some people (including me) thought "block" might simply mean "don't return until the operation is complete". This would include functions which busy-wait until the job is done (like k_spin_lock() might want to do on SMP).

My intent was to identify functions that will not transfer control from a cooperative thread. "blocking" is a poorly chosen term. Scrap it and start over.

Zephyr uses several terms including "Waiting", "Suspended", and "Ready" to identify started non-terminated non-running threads. The transitions we're interested in are "suspend" and "wait". Let's use "yield" for "suspends or waits".

There are three cases:

The thread invoking a given function will not yield;
The thread invoking a given function may yield;
The thread invoking a given function will yield.

So we can do:

yields (thread): A thread yields when it invokes an operation that transitions it to Suspended or Waiting.
pre-empted: A thread is pre-empted when it transitions to Ready without having yielded.
yielding (function): A function is yielding when it invokes (directly or indirectly) an operation that causes the executing thread to yield.
non-yielding (function): A function is non-yielding when it is not yielding.

It's non-yielding that we care about, because those functions guarantee, when invoked by a cooperative thread, that no other cooperative thread will be allowed to run before the function returns. (Absent bugs like #20255.)

I'm not sure exactly how functions should be annotated with respect to their behavior, but I would want it to be terse and consistent.

+1 to k_is_atomic() instead of k_is_in_isr().

Never mind, I take it back. https://lwn.net/Articles/274695/

Yes, but that relies on having API to provide information about context to callees, which we probably don't have. To be considered later.

mbolivar · 2019-11-20T16:39:50Z

A specific example would be the GPIO API

Excellent example, thank you.

Summarizing the status:

thread-safe: OK as-is
re-entrant: OK as-is
interrupt-safe: OK as-is
atomic: OK as-is
blocking: scrap it, need a different term
k_is_atomic(): no, not yet (and to clarify: in my mind, not ever, being convinced by the LWN article)

There are three cases:

* The thread invoking a given function will not yield;
* The thread invoking a given function may yield;
* The thread invoking a given function will yield.

Did you mean "The function invoked by a given thread" instead of "The thread invoking a given function"?

yielding (function): A function is yielding when it invokes (directly or indirectly) an operation that causes the executing thread to yield.

non-yielding (function): A function is non-yielding when it is not yielding.

With the given definition of "yield" ("suspends or waits"), I agree these are the desired semantics.

I also agree that defining terms in accordance with documented thread states and related APIs is a great approach.

However, I'm still bullish on "might sleep" terminology-wise. I worry "yielding" is too easy to confuse with "may call k_yield()".

This is an evolving document used to come to consensus on the set of defined terms necessary to precisely document the expected behavior of API calls in various contexts. It's in Markdown rather than reStructured Text because that's most convenient for the editor. This commit archives the original proposal in issue zephyrproject-rtos#18970. Signed-off-by: Peter Bigot <[email protected]>

See mbolivar comment at zephyrproject-rtos#18970 (comment) and subsequent discussion. Add Background section to reference and summarize key concepts from existing documentation, particularly for thread state and priority and execution context. Replaced "blocking" with "yielding" for precision and consistency with Zephyr thread terminology. Add Commentary subsections that give a bit of explanation for why something is important or how concepts relate to each other. Signed-off-by: Peter Bigot <[email protected]>

pabigot · 2019-11-24T22:19:02Z

I've integrated the discussion so far into this document which is available for review through #20959 so people can comment on specific concerns in context. I think that'll be a bit more convenient then chained comments on this issue.

There are three cases:
* The thread invoking a given function will not yield;
* The thread invoking a given function may yield;
* The thread invoking a given function will yield.
Did you mean "The function invoked by a given thread" instead of "The thread invoking a given function"?

No. The operation of yielding is something that directly affects threads. The impact on functions is derived.

However, I'm still bullish on "might sleep" terminology-wise. I worry "yielding" is too easy to confuse with "may call k_yield()".

A problem is that traditionally Zephyr has used "sleep" to specifically mean "wait for a timeout to occur". I would rather that "suspend" was more general, but currently it specifically references waiting without an event.

carlescufi · 2020-05-19T16:23:36Z

API meeting 19th May:

Plan of action to document the expected behavior of public API functions in Zephyr:

Define the format to be used in the header files (doc: add alias for properties of functions #25411)
Begin with a slice of the kernel, for example k_pipe and k_mutex (TBD). @pabigot volunteers to do this and then have it reviewed by all.
Evaluate any changes required to the property list (or their definitions)
Complete the kernel
Drivers and thread-safety discussion. Where should this be documented (in a single spot for all drivers minus exceptions or function-by-function using funcprops).
Assign each subsystem to their respective maintainers, keep a table that is updated with the current state of the process

During the 2.4 development cycle, begin each API meeting with a status update on the function properties.

pabigot · 2020-06-30T00:38:46Z

Here's my assessment of k_mutex and k_pipe:

function	syscall	reschedule	sleep	no-wait	isr-ok	async
k_mutex_init	YES	no	no	yes	yes
k_mutex_lock	YES	yes	YES	YES	NO*
k_mutex_unlock	YES	YES	no	yes	NO*
k_pipe_init	no	no	no	yes	yes
k_pipe_cleanup	no	no	no	yes	yes
k_pipe_alloc_init	YES	no	no	yes	yes
k_pipe_put	YES	yes	YES	YES	yes
k_pipe_get	YES	yes	YES	YES	yes
k_pipe_block_put	no	yes	YES	no	no	YES
k_pipe_read_avail	YES	yes	YES	no	no
k_pipe_write_avail	YES	yes	YES	no	no

Some notes on the table

Upper case indicates a primary attribute setting. Lower case indicates a derived setting. Derived values include:
- reschedule is yes if sleep is YES; it is YES only in some cases where sleep is no
- sleep defaults to no. Whether suspending in a function like k_sem_give() that is reschedule is something that should be sleep is open for clarification.
- no-wait is yes if not sleep. It may be YES or no if sleep
- isr-ok is yes if either no-wait or not sleep. The two NO are an inconsistency that may be removed via enhance k_mutex to be ISR safe #25678
- async is rare and only one of these functions has it.
syscall is just informational, not one of the standard function attributes
I've ignored pre-kernel-ok which is probably generally equivalent to isr-ok, but deep inspection is required to ensure it doesn't do something that isn't available until the main thread has started.
I've ignored supervisor which also requires inspection. I assume anything that's syscall is not supervisor, but the others may or may not be.

Determining the behavior of the API is not trivial. Some of the above may be wrong. Some were surprises, like the functions to read the number of spaces available for read and write in k_pipe, which can block for an unbounded period due to an internal call to an allocation function implemented as k_stack_pop() with a K_FOREVER wait.

I've identified 181 functions in kernel.h that could be categorized. There are some interesting inconsistencies where some functions are syscalls but other seemingly parallel ones are not. Some of the convenience wrappers like k_fifo and k_lifo are inconsistently implemented.

This is gonna be really unpleasant.

carlescufi · 2020-12-01T17:32:28Z

@andyross and @andrewboie could you please review the contents of this table? The terminology is further explained here:
https://docs.zephyrproject.org/latest/reference/terminology.html

nashif · 2020-12-02T00:59:33Z

@pabigot was reading through all of this and was wondering why we use "sleep" (https://docs.zephyrproject.org/latest/reference/terminology.html#sleep). Sleeping is not a documented thread state. Is this the same as _THREAD_PENDING where a thread is waiting on an object?

andrewboie · 2020-12-02T08:00:19Z

@pabigot was reading through all of this and was wondering why we use "sleep" (docs.zephyrproject.org/latest/reference/terminology.html#sleep). Sleeping is not a documented thread state. Is this the same as _THREAD_PENDING where a thread is waiting on an object?

AFAIK yes

pabigot · 2020-12-02T12:33:53Z

The terminology was introduced in #21678 based on analysis in #20969. The term "sleep" ultimately came from here.

To precisely describe what "put the caller to sleep" means to a user it became "a voluntary transition to a non-running active thread state".

The documented thread states relevant to "sleep" are transition to either "waiting" and "suspended". Technically it would also cover "ready" but AFAIK there is no API that allows voluntary direct transition from running to ready.

So the terminology does not reflect any internal kernel state names that implement the thread state machine. It's perhaps not too late to change, but I suspect "sleeps" will be more clear to people than "pends".

pabigot · 2020-12-02T12:45:39Z

Shorter: "sleeps" was intended to also describe a function that invokes k_thread_suspend(_current); which would instead be _THREAD_SUSPENDED.

update per zephyrproject-rtos#18970 (comment) Signed-off-by: Peter Bigot <[email protected]>

pabigot added the Enhancement Changes/Updates/Additions to existing features label Sep 6, 2019

pabigot mentioned this issue Sep 6, 2019

drivers: ISR-friendly driver APIs #6184

Closed

pabigot added the area: API Changes to public APIs label Sep 6, 2019

This was referenced Sep 11, 2019

Convert GPIO drivers to new GPIO API #18530

Closed

[TOPIC-GPIO] miscellaneous API documentation updates #19252

Merged

pabigot mentioned this issue Nov 24, 2019

Draft terminology related to API behavior definitions #20959

Closed

This was referenced Nov 28, 2019

Document where APIs can be called from using doxygen #21061

Closed

reschedule not done after mutex unlock #20802

Closed

kernel/mutex: Fix races, make unlock rescheduling #20919

Merged

Reschedule points are currently undocumented #21111

Closed

pabigot mentioned this issue Dec 12, 2019

conditions required for safe call of kernel operations from interrupts #21341

Closed

pabigot mentioned this issue Jan 3, 2020

doc: reference: add discussion of terms that define API behavior #21678

Merged

pabigot mentioned this issue May 4, 2020

[RFC] Device: generalizing and improving control features (sync, lock and timeout) #24511

Closed

ABOSTM mentioned this issue May 6, 2020

drivers: STM32 dualcore concurrent register access protection with HSEM #24862

Merged

carlescufi mentioned this issue May 19, 2020

i2c: Drivers are not thread safe #3694

Closed

JordanYates mentioned this issue Aug 28, 2020

UART RTT devicetree & async support #27704

Merged

pabigot mentioned this issue Nov 30, 2020

Add API context page #30115

Closed

pabigot mentioned this issue Jan 18, 2021

doc: kernel: document general policy for Zephyr without threads #29338

Closed

pabigot added a commit to pabigot/zephyr that referenced this issue Mar 8, 2021

fixup 222826c

3c47070

update per zephyrproject-rtos#18970 (comment) Signed-off-by: Peter Bigot <[email protected]>

carlescufi assigned nashif Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define and document expected behavior of API calls in various contexts #18970

Define and document expected behavior of API calls in various contexts #18970

pabigot commented Sep 6, 2019 •

edited

Loading

mbolivar commented Nov 19, 2019

mbolivar commented Nov 19, 2019

pabigot commented Nov 20, 2019

mbolivar commented Nov 20, 2019

pabigot commented Nov 24, 2019

carlescufi commented May 19, 2020 •

edited by gmarull

Loading

pabigot commented Jun 30, 2020

carlescufi commented Dec 1, 2020

nashif commented Dec 2, 2020 •

edited

Loading

andrewboie commented Dec 2, 2020

pabigot commented Dec 2, 2020

pabigot commented Dec 2, 2020

Define and document expected behavior of API calls in various contexts #18970

Define and document expected behavior of API calls in various contexts #18970

Comments

pabigot commented Sep 6, 2019 • edited Loading

mbolivar commented Nov 19, 2019

mbolivar commented Nov 19, 2019

pabigot commented Nov 20, 2019

mbolivar commented Nov 20, 2019

pabigot commented Nov 24, 2019

carlescufi commented May 19, 2020 • edited by gmarull Loading

pabigot commented Jun 30, 2020

carlescufi commented Dec 1, 2020

nashif commented Dec 2, 2020 • edited Loading

andrewboie commented Dec 2, 2020

pabigot commented Dec 2, 2020

pabigot commented Dec 2, 2020

pabigot commented Sep 6, 2019 •

edited

Loading

carlescufi commented May 19, 2020 •

edited by gmarull

Loading

nashif commented Dec 2, 2020 •

edited

Loading