Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define and document expected behavior of API calls in various contexts #18970

Open
pabigot opened this issue Sep 6, 2019 · 12 comments
Open

Define and document expected behavior of API calls in various contexts #18970

pabigot opened this issue Sep 6, 2019 · 12 comments
Assignees
Labels
area: API Changes to public APIs Enhancement Changes/Updates/Additions to existing features

Comments

@pabigot
Copy link
Collaborator

pabigot commented Sep 6, 2019

This issue is intended to cover the concerns raised in #1960, #3694, #6184, #17014, and related lack of clarity and consistency.

Proposal: We should have:

  • Definitions of terms; see Draft terminology related to API behavior definitions #20959
  • A policy for when (kernel and driver) API functions are thread-safe/reentrant. Some are (to varying degrees), some are not.
  • At the individual function level we should clearly document:
    • Can a call block?
    • Any situation where the call doesn't block but also will not return for a very long time (e.g. chip-level flash erase)
    • Can the function be invoked from an ISR? What behavior is guaranteed in that situation?
    • Are the operations of the function atomic?
    • The context in which any callback is invoked.
    • Any conditions that must be satisfied before the function is invoked, e.g. holding a lock.

Also are k_is_in_isr() and k_is_preempt_thread() sufficient to determine whether it's safe to proceed, or do we also need k_is_atomic(), #17301, or other technologies?

Example policies might be:

  • All driver functions must be invokable from ISRs, but-EWOULDBLOCK shall be returned in situations where the requested operation cannot be successfully completed in that context.
  • All GPIO manipulation functions must be atomic.

If we can define the normal behavior at either a system or driver-specific level, only the exceptions need be documented at the individual function level.

@pabigot pabigot added the Enhancement Changes/Updates/Additions to existing features label Sep 6, 2019
@pabigot pabigot added the area: API Changes to public APIs label Sep 6, 2019
@mbolivar
Copy link
Contributor

  • thread-safe: (TBD) a function is thread-safe if its behavior is correct when invoked from multiple threads at the same time

  • reentrant: (TBD) a function is reentrant if its behavior is correct when it is invoked by (indirect) recursion from the same thread

+1 to these.

  • interrupt-safe: (TBD) a function is interrupt-safe if its behavior is not affected by concurrent access from interrupts

No objections, but can you provide an example API function where you might want to use this term? I would expect all the driver APIs I can think of offhand to be interrupt-safe, at least when called from thread context.

* block: (TBD) a call blocks if it can suspend the invoking thread
  while it waits for something to happen

I would suggest "might sleep" (or "may sleep") instead.

This is more familiar from Linux kernel source code (648 matches for "might sleep" in v5.3), and has additional advantages beyond that.

In particular, as discussed for clock_control_on() at today's API meeting, some people (including me) thought "block" might simply mean "don't return until the operation is complete". This would include functions which busy-wait until the job is done (like k_spin_lock() might want to do on SMP).

By contrast, "might sleep" makes it clearer that the caller's thread might get scheduled out, and thus further hints the caller must be running in thread context.

At the individual function level we should clearly document:

Can a call block?

So this would become e.g.:

/**
 * @brief Enable a clock controlled by the device
 *
 * On success, the clock is enabled and ready when this function
 * returns.
 *
 * This function :ref:`might-sleep`.
 *
 * @param dev Device structure whose driver controls the clock.
 * @param sys Opaque data representing the clock.
 * @return 0 on success, negative errno on failure.
 */
static inline int clock_control_on(struct device *dev,
				   clock_control_subsys_t sys)

Where the Sphinx might-sleep ref would link directly to a common definition for the term.

Also are k_is_in_isr() and k_is_preempt_thread() sufficient to determine
whether it's safe to proceed, or do we also need k_is_atomic(), #17301, or other technologies?

+1 to k_is_atomic() instead of k_is_in_isr().

@mbolivar
Copy link
Contributor

+1 to k_is_atomic() instead of k_is_in_isr().

Never mind, I take it back. https://lwn.net/Articles/274695/

Once these mistakes are cleared up, there is still the question of just how kernel code should decide whether it is running in an atomic context or not. The real answer is that it just can't do that. Quoting Andrew Morton again:

The consistent pattern we use in the kernel is that callers keep track of whether they are running in a schedulable context and, if necessary, they will inform callees about that. Callees don't work it out for themselves.

@pabigot
Copy link
Collaborator Author

pabigot commented Nov 20, 2019

  • interrupt-safe: (TBD) a function is interrupt-safe if its behavior is not affected by concurrent access from interrupts

No objections, but can you provide an example API function where you might want to use this term? I would expect all the driver APIs I can think of offhand to be interrupt-safe, at least when called from thread context.

We need to be able to say succinctly "Unless otherwise specified all API functions are interrupt-safe" and expect people to know what that means. A specific example would be the GPIO API, where it is not clear to implementers that, because GPIO write functions may be invoked from ISRs, read-modify-write code like:

u32_t out = gpio->OUT;
gpio->OUT ^= (out & ~mask) | (value & mask);

must be wrapped in a spin-lock to be interrupt-safe. Most implementations don't do that, even in the ongoing GPIO rewrite, in part because we couldn't get agreement in #19252 on what "interrupt-safe" meant and whether it was important to document that expectation before people started implementing drivers.

* block: (TBD) a call blocks if it can suspend the invoking thread
  while it waits for something to happen

I would suggest "might sleep" (or "may sleep") instead.
This is more familiar from Linux kernel source code (648 matches for "might sleep" in v5.3), and has additional advantages beyond that.

In particular, as discussed for clock_control_on() at today's API meeting, some people (including me) thought "block" might simply mean "don't return until the operation is complete". This would include functions which busy-wait until the job is done (like k_spin_lock() might want to do on SMP).

My intent was to identify functions that will not transfer control from a cooperative thread. "blocking" is a poorly chosen term. Scrap it and start over.

Zephyr uses several terms including "Waiting", "Suspended", and "Ready" to identify started non-terminated non-running threads. The transitions we're interested in are "suspend" and "wait". Let's use "yield" for "suspends or waits".

There are three cases:

  • The thread invoking a given function will not yield;
  • The thread invoking a given function may yield;
  • The thread invoking a given function will yield.

So we can do:

  • yields (thread): A thread yields when it invokes an operation that transitions it to Suspended or Waiting.

  • pre-empted: A thread is pre-empted when it transitions to Ready without having yielded.

  • yielding (function): A function is yielding when it invokes (directly or indirectly) an operation that causes the executing thread to yield.

  • non-yielding (function): A function is non-yielding when it is not yielding.

It's non-yielding that we care about, because those functions guarantee, when invoked by a cooperative thread, that no other cooperative thread will be allowed to run before the function returns. (Absent bugs like #20255.)

I'm not sure exactly how functions should be annotated with respect to their behavior, but I would want it to be terse and consistent.

+1 to k_is_atomic() instead of k_is_in_isr().

Never mind, I take it back. https://lwn.net/Articles/274695/

Yes, but that relies on having API to provide information about context to callees, which we probably don't have. To be considered later.

@mbolivar
Copy link
Contributor

A specific example would be the GPIO API

Excellent example, thank you.

Summarizing the status:

  • thread-safe: OK as-is
  • re-entrant: OK as-is
  • interrupt-safe: OK as-is
  • atomic: OK as-is
  • blocking: scrap it, need a different term
  • k_is_atomic(): no, not yet (and to clarify: in my mind, not ever, being convinced by the LWN article)

There are three cases:

* The thread invoking a given function will not yield;
* The thread invoking a given function may yield;
* The thread invoking a given function will yield.

Did you mean "The function invoked by a given thread" instead of "The thread invoking a given function"?

  • yielding (function): A function is yielding when it invokes (directly or indirectly) an operation that causes the executing thread to yield.
  • non-yielding (function): A function is non-yielding when it is not yielding.

With the given definition of "yield" ("suspends or waits"), I agree these are the desired semantics.

I also agree that defining terms in accordance with documented thread states and related APIs is a great approach.

However, I'm still bullish on "might sleep" terminology-wise. I worry "yielding" is too easy to confuse with "may call k_yield()".

pabigot added a commit to pabigot/zephyr that referenced this issue Nov 24, 2019
This is an evolving document used to come to consensus on the set of
defined terms necessary to precisely document the expected behavior of
API calls in various contexts.

It's in Markdown rather than reStructured Text because that's most
convenient for the editor.

This commit archives the original proposal in issue zephyrproject-rtos#18970.

Signed-off-by: Peter Bigot <[email protected]>
pabigot added a commit to pabigot/zephyr that referenced this issue Nov 24, 2019
See mbolivar comment at
zephyrproject-rtos#18970 (comment)
and subsequent discussion.

Add Background section to reference and summarize key concepts from
existing documentation, particularly for thread state and priority and
execution context.

Replaced "blocking" with "yielding" for precision and consistency with
Zephyr thread terminology.

Add Commentary subsections that give a bit of explanation for why
something is important or how concepts relate to each other.

Signed-off-by: Peter Bigot <[email protected]>
@pabigot
Copy link
Collaborator Author

pabigot commented Nov 24, 2019

I've integrated the discussion so far into this document which is available for review through #20959 so people can comment on specific concerns in context. I think that'll be a bit more convenient then chained comments on this issue.

There are three cases:

* The thread invoking a given function will not yield;
* The thread invoking a given function may yield;
* The thread invoking a given function will yield.

Did you mean "The function invoked by a given thread" instead of "The thread invoking a given function"?

No. The operation of yielding is something that directly affects threads. The impact on functions is derived.

However, I'm still bullish on "might sleep" terminology-wise. I worry "yielding" is too easy to confuse with "may call k_yield()".

A problem is that traditionally Zephyr has used "sleep" to specifically mean "wait for a timeout to occur". I would rather that "suspend" was more general, but currently it specifically references waiting without an event.

@carlescufi
Copy link
Member

carlescufi commented May 19, 2020

API meeting 19th May:

Plan of action to document the expected behavior of public API functions in Zephyr:

  • Define the format to be used in the header files (doc: add alias for properties of functions #25411)
  • Begin with a slice of the kernel, for example k_pipe and k_mutex (TBD). @pabigot volunteers to do this and then have it reviewed by all.
  • Evaluate any changes required to the property list (or their definitions)
  • Complete the kernel
  • Drivers and thread-safety discussion. Where should this be documented (in a single spot for all drivers minus exceptions or function-by-function using funcprops).
  • Assign each subsystem to their respective maintainers, keep a table that is updated with the current state of the process

During the 2.4 development cycle, begin each API meeting with a status update on the function properties.

@pabigot
Copy link
Collaborator Author

pabigot commented Jun 30, 2020

Here's my assessment of k_mutex and k_pipe:

function syscall reschedule sleep no-wait isr-ok pre-kernel-ok async supervisor
k_mutex_init YES no no yes yes
k_mutex_lock YES yes YES YES NO*
k_mutex_unlock YES YES no yes NO*
k_pipe_init no no no yes yes
k_pipe_cleanup no no no yes yes
k_pipe_alloc_init YES no no yes yes
k_pipe_put YES yes YES YES yes
k_pipe_get YES yes YES YES yes
k_pipe_block_put no yes YES no no YES
k_pipe_read_avail YES yes YES no no
k_pipe_write_avail YES yes YES no no

Some notes on the table

  • Upper case indicates a primary attribute setting. Lower case indicates a derived setting. Derived values include:
    • reschedule is yes if sleep is YES; it is YES only in some cases where sleep is no
    • sleep defaults to no. Whether suspending in a function like k_sem_give() that is reschedule is something that should be sleep is open for clarification.
    • no-wait is yes if not sleep. It may be YES or no if sleep
    • isr-ok is yes if either no-wait or not sleep. The two NO are an inconsistency that may be removed via enhance k_mutex to be ISR safe #25678
    • async is rare and only one of these functions has it.
  • syscall is just informational, not one of the standard function attributes
  • I've ignored pre-kernel-ok which is probably generally equivalent to isr-ok, but deep inspection is required to ensure it doesn't do something that isn't available until the main thread has started.
  • I've ignored supervisor which also requires inspection. I assume anything that's syscall is not supervisor, but the others may or may not be.

Determining the behavior of the API is not trivial. Some of the above may be wrong. Some were surprises, like the functions to read the number of spaces available for read and write in k_pipe, which can block for an unbounded period due to an internal call to an allocation function implemented as k_stack_pop() with a K_FOREVER wait.

I've identified 181 functions in kernel.h that could be categorized. There are some interesting inconsistencies where some functions are syscalls but other seemingly parallel ones are not. Some of the convenience wrappers like k_fifo and k_lifo are inconsistently implemented.

This is gonna be really unpleasant.

@carlescufi
Copy link
Member

@andyross and @andrewboie could you please review the contents of this table? The terminology is further explained here:
https://docs.zephyrproject.org/latest/reference/terminology.html

@nashif
Copy link
Member

nashif commented Dec 2, 2020

@pabigot was reading through all of this and was wondering why we use "sleep" (https://docs.zephyrproject.org/latest/reference/terminology.html#sleep). Sleeping is not a documented thread state. Is this the same as _THREAD_PENDING where a thread is waiting on an object?

@andrewboie
Copy link
Contributor

@pabigot was reading through all of this and was wondering why we use "sleep" (docs.zephyrproject.org/latest/reference/terminology.html#sleep). Sleeping is not a documented thread state. Is this the same as _THREAD_PENDING where a thread is waiting on an object?

AFAIK yes

@pabigot
Copy link
Collaborator Author

pabigot commented Dec 2, 2020

The terminology was introduced in #21678 based on analysis in #20969. The term "sleep" ultimately came from here.

To precisely describe what "put the caller to sleep" means to a user it became "a voluntary transition to a non-running active thread state".

The documented thread states relevant to "sleep" are transition to either "waiting" and "suspended". Technically it would also cover "ready" but AFAIK there is no API that allows voluntary direct transition from running to ready.

So the terminology does not reflect any internal kernel state names that implement the thread state machine. It's perhaps not too late to change, but I suspect "sleeps" will be more clear to people than "pends".

@pabigot
Copy link
Collaborator Author

pabigot commented Dec 2, 2020

Shorter: "sleeps" was intended to also describe a function that invokes k_thread_suspend(_current); which would instead be _THREAD_SUSPENDED.

pabigot added a commit to pabigot/zephyr that referenced this issue Mar 8, 2021
update per zephyrproject-rtos#18970 (comment)

Signed-off-by: Peter Bigot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs Enhancement Changes/Updates/Additions to existing features
Projects
None yet
Development

No branches or pull requests

5 participants