Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Coroutine Rewrite Issue #2377

Closed
andrewrk opened this issue Apr 29, 2019 · 39 comments
Closed

The Coroutine Rewrite Issue #2377

andrewrk opened this issue Apr 29, 2019 · 39 comments
Labels
accepted This proposal is planned. breaking Implementing this issue could cause existing code to no longer compile or have different behavior. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@andrewrk
Copy link
Member

This issue is one of the main goals of the 0.5.0 release cycle. It is blocking stable event-based I/O, which is blocking networking in the standard library, which is blocking the Zig Package Manager (#943).

Note: Zig's coroutines are and will continue to be "stackless" coroutines. They are nothing more than a language-supported way to do Continuation Passing Style.

Background:

Status quo coroutines have some crippling problems:

The Plan

Step 1. Result Location Mechanism

Implement the "result location" mechanism from the Copy Elision proposal (#287). This makes it so that every expression has a "result location" where the result will go. If you were to for example do this:

test "example result location" {
    var w = hi();
}

const Point = struct {
    x: f32,
    y: f32,
}

fn hi() Point {
    return Point{
        .x = 1,
        .y = 2,
    };
}

What actually ends up happening is that hi gets a secret pointer parameter which is the address of w and initializes it directly.

Step 2. New Calling Syntax

Next, instead of coroutines being generic across an allocator parameter, they will use the result location as the coroutine frame. So the callsite can choose where the memory for the coroutine frame goes.

pub fn main() void {
    var x = myAsyncFunction();
}

In the above example, the coroutine frame goes into the variable x, which in this example is in the stack frame (or coroutine frame) of main.

The type of x is @Frame(myAsyncFunction). Every function will have a unique type associated with its coroutine frame. This means the memory can be manually managed, for example it can be put into the heap like this:

const ptr = try allocator.create(@Frame(myAsyncFunction));
ptr.* = myAsyncFunction();

@Frame could also be used to put a coroutine frame into a struct, or in static memory. It also means that, for now, it won't be possible to call a function that isn't comptime known. E.g. function pointers don't work unless the function pointer parameter is comptime.

The @Frame type will also represent the "promise"/"future" (#858) of the return value. The await syntax on this type will suspend the current coroutine, putting a pointer to its own handle into the awaited coroutine, which will tail-call resume the awaiter when the value is ready.

Next Steps

From this point the current plan is to start going down the path of #1778 and try it out. There are some problems to solve.

  • Does every suspend point in a coroutine represent a cancellation point? Does it cascade (if a coroutine which is awaiting another one gets canceled, does it cancel the target as well)?
  • How do defer and errdefer interact with suspension and cancellation? How does resource management work if the cleanup is after a suspend/cancellation point?
  • Is cancellation even a thing? Since the caller has ownership of the memory of the coroutine frame, it might not be necessary to support in the language. Coroutines could simply document whether they had to be awaited to avoid resource leaks or not.
  • Do we want to try to solve generators?

This proposal for coroutines in Zig gets us closer to a final iteration of how things will be, but you can see it may require some more design iterations as we learn more through experimentation.

@andrewrk andrewrk added breaking Implementing this issue could cause existing code to no longer compile or have different behavior. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. accepted This proposal is planned. labels Apr 29, 2019
@andrewrk andrewrk added this to the 0.5.0 milestone Apr 29, 2019
@andrewrk
Copy link
Member Author

Once this is complete, it will be time to revisit this work-in-progress code: #1848

@justinbalexander
Copy link
Contributor

justinbalexander commented Apr 29, 2019 via email

andrewrk added a commit that referenced this issue Apr 30, 2019
This is work-in-progress because it's blocked by coroutines depending on
the Allocator interface, which will be solved with the coroutine rewrite
(#2377).

closes #2292
@adrusi
Copy link

adrusi commented May 1, 2019

Since this issue touches on the topic of generators, I'll share my implementation of generators in Zig 0.4.0. https://gist.github.com/adrusi/0dfebca7ff79a8e9ff0c94259d89146d

The gist is that generators can be defined like:

const Gen = Generator(struct {
    slice: []i32,

    pub async fn generator(ctx: *@This(), item: *i32) !void {
        for (ctx.slice) |x| {
            item.* = x;
            suspend;
        }
    }
});

And used like:

var gen = Iterator.init(allocator, Gen.Args {
    .slice = []i32{ 1, 2, 3, 4, 5 },
});
defer gen.deinit();
while (try gen.next()) |item| {
    warn("{}\n", item);
}

I think that unless prettier generator support comes along for free with the coroutine rewrite, we should try a library approach like this first, and only move toward language support for generators if it ends up being insufficient.

(Currently I have an issue that forces .next() to return anyerror!?Item instead of properly inferring the errorset. The compiler fails an assertion if I try to track the errorset. I'll create a separate issue when I can figure out how to get the same failure with less code. #2398)

@jido
Copy link

jido commented May 12, 2019

Next, instead of coroutines being generic across an allocator parameter, they will use the result location as the coroutine frame.

Does that mean that the whole frame stays allocated as long as the returned value is alive?

I believe the two should have different lifetimes.

@andrewrk
Copy link
Member Author

The coroutine frame's lifetime is manually managed by the caller. If you keep reading you can see the example of how to give them different lifetimes by putting the frame on the heap.

For an event-based I/O function calling another one (and using the result), this will work nicely because the coroutine frame of the callee will exist inside the coroutine frame of the caller.

@Matthias247
Copy link

Regarding the cancellation question: I did a writeup for Rust async functions what the impact of always cancellable async functions is: https://gist.github.com/Matthias247/ffc0f189742abf6aa41a226fe07398a8
Maybe it can help you.

TLDR:

  • A requirement of synchronous cancellation impacts the ability to express completion based APIs (e.g. IOCP on windows, io_uring on linux, and lots of hypervisor or OS related APIs) via coroutines, since those APIs might not support synchronous cancellation.
  • Implicit returns in suspend point might be unexpected, and resemble exceptions in a way that they are hidden control flow.
  • But obviously the ability of unconditionally being able to cancel at least all supported operations without needing to worry about whether the implementor ignored cancellation capabilities has also it's benefits.

andrewrk added a commit that referenced this issue Jun 26, 2019
Rather than fixing regressions with deprecated coroutines, I'm going
to let them regress more until #2377 is solved.
@komuw
Copy link

komuw commented Jul 2, 2019

"Why Continuations are Coming to Java" - https://www.infoq.com/presentations/continuations-java/

@mikdusan
Copy link
Member

mikdusan commented Jul 2, 2019

<andrewrk> I'm thinking of calling them something other than "coroutines" 
<andrewrk> when people hear that word they think of "light threads" with stacks, but they're actually stackless, which confuses people 
<andrewrk> how about if they were called "continuations" 
<andrewrk> really they're just a way to do Continuation Passing Style that is supported by the language 
<andrewrk> rather than "zig has stackless coroutines" maybe it would be clearer to say "zig has continuation-passing-style syntax" 
<andrewrk> that would also hint that the feature is available in freestanding mode (which it is)

how about coframes indicating that's all you get - 1 frame

@andrewrk
Copy link
Member Author

I've started working on this in the rewrite-coroutines branch, and making good progress.

Here's a simple example:

var x: i32 = 1;

export fn entry() void {
    const p = async simpleAsyncFn();
}
fn simpleAsyncFn() void {
    x +%= 1;
    suspend;
    x +%= 1;
}

Notice that no async calling convention decorator is required on simpleAsyncFn.

; Function Attrs: nobuiltin nounwind
define void @entry() #2 !dbg !35 {
Entry:
  %p = alloca %"@Frame(simpleAsyncFn)", align 8
  %0 = getelementptr inbounds %"@Frame(simpleAsyncFn)", %"@Frame(simpleAsyncFn)"* %p, i32 0, i32 0, !dbg !44
  store i64 0, i64* %0, !dbg !44
  %1 = call fastcc i64 @simpleAsyncFn(%"@Frame(simpleAsyncFn)"* %p), !dbg !44
  call void @llvm.dbg.declare(metadata %"@Frame(simpleAsyncFn)"* %p, metadata !39, metadata !DIExpression()), !dbg !45
  ret void, !dbg !46
}

; Function Attrs: nobuiltin nounwind
define internal fastcc i64 @simpleAsyncFn(%"@Frame(simpleAsyncFn)"* nonnull) unnamed_addr #2 !dbg !47 {
AsyncSwitch:
  %1 = getelementptr inbounds %"@Frame(simpleAsyncFn)", %"@Frame(simpleAsyncFn)"* %0, i32 0, i32 0, !dbg !51
  %2 = load i64, i64* %1, !dbg !51
  switch i64 %2, label %BadResume [
    i64 0, label %Entry
    i64 1, label %GetSize
    i64 2, label %Resume
  ], !dbg !51

Entry:                                            ; preds = %AsyncSwitch
  %3 = load i32, i32* @x, align 4, !dbg !52
  %4 = add i32 %3, 1, !dbg !54
  store i32 %4, i32* @x, align 4, !dbg !54
  %5 = getelementptr inbounds %"@Frame(simpleAsyncFn)", %"@Frame(simpleAsyncFn)"* %0, i32 0, i32 0, !dbg !55
  store i64 2, i64* %5, !dbg !55
  ret i64 undef, !dbg !55

Resume:                                           ; preds = %AsyncSwitch
  %6 = load i32, i32* @x, align 4, !dbg !56
  %7 = add i32 %6, 1, !dbg !57
  store i32 %7, i32* @x, align 4, !dbg !57
  %8 = getelementptr inbounds %"@Frame(simpleAsyncFn)", %"@Frame(simpleAsyncFn)"* %0, i32 0, i32 0, !dbg !58
  store i64 3, i64* %8, !dbg !58
  ret i64 undef, !dbg !58

BadResume:                                        ; preds = %AsyncSwitch
  tail call fastcc void @panic(%"[]u8"* @1, %builtin.StackTrace* null), !dbg !51
  unreachable, !dbg !51

GetSize:                                          ; preds = %AsyncSwitch
  ret i64 8, !dbg !51
}

If you look at this gist you can compare the above code example with the equivalent from master branch and see how much simpler this is.

One thing I figured out is that we can support async function pointers (when the function is not comptime-known). How it works is that the frame size is not comptime known. So you wouldn't be able to put the memory on your stack. Something like:

const size = @asyncFrameSize(asyncFnPtr); // runtime known asyncFnPtr; runtime known result
var bytes = try some_allocator.alloc(u8, size);
const frame_ptr = @asyncCall(bytes, asyncFnPtr, arg1, arg2, ...);

If you look at the above LLVM IR code, you can see the GetSize basic block. Every async function which could potentially be called virtually (e.g. as a function pointer) would have this, and this is how the @asyncFrameSize would be implemented, by passing 1 as the resume_index and using the return value.

You can see that calling async function pointers is a lot less convenient syntactically than calling them when the function is comptime known. I think this is actually a good thing. There are plenty of reasons to prefer calling comptime-known functions, and the inconvenience here merely reveals the actual problems of virtual async functions. However, the fact that it is still possible allows interfaces such as an asynchronous output stream API to be made.


One design question to answer regarding async functions is: "do we want async function frames to be movable?"

If we make the answer "no" then that allows result locations to work for async functions which are called directly:

const result = myAsyncFunction(); // initializes result directly even though this is an async function call

But it means that copying an async function frame to somewhere else and then resuming it is not allowed. Semantically even functions called with async syntax could have a result location, but I don't know how it would be expressed syntactically:

var frame, const result = async someFunction();

// result is undefined here

await frame;

// now `result` has the return value of someFunction()

The alternative is that the result is always inside the async function frame, and await accesses it. This would mean an important difference between async functions and normal functions: for async functions, return would always be making a copy. Given that the goal is sort of to erase the differences between async functions and normal functions, I think it will be worth exploring at least making the first example have no-copy semantics.

@andrewrk
Copy link
Member Author

andrewrk commented Jul 23, 2019

Without #2761, here's a sketch of what async function semantics would look like, with result location support for the return value:

fn parseFile(allocator: *Allocator, path: []const u8) anyerror!BigStruct {
    // ...

    const bytes = allocator.allocate(u8, 100);
    errdefer allocator.free(bytes);

    // ...

    return BigStruct {
        // ...
        .bytes = bytes,
        // ...
    };
}

pub fn readFoo() anyerror!BigStruct {
    // frame is stack allocated and provided explicitly
    // result location is implicitly stack allocated
    var frame1 = async parseFile(allocator, "foo.txt");
    errdefer cancel frame1;

    // frame is heap allocated and provided explicitly
    const frame2 = try allocator.create(@Frame(parseFile));
    errdefer allocator.destroy(frame2);
    // result location heap allocated and provided explicitly
    const result2 = try allocator.create(anyerror!BigStruct);
    errdefer allocator.destroy(result2);

    frame2.* = async<result2> parseFile(allocator, "bar.txt");
    errdefer cancel frame2;

    // result2 is undefined here

    const payload1 = try await frame1;
    // `cancel frame1` is now a no-op
    errdefer allocator.free(payload1.bytes);

    _ = await frame2;
    // `cancel frame2` is now a no-op
    // result2 is available now
    const payload2 = try result2;
    // alternately, we could have done: const payload2 = try await frame2;
    defer allocator.free(payload2.bytes);

    compare(payload1, payload2);

    // The `cancel` control flow here from the errdefers checks the `completed`
    // flag of each frame and become no-ops in this case. Canceling an already
    // awaited async function is harmless. When canceling a frame that has not
    // already been canceled or awaited, it runs the errdefers that are in scope
    // at the return statement.

    return payload1;
}

With this, I can answer all the questions from the Next Steps section above:

  • Does every suspend point in a coroutine represent a cancellation point? Does it cascade (if a coroutine which is awaiting another one gets canceled, does it cancel the target as well)?
  • Is cancellation even a thing? Since the caller has ownership of the memory of the coroutine frame, it might not be necessary to support in the language. Coroutines could simply document whether they had to be awaited to avoid resource leaks or not.

I am considering two options, both viable. One is that every suspend point in an async function is a cancellation point. It would cascade. With this option, cancel is blocking, and suspending in a defer/errdefer never occurs. This matches master branch semantics.

The other option is that there are no cancellation points. cancel itself represents a suspend point. It's the same thing as await, except once the function gets to the return, it runs the errdefers along with the defers. cancel is allowed in defer/errdefer, and suspending inside defer/errdefer is allowed since suspension points do not represent cancellation points and are therefore guaranteed to be resumed. Cancellation points could be added manually by having the async function return a possible error.

I am leaning towards the second option because it is simpler, and offers better performance and less bloated binaries.

  • How do defer and errdefer interact with suspension and cancellation? How does resource management work if the cleanup is after a suspend/cancellation point?

When the async function gets to a return statement, it knows whether it is dealing with a cancel or an await. In the former case it runs errdefers and defers. In the latter case it populates the result location and then runs only defers.

  • Do we want to try to solve generators?

I think we do, but it might be a separate concept, in addition to async functions. I'm keeping the use case in mind as I work through this issue.

The main things we need for generators are:

  • generator frames can be created without calling the function. they simply initialize resume_index to 0. var generator = @generator(range, 1, 10); Syntax is not decided; this is just an example.
  • each resume of a generator has a result location that the generator places the result into. var n = resume generator;
  • generators themselves could be async, in which case "resuming" a generator would be a suspend point itself.
  • we probably want type safety for generators. an async/await operation is fundamentally different than using a generator.
  • return x; in a generator desugars to yield x; @panic("generator resumed after completion");. This allows things like try to work, and return null; to indicate the generator is finished.
  • whether yield is distinct from suspend is something I'm still considering.

One design question to answer regarding async functions is: "do we want async function frames to be movable?"

The answer here is pretty clearly "no". Async function frames are already not movable if they get casted to anyframe->T because that is a pointer type. The @asyncCall function, necessary for calling async function pointers, will return an anyframe->T type.


The zig code above explores result location support for the return value. However that is probably too complicated for a first pass, and it does not solve how to have something like anyframe->i32 (which would be the return type of @asyncCall) to represent a pointer to any async function frame with the ability to await or cancel it. As a first implementation, I plan to put the return value inside the coroutine frame so that we can have this.

Here is the zig code example simplified in this way:

fn parseFile(allocator: *Allocator, path: []const u8) anyerror!BigStruct {
    // ...

    const bytes = allocator.allocate(u8, 100);
    errdefer allocator.free(bytes);

    // ...

    return BigStruct {
        // ...
        .bytes = bytes,
        // ...
    };
}

pub fn readFoo() anyerror!BigStruct {
    // frame is stack allocated and provided explicitly
    // result location is inside frame1
    var frame1 = async parseFile(allocator, "foo.txt");
    errdefer cancel frame1;

    // frame is heap allocated and provided explicitly
    // result location is inside frame2
    const frame2 = try allocator.create(@Frame(parseFile));
    errdefer allocator.destroy(frame2);

    frame2.* = async parseFile(allocator, "bar.txt");
    errdefer cancel frame2;

    const payload1 = try await frame1;
    errdefer allocator.free(payload1.bytes);

    const payload2 = try await frame2;
    defer allocator.free(payload2.bytes);

    compare(payload1, payload2);

    return payload1;
}

@andrewrk
Copy link
Member Author

andrewrk commented Jul 24, 2019

I started implementing this as a switch, and I think that was indeed the cleaner option, except for one thing which I just realized: generating async functions as separate functions requires one less piece of state in the frame.

When implementing async functions with a switch, there is a resume_index field in the frame, which starts at 0 and increments at every suspend point. The async function is code-generated as a single function which immediately switches on the resume index. To call an async function, the callsite sets resume_index to 0 before calling the async function.

When implementing async functions with function splitting, there is a resume_fn field in the frame, which is a pointer to the next function to be called. Each suspend point stores the next function pointer in the frame before returning. To call an async function, the callsite does not need to set any state in the frame; instead simply calls the first function. (It does need to set the awaiter handle to 0 (same as in the switch implementation), so the frame cannot be truly undefined memory.)

One of the main use cases of async functions is an event loop, which has a set of suspended async functions and wants to resume them when appropriate. It has pointers to their frames (the anyframe type) and must obtain the function pointer to call. And so the switch implementation still needs the function pointer to the async function in the frame. However with the function splitting implementation, the resume_fn field suffices. To summarize: because the anyframe type allows the resume keyword to resume the function, it must be possible to resume an async function based on a pointer to its frame. And so the function splitting approach requires fewer fields in the frame.

Function splitting should be better for performance in this situation as well, because a virtual function call is basically equivalent to the big switch statement, but in the resuming anyframe situation, function splitting implementation incurs one virtual function call, but switch implementation incurs the same virtual function call and then the extra switch.

So I'm pretty sure the better approach here is function splitting. However for implementation simplicity's sake I will continue with the switch implementation in the stage1 compiler. We can consider changing the implementation to a function splitting approach once the rest of the concepts are proven, or we can leave it alone and be better informed for the self-hosted compiler.

@andrewrk
Copy link
Member Author

Regarding function splitting vs switch:

There is one thing I did not consider above, which is the implementation of @frameSize. Currently, with the switch implementation, resume_index of 1 is reserved for getting the frame size. This is quite simple to implement: set resume_index = 1 before calling the async function and then the frame size is returned as the return value.

With function splitting, a different strategy is required. I believe there is a good way to do this: with Prefix Data. Each async function has Prefix Data which is the frame size. This means that @frameSize can look up the frame size based on the function pointer.

This prefix data implementation is so clean that it's a better option to use even for a switch implementation of async functions.

@andrewrk
Copy link
Member Author

Regarding anyframe, anyframe->T, and @Frame(function):

What this coroutine rewrite brings to the table is this new type@Frame(function) which represents not just any async function frame, but the frame for a specific function. And, importantly, it is complete- the size is known and it can be used in structs, parameters, local variables, etc.

The @Frame(foo) type can be used when doing an async function call of the foo function. A pointer of this type, (*@Frame(foo)) can be implicitly casted to anyframe->T (where T is the return type of foo) or anyframe.

anyframe->T represents an in-progress async function call, which can be used as the operand to await, resume, or cancel. When awaited, it returns type T. anyframe->T can be implicitly casted to anyframe.

anyframe also represents an in-progress async function call, but only resume and cancel are available.

One thing I'm trying to figure out is how to represent this in the type system, e.g. @typeInfo. anyframe and anyframe->T are easy to represent: they're the same type in @typeInfo and the payload type is optional.

But what about @Frame(func)? It could also be represented as a frame type, the same as anyframe and anyframe->T, but with an optional function. However, it's also different in that it's not a pointer type, but the anyframe types are pointer types. If we wanted to try to make them the same type, we could make the anyframe types be more like opaque types, where the size isn't known, but they're not pointers, and you'd use, e.g. *anyframe and *anyframe->T in practice, just like * is necessary to use with opaque types.

Another option is to allow the anyframe and anyframe->T types to remain pointer types, and @Frame(func) remain a non-pointer type, but still represent them as the same type. Code using @typeInfo and the internal compiler implementation would have to distinguish carefully based on whether a function was part of the type.

Finally, one reasonable option might be simply having the Frame and AnyFrame types be different types.

An unsolved problem is how to represent @Frame(func) in @typeInfo, because there's no type that can be used to store the function. E.g. if it were func: async fn() void then this prototype only works for functions with no parameters and a void return type. Perhaps some types having fields is reasonable after all: @Frame(func).function == func.

@emekoi
Copy link
Contributor

emekoi commented Jul 27, 2019

will @Frame solve #157 then?

@andrewrk
Copy link
Member Author

It's definitely related. I'm purposefully avoiding the recursion issue with these coroutine changes to keep the complexity under control, but indeed this is a step towards safe recursion.

@andrewrk
Copy link
Member Author

andrewrk commented Aug 1, 2019

A couple updates:

Function splitting vs big switch statement

I went ahead and rewrote the rewrite-coroutines branch using function splitting rather than using a big switch statement. This resulted in a bit more implementation complexity but gained simpler generated code - smaller, faster binaries, with less memory requirements (the reasons for this are described above in this issue).

Cancellation

I'm confident about how to do cancellation now. This is the second option described above, with some of the details ironed out.

suspend no longer represents a cancellation point. According to the semantics of the language, suspend is always followed with a resume, and it is a resource leak to not resume a suspended async function.

cancel itself represents a suspend point. It's the same thing as await, except once the function gets to the return, it runs the errdefers along with the defers. cancel is allowed in defer/errdefer, and suspending inside defer/errdefer is allowed since suspension points do not represent cancellation points and are therefore guaranteed to be resumed in correct code.

Cancellation points can be added manually by using @isCancelled() and having the async function return a possible error, like this:

if (@isCancelled()) {
    return error.Cancelled; // this is not a special error code, this is just an example
}

@isCancelled is allowed in non-async functions and always returns false. Whether this will be comptime or runtime known is yet to be determined.

@daurnimator
Copy link
Contributor

cancel itself represents a suspend point.

How does one cancel? Is it e.g. a method on the frame type? someframe.cancel()

@andrewrk
Copy link
Member Author

andrewrk commented Aug 2, 2019

How does one cancel? Is it e.g. a method on the frame type? someframe.cancel()

It's currently a keyword. You can see an example in #2377 (comment)

@daurnimator
Copy link
Contributor

You can see an example in #2377 (comment)

In that example cancel cancels a frame.
I guess I misunderstood what you meant by "cancel itself represents a suspend point"? I took that to mean that adding cancel means that the running coroutine may be cancelled at that point.

@andrewrk
Copy link
Member Author

andrewrk commented Aug 2, 2019

cancel is exactly the same thing as await, except it sets the cancel bit.

Once a function has been awaited, the resources and side-effects that it allocated/performed are now the caller's responsibility.

When an async function returns, if the cancel bit is set, it runs all the errdefers, which means that the caller never has to take responsibility of the resources and side-effects that it allocated/performed. This makes e.g. try still usable in between async and await.

@isCancelled() exposes the cancel bit.

@kristoff-it
Copy link
Member

kristoff-it commented Aug 2, 2019

This design seems great!

Is it correct to say that @isCancelled() can change state only after a suspend? If that's the case, then maybe it would be worth to think of some syntax for suspend, something similar to:

fn foo() void {
  // ...
  suspend {
    // only executes if @isCancelled() == true
  }
  // ...
} 

This way we implicitly make it clear when it's a good time to check if the coroutine got cancelled or not, so that people don't get confused and check at random points / forget to refactor after moving a suspend.

@andrewrk
Copy link
Member Author

andrewrk commented Aug 3, 2019

Is it correct to say that @isCancelled() can change state only after a suspend?

Good question. In a single-threaded build, this is true, because only while suspended is it possible for other code to do anything such as perform a cancel.

However in a multi-threaded build, here's an example of how this can work:

  • two I/O operations complete at roughly the same time, resuming 2 async functions simultaneously
  • one of the async functions is running and is not suspended
  • the second one performs cancel on the first one.

cancel immediately and atomically sets a bit which can be read via @isCancelled() at any time.

suspend is oblivious to whether or not the cancel bit is set; it doesn't care.

There is one possible improvement that can be made, which would be to try to avoid doing the atomic rmw more than once in the event of a cancel. @isCancelled() requires an atomic read of the awaiter/cancel-bit, and so does return. However, if the cancel bit is set, atomic ops are no longer required (and the awaiter has already been loaded). With some creative syntax, there may be a way to re-use the same awaiter/cancel-bit which is obtained from @isCancelled() for use in the return. It could be something such as:

cancelpoint {
    // any code can go here. it runs if the cancel bit is set.
    // `return` and `try` within this scope are generated without atomic ops
}

And to change your mind about whether to return, or just to learn if the cancel bit is set, you could label the cancel point:

var is_cancelled = false;
label: cancelpoint {
    break :label true;
}

Maybe that could have an else:

const is_cancelled = label: cancelpoint break :label true else false;

This would be uncommon though; most code would want to do something like this:

cancelpoint return error.Cancelled;

@daurnimator
Copy link
Contributor

@kristoff-it the suspend may be in a called function

fn foo() void {
    set_things_up();
    suspend;
}
fn bar() !void {
    foo();
    if (@isCancelled()) {
        return error.Cancelled;
    }
    expensiveOperation();
}

@andrewrk
Copy link
Member Author

andrewrk commented Aug 3, 2019

I just realized a neat idea: if the async function frames have the following header:

extern struct {
    prev_frame_ptr: usize, // pointer to the caller's frame (whether it be stack or async)
    return_address: usize, // the address of the next instruction at the callsite
}

Then the exact same stack trace walking code will actually be able to print a stack trace of async function calls in addition to normal function calls. Async functions sort of have 2 stack traces that can be printed:

  • The stack trace that represents where all the async foo() calls are made. And when it gets to the first async call, it seamlessly continues walking up the normal function stack.
  • The stack trace of where the functions were resumed from (resume foo), which probably comes from the event loop.

The former is more generally useful when developing an application; the latter is useful when working on the event loop code. One could "select" which stack trace to get depending on whether @ptrToInt(@frame()) (would be the stack frame address or the async fn address) or @frameAddress() (always the stack frame address) is passed as the "first" stack frame to dumpCurrentStackTrace.

When this is paired with error return traces, zig's debugging facilities regarding async/await are going to be unbeatable.

Imagine that catch unreachable caught an error during an async function. All of the following things would be available:

  • "attempt to unwrap error"
  • the name of the error that was unexpectedly caught
  • a stack trace leading from main() to the event loop to the resume to the point in the async function where it got resumed
  • a stack trace going back through all the async calls to the first async function
  • an error return trace, which goes back through all the await and async calls.

So if your async and await are far apart, this debugging information would connect the dots for you to tell a complete story about what went wrong.

@fengb
Copy link
Contributor

fengb commented Aug 11, 2019

If we want any function to be coroutineable, we might have potential resource leaks:

fn foo() !*u8 {
  const result = try std.heap.direct_allocator.create(u8);
  result.* = 0;
  return result;
}

The naive analysis shows no memory leak because the data is immediately returned, but in async mode there’s a second error at the implicit cancel point.

An even more subtle variant is one where there’s no error in the body, but we need to release:

fn bar() Resource {
  return Resource.acquire();
}

This function shouldn’t even fail so errdefer doesn’t really make sense...

@andrewrk
Copy link
Member Author

I'm glad you brought this up.

Here's how I see this playing out. Firstly, these functions can be modified to be cancelable:

fn foo() !*u8 {
    const result = try std.heap.direct_allocator.create(u8);
    errdefer std.heap.direct_allocator.destroy(result);
    result.* = 0;
    return result;
}
fn bar() Resource {
    const result = Resource.acquire();
    errdefer result.release();
    return result;
}

I'll be the first to admit, it's a bit counter-intuitive that this would be necessary. On the other hand, writing code this way does better document how resources work for the caller, and makes it easier to make modifications to the functions without forgetting to add resource cleanup.

Here's a related proposal that I rejected a while ago: #782
Maybe it's worth reviving this in a slightly modified form.

Another thing we can do is add a nocancel attribute:

nocancel fn foo() !*u8 {
  const result = try std.heap.direct_allocator.create(u8);
  result.* = 0;
  return result;
}
nocancel fn bar() Resource {
  return Resource.acquire();
}

Then if you tried to cancel frame; where @typeOf(frame) == @Frame(foo) or @typeOf(frame) == nocancel anyframe->Resource you'd get a compile error: "attempt to cancel non-cancelable function".

It would be a compile error to implicitly cast nocancel fn to fn but it would be allowed to implicitly cast fn to nocancel fn.

With these examples, however, I do think the first suggestion would be best: make them both properly cancelable.


In practice, I foresee some resource leakage happening, just like we occasionally see with memory leaks in manually memory managed programs. But these are fixable bugs, with plenty of tooling to help identify the leaks, and the flip side of it is that it allows us to employ advanced resource management techniques. For example, maybe for a big section of code, cancel is never used and memory leaks every time an error occurs. However in this section of code an arena allocator is used, and everything cleaned up at the end, so it's all ok.

@mikdusan
Copy link
Member

how about cancel being an expression with type ?ReturnType:

fn bar() Resource {
    return Resource.acquire();
}

var f = async bar();
if (cancel f) |r| {
    r.release();
}

@andrewrk
Copy link
Member Author

how about cancel being an expression with type ?ReturnType:

I like where you're going with this, but I think it can get more complicated. Consider this example:

const std = @import("std");
const Allocator = std.mem.Allocator;

const simulate_fail_download = false;
const simulate_fail_file = false;

pub fn main() void {
    _ = async amainWrap();
    resume global_file_frame;
    resume global_download_frame;
}

fn amainWrap() void {
    amain() catch |e| {
        std.debug.warn("{}\n", e);
        std.process.exit(1);
    };
}

fn amain() !void {
    const allocator = std.heap.direct_allocator;
    var download_frame = async fetchUrl(allocator, "https://example.com/");
    errdefer cancel download_frame;

    var file_frame = async readFile(allocator, "something.txt");
    errdefer cancel file_frame;

    const file_text = try await file_frame;
    defer allocator.free(file_text);

    const download_text = try await download_frame;
    defer allocator.free(download_text);

    std.debug.warn("download_text: {}\n", download_text);
    std.debug.warn("file_text: {}\n", file_text);
}

var global_download_frame: anyframe = undefined;
fn fetchUrl(allocator: *Allocator, url: []const u8) ![]u8 {
    global_download_frame = @frame();
    const result = try std.mem.dupe(allocator, u8, "this is the downloaded url contents");
    errdefer allocator.free(result);
    suspend;
    if (simulate_fail_download) return error.NoResponse;
    std.debug.warn("fetchUrl returning\n");
    return result;
}

var global_file_frame: anyframe = undefined;
fn readFile(allocator: *Allocator, filename: []const u8) ![]u8 {
    global_file_frame = @frame();
    const result = try std.mem.dupe(allocator, u8, "this is the file contents");
    errdefer allocator.free(result);
    suspend;
    if (simulate_fail_file) return error.FileNotFound;
    std.debug.warn("readFile returning\n");
    return result;
}

@andrewrk
Copy link
Member Author

The design challenge here, specifically, is to make it so that return error.Foo; is valid to put on any line. e.g. try should work everywhere.

@andrewrk
Copy link
Member Author

With @mikdusan proposal above, I think it would look like this:

    var download_frame = async fetchUrl(allocator, "https://example.com/");
    errdefer cancel (download_frame) |result| if (result) |r| allocator.free(r) else |_| {}

    var file_frame = async readFile(allocator, "something.txt");
    errdefer cancel (file_frame) |result| if (result) |r| allocator.free(r) else |_| {}

Not amazing looking. And at this point, why do we even need cancel? Here's how to do it without cancel:

fn amain() !void {
    const allocator = std.heap.direct_allocator;
    var download_frame = async fetchUrl(allocator, "https://example.com/");
    var awaited_download_frame = false;
    errdefer if (!awaited_download_frame) {
        if (await download_frame) |r| allocator.free(r) else |_| {}
    }

    var file_frame = async readFile(allocator, "something.txt");
    var awaited_file_frame = false;
    errdefer if (!awaited_file_frame) {
        if (await file_frame) |r| allocator.free(r) else |_| {}
    }

    const file_text = try await file_frame;
    awaited_file_frame = true;
    defer allocator.free(file_text);

    const download_text = try await download_frame;
    awaited_download_frame = true;
    defer allocator.free(download_text);

    std.debug.warn("download_text: {}\n", download_text);
    std.debug.warn("file_text: {}\n", file_text);
}

Even less amazing looking... but it does remove the need for an entire feature.

@kristoff-it
Copy link
Member

kristoff-it commented Aug 14, 2019

It seems to me that the problem lies mainly in the interaction of two issues

  1. a function might be participating in a complex resource allocation / release scheme where it returns to the caller something the function itself has allocated
  2. a cancel will prevent a function from returning a result value
    1. in a synchronous function it's a problem because the author might not be aware of this possibility
    2. an async func might be canceled at various points in its lifecycle, so depending on when it gets canceled, it will have a different set of resources to free

nocancel doesn't seem a good solution because of point 2i: I think the most common case would be that people are not aware of this surprising consequence of how async / await works in Zig. If the author knows about that feature, then they might as well just make the function correct, and it should normally be fairly easy, because all intentional leaks must be tied to output values. Given this point, I think a keyword with the inverse meaning would work better.

I also understand that restricting people by default is not something Zig likes to do, so here's a second idea to complement the first, based on the concept of considering cancel a particular type or early return.

fn bar() Resource {
    const result = Resource.acquire();
    errdefer result.release();
    return result;
}

The previous function supports cancellation but, as stated already by Andrew, it's counter-intuitive, since no error can happen after .acquire(). Additionally, a quick look at a more complex function doesn't really tell you if the author knew about cancellation or not, since errdefers are pretty common. But I think that's an easy fix with a small change to the current semantics:

fn bar() Resource {
    const result = Resource.acquire();
    canceldefer result.release();
    return result;
}

A canceldefer only runs when a function is cancelled. When you see a canceldefer you know that the author is aware of cancellation and that they took proper care in writing the function. The Zig compiler could do the same type of inference: if a function is using async / await or contains a canceldefer, then it's cancellable, otherwise, it needs to be marked explicitly.

Now, the straightforward way of naming the explicit keyword would be cancellable, but it might be interesting to try and think outside of the box a little:

noleaks fn bar(allocator: *Allocator) void {
    var in_file = try os.File.openRead(allocator, source_path);
    defer in_file.close();
}

A noleaks keyword has almost the same implication as cancellable, but it's a more general concept and a nice hint to readers. I think people would be more interested in knowing that a function doesn't leak, more than knowing it can be canceled (one almost always cares about managing resources and only occasionally cares about cancellability).

Adding noleaks to a function that contains a canceldefer should result in an error.

  1. A function with a canceldefer is a function that wants to hand out resources and that is aware of cancellability.
  2. A noleaks function is a function that does not intend to hand out any resource ever.
  3. A function with neither canceldefer nor noleaks makes no promises about its intentions.

Please note how by making canceldefer and noleaks mutually exclusive we made the latter become a stronger statement than just cancellable. This also made cancellable inference more effective because all the non-noleaks functions that can be cancelled must have at least one canceldefer. This still won't allow users to use the async notation on functions like math.sqrt, but that might even be a good thing.

A few more considerations:

  1. I think canceldefer should be a thing because errdefer is about unwanted happenings on our level of the stack, or lower; while being cancelled is about unwanted things happening somewhere above our frame.
  2. I think erring on the side of caution is important in this case, as I stated above I think nocancel would not work well because the default would allow any leaky function to be canceled. I think noleaks is better than a hypothetical doesleak for the same reason.
  3. To mitigate the amount of erroneous caution, looking for canceldefer, or the other async/await-related keywords seems a good starting point both for compilers and humans. Maybe by tagging functions as constructors and destructors one could further minimize mistakes (by automatically inferring that a function doesn't leak if every ctor is matched by a dtor), but I would be curious to know how much improvement we would get over this euristic.
  4. Inferring cancellability, as presented in this comment, should also be resilient to code changes to a reasonable degree
    1. If a function has one canceldefer and loses it, then it will stop being cancellable. This might be an unexpected change, but it's easy to discover this problem via a test (which should have already be present to test cancellability, since the author explicitly added a cancellability-related keyword). Mistakes could happen, but normal, thoughtful programming might be good enough.

    2. noleaks is a bit trickier. One could add the keyword, forget about it, and then change the function in a way that makes it hand-off resources to the caller. I think the name is explicit enough so that people won't easily leave it if they decide to pursue the dark arts, but if we want more safety, here's a couple of ideas:

      1. if a test invokes a noleaks function and the test then leaks, make the test fail. Not sure if that is even possible for non-trivial cases.
      2. Require noleaks functions to have in their body a noleakassert that asserts in the appropriate way that all resources have been taken care of. In debug mode it's a runtime panic if the assert fails or if no noleakassert runs when the function gets cancelled. A bit overkill, in my opinion.

      Anyway, if a function is changed to make it leak, it will probably get a canceldefer added to it sooner or later, and the compiler will complain about it, resulting in a fix.

  5. One last argument in favor of preferring noleaks over nocancel is code translated from C.
  6. In the beginning it could be good to only infer cancellability from canceldefer and cancelpoint, and only later relax the rule to also include async, await, suspend, resume.

@andrewrk
Copy link
Member Author

andrewrk commented Aug 14, 2019

Thank you @kristoff-it for this detailed proposal.

I've thought long and hard about this problem, and consulted with @thejoshwolfe, and I feel confident with the path forward. And... it's the "even less amazing looking" solution above. That is:

  • I will remove the cancel syntax. No more concept of canceling.

That's it. The rules become very simple: If you async something, you must await it, and you must do that exactly once. Any kind of "cancellation" concept would be purely implemented in userland.

So that means the way to write the example would be:

fn amain() !void {
    const allocator = std.heap.direct_allocator;
    var download_frame = async fetchUrl(allocator, "https://example.com/");
    var awaited_download_frame = false;
    errdefer if (!awaited_download_frame) {
        if (await download_frame) |r| allocator.free(r) else |_| {}
    }

    var file_frame = async readFile(allocator, "something.txt");
    var awaited_file_frame = false;
    errdefer if (!awaited_file_frame) {
        if (await file_frame) |r| allocator.free(r) else |_| {}
    }

    awaited_file_frame = true;
    const file_text = try await file_frame;
    defer allocator.free(file_text);

    awaited_download_frame = true;
    const download_text = try await download_frame;
    defer allocator.free(download_text);

    std.debug.warn("download_text: {}\n", download_text);
    std.debug.warn("file_text: {}\n", file_text);
}

It's verbose, but try is available to use at any point. The nesting is limited. There are no hidden allocations or control flow. Advanced usage of resources, such as arenas, is possible.

Here is another way to write this example:

fn amain() !void {
    const allocator = std.heap.direct_allocator;

    var download_frame = async fetchUrl(allocator, "https://example.com/");
    var file_frame = async readFile(allocator, "something.txt");

    const file_text_res = await file_frame;
    defer if (file_text_res) |x| allocator.free(x) else |_| {}

    const download_text_res = await download_frame;
    defer if (download_text_res) |x| allocator.free(x) else |_| {}

    const file_text = try file_text_res;
    const download_text = try download_text_res;

    std.debug.warn("download_text: {}\n", download_text);
    std.debug.warn("file_text: {}\n", file_text);
}

I think this would actually be a less recommend way to do it, because between the first async and the last defer, it is invalid to use try. It does look a bit less noisy though.


Anyway, removing cancel solves all the design considerations:

  • If readFile or fetchUrl are normal, blocking functions, then @Frame(readFile) or @Frame(fetchUrl) become simple wrapper types for the return type of the respective function. The entire function call completes at the async call site, and the "frame" just has the return value. At the await, it's a copy of the return value to the result location of the await.
  • Async functions no longer have to stop at return when they beat the await. They can continue to run defers and suspend when nothing is left to do. await then on an async function which has early returned, does not even have to resume it; it simply copies the value and moves on.
  • errdefer continues to mean only running defers when an error is returned. No surprising new behavior here.

These semantics are certainly simple, and while requiring verbosity, will be easy to understand, debug, and maintain. Let's move forward with this plan, and collect data on usage patterns to inform what, if anything, needs to be modified.

@kyle-github
Copy link

kyle-github commented Aug 14, 2019

@andrewrk, nice. I like how clean it is.

Cancelling an async function seemed like it was going to be a tar pit. Not just for resource leaks, but for maintaining data invariants. For instance, what if you had an async function that did something to a linked list. If I could cancel it in the middle of unlinking a node, I might be left with a corrupt data structure.

I wonder if there is a clean way to do something like await but without blocking? I.e. I can see a situation where you have something like:

var doc = async fetchUrl(allocator, "http://www.example.com");
defer if (doc) |d| allocator.free(d) else |_| {}
var done = false;

while(!done) {
    if(doc) |_| { done = true; } else |_| { spinProgressSpinner(); }
}

... use doc ...

That's totally made up and I am overloading the error handling forms. The idea is that I do not want to block on the if. I want to get back some sort of "error" or at least "not ready" condition so that I can do something else while I wait. It is a non-blocking wait.

@kyle-github
Copy link

Hmm, that example could be easier:

while(! @isReady(doc)) {
    spinProgressSpinner();
}

@kyle-github
Copy link

kyle-github commented Aug 15, 2019

I reread through all the comments on this issue as well as many of the related issues. I feel a bit confused. In no particular order:

Marking functions async at the call site but not in the definition. Can I just slap "async" on any function call? I like that idea but it means that the function in question would need to be compiled twice. Once for the non-async case and once for the async case. I think I am missing something here :-(

Using suspend outside of an async function. At least one of the examples above shows suspend being called in a function that is not the async function but one that it calls. How does that get implemented? If I have:

fn notAsync(x:i32) i32 {
   var y = 42;
   suspend;  // <-- how does this work??
   return x+y;
}

fn asyncFunc(a:i32) {
    var b = notAsync(a) + 10;
    return b;
}

fn main(...) {
    var aFuture = async asyncFunc(13);
    var aVal = await aFuture;
    return aVal;
}

How does the suspend in notAsync() work? That isn't an async function. So there's no hidden frame, right? Where does suspend save the execution address into?

Where's the stack? Assuming that the calling pattern I have above is possible, where is the stack on which the activation record for notAsync() is allocated? If I call other functions from within my async function, they need to have activation records somewhere. Right?

Perhaps I am getting confused with protothreads. I have used those before and the C version is kind of a pain because a protothread cannot suspend in a called normal function. I also had a C++ version and I got around that by making a protothread class for each function with an overloaded () operator so I could call the instances like functions. All smoke and mirrors, but the code was pretty clean.

Sorry if these are stupid questions. I must be really missing something...

Note that Gabriel Kerneis implemented continuation passing in C using a transform which was a lot like creating protothreads (or Zig-style) continuations in C. It is an interesting read. Their results are quite nice and it was fairly clean: Continuation Passing C.

@daurnimator
Copy link
Contributor

(answers are based on my understanding; not necessarily correct)

Can I just slap "async" on any function call?

Yes

I like that idea but it means that the function in question would need to be compiled twice. Once for the non-async case and once for the async case.

Not exactly: an async call of a function that never uses @frame or suspend is just a normal call with a different result location.
But for a function that calls suspend (or a function that calls a function that calls suspend) yes it could be compiled twice. I don't think this is a problem.

@kyle-github
Copy link

Thanks for the response, @daurnimator!

I like that idea but it means that the function in question would need to be compiled twice. Once for the non-async case and once for the async case.

Not exactly: an async call of a function that never uses @frame or suspend is just a normal call with a different result location.
But for a function that calls suspend (or a function that calls a function that calls suspend) yes it could be compiled twice. I don't think this is a problem.

OK... But won't you have to recompile the whole call chain from the tip of the async call all the way down to where the suspend is called? That is not necessarily a problem, but it seems like it could lead to some surprises when your code size increases suddenly.

And now I am wondering what suspend is for. It returns no value that I can see. From the comments above, I must call await exactly once for each async function. If there was just async and await the symmetry is clear. But with suspend and resume it is less clear. Can I call resume until the function stops? What if I call resume too many times and the function is trying to return?

Are there two things being conflated here? One is having functions that execute asynchronously, but are otherwise just functions. They get called. They return. In this case, they get called asynchronously and then when we want to, we wait for them to finish. suspend and resume seem like they are more geared toward reentrant functions like generators, but it is not clear to me how the two kinds of thing mesh.

@andrewrk
Copy link
Member Author

Done with the merge of #3033

@karrick
Copy link

karrick commented Aug 16, 2019

wow, congratulations! this is huge. watching the video now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This proposal is planned. breaking Implementing this issue could cause existing code to no longer compile or have different behavior. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests