Server-side Blazor Production and Reliability #10472

danroth27 · 2019-05-22T22:55:54Z

Edit: @rynowak hijacking top post for great justice

Summary

This issue tracks doing all the needed work to support server-side Blazor in production.

I plan to dig into the following areas and for each area assess the current state, make recommendations and log bugs, and write documentation and guidance.

Working in order:

Error handling
Logging and Diagnostics
Network Reliability
Resiliency to app recycling
Scale out

In the background the team will be fixing any high priority issues that come up.

Error Handling

This includes the possible causes and categories of errors that can occur in a server-side Blazor application, and how application developers should be prepared to deal with them.

Make error handling explicit on boundaries between framework and user code (lifecycle methods, rendering)

We divide unhandled exceptions into two categories:

Exception thrown with an observer (event handler)
Exception thrown without an observer (during rendering)

Exceptions that are thrown as a result of an event handler (observed) are not always bugs. It might be a reasonable behavior for a component to throw an exception in response to invalid data for example. We think that logging is good enough for these cases.

Make sure errors thrown on event handlers are logged on the server
Make sure client-side code cannot see unsanitized exception details

For unobserved exceptions, these are generally thrown during the rendering process and can corrupt state. We should not attempt to recover or reuse the circuit if an exception is thrown while rendering.

Tear-down/dispose crashed circuits and disconnect the client.
~~Notify the client that the circuit is has crashed (ideally with UI).~~

UPDATE Most of this is done. The part that isn't is "displaying an error UI on the client", which is not planned for 3.0

Logging and Diagnostics (Handled as part of #11792)

This includes logging and diagnostics of communications between server-side .NET code and client-side JS code, as well as any significant events on the server side. This can also include DiagnosticSource, EventSource, and EventCounters. We will likely make a prioritized list in this area and draw a cutline. My assumption is that the priority here is around the network ingress/egress.

We need to dial up the amount of logging we can produce on both the server and client, and make it possible for developers to diagnose and report issues that we can take action upon using logs.

We need logging for entry/exit/results of all Hub calls on both the client and server. Some of this is provided by SignalR, but we need to log the relevant data at an appropriate level.
Where we're using JS interop for fundamental framework concerns, we need to make the diagnostic information first-class. One way to do this is by converting JS interop to a hub method.
We need to add logging and diagnostics for JS Interop.

Network Reliability

This covers the reliability to SignalR connection, the ability to resume a circuit, and the ability for the browser the reconnect without data loss.

We have some user-reported issues that we are acting upon here, but we need to identify a strategy for testing reliability, and ideally this would dovetail with our other E2E testing strategy.

When a client disconnects, rendering updates will be queued on the server and delivered in order once the client reconnects. [Blazor server-side] Limit the amount of queued pending renders #11964
~~[ ] When a message fails to send on the client, the message will be queued and delivered in order.~~ This is going to apply for ACK-s only and we've decided not to do anything for JS Interop.

The below three issues will be handled as part of milestone verification work after Preview 8 CC date (as part of #12196).

Test that a client can disconnect and reconnect multiple times without loss of data.
Test clients with a slow connection/high
Tests scenarios with a high interaction-rate with the goal of providing guidance about patterns that do and do not work well for server side.

Resource Consumption

This covers understanding and mitigating the causes of excessive resource consumption on the server. Due to Blazor's stateful and connected nature, keeping an eye on how usage patterns can leave to resource exhaustion is important.

Proactively remove a circuit when the user closes a tab. Dispose Circuits on graceful disconnect #12197
~~Proactively stop/start rendering when a user isn't looking at the tab.~~ No plans to do this
~~Deactivate circuits due to inactivity. This could be a sample since it might not fit all use cases.~~ No plans to do this
~~Can we rate-limit the number of connections open (per user/total)? This could use the CircuitHandler if we added the ability to reject a connection.~~ No plans to do this
~~Can we rate-limit events per-connection?~~ No plans to do this
Provide guidance a documentation for understanding resource consumption per-user/per-circuit. Provide guidance for scalability for Blazor AspNetCore.Docs#13294

Resiliency to App Recycling

This covers the set of infrastructure and guidance users will need to build applications that function well when the server is shut down or crashes. This is important because server-side Blazor holds the the application state in memory on the server - the default experience is that if the server goes away so does all of your state that hasn't been persisted to a data store.

Guidance and documentation for how to architect apps that don't rely on keeping all of the state for a workflow in memory (paginated form/wizard).
~~Provide a sample of how components can be notified for a circuit shutdown/load and use that callback to save/load state.~~ No, we wouldn't recommend persisting only on circuit shutdown, as that would be highly unreliable. What if the server goes down unexpectedly? Instead we recommend and have guidance for persisting state frequently, e.g., whenever the user changes that state.
Sample of a component that persists UI state to local storage to be resiliant to catastrophic failures.

Scale out

This covers the set of steps that are required to deploy server-side Blazor is a scalable way (multiple servers). Since Blazor uses server-side memory to hold state, we expect applications to commonly need multiple servers and a scale-out strategy.

Scale out strategies for server-side Blazor will rely on stickiness provided by the Azure SignalR service.
A non-Azure-based deployment of server-side Blazor will rely on stickiness/affinity being enforced by a load balancer.
Address the scale-out problems caused by data protection (we're using data protection for CircuitIds which introduces the need for external storage).

Known Items

Circuits are not being cleaned up without traffic #9893 Circuits are not being cleaned up without traffic
Solution for accumulation of Disposable transient services #5496 Investigate accumulation of Disposable transient services
Allow robust reconnects when client do not perform graceful disconnects #8003 Allow robust reconnects when client do not perform graceful disconnects
Server-side Blazor E2E performance and capacity testing #10449 Server-side Blazor E2E performance and capacity testing
~~[Blazor server-side] Better server-side limits #9117 Better server-side limits~~

The text was updated successfully, but these errors were encountered:

mkArtakMSFT · 2019-05-23T23:24:01Z

This issue tracks high level work for:

any feature work (if necessary)
guidance / docs

mkArtakMSFT · 2019-05-23T23:26:13Z

We need to first measure the load on some representative app (Blazing Pizza app)

ADefWebserver · 2019-05-23T23:36:59Z

We need to first measure the load on some representative app (Blazing Pizza app)

You can also find a large server side app at https://github.com/oqtane/oqtane.framework

NTaylorMullen · 2019-06-13T18:01:35Z

The ignitor client is checked in to enable perf and security testing.

/cc @pranavkm @javiercn

mkArtakMSFT · 2019-06-23T20:17:04Z

Moving this to Preview8 as the work is ongoing and will continue in the next milestone.

SteveSandersonMS · 2019-07-01T14:25:45Z

OK, after some further consideration, discussions, and research, here's what I'm proposing should actually be considered or attempted during preview 8, in priority order:

Ensure clients can't cause a global unhandled exception that would terminate the process (either accidentally or deliberately)
- e.g., today they can by making a JSInterop call and passing an unknown object ID
- This is tracked in Ensure server-side Blazor failure modes are correct (e.g., clients can't cause a global unhandled exception) #11791
Expand logging and diagnostics
- Ensure developers can trace all the major things that happen in a server-side Blazor application (e.g., circuits starting/ending) with a special focus on all incoming calls from the client (event notifications, JSInterop calls, renderbatch ACKs)
- This is tracked in Expand Blazor logging and diagnostics #11792
Provide examples/guidance around state management when reconnecting to a different server
- This can include patterns like storing state in localStorage or a database, and addressing the question of whether/how to restore some level of UI state. For server-side state storage, how do we associate the state with the circuit, and how do we ensure we clean it up at the right time?
- Consider whether it makes sense ever to do this on a per-component basis, as opposed to on an app-wide basis
- This is tracked in Guidance for cross-server state management in Blazor #11793
Provide guidance on error handling
- Enumerate all the places we'd expect a developer to write some manual error handling code so we can document and explain them.
- With respect to the different boundaries between framework and application code (lifecycle
  methods, event handlers, dispose, DI services, rendering, etc.), which of them will terminate the circuit if there's an exception? Which of them will surface the error in a way that naturally appears to the user and/or gets logged?
- This is tracked in Guidance for error handling in Blazor #11794
Feature: Auto-reload if circuit reconnection fails
Feature: proactively terminate circuits if a user disconnects gracefully

The last two "feature" items there are stretch targets. I'm definitely not promising they both get done and shipped for 3.0. It depends on how much other stuff is going on, since they are not committed for inclusion at this stage. It may be that it's sufficient to be able to describe ways of achieving someone along those lines in user code (as some server-side Blazor developers already have done for "auto-reload if reconnection fails").

javiercn · 2019-07-01T14:33:02Z

Ensure clients can't cause a global unhandled exception that would terminate the process (either accidentally or deliberately)

e.g., today they can by making a JSInterop call and passing an unknown object ID

This overlaps with some of the security stuff. I was compiling the same list to get ready to file issues for them. I guess we can add the individual items to the uber issues and just mark them as complete in both places? (Even if it involves a bit of duplication that way we ensure everything we want for a given aspect is covered).

SteveSandersonMS · 2019-07-01T14:34:21Z

@javiercn OK, that's good to know. Let's discuss in the sync this morning. Maybe we can break out that particular deliverable into its own issue, assign to you if you're already working on it, and then yes mark it complete in both places when done.

BenHayat · 2019-07-09T19:35:30Z

Hello Team;
Please put the final resolutions in the Blazor Docs, so we will have ONE place to go to, for docs and solutions, than chasing bits and pieces on gitub.

Thanks!

danroth27 · 2019-07-10T00:03:17Z

@BenHayat Absolutely, that's the plan.

BenHayat · 2019-07-10T00:09:09Z

@danroth27 ;
Thank you Sir!

Please cover the issue that for every user that starts our app, a new instance will start and server becomes limited.
We have to know the limit, so we can control how many users (perhaps only registered users can use the app and block anonymous users to lower the # of instances). This is one area that we are nervous about.

I'm sure the client-side version will not have this side effect, as out server will be mainly a REST server to respond to a lot more users.
Thank you Dan...

SteveSandersonMS · 2019-07-15T10:13:20Z

OK, all the bits that we committed to do in preview 8 (#10472 (comment)) are now done. Any small remaining bits are tracked in their own issues (I think @javiercn still has a couple of JS interop-related bits underway, and all the parts I'm doing are done).

I'm unassigning myself now and assigning @mkArtakMSFT for any further triage that might be needed. We may choose to close this issue since it probably doesn't contain remaining value as its items are all covered elsewhere.

mkArtakMSFT · 2019-07-15T16:45:58Z

Closing as everything else is now covered as part of referenced issues.

DrGriff · 2019-09-11T10:07:42Z

This is for server-side Blazor. When can we assume the Client-side version to be considered "production ready"? Thanks

danroth27 · 2019-09-13T16:29:43Z

Hi @DrGriff. We haven't officially announced the roadmap for Blazor WebAssembly yet, but we don't expect it ship it until after we've finished work on .NET Core 3.1, which we previously announced will ship in Nov of this year. After .NET Core 3.1 is done, Blazor WebAssembly will become our main focus and we expect it will be ready for production use in the first half of 2020.

danroth27 added area-blazor Includes: Blazor, Razor Components Components Big Rock This issue tracks a big effort which can span multiple issues labels May 22, 2019

danroth27 added this to the 3.0.0-preview7 milestone May 22, 2019

mkArtakMSFT added the Needs: Spec Indicates that a spec defining user experience is required label May 23, 2019

mkArtakMSFT assigned danroth27 May 23, 2019

mkArtakMSFT added the PRI: 1 - Required label May 23, 2019

mkArtakMSFT assigned javiercn May 23, 2019

mkArtakMSFT assigned rynowak and unassigned javiercn May 24, 2019

danroth27 mentioned this issue May 24, 2019

[Blazor server-side] Better server-side limits #9117

Closed

rynowak mentioned this issue May 30, 2019

Blazor 3.0 Roadmap #8177

Closed

56 tasks

rynowak changed the title ~~Server-side Blazor in production~~ Server-side Blazor Production and Reliability May 30, 2019

mkArtakMSFT mentioned this issue Jun 4, 2019

Process Memory growing in server-side Blazor only on button press. #10623

Closed

rynowak removed the Needs: Spec Indicates that a spec defining user experience is required label Jun 4, 2019

mkArtakMSFT assigned pranavkm and unassigned rynowak and danroth27 Jun 4, 2019

mkArtakMSFT added the cost: XL label Jun 7, 2019

mkArtakMSFT assigned NTaylorMullen Jun 7, 2019

mkArtakMSFT added the Working label Jun 15, 2019

mkArtakMSFT removed this from the 3.0.0-preview7 milestone Jun 23, 2019

mkArtakMSFT assigned SteveSandersonMS Jun 26, 2019

mkArtakMSFT removed the Working label Jun 26, 2019

mkArtakMSFT added the Working label Jul 1, 2019

mkArtakMSFT mentioned this issue Jul 8, 2019

Blazor Server-Side Question & Discussion - # of server instants running... #11945

Closed

SteveSandersonMS assigned mkArtakMSFT and unassigned SteveSandersonMS Jul 15, 2019

mkArtakMSFT closed this as completed Jul 15, 2019

mkArtakMSFT added Done This issue has been fixed and removed Working labels Jul 15, 2019

mkArtakMSFT assigned SteveSandersonMS and unassigned mkArtakMSFT Jul 15, 2019

ghost locked as resolved and limited conversation to collaborators Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server-side Blazor Production and Reliability #10472

Server-side Blazor Production and Reliability #10472

danroth27 commented May 22, 2019 •

edited by mkArtakMSFT

Loading

mkArtakMSFT commented May 23, 2019

mkArtakMSFT commented May 23, 2019

ADefWebserver commented May 23, 2019 •

edited

Loading

NTaylorMullen commented Jun 13, 2019

mkArtakMSFT commented Jun 23, 2019

SteveSandersonMS commented Jul 1, 2019 •

edited

Loading

javiercn commented Jul 1, 2019

SteveSandersonMS commented Jul 1, 2019

BenHayat commented Jul 9, 2019

danroth27 commented Jul 10, 2019

BenHayat commented Jul 10, 2019

SteveSandersonMS commented Jul 15, 2019

mkArtakMSFT commented Jul 15, 2019

DrGriff commented Sep 11, 2019

danroth27 commented Sep 13, 2019

Server-side Blazor Production and Reliability #10472

Server-side Blazor Production and Reliability #10472

Comments

danroth27 commented May 22, 2019 • edited by mkArtakMSFT Loading

Summary

Error Handling

Logging and Diagnostics (Handled as part of #11792)

Network Reliability

Resource Consumption

Resiliency to App Recycling

Scale out

Known Items

mkArtakMSFT commented May 23, 2019

mkArtakMSFT commented May 23, 2019

ADefWebserver commented May 23, 2019 • edited Loading

NTaylorMullen commented Jun 13, 2019

mkArtakMSFT commented Jun 23, 2019

SteveSandersonMS commented Jul 1, 2019 • edited Loading

javiercn commented Jul 1, 2019

SteveSandersonMS commented Jul 1, 2019

BenHayat commented Jul 9, 2019

danroth27 commented Jul 10, 2019

BenHayat commented Jul 10, 2019

SteveSandersonMS commented Jul 15, 2019

mkArtakMSFT commented Jul 15, 2019

DrGriff commented Sep 11, 2019

danroth27 commented Sep 13, 2019

danroth27 commented May 22, 2019 •

edited by mkArtakMSFT

Loading

ADefWebserver commented May 23, 2019 •

edited

Loading

SteveSandersonMS commented Jul 1, 2019 •

edited

Loading