Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] render_nunjucks error spike #9150

Open
Pashaminkovsky opened this issue Sep 12, 2024 · 12 comments
Open

[Spike] render_nunjucks error spike #9150

Pashaminkovsky opened this issue Sep 12, 2024 · 12 comments
Assignees
Labels

Comments

@Pashaminkovsky
Copy link
Collaborator

Pashaminkovsky commented Sep 12, 2024

Spike details: https://www.notion.so/pixiebrix/Nunjucks-Error-Spike-0283abe59c6a48ad96f5f84715516ec3?showMoveTo=true&saveParent=true

Datadog dashboards:

test edit

@twschiller twschiller changed the title render_nunjucks error spike [Spike] render_nunjucks error spike Sep 13, 2024
@Pashaminkovsky Pashaminkovsky added the bug Something isn't working label Sep 13, 2024
@mnholtz mnholtz self-assigned this Sep 16, 2024
@mnholtz
Copy link
Collaborator

mnholtz commented Sep 16, 2024

Spent the afternoon looking into this and collected my observations on the notion document: https://www.notion.so/pixiebrix/Nunjucks-Error-Spike-0283abe59c6a48ad96f5f84715516ec3?showMoveTo=true&saveParent=true

I have not been able to identify root cause, but I propose creating a PR tomorrow that

  1. Increases timeout threshold for postMessage, e.g. from 3s to 5s
  2. Throw more specific errors/add additional logging to increase confidence that these are timeout issues, and not issues with the sandbox getting injected onto the page (I don't think the latter is the case, but it would be nice to have to rule-out that case for sure)

More involved solution/future work would be to come up with a different solution for rendering nunjucks templates, an approach I have not researched and have little context on atm

@Pashaminkovsky
Copy link
Collaborator Author

@mnholtz that sounds good I appreciate you looking into it. Re logging – is it possible to have visibility into the success rate of the mod run when the error is thrown?

@mnholtz
Copy link
Collaborator

mnholtz commented Sep 17, 2024

that sounds good I appreciate you looking into it. Re logging – is it possible to have visibility into the success rate of the mod run when the error is thrown?

Discussed over zoom - the answer is that how much of the mod is able to be run/will run on a simple page refresh/button click/etc. is determined on a per-mod basis based on the mod definition. Generally speaking this error is very disruptive to mod execution and would usually result in the user modifying their workflow in some way to overcome the issue

mnholtz added a commit that referenced this issue Sep 18, 2024
…o 5s (#9168)

* increase pRetry timeout in postMessage and injectIframe from 3s -> 5s

* introduce SandboxInjectionError

* throw SandboxInjectionError from getSandbox helper

* add constant for MAX_RETRIES

* refaxtor SandboxInjectionError

* refactor rename SandboxInjectionError -> IframeInjectionError
@mnholtz
Copy link
Collaborator

mnholtz commented Sep 20, 2024

Tentatively putting this back into the backlog until we can get more logging visibility/information from #9168

@mnholtz
Copy link
Collaborator

mnholtz commented Sep 24, 2024

It looks like increasing the timeout to 5s did not make any impact on the errors: https://app.datadoghq.com/logs?query=issue.id%3A6f1c3b50-4bb1-11ef-a033-da7ad0900002%20%2A%3A%22%22%20&agg_m=count&agg_m_source=base&agg_t=count&cols=host%2Cservice%2C%40error.causes%5B1%5D&fromUser=true&messageDisplay=inline&refresh_mode=sliding&storage=hot&stream_sort=service%2Casc&viz=stream&from_ts=1726583812591&to_ts=1727188612591&live=true

And the cause is still mixed; doesn't seem to be an issue with the iframe getting injected onto the page (not a single one of these errors was caused by the new IframeInjectionError), but loading.

@mnholtz
Copy link
Collaborator

mnholtz commented Sep 25, 2024

After further investigation with @grahamlangford, we've identified the following additional hypotheses (both of which are documented on the original notion document):

  1. Large payloads sent to the sandbox to handle are causing users to hit the timeout limit for both the messages with the large payloads, and other in-flight messages in the queue
  2. iframes trigger load events even in the case where they've errored on initialization. It's possible that the iframe is erroring for some reason, and we are trying to message a bad iframe instance without knowing

Following this, we have identified the following next steps for a follow-up PR:

  1. Keep a tally of the current in-flight messages and their payload size to include in the error message, with the goal of validating hypothesis no. 1
  2. Retry iframe injection altogether rather than retrying on just messaging the sandbox iframe

mnholtz added a commit that referenced this issue Oct 2, 2024
* move postMessage to sandbox directory

* add try/catch to postMessage

* introduce SandboxTimeoutError

* refactor SandboxTimeoutMessage

* fix constructor

* bust sandbox reference cache on failed ping

* add sandbox timeout error to cache bust logic

* remove sandbox from dom if bad sandbox ping

* add comment

* remove iframe if load timeout occurs in loadSandbox

* fix unit tests

* add unit tests for SandboxTimeoutError

* refactor last test

* add jest.runAllTimers() before promise assertion

* fix postMessage test
@mnholtz
Copy link
Collaborator

mnholtz commented Oct 8, 2024

Adding re-injection in #9208 did not seem to have an effect.

Additional information added to error telemetry did not come through to datadog due to the way datadog selects error properties to report (oversight on my part). Will be fixing this in a follow-up commit.

mnholtz added a commit that referenced this issue Oct 22, 2024
* introduce widgets/ directory

* introduce demo/ directory

* refactor rename FormBuilder -> FormBuilderDemo

* fix sass component reference
@fungairino
Copy link
Collaborator

One avenue for us to pursue is to move the sandbox to the offscreen page to avoid issues related to injecting the iframe during brick execution runtime.

POC:
#9352

@fungairino fungairino self-assigned this Oct 25, 2024
@fungairino
Copy link
Collaborator

Another avenue that requires less of an architecture shift is to pre-emptively inject the sandbox onto the page upon content script initialization rather than waiting until the point where a brick requiring it is executing

@fungairino
Copy link
Collaborator

I merged in some pre-work in the above PR which will make it easier to extend the offscreen page to inject sandbox execution.

Pausing on this for now until we see how the current release does with render_nunjucks errors

@fregante
Copy link
Contributor

fregante commented Nov 7, 2024

Before you moved to the offscreen document, have you seen any initialization errors in the sidebar as well? During the chrome.sidePanel migration we attributed those errors to extension reloads but I don't know if they also happened on production.

@grahamlangford
Copy link
Collaborator

Before you moved to the offscreen document, have you seen any initialization errors in the sidebar as well? During the chrome.sidePanel migration we attributed those errors to extension reloads but I don't know if they also happened on production.

I haven't seen anything in our logging, and we haven't had any issues reported in a long time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants