Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smoke tests for production releases #3585

Closed
19 tasks
srtalbot opened this issue May 10, 2024 · 10 comments
Closed
19 tasks

Smoke tests for production releases #3585

srtalbot opened this issue May 10, 2024 · 10 comments

Comments

@srtalbot
Copy link
Contributor

srtalbot commented May 10, 2024

Description:

Add smoke tests to so that we know key aspects of the infrastructure and application are working live in production after a release.

Two approaches:

  • Start with: Health check that listens for an event
  • Next step: Can monitor the events with a "heartbeat" form that submits every x minutes to test the application. (Need to find a way to not affect our form submission metrics)
  • How might we test on Production and Staging? Given their difference in use, how do we check both environments. Production to see if things are working, staging to catch regressions.

Acceptance criteria:

  • Form responses are sent to Notify (Health check, check Notify API with a specific template ID or through a cloud watch metric alarm with specific filters.)
  • Notify's call-back is working and the reliability queue is responding appropriately to the response code. (Health check, check the logs for responses from the call-back. Update the logs to include 200 responses.)
  • Audit, responses, and form logs are being collected, saved, and archived (Health check)
  • Nagware is sending emails (Health check)
  • Check nagware emails on Tuesday when they send next (Health check)
  • 2FA codes are being sent (Both)
  • The dead letter queue is working correctly (No way to check without bringing down the reliability queue)
  • The application performance is running as expected (Health check)
  • Form responses are being submitted via the vault (Both)
  • Forms are able to build from scratch (Explore automated test)
  • Forms are being published (Explore automated test)
  • Forms can be deleted (Explore automated test)
  • Create an account (Explore automated test)
  • Reset password (Explore automated test)
  • Change and manage permissions (Manual test from support)
  • Support and Contact Us form send data to Freshdesk (Health check)
  • Uptick in 400 or 500 codes to the application (Health check)
  • Alarms are sending to Slack (Can control the message on the alarm, but trigger a Sev 1 or check the lambda)
  • Alarms are sending to OpsGenie (Can control the message on the alarm, but trigger a Sev 1 or look at the lambda)
@patheard
Copy link
Member

patheard commented Jun 11, 2024

Started testing a Form submission heartbeat with curl and had no luck with the following:

curl \
    --request GET \
    --location \
    --cookie-jar ./cookies \
    --verbose \
    https://forms-staging.cdssandbox.xyz/en/id/clx95fa1y00049c9wgmzyz2ee

curl \
    --request POST \
    --location \
    --cookie ./cookies \
    --header "Content-Type: application/json" \
    --data '[{"1":"Yes","currentGroup":"start"},"en",{"id":"clx95fa1y00049c9wgmzyz2ee","updatedAt":"Mon Jun 10 2024 18:19:05 GMT+0000 (Coordinated Universal Time)","form":{"groups":{},"layout":[1],"titleEn":"Healthcheck","titleFr":"Contrôle de santé","elements":[{"id":1,"type":"radio","properties":{"choices":[{"en":"Yes","fr":"Oui"},{"en":"No","fr":"Non"}],"titleEn":"Things are working?","titleFr":"Les choses fonctionnent bien ?","validation":{"required":true},"subElements":[],"descriptionEn":"","descriptionFr":"","placeholderEn":"","placeholderFr":""}}],"confirmation":{"descriptionEn":"Form has been submitted!","descriptionFr":"Le formulaire a été soumis !","referrerUrlEn":"","referrerUrlFr":""},"introduction":{"descriptionEn":"Simple healthcheck form to confirm the system is able to handle submissions.","descriptionFr":"Formulaire simple de contrôle de santé pour confirmer que le système est en mesure de traiter les soumissions."},"privacyPolicy":{"descriptionEn":"No personal information will be collected as part of this form submission.","descriptionFr":"Aucune information personnelle ne sera collectée par le biais de ce formulaire."}},"isPublished":true,"securityAttribute":"Unclassified"}]' \
    --verbose \
    https://forms-staging.cdssandbox.xyz/en/id/clx95fa1y00049c9wgmzyz2ee

Will start looking at Puppeteer next.

@patheard
Copy link
Member

Based on our chat after the App Router release, I'm going to start looking at the following:

  1. Define CloudWatch metric filters for the log events we want to monitor. I will start with form submissions, both from the client and lambda perspective. This may involve minor PRs to add new logging detail.
  2. Create a dashboard that graphs these metrics overtime to give a health snapshot.

Once that's taken care of, we can look at adding more metrics and alarms.

@patheard
Copy link
Member

A service health dashboard has been started in Staging to graph metrics:
https://ca-central-1.console.aws.amazon.com/cloudwatch/home?region=ca-central-1#dashboards/dashboard/Forms-System-Health

@patheard
Copy link
Member

patheard commented Jun 18, 2024

The service health dashboard and custom metrics were released today as part of v3.10.0. However after more investigation, it was realized that the custom metrics would end up being more expensive than they were worth.

These metrics will be removed as part of the v3.10.2 release and the service health dashboard will be updated to generate its graphs via log insight queries.

@patheard
Copy link
Member

  • Service health dashboard has been updated to use log insight queries and will go to prod with the next infra release. It will still need an app release before it's fully populated (needs the new app log statements).
  • Working on submission lambda invocation health check alarms. Testing with a core hours approach and an anomaly detection approach.

@patheard
Copy link
Member

Proof-of-concept showing how a form submission can be triggered via a Lambda function using Playwright:
https://github.com/patheard/playwright-lambda

@patheard
Copy link
Member

After releasing the anomaly detection alarms to Prod as part of the v3.11.0 release, it is unlikely we'll be able to use AWS's expected invocations as an alarm:

Image

  • gray band is the expected range of invocations
  • red line below the gray band would be alarm triggers

The above graph is at 4 standard deviations.

@srtalbot
Copy link
Contributor Author

Health checks are available now, we probably need to implement a heartbeat form to limit false positives.

@patheard
Copy link
Member

patheard commented Aug 6, 2024

Setup email submission heartbeat in prep for next week.

@patheard
Copy link
Member

patheard commented Aug 9, 2024

The proof-of-concept repo now includes the EventBridge rule that would trigger the form submission on a schedule. In testing it submitted my Staging test form every 15 minutes as expected:

Next steps

To get this into Forms, the following will be needed:

  1. Add an ECR. This could just be added to Staging since the target form URL it submits is configurable.
  2. Build and push the Docker image.
  3. Add the remainder of the heartbeat resources from the proof-of-concept.

⚠ Consideration

I'm don't think we should put the heartbeat lambda code in a public repo associated with Forms since it gives people an example of how to submit forms automatically. It could easily live in a private repo that builds and pushes to the ECR managed by forms-terraform.

@Abi-Nada Abi-Nada removed the App label Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants