Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC][Error Handling] Test Run Page Error Handling Improvements #2040

Closed
13 of 16 tasks
olha23 opened this issue Feb 23, 2023 · 15 comments
Closed
13 of 16 tasks

[EPIC][Error Handling] Test Run Page Error Handling Improvements #2040

olha23 opened this issue Feb 23, 2023 · 15 comments
Labels
epic Epic

Comments

@olha23
Copy link

olha23 commented Feb 23, 2023

As we come across different user sessions, the team has identified multiple areas of opportunity regarding the error handling messaging when executing a test run.

Currently, a test run goes through three significant steps:

  1. Trigger execution
  2. Trace fetching
  3. Test spec execution

Each step has its own set of success and failure scenarios that need to be appropriately displayed to the user.
Today, Tracetest uses only two fields from the test run to validate possible errors.

  • lastErrorState which contains the string info for the last known error.
  • state which controls the status of the test.

This was a good starting but now is not sufficient for the clients (CLI/UI) to display enough information so the user can understand how to fix potential problems. As well as providing good user feedback on what is the serverside executing at any given time.

In this case, we have identified a matrix of possible scenarios depending on the test run state, results, and what we should be displaying to the user.

Test Run Flow Chart

flowchart TD
    A[Run] --> B[Created]
    B --> C[Resolve Trigger Vars]
    C --> D[Execute Trigger]
    D --> ET{Is Successful Trigger}
    ET -->|Yes| E[Queue Polling]
    ET -->|NO| ES[Set State to Failed]
    ES --> Q
    E --> F[Execute Polling Job]
    F --> G[Fetch Trace from Data Store]
    G --> H{Trace Exists}
    H -->|No| I{timed out config reached}
    H -->|Yes| J{Has the span # changed}
    J -->|Yes| G
    J -->|No| K[Trace is ready]
    I -->|No| G
    I -->|Yes| L[Trace fetch failed]
    K --> O[Generating Outputs]
    O --> P[Running Test Specs]
    P --> Q[Finish]
    L --> ES
Loading

State Matrix for Test Runs

CREATED TRIGGERING CONNECTING_TO_DATA_STORE POLLING_TRACE GENERATING_OUTPUTS RUNNING_TEST_SPECS FINISHED
Successful Run Page Trigger response data - body - timing - headers Signal of successful connection to data store Trace Outputs Test Specs results Trigger/Trace/Test
Failed Failed Page Breakdown of the trigger problem - DNS connection - Queue connection - Auth problems Similar to the test connection endpoint show breakdown of issues Breakdown of the trace fetching with Reason of the error Warning that the generation of the output failed And the reason why Failed Test specs Global Failed state
In Progress Loading state Loading state with trigger steps Loading state Similar to the server output - Polling iteration # - # of spans - Reason for next iteration Loading state Loading state Loading state

Tickets and Tasks

Follow-up release

Nice to have

  • [Error Handling] Event Log text version
  • [Error Handling] Mode bar live status

Mockups

https://www.figma.com/file/LBN4SKVPq3ykegrPKbHT2Y/0.8-0.9-Release-Tracetest?node-id=1994-32394&t=5M47CI4J8VFbgit2-0

@xoscar
Copy link
Collaborator

xoscar commented Feb 23, 2023

@olha23 I think the efforts for this issue should be combined with Improve diagnostics returned to a user when executing a test trigger#1788

We also need to consider allowing users to:

  1. See the response information in case that exists (body, headers, status codes)
  2. Show an error from the trace and test mode
  3. Include the Backend changes as part of the process

CC: @kdhamric

@xoscar xoscar self-assigned this Mar 1, 2023
@olha23
Copy link
Author

olha23 commented Mar 3, 2023

@xoscar @kdhamric can we have some progress on this? we need implement this because we stuck with test failed issues a lot of times

@kdhamric
Copy link
Collaborator

kdhamric commented Mar 3, 2023

Yes yes. This is our second to the top priority, only behind knocking out the configuration work which I want the team to swarm on as it is blocking other activity. If we get to a spot where @jorgeepc or @xoscar do not have an area they can contribute to the config changes, we will want to focus on this.

@olha23
Copy link
Author

olha23 commented Mar 3, 2023

@kdhamric i added additional mockups for trace and test mode for failed cases. I need you help with copy there. Think we need to point to some troubleshooting tests docs maybe
https://www.figma.com/file/LBN4SKVPq3ykegrPKbHT2Y/0.8-0.9-Release-Tracetest?node-id=1944%3A29192&t=Ud0w5WjFpCktO9Tx-1

@kdhamric
Copy link
Collaborator

kdhamric commented Mar 3, 2023

I left some notes. Agree that we need a 'I did not get my trace, how do I troubleshoot it page'. Plan on their being a help link in your message that says 'we did not get the trace successfully' - we will work on adding that next week.

@xoscar
Copy link
Collaborator

xoscar commented Mar 3, 2023

Added some comments, if we are in the clear about config stuff I will start working on this Monday morning!

@xoscar
Copy link
Collaborator

xoscar commented Mar 6, 2023

Hello every one, here's my take on what should be added to the test run page to improve the user experience

Acceptance Criteria:
AC1
As a user looking at the test run page
And I just ran the test
And the test failed in the initial trigger request (HTTP, GRPC, etc..)
I should be able to see a breakdown of the error and steps that occurred prior to seeing the error

AC2
As a user looking at the test run page
And I just ran a test and the initial request worked as expected
And the app is trying to fetch the trace
I should be able to see a description of what the app is doing in the background, things like:

  1. What # of polling retry is it
  2. In what state is the polling (waiting, polling, failed)
  3. Recent errors or reasons why a new poll was triggered (even if the trace was already found)

AC3
As a user looking at the test run page
And I just ran a test and the initial request worked as expected
And the app failed to fetch the trace
I should be able to see a proper error description of what happened, what was done to try to fetch the trace
And I should be able to see the initial request/response details

The idea with this is to allow users to have easier ways to debug what's happening within the system, if we found a problem or if something else is happening. This can also help them tweak their polling settings to have the best result for them

CC: @olha23

@kdhamric
Copy link
Collaborator

kdhamric commented Mar 6, 2023

On AC2, it would be nice to show progress on gathering info on the trace. Maybe show the '# of spans' received so far?

If you see it getting some spans you know things:
trace data store connection is working
how far along you are if you are aware of what 'normal' is

@olha23
Copy link
Author

olha23 commented Mar 7, 2023

@xoscar xoscar changed the title Implement 'help on failed test' response [EPIC] Test Run Page Error Handling Improvements Mar 13, 2023
@xoscar xoscar added epic Epic and removed frontend labels Mar 13, 2023
@xoscar xoscar changed the title [EPIC] Test Run Page Error Handling Improvements [EPIC][Error Handling] Test Run Page Error Handling Improvements Mar 13, 2023
@jorgeepc jorgeepc self-assigned this Mar 24, 2023
@xoscar
Copy link
Collaborator

xoscar commented Mar 24, 2023

Technical Details

The main goal of this epic is to provide a better experience for new and familiar Tracetest users, focusing on displaying more information for users to better understand what the app is doing after running a test.

Event Log System

Currently, the test run process uses a web-based system to communicate updates to the clients, depending on the checkpoints defined by us where key parts of the runs are updated.

In this case, we'll be leveraging that same idea by extending it to provide even more information, separating the checkpoints to an events entity where we can store everything that is happening while executing a test run.

Blank diagram

Events will have a generic structure that can be used to define a basic event type based on stage and description, here's a class diagram describing the base event structure and how we can structure the more specific event types:

classDiagram
    Event <|-- DataStoreConnection
    Event <|-- Polling
    Event <|-- Output
    Event : string type
    Event : enum[trigger, trace, test] stage
    Event : string description
    Event : date created_at
    Event : string test_id
    Event : string run_id

    class DataStoreConnection{
      DataStoreTestConnection info
    }

    class Polling{
      int number_of_spans
      int iteration_number
      string reason_of_next_iteration
      boolean is_complete
    }

    class Output{
      string warning
      string error
      string output_name
    }
Loading

The advantage of having a setup like this is that we can create and register different types of events depending on what we need, if we have to add more trigger types or different polling mechanisms we can add different event types depending on the new processes.

Another key point is this will allow us to have the full event log for any trace run, if we decide to export it or display it fully for users as a text-based log that can be another way of allowing users to understand what happened.

Blank diagram (1)

Example DB structure

Http Trigger Unreachable Host Event

{
  "type": "UNREACHABLE_HOST",
  "description": "The host (http://localhost:8081) is unreachable",
  "stage": "TRIGGER",
  "definition": "{}"
}

Http/GRPC Docker Host Machine Mismatch

{
  "type": "DOCKER_HOST_MISMATCH",
  "description": "We identified Tracetest is running in docker compose, to connect to the host machine use the `host.docker.internal` hostname. For more information see https://docs.docker.com/docker-for-mac/networking/#use-cases-and-workarounds",
  "stage": "TRIGGER",
  "definition": "{}"
}

Polling Iteration

{
  "type": "POLLING_ITERATION",
  "description": "Polling iteration",
  "stage": "TRIGGER",
  "definition": "{"number_of_spans": 1, "iteration_number": 2,"reason_of_next_iteration": "","is_complete": true}"
}

@xoscar
Copy link
Collaborator

xoscar commented Mar 24, 2023

Frontend Rendering Factory

Blank diagram (2)

The clients should only render the events that are required when its required, FE should only display the components based on the selected mode and test run status.

Examples

Scenario 1

The HTTP test trigger failed to reach the host because of the wrong docker host usage

Expected events

  • Trigger - CREATED
  • Trigger - RESOLVE_VARS_SUCCESS
  • Trigger - UNREACHABLE_HOST
  • Trigger - DOCKER_HOST_MISMATCH

Scenario 2

The HTTP trigger returned a successful response
The poller found a full trace after 3 tries

Expected events

  • Trigger - CREATED
  • Trigger - RESOLVE_VARS_SUCCESS
  • Trigger - TRIGGER_SUCCESS
  • Trace - DATA_STORE_TEST_CONNECTION
  • Trace - POLLING_ITERATION
  • Trace - POLLING_ITERATION
  • Trace - POLLING_ITERATION
  • Trace - POLLING_SUCCESS
  • Test - TEST_SPEC_EXECUTION

Scenario 3

The HTTP trigger returned a successful response
The poller found an issue when connecting to the data store

Expected events

  • Trigger - CREATED
  • Trigger - RESOLVE_VARS_SUCCESS
  • Trigger - TRIGGER_SUCCESS
  • Trace - DATA_STORE_TEST_CONNECTION
  • Trace - POLLING_FAILURE

Scenario 4

The HTTP trigger returned a successful response
The poller timed out after X number of tries

Expected events

  • Trigger - CREATED
  • Trigger - RESOLVE_VARS_SUCCESS
  • Trigger - TRIGGER_SUCCESS
  • Trace - DATA_STORE_TEST_CONNECTION
  • Trace - POLLING_ITERATION
  • Trace - POLLING_ITERATION
  • Trace - POLLING_ITERATION
  • Trace - POLLING_ITERATION
  • Trace - POLLING_ITERATION
  • Trace - POLLING_ITERATION
  • Trace - POLLING_TIMEOUT
  • Trace - POLLING_FAILURE

@xoscar
Copy link
Collaborator

xoscar commented Mar 24, 2023

Exploratory

Instead of having a "custom" event solution, we could generate an internal otel trace for each test run which is going to keep track of every event that happens, then clients should have a way to visualize it based on attributes similar to what we would store in an event, like stage and type (span name).

Requirements for this to work:

  1. Not using external data stores (internal DB only)
  2. This shouldn't be complex to develop
  3. Generate spans based on event logs

@xoscar
Copy link
Collaborator

xoscar commented Mar 24, 2023

TODO:

  • Suffix for event types to identify the event level (info, error, warning, etc)
  • Define initial event types per stage, i.e: PARSING_ERROR should be part of trigger and test stages (granular) @xoscar @jorgeepc
  • Work on the initial open API specs (read only endpoint for events and new websocket subscription) @mathnogueira
  • Nice to have: have a text version for the events

@xoscar
Copy link
Collaborator

xoscar commented Mar 24, 2023

Future troubleshooting features:

  1. Be able to provide suggestions/recommendations to users based on events (errors).
  2. Clients to have actions depending on event types

@xoscar
Copy link
Collaborator

xoscar commented Apr 10, 2023

Closing in favor of #2331

@xoscar xoscar closed this as completed Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Epic
Projects
None yet
Development

No branches or pull requests

7 participants