[EPIC][Error Handling] Test Run Page Error Handling Improvements #2040

olha23 · 2023-02-23T18:47:50Z

As we come across different user sessions, the team has identified multiple areas of opportunity regarding the error handling messaging when executing a test run.

Currently, a test run goes through three significant steps:

Trigger execution
Trace fetching
Test spec execution

Each step has its own set of success and failure scenarios that need to be appropriately displayed to the user.
Today, Tracetest uses only two fields from the test run to validate possible errors.

lastErrorState which contains the string info for the last known error.
state which controls the status of the test.

This was a good starting but now is not sufficient for the clients (CLI/UI) to display enough information so the user can understand how to fix potential problems. As well as providing good user feedback on what is the serverside executing at any given time.

In this case, we have identified a matrix of possible scenarios depending on the test run state, results, and what we should be displaying to the user.

Test Run Flow Chart

flowchart TD
    A[Run] --> B[Created]
    B --> C[Resolve Trigger Vars]
    C --> D[Execute Trigger]
    D --> ET{Is Successful Trigger}
    ET -->|Yes| E[Queue Polling]
    ET -->|NO| ES[Set State to Failed]
    ES --> Q
    E --> F[Execute Polling Job]
    F --> G[Fetch Trace from Data Store]
    G --> H{Trace Exists}
    H -->|No| I{timed out config reached}
    H -->|Yes| J{Has the span # changed}
    J -->|Yes| G
    J -->|No| K[Trace is ready]
    I -->|No| G
    I -->|Yes| L[Trace fetch failed]
    K --> O[Generating Outputs]
    O --> P[Running Test Specs]
    P --> Q[Finish]
    L --> ES

State Matrix for Test Runs

	CREATED	TRIGGERING	CONNECTING_TO_DATA_STORE	POLLING_TRACE	GENERATING_OUTPUTS	RUNNING_TEST_SPECS	FINISHED
Successful	Run Page	Trigger response data - body - timing - headers	Signal of successful connection to data store	Trace	Outputs	Test Specs results	Trigger/Trace/Test
Failed	Failed Page	Breakdown of the trigger problem - DNS connection - Queue connection - Auth problems	Similar to the test connection endpoint show breakdown of issues	Breakdown of the trace fetching with Reason of the error	Warning that the generation of the output failed And the reason why	Failed Test specs	Global Failed state
In Progress	Loading state	Loading state with trigger steps	Loading state	Similar to the server output - Polling iteration # - # of spans - Reason for next iteration	Loading state	Loading state	Loading state

Tickets and Tasks

Follow-up release

Nice to have

[Error Handling] Event Log text version
[Error Handling] Mode bar live status

Mockups

https://www.figma.com/file/LBN4SKVPq3ykegrPKbHT2Y/0.8-0.9-Release-Tracetest?node-id=1994-32394&t=5M47CI4J8VFbgit2-0

The text was updated successfully, but these errors were encountered:

xoscar · 2023-02-23T21:24:05Z

@olha23 I think the efforts for this issue should be combined with Improve diagnostics returned to a user when executing a test trigger#1788

We also need to consider allowing users to:

See the response information in case that exists (body, headers, status codes)
Show an error from the trace and test mode
Include the Backend changes as part of the process

CC: @kdhamric

olha23 · 2023-03-03T16:01:43Z

@xoscar @kdhamric can we have some progress on this? we need implement this because we stuck with test failed issues a lot of times

kdhamric · 2023-03-03T16:09:46Z

Yes yes. This is our second to the top priority, only behind knocking out the configuration work which I want the team to swarm on as it is blocking other activity. If we get to a spot where @jorgeepc or @xoscar do not have an area they can contribute to the config changes, we will want to focus on this.

olha23 · 2023-03-03T22:07:20Z

@kdhamric i added additional mockups for trace and test mode for failed cases. I need you help with copy there. Think we need to point to some troubleshooting tests docs maybe
https://www.figma.com/file/LBN4SKVPq3ykegrPKbHT2Y/0.8-0.9-Release-Tracetest?node-id=1944%3A29192&t=Ud0w5WjFpCktO9Tx-1

kdhamric · 2023-03-03T22:31:05Z

I left some notes. Agree that we need a 'I did not get my trace, how do I troubleshoot it page'. Plan on their being a help link in your message that says 'we did not get the trace successfully' - we will work on adding that next week.

xoscar · 2023-03-03T23:00:46Z

Added some comments, if we are in the clear about config stuff I will start working on this Monday morning!

xoscar · 2023-03-06T20:36:18Z

Hello every one, here's my take on what should be added to the test run page to improve the user experience

Acceptance Criteria:
AC1
As a user looking at the test run page
And I just ran the test
And the test failed in the initial trigger request (HTTP, GRPC, etc..)
I should be able to see a breakdown of the error and steps that occurred prior to seeing the error

AC2
As a user looking at the test run page
And I just ran a test and the initial request worked as expected
And the app is trying to fetch the trace
I should be able to see a description of what the app is doing in the background, things like:

What # of polling retry is it
In what state is the polling (waiting, polling, failed)
Recent errors or reasons why a new poll was triggered (even if the trace was already found)

AC3
As a user looking at the test run page
And I just ran a test and the initial request worked as expected
And the app failed to fetch the trace
I should be able to see a proper error description of what happened, what was done to try to fetch the trace
And I should be able to see the initial request/response details

The idea with this is to allow users to have easier ways to debug what's happening within the system, if we found a problem or if something else is happening. This can also help them tweak their polling settings to have the best result for them

CC: @olha23

kdhamric · 2023-03-06T21:45:58Z

On AC2, it would be nice to show progress on gathering info on the trace. Maybe show the '# of spans' received so far?

If you see it getting some spans you know things:
trace data store connection is working
how far along you are if you are aware of what 'normal' is

olha23 · 2023-03-07T21:23:55Z

Updated mockups here https://www.figma.com/file/LBN4SKVPq3ykegrPKbHT2Y/0.8-0.9-Release-Tracetest?node-id=1994%3A32394&t=gBhenSQqNpOiOhtX-1

xoscar · 2023-03-24T15:22:48Z

Technical Details

The main goal of this epic is to provide a better experience for new and familiar Tracetest users, focusing on displaying more information for users to better understand what the app is doing after running a test.

Event Log System

Currently, the test run process uses a web-based system to communicate updates to the clients, depending on the checkpoints defined by us where key parts of the runs are updated.

In this case, we'll be leveraging that same idea by extending it to provide even more information, separating the checkpoints to an events entity where we can store everything that is happening while executing a test run.

Events will have a generic structure that can be used to define a basic event type based on stage and description, here's a class diagram describing the base event structure and how we can structure the more specific event types:

classDiagram
    Event <|-- DataStoreConnection
    Event <|-- Polling
    Event <|-- Output
    Event : string type
    Event : enum[trigger, trace, test] stage
    Event : string description
    Event : date created_at
    Event : string test_id
    Event : string run_id

    class DataStoreConnection{
      DataStoreTestConnection info
    }

    class Polling{
      int number_of_spans
      int iteration_number
      string reason_of_next_iteration
      boolean is_complete
    }

    class Output{
      string warning
      string error
      string output_name
    }

The advantage of having a setup like this is that we can create and register different types of events depending on what we need, if we have to add more trigger types or different polling mechanisms we can add different event types depending on the new processes.

Another key point is this will allow us to have the full event log for any trace run, if we decide to export it or display it fully for users as a text-based log that can be another way of allowing users to understand what happened.

Example DB structure

Http Trigger Unreachable Host Event

{
  "type": "UNREACHABLE_HOST",
  "description": "The host (http://localhost:8081) is unreachable",
  "stage": "TRIGGER",
  "definition": "{}"
}

Http/GRPC Docker Host Machine Mismatch

{
  "type": "DOCKER_HOST_MISMATCH",
  "description": "We identified Tracetest is running in docker compose, to connect to the host machine use the `host.docker.internal` hostname. For more information see https://docs.docker.com/docker-for-mac/networking/#use-cases-and-workarounds",
  "stage": "TRIGGER",
  "definition": "{}"
}

Polling Iteration

{
  "type": "POLLING_ITERATION",
  "description": "Polling iteration",
  "stage": "TRIGGER",
  "definition": "{"number_of_spans": 1, "iteration_number": 2,"reason_of_next_iteration": "","is_complete": true}"
}

xoscar · 2023-03-24T15:56:54Z

Frontend Rendering Factory

The clients should only render the events that are required when its required, FE should only display the components based on the selected mode and test run status.

Examples

Scenario 1

The HTTP test trigger failed to reach the host because of the wrong docker host usage

Expected events

Trigger - CREATED
Trigger - RESOLVE_VARS_SUCCESS
Trigger - UNREACHABLE_HOST
Trigger - DOCKER_HOST_MISMATCH

Scenario 2

The HTTP trigger returned a successful response
The poller found a full trace after 3 tries

Expected events

Trigger - CREATED
Trigger - RESOLVE_VARS_SUCCESS
Trigger - TRIGGER_SUCCESS
Trace - DATA_STORE_TEST_CONNECTION
Trace - POLLING_ITERATION
Trace - POLLING_ITERATION
Trace - POLLING_ITERATION
Trace - POLLING_SUCCESS
Test - TEST_SPEC_EXECUTION

Scenario 3

The HTTP trigger returned a successful response
The poller found an issue when connecting to the data store

Expected events

Trigger - CREATED
Trigger - RESOLVE_VARS_SUCCESS
Trigger - TRIGGER_SUCCESS
Trace - DATA_STORE_TEST_CONNECTION
Trace - POLLING_FAILURE

Scenario 4

The HTTP trigger returned a successful response
The poller timed out after X number of tries

Expected events

Trigger - CREATED
Trigger - RESOLVE_VARS_SUCCESS
Trigger - TRIGGER_SUCCESS
Trace - DATA_STORE_TEST_CONNECTION
Trace - POLLING_ITERATION
Trace - POLLING_ITERATION
Trace - POLLING_ITERATION
Trace - POLLING_ITERATION
Trace - POLLING_ITERATION
Trace - POLLING_ITERATION
Trace - POLLING_TIMEOUT
Trace - POLLING_FAILURE

xoscar · 2023-03-24T16:09:47Z

Exploratory

Instead of having a "custom" event solution, we could generate an internal otel trace for each test run which is going to keep track of every event that happens, then clients should have a way to visualize it based on attributes similar to what we would store in an event, like stage and type (span name).

Requirements for this to work:

Not using external data stores (internal DB only)
This shouldn't be complex to develop
Generate spans based on event logs

xoscar · 2023-03-24T18:45:49Z

TODO:

Suffix for event types to identify the event level (info, error, warning, etc)
Define initial event types per stage, i.e: PARSING_ERROR should be part of trigger and test stages (granular) @xoscar @jorgeepc
Work on the initial open API specs (read only endpoint for events and new websocket subscription) @mathnogueira
Nice to have: have a text version for the events

xoscar · 2023-03-24T18:56:39Z

Future troubleshooting features:

Be able to provide suggestions/recommendations to users based on events (errors).
Clients to have actions depending on event types

xoscar · 2023-04-10T17:46:42Z

Closing in favor of #2331

olha23 added the frontend label Feb 23, 2023

xoscar self-assigned this Mar 1, 2023

xoscar mentioned this issue Mar 6, 2023

Improve diagnostics returned to a user when executing a test trigger #1788

Closed

xoscar changed the title ~~Implement 'help on failed test' response~~ [EPIC] Test Run Page Error Handling Improvements Mar 13, 2023

xoscar assigned xoscar, kdhamric, mathnogueira, olha23 and danielbdias and unassigned xoscar Mar 13, 2023

xoscar mentioned this issue Mar 13, 2023

[Error Handling] - Trigger section improvements #2146

Closed

xoscar added epic Epic and removed frontend labels Mar 13, 2023

This was referenced Mar 13, 2023

[Error Handling] - Backend - Trigger section improvements #2147

Closed

[Error Handling] - Trace section improvements #2148

Closed

[Error Handling] - Backend - Trace section improvements #2149

Closed

xoscar changed the title ~~[EPIC] Test Run Page Error Handling Improvements~~ [EPIC][Error Handling] Test Run Page Error Handling Improvements Mar 13, 2023

xoscar assigned schoren Mar 15, 2023

jorgeepc self-assigned this Mar 24, 2023

This was referenced Mar 24, 2023

[Error Handling] Open API Specs #2246

Closed

[Error Handling] Backend - Event Log System #2254

Closed

mathnogueira mentioned this issue Mar 27, 2023

add endpoint spec to get events from a test run #2255

Merged

4 tasks

This was referenced Mar 27, 2023

[Error Handling] Test section improvements #2256

Closed

[Error Handling] Backend - Test section improvements #2257

Closed

mathnogueira mentioned this issue Mar 28, 2023

Implement Event database persistence migrations #2258

Closed

xoscar mentioned this issue Apr 10, 2023

[Epic][Error Handling V2] Test Run Page Error Handling Improvements #2331

Closed

12 tasks

xoscar closed this as completed Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC][Error Handling] Test Run Page Error Handling Improvements #2040

[EPIC][Error Handling] Test Run Page Error Handling Improvements #2040

olha23 commented Feb 23, 2023 •

edited by jorgeepc

Loading

xoscar commented Feb 23, 2023 •

edited

Loading

olha23 commented Mar 3, 2023

kdhamric commented Mar 3, 2023

olha23 commented Mar 3, 2023

kdhamric commented Mar 3, 2023

xoscar commented Mar 3, 2023

xoscar commented Mar 6, 2023 •

edited

Loading

kdhamric commented Mar 6, 2023

olha23 commented Mar 7, 2023

xoscar commented Mar 24, 2023

xoscar commented Mar 24, 2023

xoscar commented Mar 24, 2023 •

edited

Loading

xoscar commented Mar 24, 2023 •

edited

Loading

xoscar commented Mar 24, 2023

xoscar commented Apr 10, 2023

[EPIC][Error Handling] Test Run Page Error Handling Improvements #2040

[EPIC][Error Handling] Test Run Page Error Handling Improvements #2040

Comments

olha23 commented Feb 23, 2023 • edited by jorgeepc Loading

Test Run Flow Chart

State Matrix for Test Runs

Tickets and Tasks

Follow-up release

Nice to have

Mockups

xoscar commented Feb 23, 2023 • edited Loading

olha23 commented Mar 3, 2023

kdhamric commented Mar 3, 2023

olha23 commented Mar 3, 2023

kdhamric commented Mar 3, 2023

xoscar commented Mar 3, 2023

xoscar commented Mar 6, 2023 • edited Loading

kdhamric commented Mar 6, 2023

olha23 commented Mar 7, 2023

xoscar commented Mar 24, 2023

Technical Details

Event Log System

Example DB structure

xoscar commented Mar 24, 2023

Frontend Rendering Factory

Examples

Scenario 1

Scenario 2

Scenario 3

Scenario 4

xoscar commented Mar 24, 2023 • edited Loading

Exploratory

xoscar commented Mar 24, 2023 • edited Loading

xoscar commented Mar 24, 2023

xoscar commented Apr 10, 2023

olha23 commented Feb 23, 2023 •

edited by jorgeepc

Loading

xoscar commented Feb 23, 2023 •

edited

Loading

xoscar commented Mar 6, 2023 •

edited

Loading

xoscar commented Mar 24, 2023 •

edited

Loading

xoscar commented Mar 24, 2023 •

edited

Loading