Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate the impact of outliers on the fee rate statistics #394

Closed
Keith-CY opened this issue Aug 2, 2023 · 24 comments
Closed

Mitigate the impact of outliers on the fee rate statistics #394

Keith-CY opened this issue Aug 2, 2023 · 24 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@Keith-CY
Copy link
Member

Keith-CY commented Aug 2, 2023

image

Some high-fee-rate samples appear at 30~50s which makes the average fee rate of slow stage higher than that of high stage.

A filter may be adopted on the data as follows:

  1. if a sample of a lower speed stage has a higher fee rate than a sample of a higher speed stage, it should be wiped out

Any thoughts from @Danie0918 @Sven-TBD

@Keith-CY Keith-CY added the documentation Improvements or additions to documentation label Aug 2, 2023
@Danie0918
Copy link
Contributor

How are these anomalous data generated? If there is no reference value you can use data denoising to find out the anomalous data to clean up.

@Keith-CY
Copy link
Member Author

Keith-CY commented Aug 2, 2023

How are these anomalous data generated? If there is no reference value you can use data denoising to find out the anomalous data to clean up.

They are collected from real-world data and I thing they are reasonable because there aren't many transactions in the pool, so the order is not impacted by fee rate obviously

@Danie0918
Copy link
Contributor

Perhaps we can avoid this by sorting the fee and calculating the average elapsed time. For example, take the top 50% of the highest fee and calculate the fastest time, take the bottom 50% and calculate the slowest time, and take the average of all the fees and calculate the average time.

@Keith-CY
Copy link
Member Author

Keith-CY commented Aug 3, 2023

Perhaps we can avoid this by sorting the fee and calculating the average elapsed time. For example, take the top 50% of the highest fee and calculate the fastest time, take the bottom 50% and calculate the slowest time, and take the average of all the fees and calculate the average time.

But it cannot pick out outliers. In this case,
image

The highest fee rates appear around 30~50s, so its average time is about 40s
While the lowers fee rates appear around 10~30s, so its average time is about 20s

The highest fee rate is still mapped to longer elapsed time.

@Danie0918
Copy link
Contributor

Danie0918 commented Aug 3, 2023

Perhaps we can avoid this by sorting the fee and calculating the average elapsed time. For example, take the top 50% of the highest fee and calculate the fastest time, take the bottom 50% and calculate the slowest time, and take the average of all the fees and calculate the average time.

But it cannot pick out outliers. In this case

The highest fee rates appear around 3050s, so its average time is about 40s While the lowers fee rates appear around 1030s, so its average time is about 20s

The highest fee rate is still mapped to longer elapsed time.

Typically, a higher fee means a shorter period of time, although data from different times cannot be referenced together due to the existence of peaks and valleys in trading. Perhaps we could add a time limit of nearly 10,000 transactions, such as within the last hour, so that the data would be closer to the current situation.

@Danie0918
Copy link
Contributor

As the data is real but inaccurate due to the time span. In this case, we simply filter the data with large deviations, which does not provide correct feedback on the actual situation. So here, we can consider adding a limit to the reference value, only get the transaction data within 1000 blocks, but in order to avoid the gap caused by no data, we need to set a default value of 2000 shannons/KB within 10 seconds.

By the way, the restriction here is only for the rate-tracker reference value chart, other data charts are not adjusted.This is because the other charts are statistics, while this one is a summary chart that can be used for reference.

image

@Keith-CY
Copy link
Member Author

As the data is real but inaccurate due to the time span. In this case, we simply filter the data with large deviations, which does not provide correct feedback on the actual situation. So here, we can consider adding a limit to the reference value, only get the transaction data within 1000 blocks, but in order to avoid the gap caused by no data, we need to set a default value of 2000 shannons/KB within 10 seconds.

By the way, the restriction here is only for the rate-tracker reference value chart, other data charts are not adjusted.This is because the other charts are statistics, while this one is a summary chart that can be used for reference.

image

LGTM, but IMO the count of samples should be relevant to the count of transactions instead of blocks. What if limiting the count of samples to 10 * TPS, so it becomes dynamic based on the on-chain activities.

@Danie0918
Copy link
Contributor

LGTM, but IMO the count of samples should be relevant to the count of transactions instead of blocks. What if limiting the count of samples to 10 * TPS, so it becomes dynamic based on the on-chain activities.

This is also workable, the blocks are mainly designed to limit the timeframe and prevent long time spans in case of low active transactions.

@Keith-CY
Copy link
Member Author

LGTM, but IMO the count of samples should be relevant to the count of transactions instead of blocks. What if limiting the count of samples to 10 * TPS, so it becomes dynamic based on the on-chain activities.

This is also workable, the blocks are mainly designed to limit the timeframe and prevent long time spans in case of low active transactions.

When will we handle this, it's a bit bothering to users

@Keith-CY
Copy link
Member Author

BTW, the solution mentioned above just shrinks the time-frame of exceptional status. The algorithm is still required to be optimized to remove outliers from samples.

@Danie0918
Copy link
Contributor

BTW, the solution mentioned above just shrinks the time-frame of exceptional status. The algorithm is still required to be optimized to remove outliers from samples.

According to the last data, the anomalies will be obviously large (more than 10 times beyond the normal value), we can add a suitable value to filter, for example, filtering more than 100000shannons/kB data.

So we can start with the following program:

1.Filtering more than 100000shannons/kB data.
2.Limiting the count of samples to 10 * TPS.

@Keith-CY
Copy link
Member Author

Keith-CY commented Sep 14, 2023

BTW, the solution mentioned above just shrinks the time-frame of exceptional status. The algorithm is still required to be optimized to remove outliers from samples.

According to the last data, the anomalies will be obviously large (more than 10 times beyond the normal value), we can add a suitable value to filter, for example, filtering more than 100000shannons/kB data.

So we can start with the following program:

1.Filtering more than 100000shannons/kB data. 2.Limiting the count of samples to 10 * TPS.

The rule

1.Filtering more than 100000shannons/kB data.

It is a hardcoded value that will be imprecise in most cases. For example, when there's a jam on chain, the fee rate should all be greater than 100000 shannons/kB, and all samples will be filtered out.


I will suggest filtering outliers by the following rule

  1. Order samples by confirmation time, from short to long
  2. Check every two adjacent samples, if the long one uses lower fee rate, it should be removed from the sample

By doing so, the filtered samples will be monotonically increasing.

@Keith-CY
Copy link
Member Author

Keith-CY commented Sep 14, 2023

For the rule 2

Limiting the count of samples to 10 * TPS.

In case that TPS is very low, which means band-width on-chain is enough for most transactions, we expend the samples with dummy 1000shannons/kB to, say 1000 samples, in total.

By doing so, the fee rate will be close to 1000shannons/kB when there aren't many transactions.


I would suggest using 100 as the threshold based on the current activities

@Keith-CY
Copy link
Member Author

A temporary solution(nervosnetwork/ckb-explorer-frontend@672233c) was submitted as a hotfix to avoid exceptional samples in production environment

@Danie0918
Copy link
Contributor

1.Filtering more than 100000shannons/kB data.

This is a transitional program and values can be adjusted as appropriate.

I will suggest filtering outliers by the following rule

Order samples by confirmation time, from short to long
Check every two adjacent samples, if the long one uses lower fee rate, it should be removed from the sample
By doing so, the filtered samples will be monotonically increasing.

Removing long and high fees based on confirmation time sorting may remove normal data if there is a change in the busyness of transactions on the chain when sampling data in a uniform interval.


Returning to the question at hand, I think removing data noise is an appropriate solution.

  1. Mean value denoising : Noise is eliminated by calculating the average of the data over time and comparing it to each data point.
  2. Convolution denoising : Noise is eliminated by smoothing the data using a convolutional kernel.

I would suggest using 100 as the threshold based on the current activities

Regarding the sampling of data, 100 is set as the threshold, but a time limit such as within 1 hour needs to be added. Avoiding too large a time span of data does not reflect the current situation in a timely manner.

@Keith-CY
Copy link
Member Author

Keith-CY commented Sep 14, 2023

Removing long and high fees based on confirmation time sorting may remove normal data if there is a change in the busyness of transactions on the chain when sampling data in a uniform interval.

This filter adopts the logic that miners use. In general, tx with a high fee rate will be mined first, that means tx confirmed later won't be with a higher fee rate, theoretically.


Mean value denoising : Noise is eliminated by calculating the average of the data over time and comparing it to each data point.

Removing data above or beneath specific values won't ameliorate the outliers because outliers may all sit in the valid range.

Say the original samples are as follow
image
By remove some above or beneath specific values, it becomes
image
It still fluctuates, and the fee rate of longer confirmation time is still possible to be high

Convolution denoising : Noise is eliminated by smoothing the data using a convolutional kernel.

Same as above, if the algorithm is only to smooth the original curve, the trending won't be fixed.


Regarding the sampling of data, 100 is set as the threshold, but a time limit such as within 1 hour needs to be added. Avoiding too large a time span of data does not reflect the current situation in a timely manner.

The suggestion at #394 (comment) covers 2 aspects

  1. minimal count of samples
  2. insert dummy samples if real-world samples are not enough

If the TPS is very low, say 2 transactions/minute, similar to no transactions within 1 hour, many dummy samples with 1000shannons/kB will be inserted to make the trending close to the minimal fee rate.

@Keith-CY
Copy link
Member Author

Interquartile range was added by nervosnetwork/ckb-explorer-frontend#1411

@Keith-CY
Copy link
Member Author

The lowest fee rate should be recommended when the on-chain bandwidth is not fully occupied. So I would suggest adding a new strategy as follows

  1. set a threshold of acceptable duration;
  2. get the average durations of several low fee rates;
  3. if the average duration <= threshold, set the low fee rate as the recommended fee rate.

@Keith-CY
Copy link
Member Author

The lowest fee rate should be recommended when the on-chain bandwidth is not fully occupied. So I would suggest adding a new strategy as follows

  1. set a threshold of acceptable duration;
  2. get the average durations of several low fee rates;
  3. if the average duration <= threshold, set the low fee rate as the recommended fee rate.

Threshold will be set 60s

@Keith-CY
Copy link
Member Author

The lowest fee rate should be recommended when the on-chain bandwidth is not fully occupied. So I would suggest adding a new strategy as follows

  1. set a threshold of acceptable duration;
  2. get the average durations of several low fee rates;
  3. if the average duration <= threshold, set the low fee rate as the recommended fee rate.

Threshold will be set 60s

Will be update by nervosnetwork/ckb-explorer-frontend@a92790d

The strategy is slight tweaked that threshold is dynamically updated by average block time, it's set 2 * avg block time

@Keith-CY
Copy link
Member Author

The lowest fee rate should be recommended when the on-chain bandwidth is not fully occupied. So I would suggest adding a new strategy as follows

  1. set a threshold of acceptable duration;
  2. get the average durations of several low fee rates;
  3. if the average duration <= threshold, set the low fee rate as the recommended fee rate.

Threshold will be set 60s

That means, the low fee rates are ideal if they make transactions be committed within 2 blocks.

@Keith-CY
Copy link
Member Author

Already on mainnet and testnet. It's hard to test unless we send numerous transactions on testnet

@FrederLu
Copy link

Already on mainnet and testnet. It's hard to test unless we send numerous transactions on testnet

A large amount of data can be built on Testnet in a short time, with 1501 pieces of data per block, but no change in the feerate has been found yet.

@Keith-CY
Copy link
Member Author

Keith-CY commented May 20, 2024

Already on mainnet and testnet. It's hard to test unless we send numerous transactions on testnet

A large amount of data can be built on Testnet in a short time, with 1501 pieces of data per block, but no change in the feerate has been found yet.

Impacted by this issue #665

There are always hundreds of transactions having a 1000 fee rate in the history even though all recent transactions come with a high fee rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Archived in project
Development

No branches or pull requests

4 participants