Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mysampler slow for busy channels #19

Closed
apcarp opened this issue Sep 27, 2023 · 2 comments
Closed

mysampler slow for busy channels #19

apcarp opened this issue Sep 27, 2023 · 2 comments

Comments

@apcarp
Copy link
Contributor

apcarp commented Sep 27, 2023

The mysampler algorithm is very slow on channels like ILM0I07Lc. Running mysampler on it with 200, 1 hour samples to ~20 seconds to complete, and the original query using 3000, 1 hour samples took ~2 minutes to complete. This appears to be because the sampling algorithm run time is largely determined by how long it takes to stream the channel history. This channel has 75 million updates for that four month period. The web-based myquery/interval and command line myget both also take ~2 minutes to stream this data range.

The command line mySampler application handles the longer query within a few seconds. After speaking with Chris, this is because he performs a SQL query per channel per sample (something like SELECT ... ORDER BY TIME WHERE Time < sample_time LIMIT 1). This scales poorly with a large number of channels and a large number of samples (e.g., 100 PVs * 100 samples = 10000 queries), but it works very well in this scenario since it skips the cost of streaming all the data.

I see three possible solutions:

  1. Do what mySampler CLI does (multiple queries). This is what we did in myquery prior to the new mySamplerStream.
  2. Take a hybrid approach. Inspect one or more small sections of the channel history to be sampled. If it's a busy channel, do the multiple query strategy. If not busy, do the current single query strategy.
  3. Try an single query that filters the data prior to streaming. An alternative way to do the sampling is to construct a new column for the time period in question that is floor((sample_start - event_timestamp) / sample_interval), group on these integers, and take the last one from each bin as it contains the value at the following sample time. The integers correspond to the number of the sample bin that the event belongs in. It will likely be simpler to grab the prior point to use as the first value than to use this algorithm on data before the sample_start. The ResultSet of this can be fed into the existing mySamplerStream which handles all of the corner cases of sampling.

I will try the 3rd option manually to see if it looks viable.

@apcarp
Copy link
Contributor Author

apcarp commented Sep 28, 2023

I tested an implementation for 3) above. It's slower than the current approach. It also seems to generally scale worse (~8 million rows to ~30s, 36 million rows >10 minutes, while streaming took 1.5 minutes for 36 million rows).

Here is the example query that I tested. Someone else could use it if they substituted the appropriate table numbers, start, end, and sample interval values. Let me know if someone better with SQL sees any obvious improvements.

WITH binned_table AS (
    SELECT t.*, ROW_NUMBER() OVER (PARTITION BY bin ORDER BY time DESC) as rn
        FROM (SELECT *, FLOOR((time - 440506126644019200) / 966367641600) AS bin
        FROM table_423655
        WHERE time >= 440506126644019200 and time < 440830826171596800) as t
)
SELECT * FROM binned_table WHERE rn = 1;

@apcarp
Copy link
Contributor Author

apcarp commented Dec 13, 2023

Fixed in v6.1.0

@apcarp apcarp closed this as completed Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant