mysampler slow for busy channels #19

apcarp · 2023-09-27T17:55:54Z

The mysampler algorithm is very slow on channels like ILM0I07Lc. Running mysampler on it with 200, 1 hour samples to ~20 seconds to complete, and the original query using 3000, 1 hour samples took ~2 minutes to complete. This appears to be because the sampling algorithm run time is largely determined by how long it takes to stream the channel history. This channel has 75 million updates for that four month period. The web-based myquery/interval and command line myget both also take ~2 minutes to stream this data range.

The command line mySampler application handles the longer query within a few seconds. After speaking with Chris, this is because he performs a SQL query per channel per sample (something like SELECT ... ORDER BY TIME WHERE Time < sample_time LIMIT 1). This scales poorly with a large number of channels and a large number of samples (e.g., 100 PVs * 100 samples = 10000 queries), but it works very well in this scenario since it skips the cost of streaming all the data.

I see three possible solutions:

Do what mySampler CLI does (multiple queries). This is what we did in myquery prior to the new mySamplerStream.
Take a hybrid approach. Inspect one or more small sections of the channel history to be sampled. If it's a busy channel, do the multiple query strategy. If not busy, do the current single query strategy.
Try an single query that filters the data prior to streaming. An alternative way to do the sampling is to construct a new column for the time period in question that is floor((sample_start - event_timestamp) / sample_interval), group on these integers, and take the last one from each bin as it contains the value at the following sample time. The integers correspond to the number of the sample bin that the event belongs in. It will likely be simpler to grab the prior point to use as the first value than to use this algorithm on data before the sample_start. The ResultSet of this can be fed into the existing mySamplerStream which handles all of the corner cases of sampling.

I will try the 3rd option manually to see if it looks viable.

apcarp · 2023-09-28T15:08:49Z

I tested an implementation for 3) above. It's slower than the current approach. It also seems to generally scale worse (~8 million rows to ~30s, 36 million rows >10 minutes, while streaming took 1.5 minutes for 36 million rows).

Here is the example query that I tested. Someone else could use it if they substituted the appropriate table numbers, start, end, and sample interval values. Let me know if someone better with SQL sees any obvious improvements.

WITH binned_table AS (
    SELECT t.*, ROW_NUMBER() OVER (PARTITION BY bin ORDER BY time DESC) as rn
        FROM (SELECT *, FLOOR((time - 440506126644019200) / 966367641600) AS bin
        FROM table_423655
        WHERE time >= 440506126644019200 and time < 440830826171596800) as t
)
SELECT * FROM binned_table WHERE rn = 1;

apcarp · 2023-12-13T13:54:38Z

Fixed in v6.1.0

apcarp mentioned this issue Nov 14, 2023

MySamplerStream slow on busy channels JeffersonLab/jmyapi#21

Closed

apcarp mentioned this issue Nov 30, 2023

Update Mysampler #20

Closed

apcarp closed this as completed Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mysampler slow for busy channels #19

mysampler slow for busy channels #19

apcarp commented Sep 27, 2023 •

edited

Loading

apcarp commented Sep 28, 2023 •

edited

Loading

apcarp commented Dec 13, 2023

mysampler slow for busy channels #19

mysampler slow for busy channels #19

Comments

apcarp commented Sep 27, 2023 • edited Loading

apcarp commented Sep 28, 2023 • edited Loading

apcarp commented Dec 13, 2023

apcarp commented Sep 27, 2023 •

edited

Loading

apcarp commented Sep 28, 2023 •

edited

Loading