You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The mysampler algorithm is very slow on channels like ILM0I07Lc. Running mysampler on it with 200, 1 hour samples to ~20 seconds to complete, and the original query using 3000, 1 hour samples took ~2 minutes to complete. This appears to be because the sampling algorithm run time is largely determined by how long it takes to stream the channel history. This channel has 75 million updates for that four month period. The web-based myquery/interval and command line myget both also take ~2 minutes to stream this data range.
The command line mySampler application handles the longer query within a few seconds. After speaking with Chris, this is because he performs a SQL query per channel per sample (something like SELECT ... ORDER BY TIME WHERE Time < sample_time LIMIT 1). This scales poorly with a large number of channels and a large number of samples (e.g., 100 PVs * 100 samples = 10000 queries), but it works very well in this scenario since it skips the cost of streaming all the data.
I see three possible solutions:
Do what mySampler CLI does (multiple queries). This is what we did in myquery prior to the new mySamplerStream.
Take a hybrid approach. Inspect one or more small sections of the channel history to be sampled. If it's a busy channel, do the multiple query strategy. If not busy, do the current single query strategy.
Try an single query that filters the data prior to streaming. An alternative way to do the sampling is to construct a new column for the time period in question that is floor((sample_start - event_timestamp) / sample_interval), group on these integers, and take the last one from each bin as it contains the value at the following sample time. The integers correspond to the number of the sample bin that the event belongs in. It will likely be simpler to grab the prior point to use as the first value than to use this algorithm on data before the sample_start. The ResultSet of this can be fed into the existing mySamplerStream which handles all of the corner cases of sampling.
I will try the 3rd option manually to see if it looks viable.
The text was updated successfully, but these errors were encountered:
I tested an implementation for 3) above. It's slower than the current approach. It also seems to generally scale worse (~8 million rows to ~30s, 36 million rows >10 minutes, while streaming took 1.5 minutes for 36 million rows).
Here is the example query that I tested. Someone else could use it if they substituted the appropriate table numbers, start, end, and sample interval values. Let me know if someone better with SQL sees any obvious improvements.
WITH binned_table AS (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY bin ORDER BYtimeDESC) as rn
FROM (SELECT*, FLOOR((time-440506126644019200) /966367641600) AS bin
FROM table_423655
WHEREtime>=440506126644019200andtime<440830826171596800) as t
)
SELECT*FROM binned_table WHERE rn =1;
The mysampler algorithm is very slow on channels like ILM0I07Lc. Running mysampler on it with 200, 1 hour samples to ~20 seconds to complete, and the original query using 3000, 1 hour samples took ~2 minutes to complete. This appears to be because the sampling algorithm run time is largely determined by how long it takes to stream the channel history. This channel has 75 million updates for that four month period. The web-based myquery/interval and command line myget both also take ~2 minutes to stream this data range.
The command line mySampler application handles the longer query within a few seconds. After speaking with Chris, this is because he performs a SQL query per channel per sample (something like SELECT ... ORDER BY TIME WHERE Time < sample_time LIMIT 1). This scales poorly with a large number of channels and a large number of samples (e.g., 100 PVs * 100 samples = 10000 queries), but it works very well in this scenario since it skips the cost of streaming all the data.
I see three possible solutions:
I will try the 3rd option manually to see if it looks viable.
The text was updated successfully, but these errors were encountered: