-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Check] Entire column of Timeseries is NaN #229
base: dev
Are you sure you want to change the base?
Conversation
Just going to wait until #231 is merged then I can refactor it to use the util function and |
Codecov Report
@@ Coverage Diff @@
## dev #229 +/- ##
==========================================
+ Coverage 94.87% 95.01% +0.13%
==========================================
Files 17 17
Lines 936 962 +26
==========================================
+ Hits 888 914 +26
Misses 48 48
Flags with carried forward coverage won't be shown. Click here to find out more.
|
nwbinspector/checks/time_series.py
Outdated
subframe_selection = np.unique( | ||
np.round(np.linspace(start=0, stop=time_series.data.shape[0] - 1, num=nelems)).astype(int) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you could move this to line 110
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the clever logic we devised from the tables PR now (direct slicing with 'by' amount calculated from shape)
) | ||
) | ||
elif n_dims == 2: | ||
for col in range(time_series.data.shape[1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if there are 1000s of columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 SpikeGLX probes would hit that amount... Not that I've seen anyone do more than 2 currently, but we can think towards the future.
Could be faster if we did the np.isnan
calculation vectorized over the entire extracted array of [subselected_rows, all_cols]
...
Same case could be made for the check_table_cols_nan
for DynamicTables (could be thousands of columns there that we iterate over), but I get that the TimeSeries data is likely to be much larger.
If you're suggesting to subselect over columns as well then here, I'd just point out that could ultimately reduce the accuracy if there are only a small number of channels/ROIs
that had nothing in them but NaN at every time point - the likelihood of those offending columns being selected may not be very high, so I'd actually prefer limiting the nelems
of the row data selection to examine fewer data values per row of each column...
Now, that's aside from the other idea I have for pulling data chunk-wise whenever possible, and fully utilizing each chunks data (described below).
nwbinspector/checks/time_series.py
Outdated
if not all(np.isnan(time_series.data[:nelems, col]).flatten()): | ||
continue | ||
|
||
if all(np.isnan(time_series.data[subframe_selection, col]).flatten()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there a way of modifying this code to minimize the number of h5py dataset reads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could take the approach of pulling out and checking data in packets of 'chunks' (would be much more efficient for streaming).
In the worst case, as it currently stands, the slicing might span nelems
total number of chunks for each column, which might be cached for a bit on the HDF5 end but unlikely that's much benefit at normal data scales. Either way is definitely a bit inefficient for that.
One idea could be to check if the time_series
being read is chunked, and if it is, alter our interpretation of nelems
from spanning nelems
individual data values to spanning some smaller number of chunks throughout the dataset, but utilize all of the data retrieved from those chunks.
This is also all separate from the first early exit condition of this entire check, which only examines the first nelems
consecutive data values, which are fairly unlikely to span multiple chunks unless the chunk_shape
is something extreme like by-frame for all channels.
I can think more about the total data usage that this function might require and report some approximate numbers. But yeah, we might need to be a bit more restrictive than usual with the nelems
subsetting, the interaction with chunking could give some poor performance in practice.
Solves half of #155
Only applies to 1D and 2D TimeSeries data - any higher than that and the multi-axis structure could itself contain information; e.g., if writing an ImageSeries
(frames x width x height x 3 color channels)
, if a single(x,y)
pixel isNaN
for all frames and color channels, you'd still want to keep it in the dataset to maintain the video structure.But for 1D data, if the entire data is
NaN
, I can't think of a case why this data was even written to the NWBFile.Likewise for 2D data, it would be suggested for example, on an ElectricalSeries, to simply not write those channels that never seem to specify any actual data.
Note that it is fairly common to find blocks of
NaN
data interspersed within TimeSeries - thesubframe_selection
method thus attempts to sample an even spread of thenelems
over the range of the number of frames.