[New Check] Entire column of Timeseries is NaN #229

CodyCBakerPhD · 2022-07-12T15:52:09Z

Solves half of #155

Only applies to 1D and 2D TimeSeries data - any higher than that and the multi-axis structure could itself contain information; e.g., if writing an ImageSeries (frames x width x height x 3 color channels), if a single (x,y) pixel is NaN for all frames and color channels, you'd still want to keep it in the dataset to maintain the video structure.

But for 1D data, if the entire data is NaN, I can't think of a case why this data was even written to the NWBFile.

Likewise for 2D data, it would be suggested for example, on an ElectricalSeries, to simply not write those channels that never seem to specify any actual data.

Note that it is fairly common to find blocks of NaN data interspersed within TimeSeries - the subframe_selection method thus attempts to sample an even spread of the nelems over the range of the number of frames.

CodyCBakerPhD · 2022-07-13T19:59:41Z

Just going to wait until #231 is merged then I can refactor it to use the util function and nelems=None functionality

codecov-commenter · 2022-07-20T17:26:34Z

Codecov Report

Merging #229 (23d2e6a) into dev (2ee62e2) will increase coverage by 0.13%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #229      +/-   ##
==========================================
+ Coverage   94.87%   95.01%   +0.13%     
==========================================
  Files          17       17              
  Lines         936      962      +26     
==========================================
+ Hits          888      914      +26     
  Misses         48       48

Flag	Coverage Δ
unittests	`95.01% <100.00%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
nwbinspector/checks/tables.py	`96.59% <100.00%> (+0.38%)`	⬆️
nwbinspector/checks/time_series.py	`98.03% <100.00%> (+0.98%)`	⬆️

bendichter · 2022-07-27T17:42:46Z

nwbinspector/checks/time_series.py

+    subframe_selection = np.unique(
+        np.round(np.linspace(start=0, stop=time_series.data.shape[0] - 1, num=nelems)).astype(int)
+    )


Looks like you could move this to line 110

Using the clever logic we devised from the tables PR now (direct slicing with 'by' amount calculated from shape)

bendichter · 2022-07-27T17:43:56Z

nwbinspector/checks/time_series.py

+                )
+            )
+    elif n_dims == 2:
+        for col in range(time_series.data.shape[1]):


what if there are 1000s of columns?

3 SpikeGLX probes would hit that amount... Not that I've seen anyone do more than 2 currently, but we can think towards the future.

Could be faster if we did the np.isnan calculation vectorized over the entire extracted array of [subselected_rows, all_cols]...

Same case could be made for the check_table_cols_nan for DynamicTables (could be thousands of columns there that we iterate over), but I get that the TimeSeries data is likely to be much larger.

If you're suggesting to subselect over columns as well then here, I'd just point out that could ultimately reduce the accuracy if there are only a small number of channels/ROIs that had nothing in them but NaN at every time point - the likelihood of those offending columns being selected may not be very high, so I'd actually prefer limiting the nelems of the row data selection to examine fewer data values per row of each column...

Now, that's aside from the other idea I have for pulling data chunk-wise whenever possible, and fully utilizing each chunks data (described below).

bendichter · 2022-07-27T17:44:25Z

nwbinspector/checks/time_series.py

+            if not all(np.isnan(time_series.data[:nelems, col]).flatten()):
+                continue
+
+            if all(np.isnan(time_series.data[subframe_selection, col]).flatten()):


If there a way of modifying this code to minimize the number of h5py dataset reads?

Yeah, we could take the approach of pulling out and checking data in packets of 'chunks' (would be much more efficient for streaming).

In the worst case, as it currently stands, the slicing might span nelems total number of chunks for each column, which might be cached for a bit on the HDF5 end but unlikely that's much benefit at normal data scales. Either way is definitely a bit inefficient for that.

One idea could be to check if the time_series being read is chunked, and if it is, alter our interpretation of nelems from spanning nelems individual data values to spanning some smaller number of chunks throughout the dataset, but utilize all of the data retrieved from those chunks.

This is also all separate from the first early exit condition of this entire check, which only examines the first nelems consecutive data values, which are fairly unlikely to span multiple chunks unless the chunk_shape is something extreme like by-frame for all channels.

I can think more about the total data usage that this function might require and report some approximate numbers. But yeah, we might need to be a bit more restrictive than usual with the nelems subsetting, the interaction with chunking could give some poor performance in practice.

CodyCBakerPhD added 2 commits July 11, 2022 15:18

wip

da3908f

added and debugged check

50a5b54

CodyCBakerPhD added the category: new check a new best practices check to apply to all NWBFiles and their contents label Jul 12, 2022

CodyCBakerPhD requested a review from bendichter July 12, 2022 15:52

CodyCBakerPhD self-assigned this Jul 12, 2022

CodyCBakerPhD added 2 commits July 12, 2022 11:58

Merge branch 'dev' into check_timeseries_row_is_nan

2bf4289

Merge branch 'dev' into check_timeseries_row_is_nan

54fc7d0

CodyCBakerPhD mentioned this pull request Jul 13, 2022

[New Check] Entire column of a table is not NaN #231

Merged

Cody Baker and others added 2 commits July 13, 2022 13:49

add early data access return

72a4564

debug

60d280c

CodyCBakerPhD marked this pull request as draft July 13, 2022 19:59

CodyCBakerPhD added 3 commits July 18, 2022 12:31

Merge branch 'dev' into check_timeseries_row_is_nan

194f16a

Merge branch 'dev' into check_timeseries_row_is_nan

5a99410

Merge branch 'dev' into check_timeseries_row_is_nan

23d2e6a

Merge branch 'dev' into check_timeseries_row_is_nan

7377c26

bendichter requested changes Jul 27, 2022

View reviewed changes

CodyCBakerPhD and others added 3 commits August 24, 2022 13:37

Merge branch 'dev' into check_timeseries_row_is_nan

6d2c54c

use clever slicing from tables PR

847f0d3

Merge branch 'dev' into check_timeseries_row_is_nan

fb842e0

CodyCBakerPhD linked an issue Nov 10, 2022 that may be closed by this pull request

[Add Check]: entire col/row of table is not NaN, and entire axis of TimeSeries.data is not NaN #155

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Check] Entire column of Timeseries is NaN #229

[New Check] Entire column of Timeseries is NaN #229

CodyCBakerPhD commented Jul 12, 2022

CodyCBakerPhD commented Jul 13, 2022

codecov-commenter commented Jul 20, 2022

bendichter Jul 27, 2022

CodyCBakerPhD Aug 24, 2022

bendichter Jul 27, 2022

CodyCBakerPhD Aug 24, 2022 •

edited

Loading

bendichter Jul 27, 2022

CodyCBakerPhD Aug 24, 2022

[New Check] Entire column of Timeseries is NaN #229

Are you sure you want to change the base?

[New Check] Entire column of Timeseries is NaN #229

Conversation

CodyCBakerPhD commented Jul 12, 2022

CodyCBakerPhD commented Jul 13, 2022

codecov-commenter commented Jul 20, 2022

Codecov Report

bendichter Jul 27, 2022

Choose a reason for hiding this comment

CodyCBakerPhD Aug 24, 2022

Choose a reason for hiding this comment

bendichter Jul 27, 2022

Choose a reason for hiding this comment

CodyCBakerPhD Aug 24, 2022 • edited Loading

Choose a reason for hiding this comment

bendichter Jul 27, 2022

Choose a reason for hiding this comment

CodyCBakerPhD Aug 24, 2022

Choose a reason for hiding this comment

CodyCBakerPhD Aug 24, 2022 •

edited

Loading