Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 1900 anonymization date too early for 32-bit integers storing as days back from Jan 1, 1970 #360

Closed
alexrockhill opened this issue Nov 6, 2019 · 19 comments

Comments

@alexrockhill
Copy link
Contributor

alexrockhill commented Nov 6, 2019

See mne-tools/mne-python#7016 for a full discussion. Basically, the 32-bit integers cover dates from 1901 to 2038 relative to the starting date used of Jan 1, 1970. It seems to me if BIDS specifications were relaxed to 1925, then we could have this discussion again in 2038 when probably much will be different and it may well be irrelevant due to changing integer storage types.

EDIT: This is an issue with the FIF specification info['meas_date'] as days relative to Jan 1, 1970.

cc @agramfort @larsoner @jasmainak @bloyl.

@jasmainak
Copy link
Collaborator

jasmainak commented Nov 6, 2019

@alexrockhill for convenience of other people, can you please link to the part of the specification that asks for this?

also, is fif the only format that has this issue or do we have this for other formats too? what are the earliest dates one can store with other formats like Brainvision, EDF etc?

I'm sure @robertoostenveld will have an opinion here

@alexrockhill
Copy link
Contributor Author

alexrockhill commented Nov 6, 2019

Definitely, sorry!

https://github.com/bids-standard/bids-specification/blob/7030451090f19bcf15e49749714e204d9eadd5b1/src/02-common-principles.md


Dates can be shifted by a random number of days for privacy protection reasons. To distinguish real 
dates from shifted dates always use year 1900 or earlier when including shifted years. For longitudinal 
studies please remember to shift dates within one subject by the same number of days to maintain the 
interval information. Example: 1867-06-15T13:45:30

@alexrockhill
Copy link
Contributor Author

alexrockhill commented Nov 6, 2019

This is the header of a .vmrk file of my data

Brain Vision Data Exchange Marker File, Version 1.0

[Common Infos]
Codepage=UTF-8
DataFile=AS_HC_00001_AR.eeg

[Marker Infos]
; Each entry: Mk<Marker number>=<Type>,<Description>,<Position in data points>,
; <Size in data points>, <Channel number (0 = marker is related to all channels)>
; Fields are delimited by commas, some fields might be omitted (empty).
; Commas in type or description text are coded as "\1".
Mk1=New Segment,,1,1,0,20190710134912023581
Mk2=Stimulus,S 64,910387,1,0

Looks like they are stored like 20190710134912023581 from which I would deduce that there are 4 spaces for the year, two for the month etc. which should be able to represent any date in the foreseeable future.

@alexrockhill
Copy link
Contributor Author

alexrockhill commented Nov 6, 2019

This is the header for a biosemi .bdf. That first string of numbers looks like a date to me, probably already anonymized?

ˇBIOSEMI
07.05.1812.49.4720736   24BIT                                       2782    1       80  Fp1             AF7             AF3             F1              
F3              F5              F7              FT7             FC5             FC3             FC1             C1              C3              C5    
          T7              TP7             CP5             CP3             CP1             P1              P3              P5              P7           
   P9              PO7             PO3             O1              Iz              Oz              POz             Pz              CPz             Fpz             \Fp2             AF8             AF4             AFz             Fz              F2              F4              F6              F8              FT8   
          FC6             FC4             FC2             FCz             Cz              C2              C4              C6              T8            
  TP8             CP6             CP4             CP2             P2              P4              P6              P8              P10             PO8
             PO4             O2              EXG1            EXG2            EXG3            EXG4            EXG5            EXG6       
     EXG7            EXG8            GSR1            GSR2            Erg1            Erg2            Resp            Plet            Temp            
Status          Active Electrode                                

@effigies
Copy link
Collaborator

effigies commented Nov 6, 2019

It's not clear to me (from having very lightly skimmed the linked issue) that this is actually a problem. BIDS doesn't really concern itself with the header contents of files. Why not zero these out, and store necessary dates in TSV files, in ISO-8601 format?

If relative times within a file are needed, then any usable offset that anonymizes dates seems reasonable and outside the scope of BIDS.

@alexrockhill
Copy link
Contributor Author

@effigies such a simple solution, I really should have thought of that, thanks for looking it over!

@alexrockhill
Copy link
Contributor Author

alexrockhill commented Nov 6, 2019

Actually thinking about it, until MNE changes to match, it might be confusing to users if the date in their FIF header is different than the date in the sidecar. I know that's a small point, but I also think the difference between 1900 and 1925 is small too. I don't think anyone has any digitized neuroimaging data before 1925... :)

@agramfort
Copy link
Collaborator

For EEG/MEG data we need to decide how we proceed to avoid making mistakes. I think it's error prone to have dates in header that don't match sidecar files. As the bids datasets are written with custom packages that can export BIDS dataset we need to see how to proceed. For example if you have an MEG fif file you would get different objects if you do:

raw = mne_bids.read_raw_bids('xxx.fif') # would use sidecar
vs
raw = mne.io.read_raw_fif('xxx.fif') # would use header

if we can have a way to avoid this mistake by allowing dates between 1900 and 1925 (or something) that would be great. Note that you have data in openneuro such as https://openneuro.org/datasets/ds000117/versions/1.0.3 that have been "anonymized" with mne but therefore cannot store dates following the bids standard.

@effigies
Copy link
Collaborator

effigies commented Nov 7, 2019

I agree that raising from 1900 to 1925 is probably not a big deal, and I'm not aware of anybody using a smaller time_t than 32 bits, so we're unlikely to need to keep revisiting this. Possibly there's code that will warn "These dates don't look anonymized", but that's relatively easily fixed.

Might be worth getting some input from early BIDSers about why 1900 was chosen, in case there is some reason besides roundness to prefer it to 1925. I don't see anything on the mailing list, so it presumably happened in person. cc @jbpoline @tyarkoni @poldrack


It still seems a little strange to me that this is a problem for BIDS and not a quirk of one file format. I think the thing I'm missing is whether the datetimes are significant in the analysis of an individual file. If not, then having a specific value that indicates the date has been anonymized (all zeroes or all ones seem good candidates) seems sufficient, but I really lack the experience of analyzing EEG/MEG data. I suppose that from the above comments, it looks like time of acquisition is considered essential metadata in all formats? How is it commonly used in analysis?

In the context of BIDS-aware MRI tools, we have made an explicit policy of preferring sidecar data for repetition time, phase-encoding direction, and slice timing, when these disagree with the NIfTI headers, and the reason is that these are much more amenable to manual inspection and correction. I believe the validator warns if they do not match AND the NIfTI header fields are set.

But, again, if it makes life substantially easier, I don't see any reason not to move to 1925.

@jasmainak
Copy link
Collaborator

you might want to anonymize both the birthday and the measurement date but still keep their relative difference to preserve information such as age. That's why I think it's not preferable to "zero out" the dates.

the second argument for me is that almost no MEG/EEG software at the moment uses BIDS sidecars to read in files. They are certainly useful pieces of information to enrich the data, but losing them shouldn't make the file unusable.

@effigies
Copy link
Collaborator

effigies commented Nov 7, 2019

If you're trying to capture both birthday and acquisition in int32, aren't you going to run into problems when you scan someone 25 years old or older? If tomorrow I were to scan someone born in 1925, how would I encode them? If their birth date is set to Jan 1, 1902, then their scan date would be some time in 1996. No matter what you do, something is going to be out of accordance with BIDS anonymization rules.

Which is fine, as long as that's understood. BIDS doesn't care about most fields inside the various formats.

As I suggested in my first post, any necessary differences could be encoded as intervals, rather than hard dates. If all that matters are within-subject intervals, then set the birthday to 0 and any acquisitions to real_time - birthday. If between-subject intervals matter, then I suppose the zero-point becomes the birthday of the oldest subject. (I'm using 0 here demonstratively. You could just as easily use 1 << 31 as the reference point.)

They are certainly useful pieces of information to enrich the data, but losing them shouldn't make the file unusable.

This is what I meant about me lacking the EEG/MEG context. I don't know what data are critical for interpreting an individual file. To me, dates seem like throw-away information 90% of the time. If I want them, it would be in the context of a whole dataset.

@robertoostenveld
Copy link
Collaborator

BIDS specifies how datetime is to be dealt with in TSV/JSON sidecars, but (afaik) it only relates to the scans.tsv sidecar file (please correct me if I am wrong). Dealing with potentially identifying information in the raw data files falls outside the scope of the BIDS specification (*). Also: BIDS is not per see abut sharing data with the world, it is also for internal archiving, where subject identifiers can be identifying and dates relevant. But that said... it is an issue in one of the most desired use cases, which is sharing more data. Each raw file format has its own specifics, e.g. I was not aware of the datetime being represented in the vmrk files, nor in the bdf files. But e.g. describ[80] in nifti could in principle also contain identifying information (e.g. there is no guarantee that dcm2niix is used in the conversion).

*) however, it would be nice to have a BIDS app for data de-identification (not only defacing, but also scrubbing dates from headers and touching all files to ensure that file creation timestamps are not informative). Or to make this part of the tools that help in creating/curating BIDS datasets, e.g.implementing this in the data2bids function, as we already have done with the cfg.deface option.

For our MOUS datasets - largely prepared prior to BIDS-MEG being finished, and prior to the tooling being at the more comfortable level it is now: I do not recommend the following example any more - we documented this for CTF (see at the end of section 3, also note the extra script to deal with a bug/limitation in the CTF software) and for NBS Presentation log files in sourcedata (see section 4).

Regarding discrepancies between sidecar files (JSON and TSV) and data files: I have always been working under the (tacit) assumption that the sidecar files should (where possible) correct/overrule the original data files. E.g. trigger sequences in a binary STIM channel can appear "enriched" in the events.tsv sidecar (see e.g. issues listed here). Idem for good/bad channels in channels.tsv (not flagged in the original header either). So for synchronization purposes I would use the datetime in the sidecar files (e.g. as explained in the example that leads to this).

I would prefer the (binary) headers to remain interpretable and consistent with the sidecars. I realize that in remove_ctf_datetime and hence in the MOUS dataset I used 1970 as the date. I would also be fine with 1925, or any other date. I think that specifically the dates being the same for all subjects in the dataset would signal to me that they are not to be interpreted as absolute.

Perhaps we should also document somewhere (for the time being, i.e. until we are confident that tools are in place to handle this properly) what identifying information might be present in what raw file file. Perhaps on the starter kit?

@poldrack
Copy link

poldrack commented Nov 7, 2019 via email

@chrisgorgo
Copy link
Contributor

side note:
Re "BIDS app for data de-identification" - this might be of interest https://github.com/PeerHerholz/BIDSonym you might want to talk with @PeerHerholz

@satra
Copy link
Collaborator

satra commented Nov 7, 2019

anything that is stored in an inode can be destroyed by many operations (one example below). so on the filesystem side, there are many ways to change the inode values. for anonymization, i think we are talking about content as well timestamps, and both could be altered.

$ stat /om2/user/satra/projects/embarc.hf5
  File: ‘/om2/user/satra/projects/embarc.hf5’
  Size: 25768927        Blocks: 50336      IO Block: 1048576 regular file
Device: 4bh/75d Inode: 206204797473  Links: 1
Access: (0664/-rw-rw-r--)  Uid: (48529/   satra)   Gid: (33368/  gablab)
Access: 2019-11-07 12:42:06.396302196 -0500
Modify: 2018-02-27 19:59:34.041540637 -0500
Change: 2018-02-27 19:59:34.041540637 -0500
 Birth: -
$ cp /om2/user/satra/projects/embarc.hf5 .
$ stat embarc.hf5 
  File: ‘embarc.hf5’
  Size: 25768927        Blocks: 49152      IO Block: 4194304 regular file
Device: de6f9686h/3731854982d   Inode: 144115490502015372  Links: 1
Access: (0664/-rw-rw-r--)  Uid: (48529/   satra)   Gid: (    0/    root)
Access: 2019-11-07 12:42:06.000000000 -0500
Modify: 2019-11-07 12:42:06.000000000 -0500
Change: 2019-11-07 12:42:06.000000000 -0500
 Birth: -

@effigies
Copy link
Collaborator

effigies commented Nov 7, 2019

I think we may have veered a bit off course. I don't think any filesystem-level metadata is under consideration.

@jasmainak
Copy link
Collaborator

f you're trying to capture both birthday and acquisition in int32, aren't you going to run into problems when you scan someone 25 years old or older? If tomorrow I were to scan someone born in 1925, how would I encode them? If their birth date is set to Jan 1, 1902, then their scan date would be some time in 1996. No matter what you do, something is going to be out of accordance with BIDS anonymization rules.

Talking to @LorenzE I realized this is not a problem. Because the measurement date is stored in Unix time -- so it encodes the number of seconds since 1st January 1970 whereas the birthday is stored in Julian days and you store the number of days since 1st January 4713 BC. Thus int32 is able to encode the birthday with no problem. So if we move the recommended date to 1925, it appeases the fif format. Moving to a newer format would be a longer plan / vision but I think this is an easy fix that can resolve some problems for now.

Then regarding this:

Regarding discrepancies between sidecar files (JSON and TSV) and data files: I have always been working under the (tacit) assumption that the sidecar files should (where possible) correct/overrule the original data files

and:

BIDS doesn't care about most fields inside the various formats.

Is this documented somewhere? I actually always assumed the opposite. That the sidecars always match what is in the headers of different file formats. This is because the validator checks for this and there is even a flag to ignore the NIFTI headers. This was not implemented for MEG/iEEG/EEG because of technical limitations (no library to read these file formats in javascript).

The other issue is that now you mix file formats and data curation. Imagine if the birthday is also stored in another metadata file that takes precedence over the header, if you wanted to share just one file (for e.g., to report a bug), you will have to share the entire BIDS directory instead of just one file.

@agramfort
Copy link
Collaborator

agramfort commented Nov 10, 2019 via email

@sappelhoff
Copy link
Member

closed in #363

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants