Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporating non-datalogger data into processing pipeline #213

Open
bpbond opened this issue Oct 17, 2024 · 6 comments
Open

Incorporating non-datalogger data into processing pipeline #213

bpbond opened this issue Oct 17, 2024 · 6 comments
Labels
enhancement New feature or request
Milestone

Comments

@bpbond
Copy link
Member

bpbond commented Oct 17, 2024

@Fausto2504 @nickdward request :

Ben and Steph, for the AquaTroll’s specifically we have had issues in the past with data not migrating from the troll to the Campbell logger. This is also the case for some of our YSI Exo sondes in the surface water. In this case, the data is generally stored internally on the sondes and we can recover it.

So in some cases we have a complete dataset downloaded from the sonde, but the L1 automated pipeline would not be retrieving this data. Do we want to address this at some stage in the pipeline, or would that perhaps be considered L2 data? In other words, at what stage of the pipeline would it be desirable to say “looks like loggernet missed all of 2022, let’s replace it with the data downloaded directly from the sonde”?

I want to think through in detail how this would work. As an example, here are the fields in a datalogger "WaterLevel600A" file versus the sample raw file:

Datalogger file AquaTROLL raw file
TIMESTAMP Date Time
Aquatroll_IDA(1) Device Id (in file header)
Barometric_Pressure600A 16: Barometer (16) mmHg (22)
Temperature600A 13: Temperature (1) °C (1)
Actual_Conductivity600A 7: Actual Conductivity (9) µS/cm (65)
Specific_Conductivity600A 8: Specific Conductivity (10) µS/cm (65)
Salinity600A 9: Salinity (12) PSU (97)
TDS600A 12: Total Dissolved Solids (13) ppt (114)
Water_Density600A 11: Water Density (14) g/cm³ (129)
Resistivity600A 10: Resistivity (11) ohm-cm (81)
pH600A 1: pH (17) pH (145)
pH_mV600A 2: pH mV (18) mV (162)
pH_ORP600A 3: ORP mV (19) mV (162)
RDO_concen600A 4: Dissolved Oxygen (20) mg/L (117)
RDO_perc_sat600A 5: Dissolved Oxygen (21) %sat (177)
RDO_part_Pressure600A 6: Partial Pressure Oxygen (30) Torr (26)
Pressure600A 17: Pressure (2) psi (17)
Depth600A 18: Depth (3) cm (34)
Voltage_Ext600A 14: External (32) V (163)
Battery_Int600A 15: Battery (33) % (241)
@bpbond
Copy link
Member Author

bpbond commented Oct 17, 2024

Minor problem: timestamps

The datalogger (TIMESTAMP) and instrument (Date Time) timestamps won't be the same. We can't do anything about that, though.

Minor problem: flagging origin

Do we want a new F_DLG L1 flag indicating the origin of the data?

Major problem: ID matching

We need to match Device S/N = xxxxxxx in the Aquatroll600 file header to Compass_CRC_UP_303 or whatever, i.e. the datalogger site and plot code. This is easy but fragile to any changes. [edit: see @Fausto2504 comment below re google sheet.]

Major problem: name matching

We need to match column names from the raw instrument files to those used by @roylrich 's logger code. I can think of two ways to do this:

  • Option 1: do it when reading in the files, i.e. in code with hard-coded matching. This has the advantage that it's very early in the pipeline, so the resulting L0 data will look just like the datalogger data. The disadvantage is that it buries crucial logic in code, as opposed to centralized configuration files.
  • Option 2: do it via the design table. The advantage of this is it keeps crucial information in one place; the big disadvantage is that it will double the number of entries for ALL AquaTROLL600A entries...ugh!

unnamed

@roylrich
Copy link

@bpbond,
I would like to talk about this. Date time for AQ units should come in as separate variable from timestamp when connected. The issue, as I understand it is backfilling when we only have AQ data. Maybe we can calculate an offset between the two that creates the definitive timestamp (or new definitive timestamp) fot he dataset by filling from Loggernet timestamp unless missing then using other source plus offset?
We have this issue for EXOsondes and other instruments so it wuld be worth getting a common strategy.
My hunch is that we want to do it in L1 processing pipeline but not having a loggernet timestamp will be an issue for naming and checks

@bpbond
Copy link
Member Author

bpbond commented Oct 17, 2024

Thanks, Roy, and I agree it would be good for you and me and @stephpenn1 to discuss.

@Fausto2504
Copy link

@bpbond
Are we doing a "pre-processing" to make the internal data as we want or getting ideas on how coding to match the columns with distinct names?

I think a flag of file origin is a good idea!

Not sure yet how to ID matching. An idea is using Device serial number. In row 6 of internall data we find "Device S/N = 848067" (note for coding: we have a unit with 7 digits number). We have a spredsheet we use for flagging when it is out of water that relate serial number and site and zone (location): https://docs.google.com/spreadsheets/d/1XbIt8gsOWaLBmpzUmnupKzt92GocHBMooD-MNyvijIY/edit?gid=0#gid=0

@bpbond
Copy link
Member Author

bpbond commented Oct 18, 2024

@Fausto2504 ah THANK YOU, I had forgotten about that spreadsheet.

That is exactly what we will need! Serial number -> Site, plot, and troll A/B/C.

@bpbond
Copy link
Member Author

bpbond commented Oct 18, 2024

Summary:

  • I have tested out the ideas above in Testing: reading raw instrument data #216 and works well; I don't see any fundamental problems
  • Per the comments above, we need two files (one mapping ID or serial number to site, plot, A/B/C; the other mapping field names) for each instrument type (AquaTROLL, Exo, etc)
  • A flag in the final L1 data indicating data source seems like a good idea
  • The timestamp issue remains unresolved
  • @stephpenn1 @roylrich I have put a meeting on our calendars to discuss this. I had trouble finding anything sooner than two weeks from today; seems like Steph is busy/out quite a bit next week
  • OK, thanks!

@bpbond bpbond closed this as completed Oct 27, 2024
@bpbond bpbond reopened this Oct 27, 2024
@bpbond bpbond added this to the v1-2 milestone Oct 27, 2024
@bpbond bpbond added the enhancement New feature or request label Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants