Incorporating non-datalogger data into processing pipeline #213

bpbond · 2024-10-17T10:21:13Z

Ben and Steph, for the AquaTroll’s specifically we have had issues in the past with data not migrating from the troll to the Campbell logger. This is also the case for some of our YSI Exo sondes in the surface water. In this case, the data is generally stored internally on the sondes and we can recover it.

So in some cases we have a complete dataset downloaded from the sonde, but the L1 automated pipeline would not be retrieving this data. Do we want to address this at some stage in the pipeline, or would that perhaps be considered L2 data? In other words, at what stage of the pipeline would it be desirable to say “looks like loggernet missed all of 2022, let’s replace it with the data downloaded directly from the sonde”?

I want to think through in detail how this would work. As an example, here are the fields in a datalogger "WaterLevel600A" file versus the sample raw file:

Datalogger file	AquaTROLL raw file
TIMESTAMP	Date Time
Aquatroll_IDA(1)	Device Id (in file header)
Barometric_Pressure600A	16: Barometer (16) mmHg (22)
Temperature600A	13: Temperature (1) °C (1)
Actual_Conductivity600A	7: Actual Conductivity (9) µS/cm (65)
Specific_Conductivity600A	8: Specific Conductivity (10) µS/cm (65)
Salinity600A	9: Salinity (12) PSU (97)
TDS600A	12: Total Dissolved Solids (13) ppt (114)
Water_Density600A	11: Water Density (14) g/cm³ (129)
Resistivity600A	10: Resistivity (11) ohm-cm (81)
pH600A	1: pH (17) pH (145)
pH_mV600A	2: pH mV (18) mV (162)
pH_ORP600A	3: ORP mV (19) mV (162)
RDO_concen600A	4: Dissolved Oxygen (20) mg/L (117)
RDO_perc_sat600A	5: Dissolved Oxygen (21) %sat (177)
RDO_part_Pressure600A	6: Partial Pressure Oxygen (30) Torr (26)
Pressure600A	17: Pressure (2) psi (17)
Depth600A	18: Depth (3) cm (34)
Voltage_Ext600A	14: External (32) V (163)
Battery_Int600A	15: Battery (33) % (241)

bpbond · 2024-10-17T10:43:36Z

Minor problem: timestamps

The datalogger (TIMESTAMP) and instrument (Date Time) timestamps won't be the same. We can't do anything about that, though.

Minor problem: flagging origin

Do we want a new F_DLG L1 flag indicating the origin of the data?

Major problem: ID matching

We need to match Device S/N = xxxxxxx in the Aquatroll600 file header to Compass_CRC_UP_303 or whatever, i.e. the datalogger site and plot code. This is easy but fragile to any changes. [edit: see @Fausto2504 comment below re google sheet.]

Major problem: name matching

We need to match column names from the raw instrument files to those used by @roylrich 's logger code. I can think of two ways to do this:

Option 1: do it when reading in the files, i.e. in code with hard-coded matching. This has the advantage that it's very early in the pipeline, so the resulting L0 data will look just like the datalogger data. The disadvantage is that it buries crucial logic in code, as opposed to centralized configuration files.
Option 2: do it via the design table. The advantage of this is it keeps crucial information in one place; the big disadvantage is that it will double the number of entries for ALL AquaTROLL600A entries...ugh!

roylrich · 2024-10-17T15:16:47Z

@bpbond,
I would like to talk about this. Date time for AQ units should come in as separate variable from timestamp when connected. The issue, as I understand it is backfilling when we only have AQ data. Maybe we can calculate an offset between the two that creates the definitive timestamp (or new definitive timestamp) fot he dataset by filling from Loggernet timestamp unless missing then using other source plus offset?
We have this issue for EXOsondes and other instruments so it wuld be worth getting a common strategy.
My hunch is that we want to do it in L1 processing pipeline but not having a loggernet timestamp will be an issue for naming and checks

bpbond · 2024-10-17T15:18:45Z

Thanks, Roy, and I agree it would be good for you and me and @stephpenn1 to discuss.

Fausto2504 · 2024-10-17T21:09:08Z

@bpbond
Are we doing a "pre-processing" to make the internal data as we want or getting ideas on how coding to match the columns with distinct names?

I think a flag of file origin is a good idea!

Not sure yet how to ID matching. An idea is using Device serial number. In row 6 of internall data we find "Device S/N = 848067" (note for coding: we have a unit with 7 digits number). We have a spredsheet we use for flagging when it is out of water that relate serial number and site and zone (location): https://docs.google.com/spreadsheets/d/1XbIt8gsOWaLBmpzUmnupKzt92GocHBMooD-MNyvijIY/edit?gid=0#gid=0

bpbond · 2024-10-18T10:01:52Z

@Fausto2504 ah THANK YOU, I had forgotten about that spreadsheet.

That is exactly what we will need! Serial number -> Site, plot, and troll A/B/C.

bpbond · 2024-10-18T10:10:24Z

Summary:

I have tested out the ideas above in Testing: reading raw instrument data #216 and works well; I don't see any fundamental problems
Per the comments above, we need two files (one mapping ID or serial number to site, plot, A/B/C; the other mapping field names) for each instrument type (AquaTROLL, Exo, etc)
A flag in the final L1 data indicating data source seems like a good idea
The timestamp issue remains unresolved
@stephpenn1 @roylrich I have put a meeting on our calendars to discuss this. I had trouble finding anything sooner than two weeks from today; seems like Steph is busy/out quite a bit next week
OK, thanks!

bpbond mentioned this issue Oct 17, 2024

Testing: reading raw instrument data #216

Draft

bpbond closed this as completed Oct 27, 2024

bpbond reopened this Oct 27, 2024

bpbond added this to the v1-2 milestone Oct 27, 2024

bpbond added the enhancement New feature or request label Oct 27, 2024

bpbond mentioned this issue Oct 27, 2024

Need for standardized sensor data not on loggers #157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporating non-datalogger data into processing pipeline #213

Incorporating non-datalogger data into processing pipeline #213

bpbond commented Oct 17, 2024 •

edited

Loading

bpbond commented Oct 17, 2024 •

edited

Loading

roylrich commented Oct 17, 2024

bpbond commented Oct 17, 2024

Fausto2504 commented Oct 17, 2024

bpbond commented Oct 18, 2024

bpbond commented Oct 18, 2024

Incorporating non-datalogger data into processing pipeline #213

Incorporating non-datalogger data into processing pipeline #213

Comments

bpbond commented Oct 17, 2024 • edited Loading

bpbond commented Oct 17, 2024 • edited Loading

Minor problem: timestamps

Minor problem: flagging origin

Major problem: ID matching

Major problem: name matching

roylrich commented Oct 17, 2024

bpbond commented Oct 17, 2024

Fausto2504 commented Oct 17, 2024

bpbond commented Oct 18, 2024

bpbond commented Oct 18, 2024

bpbond commented Oct 17, 2024 •

edited

Loading

bpbond commented Oct 17, 2024 •

edited

Loading