Evaluate readiness to union data relay data and historical clearinghouse data #423

ian-r-rose · 2024-10-11T18:35:32Z

In #270 we are going to union the data relay data and the historical clearinghouse data. It would be good to have a model (either ad-hoc or ongoing) to evaluate the readiness to do that. This would go a long way towards making sure it is successful when we do it.

A few thoughts of things to check (some of which may overlap in implementation):

Are there any date gaps?
Are there any time gaps?
Are there any sections of duplicate or overlapping records?
Are there any gaps in the Caltrans district (e.g., is District 4 missing for a particular time span)?
Do the number of records match what we expect?
Do the number of file uploads roughly match what we expect? (this particular one may need some different permissions than the TRANSFORMER_DEV role has, so it may not be the highest priority right now)

@summer-mothwood let's plan to chat through some of these in detail

The text was updated successfully, but these errors were encountered:

jkarpen · 2024-10-25T18:23:06Z

Considering this issue done when Summer has a writeup on findings and appropriate follow-ups (potentially QC model, and specific fixes).

Then issues will be created for the follow-on tasks.

summer-mothwood · 2024-10-30T18:32:16Z

After comparing the data relay server data to the clearinghouse data, my conclusion is that we are not ready to union these two datasets. This is because there seems to be a significant amount of data missing in the data relay server, that we will want to resolve before cutting off the clearinghouse pipeline.

My recommendations:
1. Investigate and resolve data pipeline issues causing missing data in the data relay server: new ticket here #453
2. Set up QC checks on the data ingestion steps to alert the team of potential missing data in real time: new ticket here #454
3. Re-open an issue to evaluate readiness to union the datasets after 1) and 2) are complete

Investigation details:

Snowflake Notebook of this analysis: link.

Using a representative sample of station IDs (including ~300 station IDs per district), I counted the number of observations per day in the clearinghouse dataset (unique by station ID and timestamp), and compared that to the same data in the data relay server. In the graph below, the dark blue line represents the number of observations per day in the clearinghouse dataset, where the light blue line is the number of obserations in the data relay server:

As you can see, the totals have never matched exactly between the two datasets, and the data relay server consistently has fewer observations than the clearinghouse -- with particularly large gaps happening starting at the end of September and continuing to today.

Sometimes this missing data is the result of data in the data relay server missing some (but not all) timestamp values in a given day -- for example, this station ID has data in both clearinghouse and data_relay on October 15th up until 20:00:03 -- but all timestamps after that for 10/15 are missing in the data_relay server:

And sometimes this missing data is the result of entire days worth of data not being uploaded to S3

While there is a data discrepancy in all districts that should be investigated, District 4 and District 7 have a significant amount of missing data (this graph shows the number of observations in each district from Oct 1st - Oct 26th):

We noticed that District 7 has not been collecting any data in S3 since October 7th:

And District 4 has several missing days of data in October:

@pingpingxiu-DOT-ca-gov @kengodleskidot @ian-r-rose

pingpingxiu-DOT-ca-gov · 2024-10-31T17:13:16Z

@summer-mothwood @ian-r-rose @ZhenyuZhu-Caltrans @kengodleskidot @jkarpen

Updates

Since Oct 7, the District 4's json -> parquet conversion failed due to the short of memory,

And this happens the same for District 7 as well.

So there are intermediate json files that are exceeding the internal linux machine's memory capacity.

The mitigation is straightforward,

Near-term: breaking down those json files and upload them.
We need new internal machine, as our existing internal machine is having memory issue.
We need monitoring on the machine memory aspect, and also the Snowflake downstream counts

But this does not fully address the missing station ids before the Oct 7. So more investigation is needed.

ian-r-rose · 2024-10-31T17:49:48Z

I recommend not using JSON as an intermediate representation, it has very poor memory characteristics. Notice that in the file sizes, parquet is about 100x smaller than JSON.

Also, it looks like on average, the size of the JSON files increased by about 50%. Do you know why that might have happened?

pingpingxiu-DOT-ca-gov · 2024-10-31T18:32:27Z

Fully understood. Parquet efficient compressions make it ideal for transferring large volume of data through network from Caltrans to Snowflake. For this particular out-of-memory issue, I can quick fix by splitting large json into smaller chunks and convert to parquets separately, and merge into bigger one. With this new function, I’ll re-upload the parquet files that are failed ones in the past. It is worth noting that, currently our Dev / Prod environment is still mixed. If this is not addressed, it would be a constant risk to the system stability. We are working hard with our IT department on getting new Prod machines provisioned for PeMS. From: Ian Rose ***@***.***> Sent: Thursday, October 31, 2024 10:50 AM To: cagov/caldata-mdsa-caltrans-pems ***@***.***> Cc: Xiu, ***@***.*** ***@***.***>; Mention ***@***.***> Subject: Re: [cagov/caldata-mdsa-caltrans-pems] Evaluate readiness to union data relay data and historical clearinghouse data (Issue #423) EXTERNAL EMAIL. Links/attachments may not be safe. I recommend not using JSON as an intermediate representation, it has very poor memory characteristics. Notice that in the file sizes, parquet is about 100x smaller than JSON. Also, it looks like on average, the size of the JSON files increased by about 50%. Do you know why that might have happened? — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/cagov/caldata-mdsa-caltrans-pems/issues/423*issuecomment-2450479147__;Iw!!LWi6xHDyrA!4tzxWf_wuEB4lhlqzJ4hz-J9uKTri9CO1tBeHR3gYlCjVz0-LvcDQVM1us6FdnGeSAddFCMIqec3OQ4LrpUk2TUCqEpActs$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/BDVOYUMOYFBGUBOGHJA3FJLZ6JUVDAVCNFSM6AAAAABPZOGSXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJQGQ3TSMJUG4__;!!LWi6xHDyrA!4tzxWf_wuEB4lhlqzJ4hz-J9uKTri9CO1tBeHR3gYlCjVz0-LvcDQVM1us6FdnGeSAddFCMIqec3OQ4LrpUk2TUCpyytqfE$>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>

pingpingxiu-DOT-ca-gov · 2024-10-31T18:35:57Z

RE: “Also, it looks like on average, the size of the JSON files increased by about 50%. Do you know why that might have happened” I also noticed on that. After I re-uploading all the missing data and check the remaining gap, if this still the case, we can dig further. From: Xiu, ***@***.*** Sent: Thursday, October 31, 2024 11:32 AM To: cagov/caldata-mdsa-caltrans-pems ***@***.***>; cagov/caldata-mdsa-caltrans-pems ***@***.***> Cc: Mention ***@***.***> Subject: RE: [cagov/caldata-mdsa-caltrans-pems] Evaluate readiness to union data relay data and historical clearinghouse data (Issue #423) Fully understood. Parquet efficient compressions make it ideal for transferring large volume of data through network from Caltrans to Snowflake. For this particular out-of-memory issue, I can quick fix by splitting large json into smaller chunks and convert to parquets separately, and merge into bigger one. With this new function, I’ll re-upload the parquet files that are failed ones in the past. It is worth noting that, currently our Dev / Prod environment is still mixed. If this is not addressed, it would be a constant risk to the system stability. We are working hard with our IT department on getting new Prod machines provisioned for PeMS. From: Ian Rose ***@***.******@***.***>> Sent: Thursday, October 31, 2024 10:50 AM To: cagov/caldata-mdsa-caltrans-pems ***@***.******@***.***>> Cc: Xiu, ***@***.*** ***@***.******@***.***>>; Mention ***@***.******@***.***>> Subject: Re: [cagov/caldata-mdsa-caltrans-pems] Evaluate readiness to union data relay data and historical clearinghouse data (Issue #423) EXTERNAL EMAIL. Links/attachments may not be safe. I recommend not using JSON as an intermediate representation, it has very poor memory characteristics. Notice that in the file sizes, parquet is about 100x smaller than JSON. Also, it looks like on average, the size of the JSON files increased by about 50%. Do you know why that might have happened? — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/cagov/caldata-mdsa-caltrans-pems/issues/423*issuecomment-2450479147__;Iw!!LWi6xHDyrA!4tzxWf_wuEB4lhlqzJ4hz-J9uKTri9CO1tBeHR3gYlCjVz0-LvcDQVM1us6FdnGeSAddFCMIqec3OQ4LrpUk2TUCqEpActs$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/BDVOYUMOYFBGUBOGHJA3FJLZ6JUVDAVCNFSM6AAAAABPZOGSXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJQGQ3TSMJUG4__;!!LWi6xHDyrA!4tzxWf_wuEB4lhlqzJ4hz-J9uKTri9CO1tBeHR3gYlCjVz0-LvcDQVM1us6FdnGeSAddFCMIqec3OQ4LrpUk2TUCpyytqfE$>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>

pingpingxiu-DOT-ca-gov · 2024-11-04T19:59:03Z

The solution for the data loss is found.

Currently, the following code line failed to reserve memory for bigger json (>2.5G),

df = pd.read_json(file_path, lines=True, dtype_backend="pyarrow")

Splitting the big json into smaller parts, read them separately, will solve the issue

for file_path_portion in ...:
     dfs.append(pd.read_json(file_path_portion,...))

df = pd.concat(dfs, ignore_index=True)

I'm go ahead to apply the fix to the data relay and will verify the counts.

@ZhenyuZhu-Caltrans @ian-r-rose @summer-mothwood

ian-r-rose · 2024-11-04T20:07:35Z

I think a much better solution would be to write the files as parquet in the first place. All of the parquet files for D7 in S3 are 10-20 MB, and basically any machine should be able to handle those. I'm not sure what the constraints are that prevent using parquet as an intermediate storage format.

Do you understand why the files are so large? Even with the poor memory footprint of JSON files, I'm a little surprised that they are 2.5 GB.

pingpingxiu-DOT-ca-gov · 2024-11-04T20:47:52Z

"what the constraints are that prevent using parquet as an intermediate storage format."

If we choose parquet, we need a shared network file drive. None of our current existing shared network drives are dedicated to PeMS.
Also, Parquet is not compatible with Logstash, so we cannot easily evaluate or monitor data quality for parquet data.

ian-r-rose · 2024-11-04T21:53:56Z

1. If we choose parquet, we need a shared network file drive. None of our current existing shared network drives are dedicated to PeMS.

I know you've talked about this before, but I don't really understand it. Why would changing a file format require a new network drive? If you're writing JSON, couldn't you just write parquet instead?

2. Also, Parquet is not compatible with Logstash, so we cannot easily evaluate or monitor data quality for parquet data.

We can, however, monitor data quality within Snowflake. I don't think that JSON+Logstash is necessarily the best tool for monitoring billions of records hosted as JSON blobs on a single machine (we have a whole scalable data warehouse for that!). One idea we discussed a while ago was to write JSON logs in your pipeline that described what was being done, rather than writing the whole dataset as JSON. Can we revisit that?

pingpingxiu-DOT-ca-gov · 2024-11-04T21:59:42Z

I would only comment about 1:

Currently we write to Kafka for intermediate (in json).

And Kafka is isolated environment. (Dedicated to PeMS)

And Kafka cannot accept parquet.

summer-mothwood · 2024-11-07T17:44:59Z

Since the work/conversation for #453 has been happening in this ticket, I closed 453, and we'll continue the dicussion here. Thank you for letting us know about the inability to save parquet files in Kafka, @pingpingxiu-DOT-ca-gov . Is your solution to batch the json files moving along?

ian-r-rose assigned summer-mothwood Oct 11, 2024

jkarpen added this to the VDS Data Modeling: Merge Data Relay data with Clearinghouse milestone Oct 17, 2024

jkarpen mentioned this issue Oct 24, 2024

Union clearinghouse data with raw data relay VDS data #270

Open

This was referenced Oct 30, 2024

Investigate missing data in data relay server #453

Closed

Create QC Checks for Data Relay Server Data Ingestion #454

Open

summer-mothwood closed this as completed Oct 31, 2024

summer-mothwood assigned pingpingxiu-DOT-ca-gov and jkarpen and unassigned summer-mothwood Nov 7, 2024

summer-mothwood reopened this Nov 7, 2024

summer-mothwood closed this as completed Nov 7, 2024

summer-mothwood unassigned jkarpen Nov 7, 2024

summer-mothwood reopened this Nov 7, 2024

summer-mothwood assigned summer-mothwood and mmmiah and unassigned mmmiah Nov 7, 2024

jkarpen closed this as completed Nov 8, 2024

jkarpen reopened this Nov 8, 2024

jkarpen unassigned summer-mothwood Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate readiness to union data relay data and historical clearinghouse data #423

Evaluate readiness to union data relay data and historical clearinghouse data #423

ian-r-rose commented Oct 11, 2024

jkarpen commented Oct 25, 2024

summer-mothwood commented Oct 30, 2024 •

edited

Loading

pingpingxiu-DOT-ca-gov commented Oct 31, 2024 •

edited

Loading

ian-r-rose commented Oct 31, 2024

pingpingxiu-DOT-ca-gov commented Oct 31, 2024 via email

pingpingxiu-DOT-ca-gov commented Oct 31, 2024 via email

pingpingxiu-DOT-ca-gov commented Nov 4, 2024 •

edited

Loading

ian-r-rose commented Nov 4, 2024

pingpingxiu-DOT-ca-gov commented Nov 4, 2024

ian-r-rose commented Nov 4, 2024

pingpingxiu-DOT-ca-gov commented Nov 4, 2024 •

edited

Loading

summer-mothwood commented Nov 7, 2024

Evaluate readiness to union data relay data and historical clearinghouse data #423

Evaluate readiness to union data relay data and historical clearinghouse data #423

Comments

ian-r-rose commented Oct 11, 2024

jkarpen commented Oct 25, 2024

summer-mothwood commented Oct 30, 2024 • edited Loading

pingpingxiu-DOT-ca-gov commented Oct 31, 2024 • edited Loading

ian-r-rose commented Oct 31, 2024

pingpingxiu-DOT-ca-gov commented Oct 31, 2024 via email

pingpingxiu-DOT-ca-gov commented Oct 31, 2024 via email

pingpingxiu-DOT-ca-gov commented Nov 4, 2024 • edited Loading

ian-r-rose commented Nov 4, 2024

pingpingxiu-DOT-ca-gov commented Nov 4, 2024

ian-r-rose commented Nov 4, 2024

pingpingxiu-DOT-ca-gov commented Nov 4, 2024 • edited Loading

summer-mothwood commented Nov 7, 2024

summer-mothwood commented Oct 30, 2024 •

edited

Loading

pingpingxiu-DOT-ca-gov commented Oct 31, 2024 •

edited

Loading

pingpingxiu-DOT-ca-gov commented Nov 4, 2024 •

edited

Loading

pingpingxiu-DOT-ca-gov commented Nov 4, 2024 •

edited

Loading