Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate readiness to union data relay data and historical clearinghouse data #423

Open
ian-r-rose opened this issue Oct 11, 2024 · 12 comments
Assignees

Comments

@ian-r-rose
Copy link
Member

In #270 we are going to union the data relay data and the historical clearinghouse data. It would be good to have a model (either ad-hoc or ongoing) to evaluate the readiness to do that. This would go a long way towards making sure it is successful when we do it.

A few thoughts of things to check (some of which may overlap in implementation):

  1. Are there any date gaps?
  2. Are there any time gaps?
  3. Are there any sections of duplicate or overlapping records?
  4. Are there any gaps in the Caltrans district (e.g., is District 4 missing for a particular time span)?
  5. Do the number of records match what we expect?
  6. Do the number of file uploads roughly match what we expect? (this particular one may need some different permissions than the TRANSFORMER_DEV role has, so it may not be the highest priority right now)

@summer-mothwood let's plan to chat through some of these in detail

@jkarpen
Copy link

jkarpen commented Oct 25, 2024

Considering this issue done when Summer has a writeup on findings and appropriate follow-ups (potentially QC model, and specific fixes).

Then issues will be created for the follow-on tasks.

@summer-mothwood
Copy link
Contributor

summer-mothwood commented Oct 30, 2024

After comparing the data relay server data to the clearinghouse data, my conclusion is that we are not ready to union these two datasets. This is because there seems to be a significant amount of data missing in the data relay server, that we will want to resolve before cutting off the clearinghouse pipeline.

My recommendations:
1. Investigate and resolve data pipeline issues causing missing data in the data relay server: new ticket here #453
2. Set up QC checks on the data ingestion steps to alert the team of potential missing data in real time: new ticket here #454
3. Re-open an issue to evaluate readiness to union the datasets after 1) and 2) are complete

Investigation details:

Snowflake Notebook of this analysis: link.

Using a representative sample of station IDs (including ~300 station IDs per district), I counted the number of observations per day in the clearinghouse dataset (unique by station ID and timestamp), and compared that to the same data in the data relay server. In the graph below, the dark blue line represents the number of observations per day in the clearinghouse dataset, where the light blue line is the number of obserations in the data relay server:

Image

As you can see, the totals have never matched exactly between the two datasets, and the data relay server consistently has fewer observations than the clearinghouse -- with particularly large gaps happening starting at the end of September and continuing to today.

Sometimes this missing data is the result of data in the data relay server missing some (but not all) timestamp values in a given day -- for example, this station ID has data in both clearinghouse and data_relay on October 15th up until 20:00:03 -- but all timestamps after that for 10/15 are missing in the data_relay server:
Image

And sometimes this missing data is the result of entire days worth of data not being uploaded to S3

While there is a data discrepancy in all districts that should be investigated, District 4 and District 7 have a significant amount of missing data (this graph shows the number of observations in each district from Oct 1st - Oct 26th):

Image

We noticed that District 7 has not been collecting any data in S3 since October 7th:

Image

And District 4 has several missing days of data in October:
Image

@pingpingxiu-DOT-ca-gov @kengodleskidot @ian-r-rose

@pingpingxiu-DOT-ca-gov
Copy link
Contributor

pingpingxiu-DOT-ca-gov commented Oct 31, 2024

@summer-mothwood @ian-r-rose @ZhenyuZhu-Caltrans @kengodleskidot @jkarpen

Updates

Since Oct 7, the District 4's json -> parquet conversion failed due to the short of memory,

image

And this happens the same for District 7 as well.
image

So there are intermediate json files that are exceeding the internal linux machine's memory capacity.

The mitigation is straightforward,

  1. Near-term: breaking down those json files and upload them.
  2. We need new internal machine, as our existing internal machine is having memory issue.
  3. We need monitoring on the machine memory aspect, and also the Snowflake downstream counts

But this does not fully address the missing station ids before the Oct 7. So more investigation is needed.

@ian-r-rose
Copy link
Member Author

I recommend not using JSON as an intermediate representation, it has very poor memory characteristics. Notice that in the file sizes, parquet is about 100x smaller than JSON.

Also, it looks like on average, the size of the JSON files increased by about 50%. Do you know why that might have happened?

@pingpingxiu-DOT-ca-gov
Copy link
Contributor

pingpingxiu-DOT-ca-gov commented Oct 31, 2024 via email

@pingpingxiu-DOT-ca-gov
Copy link
Contributor

pingpingxiu-DOT-ca-gov commented Oct 31, 2024 via email

@pingpingxiu-DOT-ca-gov
Copy link
Contributor

pingpingxiu-DOT-ca-gov commented Nov 4, 2024

The solution for the data loss is found.

Currently, the following code line failed to reserve memory for bigger json (>2.5G),

df = pd.read_json(file_path, lines=True, dtype_backend="pyarrow")

Splitting the big json into smaller parts, read them separately, will solve the issue

for file_path_portion in ...:
     dfs.append(pd.read_json(file_path_portion,...))

df = pd.concat(dfs, ignore_index=True)

I'm go ahead to apply the fix to the data relay and will verify the counts.

@ZhenyuZhu-Caltrans @ian-r-rose @summer-mothwood

@ian-r-rose
Copy link
Member Author

I think a much better solution would be to write the files as parquet in the first place. All of the parquet files for D7 in S3 are 10-20 MB, and basically any machine should be able to handle those. I'm not sure what the constraints are that prevent using parquet as an intermediate storage format.

Do you understand why the files are so large? Even with the poor memory footprint of JSON files, I'm a little surprised that they are 2.5 GB.

@pingpingxiu-DOT-ca-gov
Copy link
Contributor

"what the constraints are that prevent using parquet as an intermediate storage format."

  1. If we choose parquet, we need a shared network file drive. None of our current existing shared network drives are dedicated to PeMS.

  2. Also, Parquet is not compatible with Logstash, so we cannot easily evaluate or monitor data quality for parquet data.

@ian-r-rose
Copy link
Member Author

1. If we choose parquet, we need a shared network file drive. None of our current existing shared network drives are dedicated to PeMS.

I know you've talked about this before, but I don't really understand it. Why would changing a file format require a new network drive? If you're writing JSON, couldn't you just write parquet instead?

2. Also, Parquet is not compatible with Logstash, so we cannot easily evaluate or monitor data quality for parquet data.

We can, however, monitor data quality within Snowflake. I don't think that JSON+Logstash is necessarily the best tool for monitoring billions of records hosted as JSON blobs on a single machine (we have a whole scalable data warehouse for that!). One idea we discussed a while ago was to write JSON logs in your pipeline that described what was being done, rather than writing the whole dataset as JSON. Can we revisit that?

@pingpingxiu-DOT-ca-gov
Copy link
Contributor

pingpingxiu-DOT-ca-gov commented Nov 4, 2024

I would only comment about 1:

Currently we write to Kafka for intermediate (in json).

And Kafka is isolated environment. (Dedicated to PeMS)

And Kafka cannot accept parquet.

@summer-mothwood
Copy link
Contributor

Since the work/conversation for #453 has been happening in this ticket, I closed 453, and we'll continue the dicussion here. Thank you for letting us know about the inability to save parquet files in Kafka, @pingpingxiu-DOT-ca-gov . Is your solution to batch the json files moving along?

@jkarpen jkarpen closed this as completed Nov 8, 2024
@jkarpen jkarpen reopened this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants