Skip to content

Commit

Permalink
point in time proposal
Browse files Browse the repository at this point in the history
Signed-off-by: Pawel Leszczynski <[email protected]>
  • Loading branch information
pawel-big-lebowski committed Oct 17, 2022
1 parent cd2c111 commit 85ca8ef
Showing 1 changed file with 34 additions and 0 deletions.
34 changes: 34 additions & 0 deletions proposals/2117-marquez-over-time.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Proposal: Marquez API to retrieve historical data

## Problem

Marquez data model in PostgresSQL allows extracting a snapshot of any lineage content from the past. This may be extremely
useful to track how job, dataset etc. evolved over time. The feature can be implemented within each Marquez endpoint: like
retrieving datasets or jobs. The purpose of this doc is to agree on a consistent API implementation approach.

## Solution

Point in time can be described by a datetime, a dataset version or a job version. In case of dataset version or job version, an API should find out
a point time correlated with dataset or job version (like the `created_at` of `dataset_versions` row) and prepare a snapshot based
on the entries older that.

Examples:

* `/api/v1/namespaces/some-namespace/datasets/some-dataset?snapshotAt=datetime:2011-12-03T10%3A15%3A30%2B01%3A00` which is an url encoded version of `ZonedDateTime` `2011-12-03T10:15:30+01:00`
* `/api/v1/namespaces/some-namespace/datasets/some-dataset?snapshotAt=dataset_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002`
* `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=job_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002`

Please mind that requested version id does not have to be an id of the requested dataset. A user debugging issue
with `dataset_x` at version `ver_x`, can request `dataset_y` snapshot at `ver_x`. The request should be still valid.

### Questions

* *Shall a response contain a point in time marker?* This would have been helpful but can be difficult to incorporate
into existing endpoints, especially if the root node of the JSON response is an array.

* *How to pass point in time parameter in URL?* Parameter is an url encoded string representation of `ZonedDateTime.

* *Can we rely on `createdAt` columns of foreign tables?* I think we should be able to do so but it's worth checking
if this is the case now. In other words, `createdAt` column for a dataset field should be the same as `createdAt` for a dataset
and run tables. It cannot be evaluated as `Instant.now()` as this can lead to errors: returning dataset at snapshot
can miss dataset field, as it considers it be younger.

0 comments on commit 85ca8ef

Please sign in to comment.