-
Notifications
You must be signed in to change notification settings - Fork 318
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Pawel Leszczynski <[email protected]>
- Loading branch information
1 parent
cd2c111
commit 85ca8ef
Showing
1 changed file
with
34 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Proposal: Marquez API to retrieve historical data | ||
|
||
## Problem | ||
|
||
Marquez data model in PostgresSQL allows extracting a snapshot of any lineage content from the past. This may be extremely | ||
useful to track how job, dataset etc. evolved over time. The feature can be implemented within each Marquez endpoint: like | ||
retrieving datasets or jobs. The purpose of this doc is to agree on a consistent API implementation approach. | ||
|
||
## Solution | ||
|
||
Point in time can be described by a datetime, a dataset version or a job version. In case of dataset version or job version, an API should find out | ||
a point time correlated with dataset or job version (like the `created_at` of `dataset_versions` row) and prepare a snapshot based | ||
on the entries older that. | ||
|
||
Examples: | ||
|
||
* `/api/v1/namespaces/some-namespace/datasets/some-dataset?snapshotAt=datetime:2011-12-03T10%3A15%3A30%2B01%3A00` which is an url encoded version of `ZonedDateTime` `2011-12-03T10:15:30+01:00` | ||
* `/api/v1/namespaces/some-namespace/datasets/some-dataset?snapshotAt=dataset_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002` | ||
* `/api/v1/namespaces/some-namespace/jobs/some-job?snapshotAt=job_version:5ca3b37e-4e18-11ed-bdc3-0242ac120002` | ||
|
||
Please mind that requested version id does not have to be an id of the requested dataset. A user debugging issue | ||
with `dataset_x` at version `ver_x`, can request `dataset_y` snapshot at `ver_x`. The request should be still valid. | ||
|
||
### Questions | ||
|
||
* *Shall a response contain a point in time marker?* This would have been helpful but can be difficult to incorporate | ||
into existing endpoints, especially if the root node of the JSON response is an array. | ||
|
||
* *How to pass point in time parameter in URL?* Parameter is an url encoded string representation of `ZonedDateTime. | ||
|
||
* *Can we rely on `createdAt` columns of foreign tables?* I think we should be able to do so but it's worth checking | ||
if this is the case now. In other words, `createdAt` column for a dataset field should be the same as `createdAt` for a dataset | ||
and run tables. It cannot be evaluated as `Instant.now()` as this can lead to errors: returning dataset at snapshot | ||
can miss dataset field, as it considers it be younger. |