Cartography near real-time updates #1199

achantavy · 2020-10-14T20:45:11Z

achantavy
Oct 14, 2020
Maintainer

Description:

Describe your idea. Please be detailed. If a feature request, please
describe the desired behavior, what scenario it enables, and how it
would be used.

One shortcoming of Cartography is that to get data into the graph, we must do a full sync of all assets every single time. This is time-consuming and becomes a blocker for choosing Cartography in many time-sensitive workloads.

One approach is to run a full sync to bootstrap the graph, and then have a subscriber process watch for infra creation/update/deletion events and make updates to the graph accordingly. In AWS this would involve watching CloudTrail and other vendors have equivalent products.

Because Cartography supports many different data providers, it will still be necessary for us to run full syncs in addition to subscriber processes. So for example, if we want to sync AWS (has event stream) and Okta data (does not have event stream), we would first do a full sync to bootstrap our whole graph, turn off the AWS full sync, turn on an AWS subscriber, and keep the Okta sync the same. This way we can have real-time updates for data providers that have events we can subscribe to, and we can keep the sync as-is for the other data types.

One risk of the subscriber model is write failures which result in putting the graph in an incorrect state. The first iterations of this feature will not be reliable and should not have a stringent expectation for correctness until we sort things out (although let's be real here, I think most users of Cartography already use it as only a "best-effort" view without strict requirements, since we already skip over ingesting certain assets if we encounter failure conditions).

I'm opening this feature request to start a discussion and would love to hear other approaches and opinions.

smokentar · 2022-11-11T17:19:21Z

smokentar
Nov 11, 2022

Hey @achantavy, I've recently stared experimenting with cartography and was also thinking of any possibilities to speed up the sync process, especially in larger, more distributed environments where numerous intel modules are involved.

My initial design idea was decoupling cartography as app layer and neo4j as db layer.
Utilising docker we then run multiple containers of cartography in parallel fashion.
Further narrowing this down we could have a container per intel module, all committing information to the same neo4j backend.
If multiple CSP (AWS / Azure / GCP) accounts are involved we could also have a container per CSP intel module per account.

In theory this should greatly reduce the amount of time it takes to complete a full sync.

I am not sure if this will play nicely with how the graph is populated and if any corruption / errors could be thrown.
I will be testing this in the upcoming weeks and will report my findings but I'd also appreciate your opinion on this approach and if you see anything that would make it unfeasible.

Thanks for all the work that's been put in this tool, I love it.

Best regards,
Dimitar

0 replies

varunupps · 2023-07-07T08:38:23Z

varunupps
Jul 7, 2023

Hi @smokentar . Curious if you ever tested parallel runs and experienced any graph corruptions?

0 replies

achantavy · 2023-07-07T16:22:46Z

achantavy
Jul 7, 2023
Maintainer Author

Sorry for not replying until now, I was on family leave late last year.

I'll write a longer document on cartography deployment considerations, but here are some quick thoughts on things that have worked for us at Lyft:

You will probably want a custom script or something similar to orchestrate the parallel runs. Each job could have a different subset of --aws-requested-syncs defined - for example one job definition could have "ec2:instance" and the other could have "s3" so that you would be able to run "ec2:instance" and "s3" in parallel.

The main problem you will encounter with parallel runs of cartography is cleanup jobs - if two jobs sync the same resource types but with slightly different start times, then their update tags will be different, causing a race condition where one job will clean up the results of the other. This situation can be painful and difficult to trace.

The solution here is to make sure that a given resource type is being synced by only one job at a time. This can be done by adding a unit test or some assertion in your orchestrator that inspects the resources being synced so that they appear just once, ensuring that it is impossible for the same resource to be synced by two different concurrent jobs. This way your unit test will fail and block a PR merge, preventing parallel runs of the same resource type from occurring.

0 replies

varunupps · 2023-07-07T17:59:34Z

varunupps
Jul 7, 2023

Thanks Alex. These are exactly the sort of tips I was looking for. I’ll give them a go and share my experience.Appreciate your help. On 7 Jul 2023, at 20:22, Alex Chantavy ***@***.***> wrote: Sorry for not replying until now, I was on family leave late last year. I'll write a longer document on cartography deployment considerations, but here are some quick thoughts on things that have worked for us at Lyft: You will probably want a custom script or something similar to orchestrate the parallel runs. Each job could have a different subset of --aws-requested-syncs defined - for example one job definition could have "ec2:instance" and the other could have "s3" so that you would be able to run "ec2:instance" and "s3" in parallel. The main problem you will encounter with parallel runs of cartography is cleanup jobs - if two jobs sync the same resource types but with slightly different start times, then their update tags will be different, causing a race condition where one job will clean up the results of the other. This situation can be painful and difficult to trace. The solution here is to make sure that a given resource type is being synced by only one job at a time. This can be done by adding a unit test or some assertion in your orchestrator that inspects the resources being synced so that they appear just once, ensuring that it is impossible for the same resource to be synced by two different concurrent jobs. This way your unit test will fail and block a PR merge, preventing parallel runs of the same resource type from occurring. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cartography near real-time updates #1199

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Cartography near real-time updates #1199

achantavy Oct 14, 2020 Maintainer

Replies: 4 comments

smokentar Nov 11, 2022

varunupps Jul 7, 2023

achantavy Jul 7, 2023 Maintainer Author

varunupps Jul 7, 2023

achantavy
Oct 14, 2020
Maintainer

smokentar
Nov 11, 2022

varunupps
Jul 7, 2023

achantavy
Jul 7, 2023
Maintainer Author

varunupps
Jul 7, 2023