Replies: 4 comments
-
Hey @achantavy, I've recently stared experimenting with cartography and was also thinking of any possibilities to speed up the sync process, especially in larger, more distributed environments where numerous intel modules are involved. My initial design idea was decoupling cartography as app layer and neo4j as db layer. In theory this should greatly reduce the amount of time it takes to complete a full sync. I am not sure if this will play nicely with how the graph is populated and if any corruption / errors could be thrown. Thanks for all the work that's been put in this tool, I love it. Best regards, |
Beta Was this translation helpful? Give feedback.
-
Hi @smokentar . Curious if you ever tested parallel runs and experienced any graph corruptions? |
Beta Was this translation helpful? Give feedback.
-
Sorry for not replying until now, I was on family leave late last year. I'll write a longer document on cartography deployment considerations, but here are some quick thoughts on things that have worked for us at Lyft: You will probably want a custom script or something similar to orchestrate the parallel runs. Each job could have a different subset of --aws-requested-syncs defined - for example one job definition could have "ec2:instance" and the other could have "s3" so that you would be able to run "ec2:instance" and "s3" in parallel. The main problem you will encounter with parallel runs of cartography is cleanup jobs - if two jobs sync the same resource types but with slightly different start times, then their update tags will be different, causing a race condition where one job will clean up the results of the other. This situation can be painful and difficult to trace. The solution here is to make sure that a given resource type is being synced by only one job at a time. This can be done by adding a unit test or some assertion in your orchestrator that inspects the resources being synced so that they appear just once, ensuring that it is impossible for the same resource to be synced by two different concurrent jobs. This way your unit test will fail and block a PR merge, preventing parallel runs of the same resource type from occurring. |
Beta Was this translation helpful? Give feedback.
-
Thanks Alex. These are exactly the sort of tips I was looking for. I’ll give them a go and share my experience.Appreciate your help. On 7 Jul 2023, at 20:22, Alex Chantavy ***@***.***> wrote:
Sorry for not replying until now, I was on family leave late last year.
I'll write a longer document on cartography deployment considerations, but here are some quick thoughts on things that have worked for us at Lyft:
You will probably want a custom script or something similar to orchestrate the parallel runs. Each job could have a different subset of --aws-requested-syncs defined - for example one job definition could have "ec2:instance" and the other could have "s3" so that you would be able to run "ec2:instance" and "s3" in parallel.
The main problem you will encounter with parallel runs of cartography is cleanup jobs - if two jobs sync the same resource types but with slightly different start times, then their update tags will be different, causing a race condition where one job will clean up the results of the other. This situation can be painful and difficult to trace.
The solution here is to make sure that a given resource type is being synced by only one job at a time. This can be done by adding a unit test or some assertion in your orchestrator that inspects the resources being synced so that they appear just once, ensuring that it is impossible for the same resource to be synced by two different concurrent jobs. This way your unit test will fail and block a PR merge, preventing parallel runs of the same resource type from occurring.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Description:
One shortcoming of Cartography is that to get data into the graph, we must do a full sync of all assets every single time. This is time-consuming and becomes a blocker for choosing Cartography in many time-sensitive workloads.
One approach is to run a full sync to bootstrap the graph, and then have a subscriber process watch for infra creation/update/deletion events and make updates to the graph accordingly. In AWS this would involve watching CloudTrail and other vendors have equivalent products.
Because Cartography supports many different data providers, it will still be necessary for us to run full syncs in addition to subscriber processes. So for example, if we want to sync AWS (has event stream) and Okta data (does not have event stream), we would first do a full sync to bootstrap our whole graph, turn off the AWS full sync, turn on an AWS subscriber, and keep the Okta sync the same. This way we can have real-time updates for data providers that have events we can subscribe to, and we can keep the sync as-is for the other data types.
One risk of the subscriber model is write failures which result in putting the graph in an incorrect state. The first iterations of this feature will not be reliable and should not have a stringent expectation for correctness until we sort things out (although let's be real here, I think most users of Cartography already use it as only a "best-effort" view without strict requirements, since we already skip over ingesting certain assets if we encounter failure conditions).
I'm opening this feature request to start a discussion and would love to hear other approaches and opinions.
Beta Was this translation helpful? Give feedback.
All reactions