-
Hello, I would like to open up a discussion to understand more deeply the advantages of using Estuary, especially in contrast to other open-source solutions such as Debezium and Conduit.io. While the comparison chart between Estuary and Debezium provides a good starting point, I'm interested in digging deeper into a few points. In particular, I'm keen to understand more about:
Moreover, I'm also interested in understanding how Estuary stacks up against other emerging solutions, such as Conduit.io. If possible, could you elaborate on the distinguishing features or benefits of Estuary when compared to Conduit.io? Finally, a broader question: Why should teams evaluate or migrate to Estuary? From a developer or business perspective, what are the key selling points that make Estuary stand out among its competitors? Thank you for taking the time to address these questions. I believe your responses will be invaluable to the community in helping us make informed decisions about our data pipeline solutions. Best, |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 4 replies
-
On-premises deployment: The two main components of Flow's design are the control-plane and data-plane. The data-plane runs all the captures, materializations, and derivations, and manages all the reads and writes with cloud storage. The control-plane manages things like access controls and coordinates changes to the tasks that are running in the data-plane. Importantly, the control-plane never stores or even sees any of the data from running tasks. And the steady-state operations of the data-plane are completely independent of the control-plane, so all your tasks continue to run even if our control-plane explodes. Currently, our production environment has only one data-plane (in GCP The work for this is progressing quietly. At this point, the main holdup is just that we want to focus first on stability and reliability of the platform. We'd like to ensure that it's all running very smoothly before we end up with 100 separate data-planes 😁 |
Beta Was this translation helpful? Give feedback.
-
Schema migrations: I'm glad you asked, because we're presently hard at work making this awesome. We have a few more big features in the works, like continuous schema inference and most importantly writing documentation. So I promise to soon give a far better answer to this question in our docs, but I can explain the gist of it here. Every Flow collection has an associated JSON schema, which we leverage extensively. In addition to validating your data, we also use schemas to, for example, determine what columns we can create for a database materialization. When you create a Capture from a relational database, we generate a JSON schema based on each table's columns in the discover operation. These JSON schemas are updated in a draft, which is a set of proposed changes to the tasks that are running in a data-plane. When you publish a draft, Flow first performs tons of validations to ensure that your proposed changes won't break any tasks "downstream". For example, when you "discover" (called "refresh" in the UI) 'some database' from the diagram below, graph LR
A[some database] --> B
B{Capture} --> C
C(Collection) --> D
D{Materialization} --> E
E[another database]
When we validate the changes, we run a special Note that we've also discussed being able to drop existing tables instead of just creating new tables with new names. That's totally doable, but we're uncertain about how important it would be to our users, so I'd love to know if you have opinions here. There's lots more we could talk about on this topic, but that hopefully gives you a general sense of how we approach the problem. I'll try to update here as we get new docs published on this. In the meantime, you can see our (quite lengthy and unfiltered) discussions on this feature in #1042. |
Beta Was this translation helpful? Give feedback.
-
Transforms: Transforms are useful in lots of different situations. In general, we seek to work with tools like DBT rather than needing to replace them. In other words, you're never required to do your transformations in Flow. But we encourage people to incrementally migrate their transforms into Flow as they see the benefit for each use case. Some benefits include:
We've recently been working on improving the UX for creating derivations, and we have lots more plans for that. The CLI + YAML workflow is currently the most fully supported, but we now have nascent support for creating derivations in the UI, which we'll continue to improve. |
Beta Was this translation helpful? Give feedback.
-
Conduit and most others like it are designed to run an overall pipeline where data flows from one or more sources into one or more destinations. While Flow certainly can move data from a source to a destination, it's design is not so focused on solely that. The key difference is Flow Collections. Collections each represent a realtime data lake backed by cloud storage. All data in Flow is written to and read from collections. This design has a number of benefits:
The first thing I'll mention here is cost. Flow is a streaming system that processes data incrementally. So if you're comparing to batch systems, then we tend to come out way ahead price. Flow was designed for efficiency, so we think we can probably lower our prices even more as we grow. Apart from the things I mentioned above, the main thing I'll point to is that we really designed Flow to be useful in the long term. While Flow certainly can be used for a simple point-to-point data integration (most users start out that way), it can also be useful as your main data platform, or anything in between. You can use it to share data either within or outside of your organization, and give people self-service access to collections to materialize. Or you can just use it to push all your data somewhere else. You can manage your pipelines using YAML and Gitops, or you can use the UI, and it's easy to go back and forth or do both. This type of flexibility is important when you consider how your data, team, and business will change over time in ways that can be difficult to predict. So we try to make tools that are useful right now for the task at hand, while also allowing for things to change over time. Thanks for the questions @AndryHTC. The answers got a bit long, so I broke them up into separate responses so that we can use separate threads if you have any follow up questions. Happy to dig deeper into particular topics if you like. |
Beta Was this translation helpful? Give feedback.
-
Great work,@psFried! I'm really impressed with what you've accomplished. Your efforts deserve a big round of applause. While I'm eagerly anticipating the on-prem version, I must say that you've done an exceptional job with the explanation. Your attention to detail and commitment to excellence are evident in every aspect. Keep up the fantastic work! |
Beta Was this translation helpful? Give feedback.
-
Hello. Any news about the on-prem version or how we can self-host Estuary Flow? |
Beta Was this translation helpful? Give feedback.
Conduit and most others like it are designed to run an overall pipeline where data flows from one or more sources into one or more destinations. While Flow certainly can move data from a source to a destination, it's design is not so focused on solely that. The key difference is Flow Collections. Collections each represent a realtime data lake backed by cloud storage. All data in Flow is written to and read from collections. This design has a number of…