Why Estuary? A general consideration #1109

AndryHTC · 2023-07-14T15:14:39Z

AndryHTC
Jul 14, 2023

Hello,

I would like to open up a discussion to understand more deeply the advantages of using Estuary, especially in contrast to other open-source solutions such as Debezium and Conduit.io.

While the comparison chart between Estuary and Debezium provides a good starting point, I'm interested in digging deeper into a few points. In particular, I'm keen to understand more about:

On-Premises Deployment: Debezium has an existing on-premises option which seems beneficial for businesses with strict data handling and privacy requirements. I see that Estuary plans to introduce on-premises deployment by Winter 2023. Could you provide any updates or further insights about the progress of this feature?
Schema Migrations: Estuary's automated schema evolution seems promising. How does this work in practice, especially in complex database environments with frequent changes?
Transforms: The availability of Streaming SQL and JavaScript transforms in Estuary is exciting. Could you share some examples or use cases of how these transforms can benefit end-users?

Moreover, I'm also interested in understanding how Estuary stacks up against other emerging solutions, such as Conduit.io. If possible, could you elaborate on the distinguishing features or benefits of Estuary when compared to Conduit.io?

Finally, a broader question: Why should teams evaluate or migrate to Estuary? From a developer or business perspective, what are the key selling points that make Estuary stand out among its competitors?

Thank you for taking the time to address these questions. I believe your responses will be invaluable to the community in helping us make informed decisions about our data pipeline solutions.

Best,
Andrea

Answered by psFried

Jul 17, 2023

Moreover, I'm also interested in understanding how Estuary stacks up against other emerging solutions, such as Conduit.io. If possible, could you elaborate on the distinguishing features or benefits of Estuary when compared to Conduit.io?

Conduit and most others like it are designed to run an overall pipeline where data flows from one or more sources into one or more destinations. While Flow certainly can move data from a source to a destination, it's design is not so focused on solely that. The key difference is Flow Collections. Collections each represent a realtime data lake backed by cloud storage. All data in Flow is written to and read from collections. This design has a number of…

View full answer

psFried · 2023-07-17T14:31:13Z

psFried
Jul 17, 2023
Maintainer

On-premises deployment:

The two main components of Flow's design are the control-plane and data-plane. The data-plane runs all the captures, materializations, and derivations, and manages all the reads and writes with cloud storage. The control-plane manages things like access controls and coordinates changes to the tasks that are running in the data-plane. Importantly, the control-plane never stores or even sees any of the data from running tasks. And the steady-state operations of the data-plane are completely independent of the control-plane, so all your tasks continue to run even if our control-plane explodes.

Currently, our production environment has only one data-plane (in GCP us-central1). Our plan is to allow many separate data-planes to join up with our global control-plane. We'll use this to expand our managed data-planes into different cloud providers and regions, and also to allow users to run their own data-planes just for their own tenant. The design for this will allow for data-planes to operate even without allowing in-bound network access (it only makes out-bound connections to the control-plane), so that it can run in a VPC.

The work for this is progressing quietly. At this point, the main holdup is just that we want to focus first on stability and reliability of the platform. We'd like to ensure that it's all running very smoothly before we end up with 100 separate data-planes 😁

0 replies

psFried · 2023-07-17T14:31:27Z

psFried
Jul 17, 2023
Maintainer

Schema migrations:

I'm glad you asked, because we're presently hard at work making this awesome. We have a few more big features in the works, like continuous schema inference and most importantly writing documentation. So I promise to soon give a far better answer to this question in our docs, but I can explain the gist of it here.

Every Flow collection has an associated JSON schema, which we leverage extensively. In addition to validating your data, we also use schemas to, for example, determine what columns we can create for a database materialization. When you create a Capture from a relational database, we generate a JSON schema based on each table's columns in the discover operation. These JSON schemas are updated in a draft, which is a set of proposed changes to the tasks that are running in a data-plane. When you publish a draft, Flow first performs tons of validations to ensure that your proposed changes won't break any tasks "downstream". For example, when you "discover" (called "refresh" in the UI) 'some database' from the diagram below,
you'd end up with a draft that may update the JSON schema of the collection.

graph LR
    A[some database] --> B
    B{Capture} --> C
    C(Collection) --> D
    D{Materialization} --> E
    E[another database]

When we validate the changes, we run a special Validate operation on each materialization that's sourcing from the affected collection. The materialization connector will then say whether the set of fields derived from the proposed schema is acceptable. For example, if you're materializing to a Postgres database then the connector might reject a schema change that would result in changing an existing column from TEXT to INTEGER. The Flow control-plane is able to then suggest (and automatically apply) an update to the materialization to have it materialize the affected collection into a new table. We currently do this by adding an incrementing version number to your table name (_v2, _v3, etc). Doing that allows the new table to take whatever shape is implied by the new collection schema and materialization configuration.

Note that we've also discussed being able to drop existing tables instead of just creating new tables with new names. That's totally doable, but we're uncertain about how important it would be to our users, so I'd love to know if you have opinions here.

There's lots more we could talk about on this topic, but that hopefully gives you a general sense of how we approach the problem. I'll try to update here as we get new docs published on this. In the meantime, you can see our (quite lengthy and unfiltered) discussions on this feature in #1042.

0 replies

psFried · 2023-07-17T15:00:36Z

psFried
Jul 17, 2023
Maintainer

Transforms:

Transforms are useful in lots of different situations. In general, we seek to work with tools like DBT rather than needing to replace them. In other words, you're never required to do your transformations in Flow. But we encourage people to incrementally migrate their transforms into Flow as they see the benefit for each use case. Some benefits include:

Sometimes streaming transforms can be far more efficient. Consider a query like select sum(price) from orders group by customer_id;, which gets more costly the more orders there are. A streaming transform is able to be much more efficient by maintaining a state with the running total for each customer, and only emitting a document whenever it changes.
Derivations are data products, and can be used to share the data with others. A key benefit is that this allows consumers of the data to materialize it into whatever system they like best, and have it kept up to date automatically.
Derivations support testing as a first-class citizen. See an example here. This also helps with the schema evolution piece, because your tests will automatically be run before publishing any changes that affect your derivation. Thus they can prevent publishing a change that would break your transformations.

We've recently been working on improving the UX for creating derivations, and we have lots more plans for that. The CLI + YAML workflow is currently the most fully supported, but we now have nascent support for creating derivations in the UI, which we'll continue to improve.

0 replies

psFried · 2023-07-17T16:07:25Z

psFried
Jul 17, 2023
Maintainer

Moreover, I'm also interested in understanding how Estuary stacks up against other emerging solutions, such as Conduit.io. If possible, could you elaborate on the distinguishing features or benefits of Estuary when compared to Conduit.io?

Conduit and most others like it are designed to run an overall pipeline where data flows from one or more sources into one or more destinations. While Flow certainly can move data from a source to a destination, it's design is not so focused on solely that. The key difference is Flow Collections. Collections each represent a realtime data lake backed by cloud storage. All data in Flow is written to and read from collections. This design has a number of benefits:

The most obvious benefit of that is that you can add new materializations of a collection whenever you want, and they'll just backfill from the beginning. For example, say you start out materializing to Bigquery, and then you realize you might be happier with Postgres. NBD, just create a new materialization into Postgres. This is also quite useful for materializing to purpose-built destinations like Elasticsearch alongside materializations to a more general-purpose database.
It is far more resilient in the face of failures and schema changes. For example, if a materialization fails, all the other tasks in your pipeline will continue to run normally, and when the materialization starts again it will simply backfill whatever data it missed while it was gone. All tasks in Flow pipelines are decoupled in this way.
The collections themselves can be used as data products!
You can bring your own cloud storage, even with the SAAS service. Flow can use whatever cloud storage bucket you want for your collection data, and our data-lake format uses newline-delimited JSON, so the data truly is yours.
Flow has excellent support for horizontal scaling. It can handle extremely high throughput by splitting tasks into multiple "shards" that each handle a proportion of the data.
Flow natively understands how to reduce multiple documents having the same key. This allows rich expression of your data model, but it also allows for an excellent response to backpressure. For example, if a materialization transaction takes more time than usual, Flow will roll-up more updates into the next transaction. For data sets with frequent updates, this can dramatically reduce the amount of data that needs to be sent over the network.
CDC is actually quite tricky to implement, and we're confident that our implementation will stack up quite well against Debezium or anything else that's out there. We adapted the learnings from Netflix's DBLog to create capture connectors that can perform incremental backfills concurrently with reading the write-ahead log. We've synced multi-terabyte tables with high-frequency updates, without requiring any down time or quiet periods.

Finally, a broader question: Why should teams evaluate or migrate to Estuary? From a developer or business perspective, what are the key selling points that make Estuary stand out among its competitors?

The first thing I'll mention here is cost. Flow is a streaming system that processes data incrementally. So if you're comparing to batch systems, then we tend to come out way ahead price. Flow was designed for efficiency, so we think we can probably lower our prices even more as we grow.

Apart from the things I mentioned above, the main thing I'll point to is that we really designed Flow to be useful in the long term. While Flow certainly can be used for a simple point-to-point data integration (most users start out that way), it can also be useful as your main data platform, or anything in between. You can use it to share data either within or outside of your organization, and give people self-service access to collections to materialize. Or you can just use it to push all your data somewhere else. You can manage your pipelines using YAML and Gitops, or you can use the UI, and it's easy to go back and forth or do both. This type of flexibility is important when you consider how your data, team, and business will change over time in ways that can be difficult to predict. So we try to make tools that are useful right now for the task at hand, while also allowing for things to change over time.

Thanks for the questions @AndryHTC. The answers got a bit long, so I broke them up into separate responses so that we can use separate threads if you have any follow up questions. Happy to dig deeper into particular topics if you like.

0 replies

AndryHTC · 2023-07-17T17:25:01Z

AndryHTC
Jul 17, 2023
Author

Great work,@psFried! I'm really impressed with what you've accomplished. Your efforts deserve a big round of applause. While I'm eagerly anticipating the on-prem version, I must say that you've done an exceptional job with the explanation. Your attention to detail and commitment to excellence are evident in every aspect. Keep up the fantastic work!

1 reply

psFried Jul 17, 2023
Maintainer

Thank you so much for you kind words!

MichalisDBA · 2024-06-18T08:16:56Z

MichalisDBA
Jun 18, 2024

Hello. Any news about the on-prem version or how we can self-host Estuary Flow?

3 replies

dyaffe Jun 18, 2024
Maintainer

Hey @MichalisDBA -- we're getting closer to that. Our efforts have been towards making a privately deployed SaaS version of Estuary Flow and that will be available in just a few weeks.

Is there any chance that would meet your needs? It will be deployed to any cloud you'd like and something that could connect over privatelink (or other cloud equivalents).

If not, we can let you know afterwards when we make the tech we use to deploy each data plane externally available.

MichalisDBA Jun 18, 2024

So you are saying that in a few week we can deploy Estuary on private cloud VPS like Hetzner for example? Will the deployed SaaS be air-gapped?

We really want to evaluate Estuary to see if it fits our needs but we have strong Privacy and ISO rules and we can not expose our data to third parties.

dyaffe Jun 18, 2024
Maintainer

Yes, we actually will be able to deploy on Hetzner. We're using Pulumi and they happen to support that out of the box.

It wouldn't be air gapped because the service would have to send just metadata to our control plane. That said, no actual data would be sent to the control plane -- just aggregate metrics, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Estuary? A general consideration #1109

{{title}}

Replies: 6 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why Estuary? A general consideration #1109

AndryHTC Jul 14, 2023

Replies: 6 comments · 4 replies

psFried Jul 17, 2023 Maintainer

psFried Jul 17, 2023 Maintainer

psFried Jul 17, 2023 Maintainer

psFried Jul 17, 2023 Maintainer

AndryHTC Jul 17, 2023 Author

psFried Jul 17, 2023 Maintainer

MichalisDBA Jun 18, 2024

dyaffe Jun 18, 2024 Maintainer

MichalisDBA Jun 18, 2024

dyaffe Jun 18, 2024 Maintainer

AndryHTC
Jul 14, 2023

Replies: 6 comments 4 replies

psFried
Jul 17, 2023
Maintainer

psFried
Jul 17, 2023
Maintainer

psFried
Jul 17, 2023
Maintainer

psFried
Jul 17, 2023
Maintainer

AndryHTC
Jul 17, 2023
Author

psFried Jul 17, 2023
Maintainer

MichalisDBA
Jun 18, 2024

dyaffe Jun 18, 2024
Maintainer

dyaffe Jun 18, 2024
Maintainer