Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port MongoDB Source to Java #3428

Closed
sherifnada opened this issue May 15, 2021 · 3 comments · Fixed by #5530
Closed

Port MongoDB Source to Java #3428

sherifnada opened this issue May 15, 2021 · 3 comments · Fixed by #5530

Comments

@sherifnada
Copy link
Contributor

sherifnada commented May 15, 2021

Tell us about the new integration you’d like to have

MongoDB is a critical source to support. Our current connector was contributed by a user. However, while the implementation is generally high quality, it is written in Ruby, and the Airbyte Core team's proficiencies are Java & Python. This means that we are much slower to implement features & bugfixes due to a lack of proficiency in Ruby. So we'd like to port the connector over to one of our core languages in order to offer better SLA & support.

Describe the alternative you are considering or using

Continue to use current Ruby-based connector

Implementation:

test container to use:
https://www.testcontainers.org/modules/databases/mongodb/

Todo:

  1. Investigate possibility of using JDBC driver for mongo db (https://www.unityjdbc.com/mongojdbc/mongo_jdbc.php), but this seems to have the only paid jdbc driver which is not an option for us. Another option to check https://docs.mongodb.com/datalake/tutorial/jdbc-driver/ and https://search.maven.org/search?q=a:mongodb-jdbc, but not sure how supportable it this driver for all DBs, this seems to be a kinda specific for Atlas Data Lake
    In case of possible use of jdbc
  2. Generate new connector and implement connections.
  3. Create unit test
  4. Integration test
  5. Comprehensive:

use existing mongo source

┆Issue is synchronized with this Asana task by Unito

Notes
It seems like the JDBC driver provided by unityjdbc is paid. So we have the same case here as it was for BigQuery. @DoNotPanicUA is currently working on db sources refactoring and implementation to make core better for such cases. So there is no value to start working on this ticket until the #4024 and #1876 are not completed. Then we would also need to support non-jdbc tests basics.
Aa this is non JDBC and even non SQL DB additional work in core part would be also required

@sherifnada sherifnada added area/connectors Connector related issues new-connector labels May 15, 2021
@sherifnada sherifnada changed the title Port MongoDB Source to Java or Python Port MongoDB Source to Java May 24, 2021
@sherifnada
Copy link
Contributor Author

sherifnada commented Jun 11, 2021

Mongo is schemaless which will be a tricky situation to support. The most difficult problem we need to work out for the MVP release is how to support incremental sync. To support incremental sync, we need to discover the schema in each collection (table). I propose that we sample 10,000 records (or some other number) in each collection to discover the schema.

We should transform the schema as follows:

  1. If a column is a simple property (not object or array) the schema should preserve its type.
  2. If a column is an object or an array, the schema should say that it's an object or array, but make no further attempt to understand its schema.

This connector must support full refresh and incremental sync.

@etsybaev etsybaev assigned etsybaev and unassigned etsybaev Jun 14, 2021
@oustynova oustynova added size/XL and removed size/XL labels Jun 18, 2021
@nathan5280
Copy link

Past MVP it seems like we could do some additional Normalization work.

  1. If object, scan some number of them to discover their schema and break them out into a child table.
  2. If list, scan some number of them to discover their schema and break them out into a child table. Preserve the index as a new column.

If we could do that recursively it would be awesome. This would take the schemaless mongo and get it into a reasonable normalized schema in a relational model. For reasonably well formed mongo documents this would save a ton of custom DBT transforms.

@irynakruk irynakruk self-assigned this Jul 5, 2021
@tuliren
Copy link
Contributor

tuliren commented Jul 5, 2021

Many Mongo users use Mongoose to define the schema and communicate with Mongo. It would be cool if users can just drop in the Mongoose schema for each collection, and the source connector can convert them to Json Schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

8 participants