-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(#107): Adds multi-db watcher support #113
Conversation
# Conflicts: # couch2pg/src/db.js # couch2pg/src/index.js
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Because I was messing around with dbt models at the same time, I had trouble running the e2e tests; the counts in the dbt tables didn't match the number of documents generated by csv-to-docs, and so waitForDBT waited forever and timed out. This means that the dbt models were not working and it's valid for tests to fail, but it's unexpected to have changes in one repository break tests in another. I don't have any alternative suggestions, just noting it, there is already a long (and not very productive) discussion of this here.
I also used this branch to add more detailed tests for base models; having dbt dockerized and using csv-to-docs to add test data is much easier to work with than what is in cht-pipeline currently, where I'm getting version and test data issues. But then, do those more detailed tests belong in this repository? maybe instead should do something similar in the cht-pipeline repository...but also don't want to have copy/pasted code that diverges
const opts = { auth: { username: env.COUCHDB_USER, password: env.COUCHDB_PASSWORD, skip_setup: false } }; | ||
const db = new PouchDb(dbUrl(dbName), opts); | ||
|
||
const dbDocs = docs.map(doc => ({ ...doc, _id: `${dbName}-${doc._id}` })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
${dbName}-${doc._id}
instead of ${doc._id}
adds some weirdness with references, is it really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :)
Because otherwise both databases would have the same docs. And I want them to have "different" docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure exactly what you mean...like the two couchdb databases, medic and medic_sentinel for example, would have the same docs? if so, its a problem for references, like in the json docs if a contact "aaa" has a parent "bbb", it ends up in the db with id "medic-bbb" but without replacing the reference, "aaa" ends up in postgres with parent id "bbb", which now doesn't exist...
or the "source" (the json docs) and "target" (test couchdb) database have the same docs? in which case why do they need to be different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ummmm lol. welcome to how our PG works.
All CHT databases are copied into a single PG table, this has been the case forever.
The unique field for documents is _id, which should not have conflicts across these CHT databases (ideally - this is sort of naive but this is how people want it), mainly because they are uuids with low conflict rates, and we purposefully NEVER create a doc with the same uuid in two databases, even if two docs are "connected", their _id fields are either sufixed or prefixed with something.
In the case of this test, I wanted to check that both databases have their docs copied to the main pg table. This means that all docs would need to have different _id fields, otherwise they would get overwritten when we dump the _rev field, so I chose to prefix them.
I don't exactly understand the source of confusion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so its just for this test where the same document is in both databases, which is not a situation we would expect in the real world.
the problem is when expanding these tests beyond just counting the number of documents; in postgres the expectation is that if you have a document in couchdb with an _id you expect to be able to find a row the corresponding tables in postgres with that same id. Having to add a prefix is maybe ok, but adds complexity. Then if a document in couchdb refers to another document (e.g. contact with a parent), that needs to translate into a foreign key relationship. And there it breaks completely if the documents have their ids changed but references not changed.
But it is only an issue where expanding these tests, so I guess beyond the scope of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yea, you're right, when we introduce relations between these docs and want to test the DBT model, then the _ids are relevant.
I agree that this would be an issue in a dbt model oriented test, but I structured these as sync copy sanity check tests.
We can easily revisit them when or if we want to change the tests to cover models too instead of just the copying.
I agree with the whole weirdness around e2e tests in this repo depending on another repository. I really struggled not to make this change in this iteration, because I wanted to keep this relatively contained to the watcher + multi db support. I think the simplest solution right now is to have a sort of test-branch for cht-pipeline instead of passing main to e2e tests here. |
# 1.0.0 (2024-09-10) ### Bug Fixes * Change env variables according to cht pipeline updates ([#71](#71)) ([c89aadf](c89aadf)) * Fix numbering ([#50](#50)) ([5c93300](5c93300)) ### Features * **#107:** Adds multi-db watcher support ([#113](#113)) ([279d8f2](279d8f2)), closes [#107](#107) [#107](#107) * **#112:** drop support for multiple copies of every document ([#115](#115)) ([b46f288](b46f288)), closes [#112](#112) [#118](#118) * **#129:** add back automatic pipeline updates ([#130](#130)) ([fc73fd7](fc73fd7)), closes [#129](#129) [#129](#129) [#129](#129) * **#1:** first release ([ff0fedd](ff0fedd)), closes [#1](#1) * **#25:** custom databases ([#33](#33)) ([cd10db0](cd10db0)), closes [#25](#25) * **#78:** full refresh on changed objects, only incremental runs continously ([0869ee9](0869ee9)), closes [#78](https://github.com/medic/cht-sync/issues/78) * add versioning and releases ([a528aba](a528aba)) * bind sequence token path to host for persistence ([#88](#88)) ([e1c3953](e1c3953)) * remove superset container and update Readme ([#64](#64)) ([8acbc93](8acbc93)) * update logstash base image version and update default configuration files ([#61](#61)) ([674582d](674582d)) * update postgres version to 16 ([8bf1e84](8bf1e84))
🎉 This PR is included in version 1.0.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Adds watcher for continuous doc import. Watcher has a timeout of 5 seconds between tries.
Adds multi database support.
Ignores postgres deadlock errors, and adds retry.
Updates e2e tests to:
#107