Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(#107): Adds multi-db watcher support #113

Merged
merged 25 commits into from
Jun 18, 2024
Merged

Conversation

dianabarsan
Copy link
Member

@dianabarsan dianabarsan commented Jun 10, 2024

Adds watcher for continuous doc import. Watcher has a timeout of 5 seconds between tries.
Adds multi database support.
Ignores postgres deadlock errors, and adds retry.

Updates e2e tests to:

  • not use two additional containers to push data to CouchDb, that used python to push data
  • not have data in a zipped file, where it's literally invisible to the developer. instead using scalability csv docs, and using cht-conf to generate documents that get uploaded before the test runs.
  • updates test to wait until DBT processing is complete and checks number of results.

#107

@dianabarsan dianabarsan requested a review from witash June 12, 2024 08:10
Copy link
Contributor

@witash witash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Because I was messing around with dbt models at the same time, I had trouble running the e2e tests; the counts in the dbt tables didn't match the number of documents generated by csv-to-docs, and so waitForDBT waited forever and timed out. This means that the dbt models were not working and it's valid for tests to fail, but it's unexpected to have changes in one repository break tests in another. I don't have any alternative suggestions, just noting it, there is already a long (and not very productive) discussion of this here.

I also used this branch to add more detailed tests for base models; having dbt dockerized and using csv-to-docs to add test data is much easier to work with than what is in cht-pipeline currently, where I'm getting version and test data issues. But then, do those more detailed tests belong in this repository? maybe instead should do something similar in the cht-pipeline repository...but also don't want to have copy/pasted code that diverges

const opts = { auth: { username: env.COUCHDB_USER, password: env.COUCHDB_PASSWORD, skip_setup: false } };
const db = new PouchDb(dbUrl(dbName), opts);

const dbDocs = docs.map(doc => ({ ...doc, _id: `${dbName}-${doc._id}` }));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

${dbName}-${doc._id} instead of ${doc._id} adds some weirdness with references, is it really necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :)
Because otherwise both databases would have the same docs. And I want them to have "different" docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure exactly what you mean...like the two couchdb databases, medic and medic_sentinel for example, would have the same docs? if so, its a problem for references, like in the json docs if a contact "aaa" has a parent "bbb", it ends up in the db with id "medic-bbb" but without replacing the reference, "aaa" ends up in postgres with parent id "bbb", which now doesn't exist...

or the "source" (the json docs) and "target" (test couchdb) database have the same docs? in which case why do they need to be different?

Copy link
Member Author

@dianabarsan dianabarsan Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummmm lol. welcome to how our PG works.
All CHT databases are copied into a single PG table, this has been the case forever.
The unique field for documents is _id, which should not have conflicts across these CHT databases (ideally - this is sort of naive but this is how people want it), mainly because they are uuids with low conflict rates, and we purposefully NEVER create a doc with the same uuid in two databases, even if two docs are "connected", their _id fields are either sufixed or prefixed with something.

In the case of this test, I wanted to check that both databases have their docs copied to the main pg table. This means that all docs would need to have different _id fields, otherwise they would get overwritten when we dump the _rev field, so I chose to prefix them.

I don't exactly understand the source of confusion here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so its just for this test where the same document is in both databases, which is not a situation we would expect in the real world.

the problem is when expanding these tests beyond just counting the number of documents; in postgres the expectation is that if you have a document in couchdb with an _id you expect to be able to find a row the corresponding tables in postgres with that same id. Having to add a prefix is maybe ok, but adds complexity. Then if a document in couchdb refers to another document (e.g. contact with a parent), that needs to translate into a foreign key relationship. And there it breaks completely if the documents have their ids changed but references not changed.

But it is only an issue where expanding these tests, so I guess beyond the scope of this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yea, you're right, when we introduce relations between these docs and want to test the DBT model, then the _ids are relevant.
I agree that this would be an issue in a dbt model oriented test, but I structured these as sync copy sanity check tests.
We can easily revisit them when or if we want to change the tests to cover models too instead of just the copying.

@dianabarsan
Copy link
Member Author

I agree with the whole weirdness around e2e tests in this repo depending on another repository. I really struggled not to make this change in this iteration, because I wanted to keep this relatively contained to the watcher + multi db support.

I think the simplest solution right now is to have a sort of test-branch for cht-pipeline instead of passing main to e2e tests here.
But the best would be to somehow source the DBT models to this repo.

@dianabarsan dianabarsan merged commit 279d8f2 into main Jun 18, 2024
5 checks passed
@dianabarsan dianabarsan deleted the 107-couch2pg-watch branch June 18, 2024 11:44
medic-ci pushed a commit that referenced this pull request Sep 10, 2024
# 1.0.0 (2024-09-10)

### Bug Fixes

* Change env variables according to cht pipeline updates ([#71](#71)) ([c89aadf](c89aadf))
* Fix numbering ([#50](#50)) ([5c93300](5c93300))

### Features

* **#107:** Adds multi-db watcher support ([#113](#113)) ([279d8f2](279d8f2)), closes [#107](#107) [#107](#107)
* **#112:** drop support for multiple copies of every document ([#115](#115)) ([b46f288](b46f288)), closes [#112](#112) [#118](#118)
* **#129:** add back automatic pipeline updates ([#130](#130)) ([fc73fd7](fc73fd7)), closes [#129](#129) [#129](#129) [#129](#129)
* **#1:** first release ([ff0fedd](ff0fedd)), closes [#1](#1)
* **#25:** custom databases ([#33](#33)) ([cd10db0](cd10db0)), closes [#25](#25)
* **#78:** full refresh on changed objects, only incremental runs continously ([0869ee9](0869ee9)), closes [#78](https://github.com/medic/cht-sync/issues/78)
* add versioning and releases ([a528aba](a528aba))
* bind sequence token path to host for persistence ([#88](#88)) ([e1c3953](e1c3953))
* remove superset container and update Readme ([#64](#64)) ([8acbc93](8acbc93))
* update logstash base image version and update default configuration files ([#61](#61)) ([674582d](674582d))
* update postgres version to 16 ([8bf1e84](8bf1e84))
@medic-ci
Copy link

🎉 This PR is included in version 1.0.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants