Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Singer Postgres --> Postgres replication demo #2

Closed
wants to merge 6 commits into from

Conversation

sherifnada
Copy link
Contributor

Demo of replicating data from 2 postgres taps (1 remote, 1 local docker-powered) to a local docker postgres db.
Some learnings:

Setting up a tap requires

  1. Creating a python virtual environment for the tap (accomplished via install_connector.sh). This is not strictly required but is strongly recommended.
  2. Supplying a json file which configures the tap. Configuration typically includes credentials, but can also include flags that control behavior.
  3. Configuring which columns to replicate, a two-step process:
    1. Run “Discovery” to generate the Tap Catalog (it’s a superset of the schema, describes all “streams” you can pull from this tap). Store the catalog as a JSON. Discovery is not necessarily supported by all taps. I’m not certain how a tap is supposed to behave if discovery is not supported.
    2. Mutate the catalog file to indicate which streams to subscribe to, as well their replication mode i.e: whether to incrementally replicate a table or always replicate the full table. Incremental replication requires keeping state. This seems to be kept in a local file and supplied via the --state flag.
    3. I’ve found that the mutation of the catalog file is finicky for a few reasons:
      1. One must manually edit a long and fangled JSON file
      2. “legacy style” taps (postgres sql tap is considered legacy) assume a very slightly different schema of catalog file than newer taps.
  4. Invoke the tap within its python env

Setting up a target requires:

  1. Creating a virtual env.
  2. Supplying a JSON configuration file.
  3. Optionally supplying a state file
  4. Piping output from stdout into an invocation of the target

This example demonstrates piping of data from two different sources: a publicly available remote postgres database, and a local database stood up with docker.

Generate your own catalog files to and edit them to do full table replication and select one of the tables to replicate in each database. You might need to tell the target database to persist empty tables in case you try to replicate a table that doesn’t exist yet.

Once you’re done, you can poke around the database, creds are in target_config.json.

To check this out, first run ./first_time_setup.sh then ./run.sh

@sherifnada
Copy link
Contributor Author

sherifnada commented Jul 28, 2020

To see the checked in catalogs see #3

@sherifnada sherifnada changed the title Postgres --> Postgres replication demo Singer Postgres --> Postgres replication demo Jul 28, 2020
Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicely done! we're 50% done with this product now 😉

psql2psql/run.sh Outdated Show resolved Hide resolved
psql2psql/run.sh Outdated Show resolved Hide resolved
@cgardens cgardens marked this pull request as draft August 25, 2020 22:14
@sherifnada sherifnada closed this Aug 26, 2020
sherifnada added a commit that referenced this pull request Oct 23, 2020
# This is the 1st commit message:

publish build scans in CI (#691)

# This is the commit message #2:

Add basic tests

# This is the commit message #3:

pleasantries"
jrhizor pushed a commit that referenced this pull request Dec 30, 2020
jrhizor pushed a commit that referenced this pull request Dec 31, 2020
* Issue #1353: implement backoff for integration tests

* update code for backoff HTTP error while read Google Sheet

* create Client class for Google Sheets with backoff all methods

* update Google Sheets Source after review #2

* update docker version for google_sheets_source
davinchia added a commit that referenced this pull request May 3, 2021
yaroslav-dudar added a commit that referenced this pull request May 28, 2021
@avirajsingh7 avirajsingh7 mentioned this pull request Aug 3, 2023
4 tasks
marcosmarxm pushed a commit that referenced this pull request Aug 4, 2023
overrided methods that uses the "groups" in SQL
maxi297 added a commit that referenced this pull request Aug 9, 2023
girarda added a commit that referenced this pull request Aug 14, 2023
* [ISSUE #28893] infer csv schema

* [ISSUE #28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE #28893] legacy inference and infer only when needed

* [ISSUE #28893] fix scenario tests

* [ISSUE #28893] using discovered schema as part of read

* [ISSUE #28893] self-review + cleanup

* [ISSUE #28893] fix test

* [ISSUE #28893] code review part #1

* [ISSUE #28893] code review part #2

* Fix test

* formatcdk

* first pass

* [ISSUE #28893] code review

* fix mypy issues

* comment

* rename for clarity

* Add a scenario test case

* this isn't optional anymore

* FIX test log level

* Re-adding failing tests

* [ISSUE #28893] improve inferrence to consider multiple types per value

* Automated Commit - Formatting Changes

* [ISSUE #28893] remove InferenceType.PRIMITIVE_AND_COMPLEX_TYPES

* Code review

* Automated Commit - Formatting Changes

* fix unit tests

---------

Co-authored-by: maxi297 <[email protected]>
Co-authored-by: maxi297 <[email protected]>
brianjlai added a commit that referenced this pull request Aug 14, 2023
* [ISSUE #28893] infer csv schema

* [ISSUE #28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE #28893] legacy inference and infer only when needed

* [ISSUE #28893] fix scenario tests

* [ISSUE #28893] using discovered schema as part of read

* [ISSUE #28893] self-review + cleanup

* [ISSUE #28893] fix test

* [ISSUE #28893] code review part #1

* [ISSUE #28893] code review part #2

* Fix test

* formatcdk

* [ISSUE #28893] code review

* FIX test log level

* Re-adding failing tests

* [ISSUE #28893] improve inferrence to consider multiple types per value

* Automated Commit - Formatting Changes

* add file adapters for avro, csv, jsonl, and parquet

* fix try catch

* pr feedback with a few additional default options set

* fix things from the rebase of master

---------

Co-authored-by: maxi297 <[email protected]>
Co-authored-by: maxi297 <[email protected]>
octavia-approvington pushed a commit that referenced this pull request Aug 15, 2023
* [ISSUE #28893] infer csv schema

* [ISSUE #28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE #28893] legacy inference and infer only when needed

* [ISSUE #28893] fix scenario tests

* [ISSUE #28893] using discovered schema as part of read

* [ISSUE #28893] self-review + cleanup

* [ISSUE #28893] fix test

* [ISSUE #28893] code review part #1

* [ISSUE #28893] code review part #2

* Fix test

* formatcdk

* [ISSUE #28893] code review

* FIX test log level

* Re-adding failing tests

* [ISSUE #28893] improve inferrence to consider multiple types per value

* set decimal_as_float to True

* update

* Automated Commit - Formatting Changes

* add file adapters for avro, csv, jsonl, and parquet

* fix try catch

* update

* format

* pr feedback with a few additional default options set

---------

Co-authored-by: maxi297 <[email protected]>
Co-authored-by: maxi297 <[email protected]>
Co-authored-by: brianjlai <[email protected]>
rodireich pushed a commit that referenced this pull request Dec 7, 2023
* Update Dockerfile

* Update metadata.yaml

* Update mysql.md
maxi297 pushed a commit that referenced this pull request Jun 11, 2024
create new bigquery source in python
maxi297 added a commit that referenced this pull request Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants