Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pg_bulkload on PG13 fails to handle large data duplicates #148

Open
Tyler-DiPentima opened this issue Jul 28, 2023 · 8 comments
Open

pg_bulkload on PG13 fails to handle large data duplicates #148

Tyler-DiPentima opened this issue Jul 28, 2023 · 8 comments

Comments

@Tyler-DiPentima
Copy link

Hey,

I am trying to test out pg_bulkload for a sample dataset that has a unique constraint between 3 of its columns. When I test out a dataset of ~200 lines, pg_bulkload loads the data fine and when re-load is attempted again with the same data, all are recognized as duplicates and discarded. This was on PG9.6.20

When upgrading to PG13.3 the same test is attempted with the same table and data and works the same. However, when the data size is increased to ~350-400 lines issues begin to occur. The first load works fine, but instead of detecting the second run as duplicates, the following error is output:

277 Row(s) loaded so far to table mbaf_exclude. 2 Row(s) bad. 0 rows skipped, duplicates will now be removed
ERROR: query failed: ERROR: could not create unique index "index_name"
DETAIL: Key (account_number, areacode, phone)=(083439221 , 954, 2222222) is duplicated.
DETAIL: query was: SELECT * FROM pg_bulkload($1)

The increased data size works as expected on PG9.6.20, which makes me believe this has to do with postgres version compatibility. Any guidance here would be greatly appreciated!

@mikecaat
Copy link
Contributor

Thanks. I could reproduced.

  • 500lines: ERROR
# current master
$ pg_bulkload -V
pg_bulkload 3.1.20

# generate data
$ for i in `seq 1 500`; do echo "$i,$((i+1)),$((i+2))"; done > /tmp/in.csv

# create table
$ psql -c "SELECT version();"
                                                version
--------------------------------------------------------------------------------------------------------
 PostgreSQL 13.11 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
(1 row)
$ psql -c "CREATE EXTENSION pg_bulkload;"
CREATE EXTENSION
$ psql -c "CREATE TABLE test (id1 int, id2 int, id3 int, UNIQUE(id1, id2, id3));" 
CREATE TABLE

# pg_bulkload control file
$ cat /tmp/test.ctl
# output table
OUTPUT = test

# input data
WRITER = DIRECT
INPUT = /tmp/in.csv
TYPE = CSV
QUOTE = "\""
ESCAPE = \
DELIMITER = ","

# error handling
CHECK_CONSTRAINTS = YES
PARSE_ERRORS = INFINITE
DUPLICATE_ERRORS = INFINITE

# logging
LOGFILE = /tmp/pg_bulkload.log
PARSE_BADFILE = /tmp/parse_badfile.csv
DUPLICATE_BADFILE = /tmp/duplicate_badfile.csv

# test
$ pg_bulkload /tmp/test.ctl 
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
        0 Rows skipped.
        500 Rows successfully loaded.
        0 Rows not loaded due to parse errors.
        0 Rows not loaded due to duplicate errors.
        0 Rows replaced with new rows.

$ pg_bulkload /tmp/test.ctl 
NOTICE: BULK LOAD START
ERROR: query failed: ERROR:  could not create unique index "test_id1_id2_id3_key"                              
DETAIL:  Key (id1, id2, id3)=(126, 127, 128) is duplicated. 
DETAIL: query was: SELECT * FROM pgbulkload.pg_bulkload($1)  
  • 200lines: NO ERROR
$ for i in `seq 1 200`; do echo "$i,$((i+1)),$((i+2))"; done > /tmp/in.csv
$ psql -c "DROP TABLE test; CREATE TABLE test (id1 int, id2 int, id3 int, UNIQUE(id1, id2, id3));" 

$ pg_bulkload /tmp/test.ctl
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
        0 Rows skipped.
        200 Rows successfully loaded.
        0 Rows not loaded due to parse errors.
        0 Rows not loaded due to duplicate errors.
        0 Rows replaced with new rows.

$ pg_bulkload /tmp/test.ctl
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
        0 Rows skipped.
        200 Rows successfully loaded.
        0 Rows not loaded due to parse errors.
        0 Rows not loaded due to duplicate errors.
        200 Rows replaced with new rows.

$ wc -l /tmp/duplicate_badfile.csv
200 /tmp/duplicate_badfile.csv

$ head -n 10 /tmp/duplicate_badfile.csv
1,2,3
2,3,4
3,4,5
4,5,6
5,6,7
6,7,8
7,8,9
8,9,10
9,10,11
10,11,12

@Omegapol
Copy link

Omegapol commented Jul 31, 2023

Hey, I was also looking at the code and this is what I found.
The duplicate removal code never triggers (this one

if (on_duplicate == ON_DUPLICATE_KEEP_NEW)
)
because merge variable is set to -1 (https://github.com/ossc-db/pg_bulkload/blob/8caced46019119e2adf6f3dbdf96b72de9df08e9/lib/pg_btree.c#L489C6-L489C12) here
and that, in turn, happens because of this piece of code

		firstid = PageGetItemId(reader->page, P_FIRSTDATAKEY(opaque));
		itup = (IndexTuple) PageGetItem(reader->page, firstid);

		if ((itup->t_tid).ip_posid == 0)
		{
			elog(DEBUG1, "pg_bulkload: failded in BTReaderInit for \"%s\"",
				RelationGetRelationName(rel));
			return -1;
		}
``` inside BTReaderInit.

I did a quick test with that **condition removed**, which sounds like a very naive solution of course... (I believe it wasn't there in the older version of pg_bulkload as it was added in 2020), then the program performs as expected. It properly identifies duplicates, removes them etc and index gets created properly.
That check there is there because of some other issue(s). It is supposed to address crashes. I did a test with memory lowered to 1MB in postgres to try to make it crash with that change, but that didn't happen. 
That PR with that check, had also few other changes, so perhaps those other changes address it? or perhaps my dataset was too small to make it crash with 

@mikecaat
Copy link
Contributor

mikecaat commented Aug 1, 2023

Hi, thanks for sharing your idea.
I agree the start point to analysis is the function.

In my understanding, the root cause of the error is that pg_bulkload forgot to be aware of pivot tuples (ex. b-tree root and intermediate nodes) and its ip_posid may be 0, even though they were introduced in the commit (postgres/postgres@8224de4) that supports the INCLUDE clause in the Index from PG v11.

When the b-tree is in a multi-level hierarchical state, the error occurs because there are pivot tuples. That's why the error occurred only when many data is loaded.

As you said, I assumed that we need to fix BTReaderInit() to handle the case that the tuple is the pivot tuple. We will make a patch.

@Omegapol
Copy link

Omegapol commented Aug 4, 2023

@mikecaat thank you for figuring out!
Do you have initial ETA on this issue perhaps?
Multi column indices is something used very often for us, so there is impact. Any help is appreciated!

@mikecaat
Copy link
Contributor

mikecaat commented Aug 7, 2023

Do you have initial ETA on this issue perhaps?

In addition to v16 support, we plan to release a fix for the issue by January.

IIUC, the workarounds are

  • to delete the index before loading the data and create the index after loaded.
  • to truncate table before loading the data.

Multi column indices is something used very often for us, so there is impact. Any help is appreciated!

Yes. In addition, the problem also occurs with single column indexes.

@Omegapol
Copy link

Omegapol commented Aug 7, 2023

In our testing we noticed this doesn't occur if we are loading the data straight to empty table. (and this allows to eliminate duplicates in the data fed to bulkload which is also part of our use case)
Is there anything that could go wrong when we are loading into empty table with indices on it? We haven't observed it but perhaps once again its matter of how much data we throw on it.

@mikecaat
Copy link
Contributor

mikecaat commented Aug 8, 2023

In our testing we noticed this doesn't occur if we are loading the data straight to empty table. (and this allows to eliminate duplicates in the data fed to bulkload which is also part of our use case) Is there anything that could go wrong when we are loading into empty table with indices on it? We haven't observed it but perhaps once again its matter of how much data we throw on it.

IIUC, if the table is empty, which means that the b-tree index of the table doesn't have data, the issue doesn't occur. That's one of the workaround.

@Omegapol
Copy link

Hello @mikecaat,
Just wanted to check, is fix for this issue on schedule for January? Did anything change?

zwyan0 added a commit that referenced this issue Jan 16, 2024
BTReaderInit() assumed that the case (itup->t_tid).ip_posid = 0 was anomalous when searching for the leftmost leaf page of a b-tree, but this is an old condition and has been removed.

With the commit (postgres/postgres@8224de4) to support the INCLUDE clause in indexes from PostgreSQL v11, new pivot tuples (e.g. b-tree root and intermediate nodes) were added, which could cause ip_posid to be 0.

Because they had forgotten to address this, errors sometimes occurred when running the process on B-trees that held some large data with pivot tuples.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants