Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempting to GC indexes: clearing index 2: command is too large #61206

Closed
dankinder opened this issue Feb 26, 2021 · 14 comments
Closed

attempting to GC indexes: clearing index 2: command is too large #61206

dankinder opened this issue Feb 26, 2021 · 14 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@dankinder
Copy link

Describe the problem

On truncating a table with about 180GB of data, the GC got this error:
attempting to GC indexes: clearing index 2: command is too large: 120227141 bytes (max: 67108864)

This was data we had imported (via IMPORT INTO ... CSV) within the past day.

Note, this is using v21.1.0-alpha3 in order to get this fix, or else our S3 reads time out.

I have a debug zip exported if you want me to upload it in the support portal.

Environment:

  • CockroachDB version v21.1.0-alpha3
  • Server OS: Centos 6
@dankinder dankinder added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 26, 2021
@blathers-crl
Copy link

blathers-crl bot commented Feb 26, 2021

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Feb 26, 2021
@ajwerner
Copy link
Contributor

Is this using interleaving somehow?

@ajwerner
Copy link
Contributor

There is a cluster setting with the max command size; that may unstick you but it carries modest risk. Definitely set it back down afterwards. Fortunately this isn't attempting to do anything totally nuts. The other option is to decrease the max range size for the database to, say, 32 MiB (or just the default one if that database is gone now). That's a safer choice but will incur more load on the cluster as splitting and merging happens.

@dankinder
Copy link
Author

Interestingly it seems like that data still got dumped regardless... so I guess there's nothing about our cluster that needs remediation per se. But I assume you would want to prevent this from happening in the future. Nothing about the dataset was really unusual.

And no, no interleaving. This was a really simple dataset with one table, a few columns with one INT primary key.

@ajwerner
Copy link
Contributor

"Still got dumped" as in gc happened?

Also, that error, where are you seeing it? Is it in the logs or did it return from truncate itself?

@dankinder
Copy link
Author

What I mean is, the number of live bytes dropped dramatically, so seemingly most of the data got cleared if not all of it.

This error is not on the TRUNCATE job, it's on the GC job that followed it, i.e. it's GC for TRUNCATE TABLE <the table> that failed.

If that failure does leave data leftover, will it eventually get cleared in the normal compaction process?

@ajwerner
Copy link
Contributor

ajwerner commented Mar 1, 2021

It's bad that that job failed. We've had recent discussions about whether we should ever let that job fail.

#55740
#59542
#59788 (comment)

I can help you to re-start that job. In the meantime, can you grab a copy of its record before it gets deleted by the system? That'd be:

SELECT id, status, created, encode(payload, 'hex'), encode(progress, 'hex') FROM system.jobs WHERE id = <relevant job id>;

@ajwerner
Copy link
Contributor

ajwerner commented Mar 1, 2021

What I don't understand is why the GC job would be sending a large raft command. The clear range operation it uses should be small.

@ajwerner
Copy link
Contributor

ajwerner commented Mar 1, 2021

Are your keys somehow absolutely gigantic?

@ajwerner
Copy link
Contributor

ajwerner commented Mar 1, 2021

I think I've got a lead on this one. We'll need to do some manual things to recover the job. Thanks for the bug report!

@ajwerner
Copy link
Contributor

ajwerner commented Mar 1, 2021

Actually we're still pretty confused. Do you have any more intel on the structure of these tables to share?

@dankinder
Copy link
Author

Yeah so I can at least say it's an extremely simple table, basically a few INT columns that together are the primary key. No other columns, no indexes.

I just sent a debug zip through the support portal and tagged this issue.

@dankinder
Copy link
Author

If y'all couldn't find anything and don't intend to investigate further (because it's an alpha), it's okay if you want to close this, we're good as far as our cluster.

@ajwerner ajwerner closed this as completed Mar 4, 2021
@erikgrinaker
Copy link
Contributor

I think we've found the cause of this, submitted a fix in #74674.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

5 participants