Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for standalone (sparkless) GC #8307

Merged
merged 20 commits into from
Oct 29, 2024
Merged

Conversation

yonipeleg33
Copy link
Contributor

Closes #8306

@yonipeleg33 yonipeleg33 added include-changelog PR description should be included in next release changelog docs Improvements or additions to documentation labels Oct 27, 2024
@yonipeleg33 yonipeleg33 changed the title Add documentation for sparkless GC Add documentation for standalone (sparkless) GC Oct 27, 2024
Copy link

github-actions bot commented Oct 27, 2024

♻️ PR Preview b44a572 has been successfully destroyed since this PR has been closed.

🤖 By surge-preview

@yonipeleg33 yonipeleg33 requested review from itaiad200 and a team October 27, 2024 11:59
Copy link

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed

Copy link

E2E Test Results - Quickstart

11 passed

Copy link
Contributor

@talSofer talSofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @yonipeleg33, for implementing and documenting this important feature!

I'm blocking adding the required steps and clarifying the MinIO scenario's behavior.

> Standalone GC is only available for [lakeFS Enterprise]({% link enterprise/index.md %}).

{: .note .warning }
> Standalone GC is experimental and offers limited capabilities compared to the [Spark-backed GC]({% link howto/garbage-collection/gc.md %}). Read through the [limitations](./standalone-gc.md#limitations) carefully and use at your own risk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> Standalone GC is experimental and offers limited capabilities compared to the [Spark-backed GC]({% link howto/garbage-collection/gc.md %}). Read through the [limitations](./standalone-gc.md#limitations) carefully and use at your own risk.
> Standalone GC is experimental and offers limited capabilities compared to the [Spark-backed GC]({% link howto/garbage-collection/gc.md %}). Read through the [limitations](./standalone-gc.md#limitations) carefully before using it.

"at your own risk" sounds a bit harsh and unreliable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{: .note .warning }
> Standalone GC is experimental and offers limited capabilities compared to the [Spark-backed GC]({% link howto/garbage-collection/gc.md %}). Read through the [limitations](./standalone-gc.md#limitations) carefully and use at your own risk.

## About
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; add a whitespace after every markdown heading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Standalone GC is a limited version of the Spark-backed GC that runs without any external dependencies, as a standalone docker image.

## Limitations
1. Tested in-lab under the following conditions: <TODO: publish acceptance and performance test results once ready>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is going to include the scale in which sgc was tested at?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added concrete numbers (left the TODO because I'm waiting to run it on the final version, which is not ready yet).

As an enterprise customer, you should already have a dockerhub token for the `externallakefs` user.
If not, contact us at ___ (TODO: add mail/whaterver).

### 2. Login to docker with this token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### 2. Login to docker with this token
### 2. Login to dockerhub with this token

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## Installation

### 1. Obtain Dockerhub token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; I prefer using words rather than bulleted lists in markdown headers but its up to you

Suggested change
### 1. Obtain Dockerhub token
### Step 1: Obtain Dockerhub token

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

logging:
level: <value> # info,debug...
```
Then, pass it to the program using the `--config path/to/config.yaml` argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a default location it expects for? that's ok, just to clarify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is, it's mentioned in the "Command line reference" section:

Flags:

  • -c, --config: config file to use (default is $HOME/.lakefs-sgc.yaml)

## Limitations
1. Tested in-lab under the following conditions: <TODO: publish acceptance and performance test results once ready>.
2. Horizontal scale is not supported - Only a single instance of `lakefs-sgc` can operate at a time on a given repository.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a limitation that says that sgc only implements the mark stage without sweeping, and sweep requires user action

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added a bullet to "Limitations", and a new "Output" section describing this.

docker pull treeverse/lakefs-sgc:tagname
```

## Usage
Copy link
Contributor

@talSofer talSofer Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add these two steps here:
3. running the job with example params
4. How to find the output and guidance for how to read it and a CTA to delete the objects manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running the job with example params

I already added an example - take a look at "Example - docker run command"

How to find the output and guidance for how to read it and a CTA to delete the objects manually

Done - in the new "Output" section. Not sure WDYM by a "CTA", I just added a sentence explaining that the user should read the report and delete manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - in the new "Output" section. Not sure WDYM by a "CTA", I just added a sentence explaining that the user should read the report and delete manually.

Update: I added a dedicated section for "Deleting marked objects" with the same sentence ^

Currently, `lakefs-sgc` does not provide an option to explicitly set AWS credentials. It relies on the hosting machine
to be set up correctly, and reads the AWS credentials from the machine.

This means, you should set up your machine however AWS expects you to set it. \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How configurations work for on-prem users who use Minio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added "S3-compatible clients" section and example (cc @itaiad200)

Copy link
Contributor Author

@yonipeleg33 yonipeleg33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @talSofer!
I addressed all your comments other than the Minio question, which I need to get back to you about.
PTAL

> Standalone GC is only available for [lakeFS Enterprise]({% link enterprise/index.md %}).

{: .note .warning }
> Standalone GC is experimental and offers limited capabilities compared to the [Spark-backed GC]({% link howto/garbage-collection/gc.md %}). Read through the [limitations](./standalone-gc.md#limitations) carefully and use at your own risk.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{: .note .warning }
> Standalone GC is experimental and offers limited capabilities compared to the [Spark-backed GC]({% link howto/garbage-collection/gc.md %}). Read through the [limitations](./standalone-gc.md#limitations) carefully and use at your own risk.

## About
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## Limitations
1. Tested in-lab under the following conditions: <TODO: publish acceptance and performance test results once ready>.
2. Horizontal scale is not supported - Only a single instance of `lakefs-sgc` can operate at a time on a given repository.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added a bullet to "Limitations", and a new "Output" section describing this.


## Installation

### 1. Obtain Dockerhub token
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

As an enterprise customer, you should already have a dockerhub token for the `externallakefs` user.
If not, contact us at ___ (TODO: add mail/whaterver).

### 2. Login to docker with this token
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

docker pull treeverse/lakefs-sgc:tagname
```

## Usage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running the job with example params

I already added an example - take a look at "Example - docker run command"

How to find the output and guidance for how to read it and a CTA to delete the objects manually

Done - in the new "Output" section. Not sure WDYM by a "CTA", I just added a sentence explaining that the user should read the report and delete manually.

logging:
level: <value> # info,debug...
```
Then, pass it to the program using the `--config path/to/config.yaml` argument.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is, it's mentioned in the "Command line reference" section:

Flags:

  • -c, --config: config file to use (default is $HOME/.lakefs-sgc.yaml)

Standalone GC is a limited version of the Spark-backed GC that runs without any external dependencies, as a standalone docker image.

## Limitations
1. Tested in-lab under the following conditions: <TODO: publish acceptance and performance test results once ready>.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added concrete numbers (left the TODO because I'm waiting to run it on the final version, which is not ready yet).


### Step 1: Obtain Dockerhub token
As an enterprise customer, you should already have a dockerhub token for the `externallakefs` user.
If not, contact us at ___ (TODO: add mail/whaterver).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talSofer what should I put here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


### Step 1: Obtain Dockerhub token
As an enterprise customer, you should already have a dockerhub token for the `externallakefs` user.
If not, contact us at ___ (TODO: add mail/whaterver).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
`lakefs-sgc run <repository>`

Flags:
- `--cache-dir`: directory to cache read files and metadataDir (default is $HOME/.lakefs-sgc/data/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a config value? How can it be both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can 🙂
Using Viper, we bind this config key (from env/file) to a run argument as well

docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@yonipeleg33 yonipeleg33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itaiad200 thanks, awesome review!
PTAL (also @talSofer regarding MinIO)


### Step 1: Obtain Dockerhub token
As an enterprise customer, you should already have a dockerhub token for the `externallakefs` user.
If not, contact us at ___ (TODO: add mail/whaterver).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
Currently, `lakefs-sgc` does not provide an option to explicitly set AWS credentials. It relies on the hosting machine
to be set up correctly, and reads the AWS credentials from the machine.

This means, you should set up your machine however AWS expects you to set it. \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added "S3-compatible clients" section and example (cc @itaiad200)

docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
`lakefs-sgc run <repository>`

Flags:
- `--cache-dir`: directory to cache read files and metadataDir (default is $HOME/.lakefs-sgc/data/)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can 🙂
Using Viper, we bind this config key (from env/file) to a run argument as well

docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
docs/howto/garbage-collection/standalone-gc.md Outdated Show resolved Hide resolved
Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Added few nit picking comments, please review them before merging

@@ -149,41 +170,75 @@ docker run \
-e AWS_ACCESS_KEY_ID="$(grep 'aws_access_key_id' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
-e AWS_SECRET_ACCESS_KEY="$(grep 'aws_secret_access_key' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<lakeFS Endpoint URL> \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 173 to 174
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<lakeFS accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<lakeFS secret key> \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 173 to 174
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we mention somewhere which lakeFS user this user is?
If I'm a customer who want to use that, I'd wish to give it the minimal permissions possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - Added a "Permissions" section

--network=host \
-v ~/.aws:/home/lakefs-sgc/.aws \
-e AWS_REGION=us-east-1 \
-e AWS_PROFILE=<your profile> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit; Here and elsewhere, drop the your prefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


To delete the objects marked by the GC, you'll need to read the `deleted.csv` file, and manually delete each address from AWS.

Example bash command to move all the marked objects to a different bucket on S3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a note about playing it safe and moving the objects instead of deleting them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@talSofer talSofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
I added a couple of more comments I think worth addressing before merging.


1. Except for the [Lab tests](./standalone-gc.md#lab-tests) performed, there are no further guarantees about the performance profile of the Standalone GC.
2. Horizontal scale is not supported - Only a single instance of `lakefs-sgc` can operate at a time on a given repository.
3. It only marks objects and does not delete them - Equivalent to the GC's [mark only mode]({% link howto/garbage-collection/gc.md %}#mark-only-mode). \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. It only marks objects and does not delete them - Equivalent to the GC's [mark only mode]({% link howto/garbage-collection/gc.md %}#mark-only-mode). \
3. Stand-alone GC only marks objects and does not delete them - Equivalent to the GC's [mark only mode]({% link howto/garbage-collection/gc.md %}#mark-only-mode). \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

| `cache_dir` | Directory to use for caching data during run | ~/.lakefs-sgc/data | string |
| `aws.max_page_size` | Max number of items per page when listing objects in AWS | 1000 | number |
| `aws.s3.addressing_path_style` | Whether or not to use [path-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access) when reading objects from AWS | true | boolean |
| `objects_min_age`* | Ignore any object that is last modified within this time frame ("cutoff time") | "6h" | duration |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this if we have retention policy? and if it is risky to change it, why it is configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

It's on top of the retention policy, this configuration exists in the GC as well.
But maybe it's worth not documenting it, to prevent mishaps...

```

In this prefix, you'll find 2 objects:
- `deleted.csv` - Containing all marked objects in a CSV containing one `address` column. Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a docs question, but why this file called deleted if it contains objects that are marked for deletion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's aligned with the GC's output
(cc @itaiad200 @Jonathan-Rosenberg - right?)

- `--parallelism`: number of parallel downloads for metadataDir (default 10)
- `--presign`: use pre-signed URLs when downloading/uploading data (recommended) (default true)

### Example run commands
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Example run commands
## How to Run Standalone GC
### Run Commands

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

-e LAKEFS_SGC_LOGGING_LEVEL=debug \
treeverse/lakefs-sgc:<tag> run <repository>
```
### Output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Output
### Get the List of Objects Marked for Deletion
The output of an SGC job includes the list of objects marked for deletion. it is located at...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
```

### Deleting marked objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Deleting marked objects
### Delete marked objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@yonipeleg33 yonipeleg33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @itaiad200 and @talSofer!

Since you approved I'll not ask for your re-review, but you're welcome to take another look

@@ -149,41 +170,75 @@ docker run \
-e AWS_ACCESS_KEY_ID="$(grep 'aws_access_key_id' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
-e AWS_SECRET_ACCESS_KEY="$(grep 'aws_secret_access_key' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 173 to 174
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 173 to 174
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - Added a "Permissions" section

--network=host \
-v ~/.aws:/home/lakefs-sgc/.aws \
-e AWS_REGION=us-east-1 \
-e AWS_PROFILE=<your profile> \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

-e LAKEFS_SGC_LOGGING_LEVEL=debug \
treeverse/lakefs-sgc:<tag> run <repository>
```
### Output
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
```

### Deleting marked objects
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


To delete the objects marked by the GC, you'll need to read the `deleted.csv` file, and manually delete each address from AWS.

Example bash command to move all the marked objects to a different bucket on S3:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


1. Except for the [Lab tests](./standalone-gc.md#lab-tests) performed, there are no further guarantees about the performance profile of the Standalone GC.
2. Horizontal scale is not supported - Only a single instance of `lakefs-sgc` can operate at a time on a given repository.
3. It only marks objects and does not delete them - Equivalent to the GC's [mark only mode]({% link howto/garbage-collection/gc.md %}#mark-only-mode). \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

| `cache_dir` | Directory to use for caching data during run | ~/.lakefs-sgc/data | string |
| `aws.max_page_size` | Max number of items per page when listing objects in AWS | 1000 | number |
| `aws.s3.addressing_path_style` | Whether or not to use [path-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access) when reading objects from AWS | true | boolean |
| `objects_min_age`* | Ignore any object that is last modified within this time frame ("cutoff time") | "6h" | duration |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

It's on top of the retention policy, this configuration exists in the GC as well.
But maybe it's worth not documenting it, to prevent mishaps...

- `--parallelism`: number of parallel downloads for metadataDir (default 10)
- `--presign`: use pre-signed URLs when downloading/uploading data (recommended) (default true)

### Example run commands
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<bucket>/*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too permesive. Should be this, no?

Suggested change
"arn:aws:s3:::<bucket>/*"
"arn:aws:s3:::<storage_namespace>/_lakefs/*"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use only _lakefs as it needs access to the _data prefix as well.
But you're right that it doesn't need permissions for the entire bucket, only the storage namespace prefix.
Changed accordingly.

@yonipeleg33 yonipeleg33 enabled auto-merge (squash) October 29, 2024 16:06
@yonipeleg33 yonipeleg33 merged commit 10fcb19 into master Oct 29, 2024
38 of 39 checks passed
@yonipeleg33 yonipeleg33 deleted the sgc/documentation branch October 29, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Improvements or additions to documentation include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sparkless GC - Add documentation
3 participants