Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Store the resources in S3 buckets #611

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

sband
Copy link

@sband sband commented May 12, 2023

Initial draft for fixing issue #582

@welcome
Copy link

welcome bot commented May 12, 2023

Thanks so much for opening this pull request and for helping to improve SirixDB 🚀

@sband
Copy link
Author

sband commented May 12, 2023

This is a DRAFT pull request. The implementation is not complete yet. However I would really look out for your suggestions/advice on this, if you see that I am implementing in the wrong direction in terms of the use-case/issue

@JohannesLichtenberger
Copy link
Member

I'm on my phone right now, but it seems to be fine so far, thanks :-)

@JohannesLichtenberger
Copy link
Member

@sband I saw that you're writing to the local file backend as well as to S3. I'd probably simply write to S3 and read from S3. We can combine the storages later on with the combined storage and an async write to the second storage. What do you think?

@sband
Copy link
Author

sband commented May 19, 2023

@sband I saw that you're writing to the local file backend as well as to S3. I'd probably simply write to S3 and read from S3. We can combine the storages later on with the combined storage and an async write to the second storage. What do you think?

Yes. I agree to that. But as far as I know, to write an object in AWS S3, I have to first write it as a file and then put that file object on to S3 storage. Atleast that is what the SDK is supporting (please correct me if that is not the case). So to work around this issue and keeping in mind our discussion in the issue comments, I used the approach of creating the files from the object keyName. These files are created in the system tmp directory - using AmazonS3StorageReader#readObjectDataFromS3 . One thing I forgot was to delete these files from system tmp directory once these were successfully written on to the S3 storage. Do you have any other way where I could directly write to S3 object ? If yes, please let me know - that would be a new learning for me.
Also - yes I should have used AsyncS3Client for writing asynchronously to S3 storage - will make that change. I am yet to complete on the reader part though.

@JohannesLichtenberger
Copy link
Member

Oh, I didn't look into the S3 API documentation, but you seem to be correct. Thus, the bucket name will be the database name, and the objects stored therein will be the Sirix resources (within the database)!?

BTW: I saw that Sebastian used JClouds (usable for different cloud storage) back then. Would an additional IO package make sense?

Also, you can store a local version (not temporary), as it requires the files nonetheless... otherwise, we'd probably almost always combine the local and S3 store, and we'd end up storing temp files and the "real" local folders/files.

@sband
Copy link
Author

sband commented May 19, 2023

I saw that Sebastian used JClouds (usable for different cloud storage) back then. Would an additional IO package make sense?
I have created a package "org.sirix.io.cloud" for that, but the class implementation is empty for now - Since we want to support any cloud platform, there would be additions to CloudPlatform.java and CloudConnectionStorageFactory.java for every new support we want to add. ICloudStorage.java is being used just as a marker interface to identify remote storage - but this may be unnecessary IMO.

Also, you can store a local version (not temporary), as it requires the files nonetheless... otherwise, we'd probably almost always combine the local and S3 store, and we'd end up storing temp files and the "real" local folders/files.
In that case atleast for Amazon S3 storage, we wont need a CombinedStorage class. If this is the case for all other cloud platforms (umm, atleast the well known ones like GCP and Azure) then we wont need CombinedStorage for them either.

@sband
Copy link
Author

sband commented May 23, 2023

@JohannesLichtenberger ,
I have submitted the code combined storage as well. Let me know if I have not included any of your suggestions from our above conversation.
My Pending Action Items for this MR:

  1. To write test cases.

Thanks
Sanket

@JohannesLichtenberger
Copy link
Member

Can you make sure to include the dependency in sirix-core in the build.gradle file?

@JohannesLichtenberger
Copy link
Member

Regarding the combined storage, I think you should make sure to write async, e.g.: https://github.com/sebastiangraf/treetank/blob/master/coremodules/core/src/main/java/org/treetank/io/combined/CombinedWriter.java

I'd also suggest to create a real local file for the S3-based reader/writer instead of the temporary file.

@sband
Copy link
Author

sband commented May 25, 2023

Regarding the combined storage, I think you should make sure to write async, e.g.: https://github.com/sebastiangraf/treetank/blob/master/coremodules/core/src/main/java/org/treetank/io/combined/CombinedWriter.java
Yes , I had discussed with you in one of our conversations that I could use ExecutorsService to start a separate thread for that. Will do.

…l file system without deleting instead of using tmp parition
@sband sband reopened this May 26, 2023
@sband
Copy link
Author

sband commented May 26, 2023

Re-opening, got closed due sync with forked master

@sband
Copy link
Author

sband commented May 31, 2023

@JohannesLichtenberger could you please review, I think I have done all the changes. I looked at the possibility of writing test cases, but except the S3 storage read and write code, everything is same as FileStorageWriter and FileStorageReader, do you still think we need test cases for that ? S3 storage read and write code are pretty much standard API calls. Let me know your opinion on this.

@JohannesLichtenberger
Copy link
Member

The tests seem to fail, but I'm in holidays right now and it would be better to check once I'm home on Sunday.

I'd prefer to add integration tests, however. Can you also use the SirixDB.xml formatter in the project root and reformat your code?

Other than that, thanks a lot for your work :-)

@JohannesLichtenberger
Copy link
Member

Besides, the file based approaches are currently broken, I believe. Better use the FileChannel reader/writer.

@JohannesLichtenberger
Copy link
Member

@sband Hope it makes sense so far :-)

@sband
Copy link
Author

sband commented Jun 5, 2023 via email

@sband
Copy link
Author

sband commented Jun 8, 2023

The tests seem to fail, but I'm in holidays right now and it would be better to check once I'm home on Sunday.

I'd prefer to add integration tests, however. Can you also use the SirixDB.xml formatter in the project root and reformat your code?

Other than that, thanks a lot for your work :-)

Regarding the integration tests, can you help me on this? As far as I understand, for integration tests we would need a working AWS connection which is not available.

@JohannesLichtenberger
Copy link
Member

The second approach with a local s3 looks appealing; https://stackoverflow.com/questions/6615988/how-to-mock-amazon-s3-in-an-integration-test

@JohannesLichtenberger
Copy link
Member

BTW: Currently something fails

@sband
Copy link
Author

sband commented Jun 8, 2023 via email

@JohannesLichtenberger
Copy link
Member

Can you work on the integration tests? Did you already try it with a real account...?

@sband
Copy link
Author

sband commented Jun 13, 2023 via email

@JohannesLichtenberger
Copy link
Member

thank you :-)

@JohannesLichtenberger
Copy link
Member

Currently it doesn't compile

@JohannesLichtenberger
Copy link
Member

@sband do you have some time this week?

@sband
Copy link
Author

sband commented Jun 22, 2023 via email

@JohannesLichtenberger
Copy link
Member

Oh, sorry to hear. Hopefully everything turned out well in the end.

Take your time

@JohannesLichtenberger
Copy link
Member

@sband do you think you'll have some time this week? :-)

@JohannesLichtenberger
Copy link
Member

@sband ping, just like to know if you'll keep on working on the PR 👍

@sband
Copy link
Author

sband commented Jul 9, 2023 via email

@JohannesLichtenberger
Copy link
Member

Oh sure, get well soon! :-) hope I can encourage you to keep on working on Sirix in the future ;-)

@sband
Copy link
Author

sband commented Jul 19, 2023

Hi @JohannesLichtenberger ,
I am back. Resuming from today.

@JohannesLichtenberger
Copy link
Member

Hope you've recovered completely :-) thanks for the heads up

@JohannesLichtenberger
Copy link
Member

@sband you made any progress!?

@JohannesLichtenberger
Copy link
Member

@sband trying to ping

@BitoAgent
Copy link

PR Run Status

  • AI Based Review: Successful
  • Static Analysis: Failure - Failed to execute static code analysis using FBinfer.

PR Analysis

  • Main theme: Integration of AWS S3 storage for resource configurations and IO operations.
  • PR summary: This PR introduces AWS S3 storage support in SirixDB. It modifies the build configuration to include the AWS SDK dependency, adds new classes for S3 storage management, and integrates S3 as an optional storage backend for resources. This allows SirixDB to store and retrieve data from S3 buckets, offering a cloud storage solution alongside the existing file system storage option.
  • Type of PR: Enhancement
  • Score (0-100, higher is better): 85
  • Relevant tests added: Yes
  • Estimated effort to review (1-5, lower is better): 3
    The PR includes significant changes, including new features and external library integration, requiring a thorough review for correctness, error handling, and adherence to best practices.

PR Feedback

  • General suggestions: The integration of AWS S3 storage is a valuable addition to SirixDB, enhancing its scalability and flexibility by enabling cloud storage options. However, it's crucial to ensure that the implementation is robust, secure, and efficient. Consider adding more detailed error handling and logging, especially for network-related operations and AWS SDK interactions. It's also important to ensure that the integration does not introduce performance regressions or security vulnerabilities. Additionally, reviewing the AWS SDK usage for best practices and cost optimization could be beneficial. Lastly, consider the implications of this change on existing deployments and document any required configuration changes or migration steps for users.

Code feedback

file: bundles/sirix-core/build.gradle

  • Suggestions:
  1. Consider specifying a version range or using a variable for the AWS SDK dependency to facilitate easier updates and compatibility checks. [important] Relevant line:In bundles/sirix-core/build.gradle at line 47
+ amazonS3 : 'software.amazon.awssdk:s3:2.20.62'

file: bundles/sirix-core/src/main/java/org/sirix/access/ResourceConfiguration.java

  • Suggestions:
  1. Ensure that AWS credentials are securely managed, possibly supporting environment variables or AWS IAM roles for EC2 instances, in addition to the current profile-based approach. [important] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/access/ResourceConfiguration.java at line 176
+ private final String awsProfile;

file: bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3Storage.java

  • Suggestions:
  1. Implement retry logic or use the AWS SDK's built-in retry mechanism for handling transient network and service issues. [important] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3Storage.java at line 186
+ final var storage = new AmazonS3Storage(resourceConf, cache);

file: bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3StorageWriter.java

  • Suggestions:
  1. Consider abstracting the S3 path construction to support different bucket naming conventions and structures. [medium] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3StorageWriter.java at line 319
+ .build();

file: libraries.gradle

  • Suggestions:
  1. Verify compatibility of the added AWS SDK version with other project dependencies to avoid conflicts. [medium] Relevant line:In libraries.gradle at line 47
+ amazonS3 : 'software.amazon.awssdk:s3:2.20.62'

file: bundles/sirix-core/src/test/java/org/sirix/io/cloud/amazon/AWSS3StorageTest.java

  • Suggestions:
  1. Expand test coverage to include failure scenarios, such as network issues or permission errors. [important] Relevant line:In bundles/sirix-core/src/test/java/org/sirix/io/cloud/amazon/AWSS3StorageTest.java at line 76
+\ No newline at end of file

file: bundles/sirix-core/src/main/java/org/sirix/io/combined/CombinedStorage.java

  • Suggestions:
  1. Ensure that data consistency is maintained between local and remote storage, especially in write and truncate operations. [important] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/io/combined/CombinedStorage.java at line 48
+}

@BitoAgent
Copy link

PR Run Status

  • AI Based Review: Successful
  • Static Analysis: Failure - Failed to execute static code analysis using FBinfer.

PR Analysis

  • Main theme: Integration of Amazon S3 for storage in SirixDB
  • PR summary: This PR introduces the capability to store SirixDB resources in Amazon S3 buckets, enhancing the storage options available. It includes the necessary configurations, implementations for S3 storage interaction, and extends the storage type enumeration to support cloud storage options. Additionally, it integrates tests to ensure the functionality works as expected.
  • Type of PR: Enhancement
  • Score (0-100, higher is better): 85
  • Relevant tests added: Yes
  • Estimated effort to review (1-5, lower is better): 3
    The PR introduces a significant new feature with multiple new classes and modifications to existing ones, necessitating a thorough review for correctness, potential security implications, and adherence to best practices.

PR Feedback

  • General suggestions: The integration of Amazon S3 as a storage option is a valuable addition to SirixDB, offering users more flexibility in how and where they store their data. However, there are several areas where the implementation can be improved for better performance, security, and maintainability. Specifically, attention should be paid to error handling, the use of blocking calls in asynchronous contexts, and ensuring that resources are properly closed or released. Adopting a more modular design for the cloud storage integration could also enhance the maintainability and extensibility of the codebase. Additionally, considering security implications, such as bucket permissions and data encryption, is crucial to protect user data.

Code feedback

file: bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3Storage.java

  • Suggestions:
  1. Consider using try-with-resources or finally blocks to ensure resources like S3Client are properly closed even in the case of exceptions. This is crucial to avoid resource leaks. [important] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3Storage.java at line 119
final var storage = new AmazonS3Storage(resourceConf, cache);
  1. Implement error handling for AWS S3 operations to gracefully handle and log errors such as network issues or permission errors. [important] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3Storage.java at line 130
s3Client.createBucket(bucketRequest);
  1. Validate configuration parameters such as AWS profile, region, and bucket name to ensure they are not null or malformed before use. [medium] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3Storage.java at line 90
this.awsStorageInfo = resourceConfig.awsStoreInfo;

file: bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3StorageWriter.java

  • Suggestions:
  1. Use asynchronous API calls for S3 interactions instead of blocking calls within the CompletableFuture to improve performance and responsiveness. [important] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3StorageWriter.java at line 138
objectFutureResponse.join();
  1. Avoid using System.exit within library code as it can lead to unexpected application termination. Consider throwing exceptions or using other error reporting mechanisms. [important] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/io/cloud/amazon/AmazonS3StorageWriter.java at line 173
System.exit(1);

file: bundles/sirix-core/src/main/java/org/sirix/io/combined/CombinedStorageWriter.java

  • Suggestions:
  1. Ensure that CompletableFuture exceptions are properly handled or logged to avoid silent failures in asynchronous operations. [medium] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/io/combined/CombinedStorageWriter.java at line 32
CompletableFuture.supplyAsync(() -> remoteStorageWriter.write(pageReadOnlyTrx, pageReference, bufferedBytes));

file: bundles/sirix-core/src/main/java/org/sirix/io/cloud/ICloudStorage.java

  • Suggestions:
  1. Consider defining more specific methods in the ICloudStorage interface to cover common cloud storage operations, enhancing the interface's utility and clarity. [medium] Relevant line:In bundles/sirix-core/src/main/java/org/sirix/io/cloud/ICloudStorage.java at line 5
public interface ICloudStorage extends IOStorage {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants