Parquet: Support parquet modular encryption #2639

ggershinsky · 2021-05-26T13:26:18Z

Implements #1413

rdblue · 2021-06-12T23:32:08Z

I'm approving this to run unit tests. Thanks for working on this, @ggershinsky!

rdblue · 2021-06-12T23:35:30Z

@ggershinsky, the doc you linked to is quite long and it isn't clear what part of it explains what you're doing here. Can you provide a quick summary of how to plug into Parquet encryption and what this does? I'm assuming that it provides Parquet's equivalent of an EncryptionManager that gets the file AAD and necessary key material from Iceberg's key_metadata field. Is that correct?

ggershinsky · 2021-06-14T11:50:58Z

I'm approving this to run unit tests. Thanks for working on this, @ggershinsky!

Thanks @rdblue!

ggershinsky · 2021-06-14T12:00:12Z

Can you provide a quick summary of how to plug into Parquet encryption and what this does?

Certainly. There are two encryption interfaces in parquet-mr-1.12.0 : low-level (direct impl of the spec; max flexibility; no key management) and high-level (a layer on top of low-level, with a lib-local key management tools driven by Hadoop properties). In Iceberg, we'll use directly the low-level Parquet encryption API - because the key management will be done by Iceberg, in a similar fashion for all formats; and because Iceberg has a centralized manifest capability, which makes key management more efficient than running lib-local nodes in each worker process.
Since Iceberg also taps directly into Parquet low-level API (general, no encryption), this PR enables to link it to the encryption feature, and translates general column encryption configuration (TBD) into Parquet encryption configuration.

it provides Parquet's equivalent of an EncryptionManager that gets the file AAD and necessary key material from Iceberg's key_metadata field.

Well, here we have the gap that I've described in the other PR. To encrypt data(/delete) files, we need the AES keys - the DEKs, "data encryption keys" which are used to actually encrypt the data and metadata modules (there must be a unique DEK per file/column). But the key_material is a binary field in the manifest entry for a data(/delete) file that keeps a "wrapped" version of these DEKs (encrypted with master keys, MEKs, in user's KMS system). It doesn't and shouldn't keep raw DEKs. Therefore, sending key_material to Parquet file writers (or any other writers) doesn't help. Per my proposal in the last sync, we can reverse this process - generate random DEKs at (Parquet) writers, use them for encryption, and send them in the DataFile/DeleteFile/ContentFile objects back to the manifest writer. This also seems to fit the current Iceberg model, where manifest entries are written after collecting ContentFile objects from file writers. At this point, the manifest writer process will contact the KMS to wrap these DEKs, and package them into the key_metadata field in the manifest file (for the readers).

In a future version, we might want to generate the DEKs (or get them from KMS) in the manifest writer process, and then distribute them to data/delete file writers, with a unique DEK per file (or a set of unique DEKs per file, for column encryption). This seems to be more complicated, and less fitting the current Iceberg flow; my suggestion would be to start with the reverse approach described above.

flyrain · 2021-06-22T23:37:16Z

Hi @ggershinsky , can you rebase this PR to have #2441?

ggershinsky · 2021-06-23T10:57:43Z

Hi @flyrain, will do, thanks for the notice.

ggershinsky · 2021-06-27T10:58:42Z

Per the comments above, there will be changes in this pull request (and in #2638, #2640). Converting the three to drafts.

liujinhui1994 · 2022-01-06T06:06:05Z

@ggershinsky hello，Is there any latest development of this feature, we are looking forward to this feature

ggershinsky · 2022-01-06T11:11:51Z

hi @liujinhui1994 , yes, this feature is under active development. Per the community discussions, the encryption PRs will be updated in Q1'22; probably I'll have the commits ready later this month or in Feb.

shangxinli · 2022-02-08T17:09:02Z

@ggershinsky Let me know when it is ready for review. I would love to review this change.

ggershinsky · 2022-02-08T18:35:36Z

Hi @shangxinli ; frankly, the only dependency left is the parquet-mr update. This PR depends on the "uniform encryption" feature, which is already merged in the parquet master, but not released yet. We've started this discussion in the last parquet community sync, it's on track. Once the parquet-mr master is cut (as 1.12.3 or 1.13.0 :), this Iceberg PR will be able to pass CI, ready for a review.

shangxinli · 2022-02-08T18:38:48Z

OK. I will have a look once I have time.

…

On Tue, Feb 8, 2022 at 10:35 AM ggershinsky ***@***.***> wrote: Hi @shangxinli <https://github.com/shangxinli> ; frankly, the only dependency left is the parquet-mr update. This PR depends on the "uniform encryption" feature, which is already merged in the parquet master, but not released yet. We've started this discussion in the last parquet community sync, it's on track. Once the parquet-mr master is cut (as 1.12.3 or 1.13.0 :), this Iceberg PR will be able to pass CI, ready for a review. — Reply to this email directly, view it on GitHub <#2639 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHPXKMIIJC57SQDO2WA22UTU2FPALANCNFSM45SBJ3JQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Xinli Shang

zhongyujiang · 2022-08-02T10:52:21Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+        parquetEncryptionAlgorithm = ParquetCipher.AES_GCM_V1; // default
+        LOG.info("No encryption algorithm specified. Using Parquet default - AES_GCM_V1");
+      } else {
+        EncryptionAlgorithm icebergEncryptionAlgorithm = nativeParameters.encryptionAlgorithm();


Nit: a switch would look simpler than an if block.

zhongyujiang · 2022-08-02T11:35:36Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+      }
+
+      ByteBuffer footerDataKey = nativeParameters.fileKey();
+      if (null == footerDataKey) {


File key must not be null when we push down encryption, right? If so, can we move this check to NativeFileCryptoParameters#build ? So we won't need to check if the file key is null every time when converting NativeFileCryptoParameters to Parquet / ORC config.

I believe ORC works with column keys, not files keys. So this is a parquet-specific check.

Got it, thanks for explanation.

pvary · 2023-01-16T11:00:06Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

@@ -139,6 +153,13 @@ private WriteBuilder(OutputFile file) {
      } else {
        this.conf = new Configuration();
      }
+      if (file instanceof NativelyEncryptedFile) {


nit: empty line between blocks

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

rdblue · 2023-03-11T20:45:35Z

parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

+
+    try (ParquetFileReader fileReader =
+        newReader(
+            file, ParquetReadOptions.builder().withDecryption(decryptionProperties).build())) {


Style: it would be better to create the new read options on a separate line.

I tried a couple of ways to break the line(s), but the spotless check hasn't accepted them, the spotless apply always returns the code to this form.

I'm not talking about reformatting the line breaks, I mean to create a variable:

ParquetReadOptions readOptions = ParquetReadOptions.builder().withDecryption(decryptionProperties).build(); try (ParquetFileReader fileReader = newReader(file, readOptions)) { ... }

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetEncryption.java

rdblue · 2023-03-11T20:55:44Z

@ggershinsky, this is looking good, but there are a few minor updates needed and it requires more testing to exercise more of the Iceberg specific code. You can check out code coverage to see when you're hitting all the areas you've changed.

Also, for the change to remove reflection, here's a diff since I explored that locally:

[blue@work iceberg]$ git diff
diff --git a/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java b/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
index af2fb0e80a..ee3a8f50be 100644
--- a/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
+++ b/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
@@ -28,20 +28,16 @@ import org.apache.hadoop.conf.Configuration;
 import org.apache.iceberg.Metrics;
 import org.apache.iceberg.MetricsConfig;
 import org.apache.iceberg.Schema;
-import org.apache.iceberg.common.DynConstructors;
-import org.apache.iceberg.common.DynMethods;
 import org.apache.iceberg.io.FileAppender;
 import org.apache.iceberg.io.OutputFile;
 import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
-import org.apache.parquet.bytes.ByteBufferAllocator;
 import org.apache.parquet.column.ColumnWriteStore;
 import org.apache.parquet.column.ParquetProperties;
-import org.apache.parquet.column.page.PageWriteStore;
-import org.apache.parquet.column.values.bloomfilter.BloomFilterWriteStore;
 import org.apache.parquet.crypto.FileEncryptionProperties;
 import org.apache.parquet.crypto.InternalFileEncryptor;
 import org.apache.parquet.hadoop.CodecFactory;
+import org.apache.parquet.hadoop.ColumnChunkPageWriteStore;
 import org.apache.parquet.hadoop.ParquetFileWriter;
 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
 import org.apache.parquet.schema.MessageType;
@@ -50,25 +46,6 @@ class ParquetWriter<T> implements FileAppender<T>, Closeable {
 
   private static final Metrics EMPTY_METRICS = new Metrics(0L, null, null, null, null);
 
-  private static final DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet =
-      DynConstructors.builder(PageWriteStore.class)
-          .hiddenImpl(
-              "org.apache.parquet.hadoop.ColumnChunkPageWriteStore",
-              CodecFactory.BytesCompressor.class,
-              MessageType.class,
-              ByteBufferAllocator.class,
-              int.class,
-              boolean.class,
-              InternalFileEncryptor.class,
-              int.class)
-          .build();
-
-  private static final DynMethods.UnboundMethod flushToWriter =
-      DynMethods.builder("flushToFileWriter")
-          .hiddenImpl(
-              "org.apache.parquet.hadoop.ColumnChunkPageWriteStore", ParquetFileWriter.class)
-          .build();
-
   private final long targetRowGroupSize;
   private final Map<String, String> metadata;
   private final ParquetProperties props;
@@ -81,7 +58,7 @@ class ParquetWriter<T> implements FileAppender<T>, Closeable {
   private final OutputFile output;
   private final Configuration conf;
 
-  private DynMethods.BoundMethod flushPageStoreToWriter;
+  private ColumnChunkPageWriteStore pageStore = null;
   private ColumnWriteStore writeStore;
   private long recordCount = 0;
   private long nextCheckRecordCount = 10;
@@ -232,7 +209,7 @@ class ParquetWriter<T> implements FileAppender<T>, Closeable {
         ensureWriterInitialized();
         writer.startBlock(recordCount);
         writeStore.flush();
-        flushPageStoreToWriter.invoke(writer);
+        pageStore.flushToFileWriter(writer);
         writer.endBlock();
         if (!finished) {
           writeStore.close();
@@ -253,8 +230,8 @@ class ParquetWriter<T> implements FileAppender<T>, Closeable {
             props.getMaxRowCountForPageSizeCheck());
     this.recordCount = 0;
 
-    PageWriteStore pageStore =
-        pageStoreCtorParquet.newInstance(
+    this.pageStore =
+        new ColumnChunkPageWriteStore(
             compressor,
             parquetSchema,
             props.getAllocator(),
@@ -264,9 +241,7 @@ class ParquetWriter<T> implements FileAppender<T>, Closeable {
             rowGroupOrdinal);
     this.rowGroupOrdinal++;
 
-    this.flushPageStoreToWriter = flushToWriter.bind(pageStore);
-    this.writeStore =
-        props.newColumnWriteStore(parquetSchema, pageStore, (BloomFilterWriteStore) pageStore);
+    this.writeStore = props.newColumnWriteStore(parquetSchema, pageStore, pageStore);
 
     model.setColumnStore(writeStore);
   }

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

...c/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java

rdblue · 2023-04-02T18:33:03Z

@ggershinsky this looks close. Just style and naming issues left.

rdblue · 2023-05-22T16:09:23Z

@ggershinsky, looks like this is out of date. Can you rebase?

update and clean up update unitest clean up indent clean up update read conf style update style update isolate from Spark/PME configuration refactor common encrypted IO classes for unitests format fixes spotless apply use key metadata address review comments clean up post-review changes post-review changes 2 update method names in tests separate read options line conflict resolution conflict resolution 2 conflict resolution 3 spotless fix add decryption check

ggershinsky · 2023-05-24T13:52:33Z

@rdblue The PR is rebased

rdblue · 2023-05-25T23:02:16Z

Thanks, @ggershinsky!

github-actions bot added the parquet label May 26, 2021

This was referenced May 26, 2021

Move to Parquet 1.12.0 #2441

Merged

Core: Encryption basics #2638

Merged

ggershinsky force-pushed the support-parquet-encryption branch from 409a047 to 97f6080 Compare June 27, 2021 10:50

ggershinsky marked this pull request as draft June 27, 2021 10:59

github-actions bot added the core label Nov 10, 2021

ggershinsky force-pushed the support-parquet-encryption branch from 72cb406 to 97f6080 Compare November 10, 2021 12:34

ggershinsky mentioned this pull request Jan 6, 2022

GCM encryption stream #3231

Merged

ggershinsky force-pushed the support-parquet-encryption branch from 97f6080 to b8b9270 Compare February 8, 2022 07:50

ggershinsky force-pushed the support-parquet-encryption branch from 8b87cc0 to 380e3e2 Compare July 26, 2022 06:07

ggershinsky marked this pull request as ready for review July 26, 2022 11:58

zhongyujiang reviewed Aug 2, 2022

View reviewed changes

ggershinsky force-pushed the support-parquet-encryption branch from 978c2b6 to dc7aa9b Compare August 16, 2022 05:38

ggershinsky mentioned this pull request Aug 19, 2022

Parquet-MR Encryption - Modify to true to encrypt apache/parquet-java#987

Closed

pvary reviewed Jan 16, 2023

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Outdated Show resolved Hide resolved

pvary reviewed Jan 16, 2023

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 11, 2023

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 11, 2023

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Show resolved Hide resolved

rdblue reviewed Mar 11, 2023

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 11, 2023

View reviewed changes

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetEncryption.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 11, 2023

View reviewed changes

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetEncryption.java Show resolved Hide resolved

github-actions bot added data spark labels Mar 16, 2023