Add NaN counter to Metrics and implement in Parquet writers #1641

yyanyy · 2020-10-22T02:20:51Z

This change adds NaN counter in Metrics model, and update it during Parquet writing. I believe it only touches internal models and will not write the new attribute to output files. This change is the first step towards implementing spec change defined in #348 .

Questions:

As mentioned in a comment I highlighted, SparkTableUtil (essentially importSparkTable()) (link) reads metrics from Parquet footer directly, and thus won't populate NaN counts. If we don't want to accept this as a fact, we may need to switch ParquetFileReader.readFooter() to use internal parquet reader, and enable metric collection during reading, but this could be much more expensive than the current approach. Do people have better suggestions?
- One thing that may worth noting is that ParquetFileReader.readFooter is on deprecation path (https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.10.0/deprecated-list.html)
- fileMetrics() in ParquetUtil also has the same problem
The current change doesn't help with removing NaN from lower/upper bound, since parquet library doesn't treat NaN for its min/max stats specially. I'm thinking to use the same approach to populate upper and lower bounds, and wondering if people have better suggestions.

Wanted to submit a draft to gather comments on the approach in general. Will add more tests later.

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

data/src/test/java/org/apache/iceberg/parquet/TestParquetMergingMetrics.java

yyanyy · 2020-10-22T02:30:45Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java

  public static Metrics fileMetrics(InputFile file, MetricsConfig metricsConfig, NameMapping nameMapping) {
    try (ParquetFileReader reader = ParquetFileReader.open(ParquetIO.file(file))) {
-      return footerMetrics(reader.getFooter(), metricsConfig, nameMapping);
+      return footerMetrics(reader.getFooter(), Stream.empty(), null, metricsConfig, nameMapping);


This has the similar problem I mentioned in pr description for importing spark table; if the file itself is directly passed in there's not much chance to get the additional metrics tracked by value writers. Currently fileMetrics are only used by tests. Do people have suggestions on this?

In that case, we should just set the NaN count map to null. I don't think that we want to scan imported files to create these metrics. Also, I believe that we can rely on recent Parquet versions to not produce min or max values that are NaN, so it should be safe to use these as long as we check that they are not NaN.

Thank you! I'll leave this as Stream.empty for now.

Regarding relying on recent Parquet versions to not produce min or max values that are NaN, sounds like that also answers my question in the pr description (i.e. we won't follow this approach to populate upper and lower bounds). In my current code base it looks like parquet still gives us NaN as max; do you happen to have a reference to the parquet version that supports NaN properly? From a quick search I wasn't able to find it; I noticed this but I suspect it's for something else. I'll look into it more deeply if you don't have it handy.

so it should be safe to use these as long as we check that they are not NaN.

Sorry to make sure I understand this correctly, sounds like only the three following cases will be valid:

v1 table, no NaN counter, min/max could have NaN - use the existing logic, we can't do much about min/max==NaN

v2 table, NaN counter exists, min/max will not be NaN - in this case metrics are produced by iceberg writer

v2 table, no NaN counter, min/max will not be NaN - in this case the file is imported or from this fileMetrics

Then to accommodate for (3) we will have to remember in evaluators that absence of NaN counter doesn't necessarily mean there's no NaN value in the column; but that might be fine since we will need this logic to accommodate (1) as well (unless we implement evaluators in a way that we can differentiate v1/2 table; not sure if we want that).

I don't have a reference for the Parquet fix. I think I was in a sync where it was discussed. Maybe we should generate our own lower/upper bounds for Parquet then.

Float.compare will sort NaN values last, so if we do get max=NaN from Parquet our evaluators should still work as expected. It will just include a much larger range than necessary. If we can generate better stats for table metadata, then that would be ideal.

Your cases look correct, except that I would say that we expect only max=NaN or min=max=NaN from Parquet. Using the existing logic should be okay.

I agree that the lack of a NaN counter means that the value is unknown. This is the case for all files written to v1 tables. V2 tables will generally have the NaN counter values, but not in all cases (imported files).

Thank you! I guess I'll revisit min/max status for Parquet after NaN related code are mostly done.

api/src/main/java/org/apache/iceberg/FieldMetrics.java

spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java

yyanyy · 2020-10-29T00:03:17Z

spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java

@@ -146,8 +146,13 @@ public UTF8String getUTF8String(int ordinal) {



The change to this class was mostly trying to use it in TestSparkParquetMergingMetrics. Currently this class is only being used for reading rows for metadata tables (in RowDataReader, and I think only the metadata table will produce DataTask).

In the tests I wanted to convert Record to InternalRow for testing Spark appender. I was debating if I should expand this class beyond its current usage, or to write a new converter. Here are the two things I need to change to (partially) implement the former:

null handling which results to the change in get(): RowDataReader doesn't call get() directly (it uses Dyn reflections for reading individual attributes and skip nulls) for converting into other Spark internal row representation (UnsafeRow in this case), thus we didn't see issue. However, when spark parquet writer uses get(), without this change NPE will occur. Note that even after this change, other use cases of this class (e.g. getUTF8String() are still not null safe, and I wonder if people have opinion on if we want a full null-safe support update to all methods in this class.

for getBinary() change, currently we convert Fixed type to binary for Spark (link), and the method I used for creating random records generate fixed type with byte[] (link), and before this change getBinary() doesn't work for byte[] implementation of fixed type. Alternatively we could wrap fixed type in random record generator the same way we do for binary type. I decide to do the former to allow binary related types to be more flexible when wrapping them in this class, but I guess this comes back to the question to if we want to evolve this class or to create a separate converter.

For #1, I'm fine adding the null check since this isn't used in a high performance code path, but it would be better to have the code that uses this call isNullAt directly because that is the contract for Spark's InternalRow.

For #2, since getBinary is called for both fixed and binary types, I think we do need to check the type of the object. I'd much rather do that using instanceof rather than catching an exception. Can you update it to use struct.get(ordinal, Object.class) and then check the type of the object returned?

Thanks for the information!

For 1, thanks for the info! I wasn't aware of the contract.

I wasn't sure if we want to add isNullAt for this specific case though, as I guess the problem comes from a difference in behavior between InternalRow and StructInternalRow, and adding isNullAt might have a larger performance penalty in production.

The behavior difference comes from the ability of calling internalRow.get() that could return null. The NPE eventually comes from struct.get(index, types[index]) in SparkParquetWriters. While the actual InternalRow could return null for null column, for StructInternalRow it assumes the underlying get() returns non-null, and sometimes performs some actions to them that could lead to NPE. e.g. call toString for converting, or return as primitive type directly. Thus SparkParquetWriters was able to call get fine under normal circumstances, but it couldn't for the specific usage of StructInternalRow in this test.

For 2, that's a better idea, will do this instead.

For 1, let's just add the check. This isn't used in a performance-critical path.

This looks good to me.

data/src/test/java/org/apache/iceberg/TestMergingMetrics.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

jackye1995 · 2020-10-30T07:54:03Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriter.java

@@ -28,4 +30,7 @@
  List<TripleWriter<?>> columns();

  void setColumnStore(ColumnWriteStore columnStore);
+
+  Stream<FieldMetrics> metrics();


Documentation needed. And is the name a bit too generic?

I'll add documentations. For the name, I think ParquetValueWriter already implies that it's field specific, so I guess metrics itself would convey the idea that it's field specific metrics. Do you have better suggestions?

spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java

api/src/main/java/org/apache/iceberg/FieldMetrics.java

api/src/main/java/org/apache/iceberg/Metrics.java

core/src/test/java/org/apache/iceberg/TestMetrics.java

rdblue · 2020-11-01T20:38:59Z

core/src/test/java/org/apache/iceberg/TestMetrics.java

    assertBounds(1, IntegerType.get(), null, null, metrics);
  }

+  @Test
+  public void testMetricsForNaNColumns() throws IOException {


There are a few cases to consider with NaN values because comparison with NaN is always false.

Here are a couple of implementations that have issues because they use comparison without checking for NaN:

// max is NaN for values [NaN, 1.0, 1.1] Float max = null; for (value : values) { if (max == null || max < value) { max = value; } } // max is NaN if any value is NaN Float max = null; for (value : values) { max = (max != null && max >= value) ? max : value; }

Because the failure cases are different, I think we should test a few different cases:

A column starts with NaN

A column contains NaN in the middle

A column ends with NaN

Thanks for pointing this out! I guess this tests more about boundary than of NaN count, but it would be very helpful when we start to exclude NaN from upper/lower bounds. Will update.

flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetWriters.java

rdblue · 2020-11-01T20:54:05Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java

+
+    return fieldMetrics
+        .filter(metrics -> {
+          String alias = inputSchema.idToAlias(metrics.getId());


Alias isn't what you want to use here. An alias is the file schema's name when an Iceberg schema is converted from a file schema. Parquet and Avro don't allow special characters in field names, or a column's name may have changed after a file is written. In both cases, a file schema's names won't match the schema. The alias map exposes the original file field names for when we need to use them (e.g., get a page reader for a column from the file).

In this case, we want to use the table schema's name, not a file schema's name. That's why we use findColumnName above. That should work here as well.

Thank you for the explanation!

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

rdblue · 2020-11-01T21:00:41Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

+
+    @Override
+    public Stream<FieldMetrics> metrics() {
+      if (id != null) {


I think this should always produce the metric. Field IDs are required to write, so we are guaranteed that they are always present (or should fail if one is not). And as long as this is always gathering the metric, we may as well return it.

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

rdblue · 2020-11-01T21:08:30Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

+    }
+  }
+
+  private static class DoubleWriter extends UnboxedWriter<Double> {


Same comments from the float case above.

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

rdblue · 2020-11-01T21:10:11Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriteAdapter.java

+ * It shouldn't be used in production; {@link org.apache.iceberg.parquet.ParquetWriter} is a better alternative.
+ * @deprecated use {@link org.apache.iceberg.parquet.ParquetWriter}
+ */
+@Deprecated


parquet/src/main/java/org/apache/iceberg/parquet/ParquetFieldMetrics.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

yyanyy · 2020-11-05T02:15:33Z

Thank you for all the comments! I think I have addressed all the feedback, and rebased the change. I have also removed the draft tag from the pr.

core/src/test/java/org/apache/iceberg/TestMetrics.java

rdblue · 2020-11-11T21:19:38Z

core/src/test/java/org/apache/iceberg/TestMetrics.java


-  public abstract InputFile writeRecords(Schema schema, Record... records) throws IOException;
+  public abstract Metrics getMetrics(Schema schema, Record... records) throws IOException;


This refactor seems to have introduced a lot of changes. Is it needed? Seems like it may just introduce conflicts.

I think the main reason for refactoring is that, before this change we have writeRecords to create appender and return InputFile, and then in Parquet specific tests we use ParquetUtil.fileMetrics to read metrics from the file footer of InputFile directly. But now since NaN is tracked during writing, to test NaN we will need to test against appender.metrics().

Makes sense! Looks like we need to keep it then.

core/src/test/java/org/apache/iceberg/TestMetrics.java

data/src/test/java/org/apache/iceberg/TestMergingMetrics.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriter.java

rdblue · 2020-11-11T21:32:16Z

@yyanyy, thank you for the update! Nothing major to fix. Overall the changes look good, but I found a couple of minor things that might reduce the size of this PR. Thanks!

rdblue · 2020-11-11T21:51:39Z

core/src/test/java/org/apache/iceberg/TestMetrics.java

+    // behaviors differ due to their implementation of comparison being different.
+    if (fileFormat() == FileFormat.ORC) {
+      assertBounds(1, FloatType.get(), Float.NaN, Float.NaN, metrics);
+      assertBounds(2, DoubleType.get(), Double.NaN, Double.NaN, metrics);


NaN as an upper bound should be safe, but NaN as a lower bound may not be. Does this mean we need to fix our evaluators to check for NaN?

@omalley and @shardulm94, FYI. Looks like we are getting unexpected bounds for some ORC cases with NaN.

I guess this means that today, we may skip including an ORC file for predicates that utilize bounds when the column to be evaluated contains non-NaN data but both upper and lower bound is NaN, which happens when the field of the first record in the file is NaN. Is my understanding correct? I can create an issue about this.

Yeah, an issue with a test case would be great! Thank you!

Forgot to do that yesterday: #1761

rdblue · 2020-11-12T19:36:03Z

Thanks for the update, @yyanyy! I'll merge this.

yyanyy commented Oct 22, 2020

View reviewed changes

giovannifumarola reviewed Oct 22, 2020

View reviewed changes

api/src/main/java/org/apache/iceberg/FieldMetrics.java Show resolved Hide resolved

spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java Outdated Show resolved Hide resolved

yyanyy force-pushed the parquet_additional_metrics branch from bbfa6af to a128ae0 Compare October 28, 2020 23:25

yyanyy commented Oct 29, 2020

View reviewed changes

data/src/test/java/org/apache/iceberg/TestMergingMetrics.java Outdated Show resolved Hide resolved

jackye1995 reviewed Oct 30, 2020

View reviewed changes