Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Add PartitionMap #9194

Merged
merged 1 commit into from
Dec 7, 2023
Merged

Conversation

aokolnychyi
Copy link
Contributor

This PR adds PartitionMap, a map that uses a pair of spec ID and partition tuple as keys. It is similar to PartitionSet.

The class will simplify places like DeleteFileIndex that uses the following code not related to the main logic.

private final Map<Integer, Types.StructType> partitionTypeById;
private final Map<Integer, ThreadLocal<StructLikeWrapper>> wrapperById;
private final Map<Pair<Integer, StructLikeWrapper>, DeleteFileGroup> deletesByPartition;

// use HashMap with precomputed values instead of thread-safe collections loaded on demand
// as the cache is being accessed for each data file and the lookup speed is critical
private Map<Integer, ThreadLocal<StructLikeWrapper>> wrappers(Map<Integer, PartitionSpec> specs) {
  Map<Integer, ThreadLocal<StructLikeWrapper>> wrappers = Maps.newHashMap();
  specs.forEach((specId, spec) -> wrappers.put(specId, newWrapper(specId)));
  return wrappers;
}

private ThreadLocal<StructLikeWrapper> newWrapper(int specId) {
  return ThreadLocal.withInitial(() -> StructLikeWrapper.forType(partitionTypeById.get(specId)));
}

private Pair<Integer, StructLikeWrapper> partition(int specId, StructLike struct) {
  ThreadLocal<StructLikeWrapper> wrapper = wrapperById.get(specId);
  return Pair.of(specId, wrapper.get().set(struct));
}

@github-actions github-actions bot added the core label Dec 1, 2023
@aokolnychyi aokolnychyi force-pushed the partition-map branch 2 times, most recently from 1e73a91 to 4b0da7f Compare December 1, 2023 23:11
}

private Map<StructLike, V> newPartitionMap(int specId) {
PartitionSpec spec = specs.get(specId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will it happen if specs don't contain specId?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fail in that case. These maps will be instantiated with table.specs() containing all known specs of the table. I can add a precondition with a better error message but I wasn't worried about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a precondition with a proper error message.


@Override
public V get(Object key) {
if (key instanceof Pair) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional we can throw a NPE on "null" key, API doesn't say we have to but this map doesn't allow for a null key so we could do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check built-in maps that don't support null keys. We will throw an NPE in put. I am not sure what would be the best behavior on get so I followed PartitionSet.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some built-in maps actually throw an exception but mainly because they call hashCode on the provided key. I don't mind adding an extra check to see if the value is null.

Do you think that helps indicate something is not right, @RussellSpitzer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably need to do the same for remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the Java API description which says to throw NPE's if you don't support null keys but I really don't care either way. I think in our case we have no use case for a null pair so we probably should just not allow it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll add an exception to get and remove.


@Override
public String toString() {
return partitionMaps.entrySet().stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we using this.entryset()?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but I didn't do that for a few reasons:

  • Overhead of constructing entrySet as we need to create PartitionEntry for each mapping.
  • The need to look up PartitionSpec for every pair vs once per partition map right now.

private String toString(PartitionSpec spec, Entry<StructLike, V> entry) {
StructLike struct = entry.getKey();
V value = entry.getValue();
return spec.partitionToPath(struct) + " -> " + (value == this ? "(this Map)" : value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow this string creation

/foo=1/ -> (this map)

Are we assuming we would get nested PartitionMaps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather we just ban putting the map into itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Banning would mean adding extra logic on each put, which I'd probably avoid (to be honest, it is highly unlikely to hit this in practice).

I copied this approach from Java AbstractMap that does a similar trick.

public String toString() {
    Iterator<Entry<K,V>> i = entrySet().iterator();
    if (! i.hasNext())
        return "{}";

    StringBuilder sb = new StringBuilder();
    sb.append('{');
    for (;;) {
        Entry<K,V> e = i.next();
        K key = e.getKey();
        V value = e.getValue();
        sb.append(key   == this ? "(this Map)" : key);
        sb.append('=');
        sb.append(value == this ? "(this Map)" : value);
        if (! i.hasNext())
            return sb.append('}').toString();
        sb.append(',').append(' ');
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep with tradition

otherMap.put(Pair.of(BY_DATA_SPEC.specId(), Row.of("bbb")), "v2");
map.putAll(otherMap);

assertThat(map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good time to also test the equals method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added below.

assertThat(map).doesNotContainKey(Pair.of(1, Row.of(1))).doesNotContainValue("value");
assertThat(map.values()).isEmpty();
assertThat(map.keySet()).isEmpty();
assertThat(map.entrySet()).isEmpty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot to check equals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a separate test case to check both empty and non-empty cases.

map.put(BY_DATA_SPEC.specId(), struct(BY_DATA_SPEC, "data", "aaa"), "value");
assertThat(map)
.containsEntry(Pair.of(BY_DATA_SPEC.specId(), struct(BY_DATA_SPEC, "data", "aaa")), "value")
.containsEntry(Pair.of(BY_DATA_SPEC.specId(), Row.of("aaa")), "value");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be another good spot to do an "equals" check with two maps with the same specs but different structlikes with the same data?

}

@Test
public void testConcurrencyReadAccess() throws InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Concurrency -> Concurrent

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, left a few more suggestions but that's at your discretion

@@ -32,7 +33,7 @@
import org.apache.iceberg.relocated.com.google.common.collect.Maps;
import org.apache.iceberg.types.Types;

public class PartitionSet implements Set<Pair<Integer, StructLike>> {
public class PartitionSet extends AbstractSet<Pair<Integer, StructLike>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for proper equality.

@aokolnychyi aokolnychyi merged commit 6a9d3c7 into apache:main Dec 7, 2023
45 checks passed
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @qqqttt123 @jerqi @RussellSpitzer!

lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants