Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade to 0.157 #57

Merged
merged 1,219 commits into from
Dec 2, 2016
Merged

upgrade to 0.157 #57

merged 1,219 commits into from
Dec 2, 2016

Conversation

dabaitu
Copy link

@dabaitu dabaitu commented Dec 1, 2016

upgrade to 0.157 part 2 of 3

part 1 - remove old twitter event scriber impl
part 2 - upgrade to oss 0.157
part 3 - add new twitter event scriber

martint and others added 30 commits October 16, 2016 11:17
When one side of a join has an effective predicate expression in terms of the field
use in the join criteria (e.g., v = f(k1), with a join criteria of k1 = k2), and
that expression can produce null on non-null input (e.g., nullif, case, if, most of
the array/map functions, etc), queries can produce incorrect results.

In that scenario, predicate pushdown derives another join condition v = f(k2). Since
f() can produce null on non-null input, it's possible for some value of k1 that's
equal to k2, f(k2) is null or f(k1) is null. This will cause the join criteria to
evaluate to null instead of true.

A correct derivation, although less useful for predicate pushdown,  would be

    k1 = k2 AND ((f(k1) IS NULL AND f(k2) IS NULL) OR f(k1) = f(k2)).

This change prevents the equality inference logic from considering expressions
that may return null on non-null input.
CAST(JSON 'null' AS ...) will also return null
This version avoids allocating arrays that are beyond the JVM limit.
Currently only the ordering column is being printed
in the Explain plan output for Window nodes. It is
also desirable to know what ordering is used for each
of those columns.
Other queries could timeout because they were abandoned causing the test
to fail
Rename fields in ORC dictionary reader to make it clear if the field is
used for the stripe dictionary or row group dictionary.
Always create dictionary blocks in DRWF for columns using a row group
dictionary. This prevents expansion of the dictionary which can create
a very large block.
Simplify the connector materialization of connectors in ConnectorManager
Acquire transaction handle in SystemConnector lazily to avoid accessing
the transaction manager during begin transaction.
ArturGajowy and others added 27 commits November 9, 2016 08:12
This fixes a regression from the previous commit.
Test and test utility methods were declared that can throw Exception
while no exception could be thrown.
It is weird that a method gets already rewrittenNode.
Adding test for b19d3df
("Fix base for counter in AssignUniqueIdOperator"). Without mentioned
commit added test fails.
This is a rewrite of the partial aggregation pushdown
optimizer to make the code easier to follow and reason
about.

The approach is as follows:
1. Determine whether the optimization is applicable.
   At a minimum, there must be an aggregation on top
   of an exchange.
2. If the aggregation is SINGLE, split it into a FINAL
   on top of a PARTIAL and reprocess the resulting plan.
3. If the aggregation is a PARTIAL, push it underneath
   each branch of the exchange.

We use a couple of tricks to avoid having to juggle
and rename field names as the nodes are rewired:

1. When pushing the partial aggregation through the exchange,
   the names of the outputs of the aggregation are preserved.
2. If the input->output mappings in the exchange are not
   simple identity projections without rename, we introduce
   a projection under the partial aggregation. This helps
   avoid having to rewrite all the aggregation functions
   to refer to new names.

It also fixes a planning issue under certain scenarios
involving aggregation subqueries and partitioned tables.

E.g.,

    SELECT *
    FROM (
        SELECT count(*)
        FROM tpch.tiny.orders
        HAVING count(DISTINCT custkey) > 1
    )
    CROSS JOIN t

where "t" is a partitioned Hive table.
82620d9 caused a regression
when scheduling non-remotely accessible splits, bucketed splits, or
splits when network aware scheduling was used
Now that we use a low watermark to trigger scheduling, we don't want to
reserve too much space for splits with network affinity, otherwise the
scheduler may have to run too frequently when splits have little to no affinity
These connectors uses non-canonincal types for varchar columns
in TPC-H, so the output doesn't match. Disable the tests for now.
Don't wait for deletion executor if there are no rows to delete.
@billonahill
Copy link
Collaborator

👍 assuming all tests pass.

@dabaitu
Copy link
Author

dabaitu commented Dec 2, 2016

all tests pass

@dabaitu dabaitu merged commit 16db1d7 into twitter-forks:twitter-master Dec 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.