Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to 0.161 #63

Merged
merged 218 commits into from
Jan 6, 2017
Merged

Conversation

dabaitu
Copy link

@dabaitu dabaitu commented Jan 5, 2017

No description provided.

martint and others added 30 commits November 10, 2016 19:10
flushCache only makes sense in CachingHiveMetastore.
The original purpose of this function was to provide an exception-free
alternative to array and map subscript operators.

This change makes the array version consistent with the function that
operates on maps.
Converting the type to uppercase breaks type equality for row types as
the field names get uppercased.
This allows Presto start regardless of host resolution at the moment of
startup. Previously, Presto will fail to start when any entry from
cassandra.contact-points is a host (instead of IP) and is not resolvable.

This change postpones host resolution to the first query.
Rename symbols to the actual columns type instead of using the
alphabet. Alphabetic are error prone and it is hard to merge
patched that adds new columns.
Symbol unaliasing for ExchangeNode canonizes
symbols that are aliased in source nodes.

Plan after optimization:
presto:default> explain SELECT c.custkey FROM customer c, orders o WHERE c.custkey = o.custkey AND o.orderdate >= DATE '1994-01-01';
                                                                                  Query Plan
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 - Output[custkey] => [custkey:bigint]
     - RemoteExchange[GATHER] => custkey:bigint
         - Project => [custkey:bigint]
             - InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, $hashvalue:bigint, custkey_0:bigint, $hashvalue_15:bigint]
                 - RemoteExchange[REPARTITION] => custkey:bigint, $hashvalue:bigint
                     - Project => [custkey:bigint, $hashvalue_14:bigint]
                             $hashvalue_14 := "combine_hash"(BIGINT '0', COALESCE("$operator$hash_code"("custkey"), 0))
                         - TableScan[hive:hive:default:customer, originalConstraint = true] => [custkey:bigint]
                                 LAYOUT: hive
                                 custkey := HiveColumnHandle{clientId=hive, name=custkey, hiveType=bigint, hiveColumnIndex=0, columnType=REGULAR}
                 - RemoteExchange[REPARTITION] => custkey_0:bigint, $hashvalue_15:bigint
                     - Project => [$hashvalue_16:bigint, custkey_0:bigint]
                             $hashvalue_16 := "combine_hash"(BIGINT '0', COALESCE("$operator$hash_code"("custkey_0"), 0))
                         - Filter[("orderdate" >= "$literal$date"(BIGINT '8766'))] => [custkey_0:bigint, orderdate:date]
                             - TableScan[hive:hive:default:orders, originalConstraint = ("orderdate" >= "$literal$date"(BIGINT '8766'))] => [custkey_0:bigint, orderdate:date]
                                     LAYOUT: hive
                                     custkey_0 := HiveColumnHandle{clientId=hive, name=custkey, hiveType=bigint, hiveColumnIndex=1, columnType=REGULAR}
                                     orderdate := HiveColumnHandle{clientId=hive, name=orderdate, hiveType=date, hiveColumnIndex=4, columnType=REGULAR}

Plan before optimization:
presto:default> explain SELECT c.custkey FROM customer c, orders o WHERE c.custkey = o.custkey AND o.orderdate >= DATE '1994-01-01';
                                                                                    Query Plan
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 - Output[custkey] => [custkey:bigint]
     - RemoteExchange[GATHER] => custkey:bigint
         - Project => [custkey:bigint]
             - InnerJoin[("custkey_8" = "custkey_9")] => [custkey:bigint, custkey_8:bigint, $hashvalue:bigint, custkey_0:bigint, custkey_9:bigint, $hashvalue_16:bigint]
                 - Project => [custkey:bigint, custkey_8:bigint, $hashvalue:bigint]
                     - RemoteExchange[REPARTITION] => custkey:bigint, custkey_8:bigint, $hashvalue:bigint, $hashvalue_14:bigint
                         - Project => [custkey:bigint, $hashvalue_15:bigint]
                                 $hashvalue_15 := "combine_hash"(BIGINT '0', COALESCE("$operator$hash_code"("custkey"), 0))
                             - TableScan[hive:hive:default:customer, originalConstraint = true] => [custkey:bigint]
                                     LAYOUT: hive
                                     custkey := HiveColumnHandle{clientId=hive, name=custkey, hiveType=bigint, hiveColumnIndex=0, columnType=REGULAR}
                 - Project => [custkey_0:bigint, custkey_9:bigint, $hashvalue_16:bigint]
                     - RemoteExchange[REPARTITION] => custkey_0:bigint, custkey_9:bigint, $hashvalue_16:bigint, $hashvalue_17:bigint
                         - Project => [$hashvalue_18:bigint, custkey_0:bigint]
                                 $hashvalue_18 := "combine_hash"(BIGINT '0', COALESCE("$operator$hash_code"("custkey_0"), 0))
                             - Filter[("orderdate" >= "$literal$date"(BIGINT '8766'))] => [custkey_0:bigint, orderdate:date]
                                 - TableScan[hive:hive:default:orders, originalConstraint = ("orderdate" >= "$literal$date"(BIGINT '8766'))] => [custkey_0:bigint, orderdate:date]
                                         LAYOUT: hive
                                         custkey_0 := HiveColumnHandle{clientId=hive, name=custkey, hiveType=bigint, hiveColumnIndex=1, columnType=REGULAR}
                                         orderdate := HiveColumnHandle{clientId=hive, name=orderdate, hiveType=date, hiveColumnIndex=4, columnType=REGULAR}
This reduces number of unique symbols in query plan
and allows other optimizations to be applied
(e.g: running multiple joins in the same stage which
operate on same partitions after canonicalization)
PredicatePushdown optimizer created unnecessary symbols for join
clauses which are not required.
Raghav Sethi and others added 27 commits December 12, 2016 12:45
This allows trivial queries to run even when the node is "out of
memory"
Previously they were uploaded from the 'PRODUCT_TESTS' job.
This allows us to keep the logs for restarted jobs.
…avior

We recently made a change to the column resolution rules for ORDER BY to
make them compliant with ANSI SQL. In order to ease the transition from
the old semantics, we now add a config option and session property that
controls the behavior.

The session property is "legacy_order_by". The config option is
"deprecated.legacy-order-by".
The arguments were in the wrong order, so the value from
FeaturesConfig was not being used to control the default value.
A recent commit (ec2e897) changed
the way ORDER BY expressions are handled in a way that causes certain
expression to not be "analyzed" and their types be recorded in the
Analysis object.

extractAggregates() looks for aggregations in node.getOrderBy()
and records them for later use by the planner. The new ORDER BY
analyzer process the rewritten expressions, which have a different
object identity. As a result, the aggregates don't have associated
type and implicit coercion information for the planner to use.

This change makes it so that the aggregates are extracted from
the rewritten expressions.
Due to ec2e897, when analysis fails for certain
expressions, the error is misreported as happening in the SELECT clause
instead of in the ORDER BY clause. This is because the analyzer processes
the rewritten expressions, which contain inlined SELECT expressions and their
original locations.

This change fixes the issue by analyzing the original unmodified expressions
with a synthetic scope built from the output of the SELECT clause
that can delegate resolution to the source scope for missing names (essentially,
it implements the resolution rules per the SQL spec).

One side-effect of this change is that queries whose ORDER BY clause reference
columns that appear multiple times in the SELECT clause are now considered
invalid due to ambiguous references -- this matches the expected behavior
according to the ANSI spec.
This query shape is no longer valid due to ambiguous
column references.
@billonahill
Copy link
Collaborator

👍

@dabaitu dabaitu merged commit 0cf760d into twitter-forks:twitter-master Jan 6, 2017
Yaliang added a commit to Yaliang/presto that referenced this pull request Feb 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.