Skip to content

Commit

Permalink
[SPARK-12520] [PYSPARK] Correct Descriptions and Add Use Cases in Equ…
Browse files Browse the repository at this point in the history
…i-Join

After reading the JIRA https://issues.apache.org/jira/browse/SPARK-12520, I double checked the code.

For example, users can do the Equi-Join like
  ```df.join(df2, 'name', 'outer').select('name', 'height').collect()```
- There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`).
- After a PR: #8600, the 1.6 does not have such an issue, but the description has not been updated.

Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join.

Author: gatorsmile <[email protected]>

Closes #10477 from gatorsmile/pyOuterJoin.
  • Loading branch information
gatorsmile authored and davies committed Dec 28, 2015
1 parent 865dd8b commit b8da77e
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion python/pyspark/sql/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -608,13 +608,16 @@ def join(self, other, on=None, how=None):
:param on: a string for join column name, a list of column names,
, a join expression (Column) or a list of Columns.
If `on` is a string or a list of string indicating the name of the join column(s),
the column(s) must exist on both sides, and this performs an inner equi-join.
the column(s) must exist on both sides, and this performs an equi-join.
:param how: str, default 'inner'.
One of `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]
>>> df.join(df2, 'name', 'outer').select('name', 'height').collect()
[Row(name=u'Tom', height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]
>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]
Expand Down

0 comments on commit b8da77e

Please sign in to comment.