Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame #33214

Closed
wants to merge 9 commits into from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Jul 5, 2021

What changes were proposed in this pull request?

Currently, inferring nested structs is always using MapType.

This behavior causes an issue because it infers the schema with a value type of the first field of the struct as below:

data = [{"inside_struct": {"payment": 100.5, "name": "Lee"}}]
df = spark.createDataFrame(data)
df.show(truncate=False)
+--------------------------------+
|inside_struct                   |
+--------------------------------+
|{name -> null, payment -> 100.5}|
+--------------------------------+

The "name" became null, but it should've been "Lee".

In this case, we need to be able to infer the schema with a StructType instead of a MapType.

Therefore, this PR proposes adding an new configuration spark.sql.pyspark.inferNestedDictAsStruct.enabled to handle which type is used for inferring nested structs.

  • When spark.sql.pyspark.inferNestedDictAsStruct.enabled is false (by default), inferring nested structs by MapType
  • When spark.sql.pyspark.inferNestedDictAsStruct.enabled is true, inferring nested structs by StructType

Why are the changes needed?

Because always inferring the nested structs by MapType doesn't work properly for some cases.

Does this PR introduce any user-facing change?

New configuration spark.sql.pyspark.inferNestedDictAsStruct.enabled is added.

How was this patch tested?

Added an unit test

@SparkQA
Copy link

SparkQA commented Jul 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45156/

@SparkQA
Copy link

SparkQA commented Jul 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45156/

@SparkQA
Copy link

SparkQA commented Jul 5, 2021

Test build #140644 has finished for PR 33214 at commit a750b17.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Test build #140689 has finished for PR 33214 at commit 0ce96fa.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45199/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45202/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45199/

@HyukjinKwon HyukjinKwon changed the title [SPARK-35929][PYTHON] Schema inference of nested structs defaults to map [SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame Jul 6, 2021
@HyukjinKwon
Copy link
Member

Can you also update the PR description?

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45202/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Test build #140691 has finished for PR 33214 at commit 048d222.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Test build #140706 has finished for PR 33214 at commit d81b3d3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45217/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45217/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45230/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45230/

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise

@itholic
Copy link
Contributor Author

itholic commented Jul 7, 2021

Thanks! :)

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 7, 2021

cc @viirya or @ueshin would you mind taking a quick look please? This is ready for a review.

Comment on lines +1030 to +1034
for key, value in obj.items():
if key is not None and value is not None:
return MapType(_infer_type(key, infer_dict_as_struct),
_infer_type(value, infer_dict_as_struct), True)
return MapType(NullType(), NullType(), True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to log warning if inferred value types are not inconsistent? We can recommend users to use the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! :)
Actually PySpark merging one only handles null cases only (that's called out here) at

def _merge_type(a, b, name=None):
if name is None:
new_msg = lambda msg: msg
new_name = lambda n: "field %s" % n
else:
new_msg = lambda msg: "%s: %s" % (name, msg)
new_name = lambda n: "field %s in %s" % (n, name)
if isinstance(a, NullType):
return b
elif isinstance(b, NullType):
return a
elif type(a) is not type(b):
# TODO: type cast (such as int -> long)
raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b))))
# same type
if isinstance(a, StructType):
nfs = dict((f.name, f.dataType) for f in b.fields)
fields = [StructField(f.name, _merge_type(f.dataType, nfs.get(f.name, NullType()),
name=new_name(f.name)))
for f in a.fields]
names = set([f.name for f in fields])
for n in nfs:
if n not in names:
fields.append(StructField(n, nfs[n]))
return StructType(fields)
elif isinstance(a, ArrayType):
return ArrayType(_merge_type(a.elementType, b.elementType,
name='element in array %s' % name), True)
elif isinstance(a, MapType):
return MapType(_merge_type(a.keyType, b.keyType, name='key of map %s' % name),
_merge_type(a.valueType, b.valueType, name='value of map %s' % name),
True)
else:
return a

It actually fails for different types (unlike JSON or CSV type inference).
I am not sure what's the ideal behavior for the null case pointed out here though.
Let me separate it from this PR in any event if you're fine.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me. Just one minor suggestion.

@SparkQA
Copy link

SparkQA commented Jul 7, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45234/

@SparkQA
Copy link

SparkQA commented Jul 7, 2021

Test build #140719 has finished for PR 33214 at commit 35dacda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2021

Test build #140723 has finished for PR 33214 at commit 52a9a70.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants