[SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame #33214

itholic · 2021-07-05T08:13:43Z

What changes were proposed in this pull request?

Currently, inferring nested structs is always using MapType.

This behavior causes an issue because it infers the schema with a value type of the first field of the struct as below:

data = [{"inside_struct": {"payment": 100.5, "name": "Lee"}}]
df = spark.createDataFrame(data)
df.show(truncate=False)
+--------------------------------+
|inside_struct                   |
+--------------------------------+
|{name -> null, payment -> 100.5}|
+--------------------------------+

The "name" became null, but it should've been "Lee".

In this case, we need to be able to infer the schema with a StructType instead of a MapType.

Therefore, this PR proposes adding an new configuration spark.sql.pyspark.inferNestedDictAsStruct.enabled to handle which type is used for inferring nested structs.

When spark.sql.pyspark.inferNestedDictAsStruct.enabled is false (by default), inferring nested structs by MapType
When spark.sql.pyspark.inferNestedDictAsStruct.enabled is true, inferring nested structs by StructType

Why are the changes needed?

Because always inferring the nested structs by MapType doesn't work properly for some cases.

Does this PR introduce any user-facing change?

New configuration spark.sql.pyspark.inferNestedDictAsStruct.enabled is added.

How was this patch tested?

Added an unit test

…5929

python/pyspark/sql/types.py

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

python/pyspark/sql/tests/test_types.py

SparkQA · 2021-07-05T09:42:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45156/

SparkQA · 2021-07-05T10:14:49Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45156/

SparkQA · 2021-07-05T12:53:41Z

Test build #140644 has finished for PR 33214 at commit a750b17.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

python/pyspark/sql/tests/test_types.py

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2021-07-06T07:07:35Z

Test build #140689 has finished for PR 33214 at commit 0ce96fa.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-06T07:45:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45199/

SparkQA · 2021-07-06T08:00:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45202/

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2021-07-06T08:19:56Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45199/

python/pyspark/sql/tests/test_types.py

HyukjinKwon · 2021-07-06T08:22:59Z

Can you also update the PR description?

SparkQA · 2021-07-06T08:36:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45202/

SparkQA · 2021-07-06T11:46:05Z

Test build #140691 has finished for PR 33214 at commit 048d222.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-06T14:03:43Z

Test build #140706 has finished for PR 33214 at commit d81b3d3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-06T14:52:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45217/

SparkQA · 2021-07-06T15:27:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45217/

SparkQA · 2021-07-06T22:51:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45230/

SparkQA · 2021-07-06T23:25:38Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45230/

python/pyspark/sql/tests/test_types.py

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

HyukjinKwon

Looks good otherwise

itholic · 2021-07-07T00:35:08Z

Thanks! :)

…onf.scala

HyukjinKwon · 2021-07-07T00:55:01Z

cc @viirya or @ueshin would you mind taking a quick look please? This is ready for a review.

viirya · 2021-07-07T01:11:35Z

python/pyspark/sql/types.py

+            for key, value in obj.items():
+                if key is not None and value is not None:
+                    return MapType(_infer_type(key, infer_dict_as_struct),
+                                   _infer_type(value, infer_dict_as_struct), True)
+            return MapType(NullType(), NullType(), True)


Do we need to log warning if inferred value types are not inconsistent? We can recommend users to use the config.

Thanks for the comment! :)
Actually PySpark merging one only handles null cases only (that's called out here) at

spark/python/pyspark/sql/types.py

Lines 1096 to 1133 in 52a9a70

def _merge_type(a, b, name=None):

if name is None:

new_msg = lambda msg: msg

new_name = lambda n: "field %s" % n

else:

new_msg = lambda msg: "%s: %s" % (name, msg)

new_name = lambda n: "field %s in %s" % (n, name)

if isinstance(a, NullType):

return b

elif isinstance(b, NullType):

return a

elif type(a) is not type(b):

# TODO: type cast (such as int -> long)

raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b))))

# same type

if isinstance(a, StructType):

nfs = dict((f.name, f.dataType) for f in b.fields)

fields = [StructField(f.name, _merge_type(f.dataType, nfs.get(f.name, NullType()),

name=new_name(f.name)))

for f in a.fields]

names = set([f.name for f in fields])

for n in nfs:

if n not in names:

fields.append(StructField(n, nfs[n]))

return StructType(fields)

elif isinstance(a, ArrayType):

return ArrayType(_merge_type(a.elementType, b.elementType,

name='element in array %s' % name), True)

elif isinstance(a, MapType):

return MapType(_merge_type(a.keyType, b.keyType, name='key of map %s' % name),

_merge_type(a.valueType, b.valueType, name='value of map %s' % name),

True)

else:

return a

It actually fails for different types (unlike JSON or CSV type inference).
I am not sure what's the ideal behavior for the null case pointed out here though.
Let me separate it from this PR in any event if you're fine.

viirya

Looks okay to me. Just one minor suggestion.

SparkQA · 2021-07-07T01:41:40Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45234/

SparkQA · 2021-07-07T02:28:12Z

Test build #140719 has finished for PR 33214 at commit 35dacda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-07T05:24:15Z

Test build #140723 has finished for PR 33214 at commit 52a9a70.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-07-07T06:14:07Z

Merged to master

itholic added 3 commits July 5, 2021 16:50

[SPARK-35929] Schema inference of nested structs defaults to map

d7d7032

Merge branch 'master' of https://github.com/apache/spark into SPARK-3…

ed44e01

…5929

Add .pyspark

a750b17

github-actions bot added CORE PYTHON SQL labels Jul 5, 2021

HyukjinKwon reviewed Jul 5, 2021

View reviewed changes

python/pyspark/sql/types.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 5, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 5, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 5, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 5, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 5, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 5, 2021

View reviewed changes

python/pyspark/sql/tests/test_types.py Outdated Show resolved Hide resolved

resolved comments

0ce96fa

HyukjinKwon reviewed Jul 6, 2021

View reviewed changes

python/pyspark/sql/tests/test_types.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 6, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

resolved comments

048d222

HyukjinKwon reviewed Jul 6, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 6, 2021

View reviewed changes

python/pyspark/sql/tests/test_types.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 6, 2021

View reviewed changes

python/pyspark/sql/tests/test_types.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 6, 2021

View reviewed changes

python/pyspark/sql/tests/test_types.py Outdated Show resolved Hide resolved

HyukjinKwon changed the title ~~[SPARK-35929][PYTHON] Schema inference of nested structs defaults to map~~ [SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame Jul 6, 2021

Resolved comments

d81b3d3

Fix scala linter

35dacda

HyukjinKwon reviewed Jul 7, 2021

View reviewed changes

python/pyspark/sql/tests/test_types.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 7, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Jul 7, 2021

View reviewed changes

resolved comments

25b20ba

Update sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLC…

52a9a70

…onf.scala

viirya reviewed Jul 7, 2021

View reviewed changes

viirya approved these changes Jul 7, 2021

View reviewed changes

HyukjinKwon closed this in 2537fe8 Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame #33214

[SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame #33214

itholic commented Jul 5, 2021 •

edited

Loading

SparkQA commented Jul 5, 2021

SparkQA commented Jul 5, 2021

SparkQA commented Jul 5, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

HyukjinKwon commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

HyukjinKwon left a comment

itholic commented Jul 7, 2021

HyukjinKwon commented Jul 7, 2021 •

edited

Loading

viirya Jul 7, 2021

itholic Jul 7, 2021

viirya left a comment

SparkQA commented Jul 7, 2021

SparkQA commented Jul 7, 2021

SparkQA commented Jul 7, 2021

HyukjinKwon commented Jul 7, 2021

	def _merge_type(a, b, name=None):
	if name is None:
	new_msg = lambda msg: msg
	new_name = lambda n: "field %s" % n
	else:
	new_msg = lambda msg: "%s: %s" % (name, msg)
	new_name = lambda n: "field %s in %s" % (n, name)

	if isinstance(a, NullType):
	return b
	elif isinstance(b, NullType):
	return a
	elif type(a) is not type(b):
	# TODO: type cast (such as int -> long)
	raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b))))

	# same type
	if isinstance(a, StructType):
	nfs = dict((f.name, f.dataType) for f in b.fields)
	fields = [StructField(f.name, _merge_type(f.dataType, nfs.get(f.name, NullType()),
	name=new_name(f.name)))
	for f in a.fields]
	names = set([f.name for f in fields])
	for n in nfs:
	if n not in names:
	fields.append(StructField(n, nfs[n]))
	return StructType(fields)

	elif isinstance(a, ArrayType):
	return ArrayType(_merge_type(a.elementType, b.elementType,
	name='element in array %s' % name), True)

	elif isinstance(a, MapType):
	return MapType(_merge_type(a.keyType, b.keyType, name='key of map %s' % name),
	_merge_type(a.valueType, b.valueType, name='value of map %s' % name),
	True)
	else:
	return a

[SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame #33214

[SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame #33214

Conversation

itholic commented Jul 5, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jul 5, 2021

SparkQA commented Jul 5, 2021

SparkQA commented Jul 5, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

HyukjinKwon commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

SparkQA commented Jul 6, 2021

HyukjinKwon left a comment

Choose a reason for hiding this comment

itholic commented Jul 7, 2021

HyukjinKwon commented Jul 7, 2021 • edited Loading

viirya Jul 7, 2021

Choose a reason for hiding this comment

itholic Jul 7, 2021

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented Jul 7, 2021

SparkQA commented Jul 7, 2021

SparkQA commented Jul 7, 2021

HyukjinKwon commented Jul 7, 2021

itholic commented Jul 5, 2021 •

edited

Loading

HyukjinKwon commented Jul 7, 2021 •

edited

Loading