Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GpuJsonScan incorrect behavior when parsing dates #9905

Closed
andygrove opened this issue Nov 30, 2023 · 1 comment · Fixed by #9975
Closed

[BUG] GpuJsonScan incorrect behavior when parsing dates #9905

andygrove opened this issue Nov 30, 2023 · 1 comment · Fixed by #9975
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug
Reading a JSON file containing dates in yyyy-MM-dd format and when specifying the option dateFormat as dd/MM/yyyy causes a runtime exception of One or more values is not a valid date when running on the GPU.

Steps/Code to reproduce bug

import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataTypes._

val schema = StructType(Seq(StructField("number", DateType, false)))

val df = spark.read.option("dateFormat","dd/MM/yyyy").schema(schema).json("integration_tests/src/test/resources/dates.json")

df.show(false)

CPU Result

+----------+
|number    |
+----------+
|2020-09-16|
|2020-09-16|
|2020-09-16|
|1581-01-01|
|1583-01-01|
+----------+

GPU Result

java.time.DateTimeException: One or more values is not a valid date
	at com.nvidia.spark.rapids.GpuTextBasedDateUtils$.$anonfun$castStringToDate$3(GpuTextBasedPartitionReader.scala:762)

Expected behavior

Environment details (please complete the following information)

Additional context

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 30, 2023
@andygrove andygrove changed the title [BUG] GpuJsonScan incorrect behavior when parsing dates [BUG] GpuJsonScan incorrect behavior when parsing dates using non-default dateFormat Dec 1, 2023
@andygrove
Copy link
Contributor Author

Here is an integration test that highlights some of the issues:

@pytest.mark.parametrize('data_gen', [
    StringGen('[1-9]{4}-[1-3]{1,2}-[1-3]{1,2}', nullable=False),
    StringGen('[1-3]{1,2}-[1-3]{1,2}-[1-9]{4}', nullable=False),
    StringGen('[1-3]{1,2}/[1-3]{1,2}/[1-9]{4}', nullable=False),
])
@pytest.mark.parametrize('schema', [StructType([StructField('value', DateType())])])
@pytest.mark.parametrize('date_format', [
    'yyyy-MM-dd',
    'yyyy/MM/dd',
    'dd-MM-yyyy',
    'dd/MM/yyyy',
    'MM-dd-yyyy',
    'MM/dd/yyyy',
])
@pytest.mark.parametrize('ansi_enabled', [True, False])
def test_json_read_generated_dates(spark_tmp_table_factory, spark_tmp_path, data_gen, schema, date_format, ansi_enabled):

    # create test data with json strings where a subset are valid dates
    # example format: {"value":"3481-1-31"}
    path = spark_tmp_path + '/JSON_DATA'
    with_cpu_session(lambda spark: gen_df(spark, data_gen).write.json(path))

    updated_conf = copy_and_update(_enable_all_types_conf,
        {'spark.sql.ansi.enabled': ansi_enabled,
        'spark.sql.legacy.timeParserPolicy': 'CORRECTED'})

    f = read_json_df(path, schema, spark_tmp_table_factory, { 'dateFormat': date_format })
    assert_gpu_and_cpu_are_equal_collect(f, conf = updated_conf)

@andygrove andygrove self-assigned this Dec 1, 2023
@andygrove andygrove changed the title [BUG] GpuJsonScan incorrect behavior when parsing dates using non-default dateFormat [BUG] GpuJsonScan incorrect behavior when parsing dates Dec 1, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants