Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spark reports a decimal error when create lit scalar when generate Decimal(34, -5) data. #9404

Closed
res-life opened this issue Oct 9, 2023 · 3 comments · Fixed by #9405
Assignees
Labels
bug Something isn't working

Comments

@res-life
Copy link
Collaborator

res-life commented Oct 9, 2023

Describe the bug
Spark reports the following error when create lit scalar when generate Decimal(34, -5) data.

pyspark.sql.utils.AnalysisException: decimal can only support precision up to 38

Steps/Code to reproduce bug

Case 1, failed

Update test_greatest to the following, then run on Spark 311.

@pytest.mark.parametrize('data_gen', [DecimalGen(34, -5)], ids=idfn)
def test_greatest1(data_gen):
    num_cols = 20
    s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))
    # we want lots of nulls
    gen = StructGen([('_c' + str(x), data_gen.copy_special_case(None, weight=100.0))
        for x in range(0, num_cols)], nullable=False)
    command_args = [f.col('_c' + str(x)) for x in range(0, num_cols)]
    command_args.append(s1)
    data_type = data_gen.data_type
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark : gen_df(spark, gen).select(
                f.greatest(*command_args)))

case 2, passed. Only modify the parameter of the test case to add a DecimalGen(7, 7)

@pytest.mark.parametrize('data_gen', [DecimalGen(7, 7), [DecimalGen(34, -5)]], ids=idfn)
def test_greatest2(data_gen):
    num_cols = 20
    s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))
    # we want lots of nulls
    gen = StructGen([('_c' + str(x), data_gen.copy_special_case(None, weight=100.0))
        for x in range(0, num_cols)], nullable=False)
    command_args = [f.col('_c' + str(x)) for x in range(0, num_cols)]
    command_args.append(s1)
    data_type = data_gen.data_type
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark : gen_df(spark, gen).select(
                f.greatest(*command_args)))

case 3, failed. Only comment the tail lines.

@pytest.mark.parametrize('data_gen', [DecimalGen(7, 7), [DecimalGen(34, -5)]], ids=idfn)
def test_greatest3(data_gen):
    num_cols = 20
    s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))
    # we want lots of nulls
    # gen = StructGen([('_c' + str(x), data_gen.copy_special_case(None, weight=100.0))
    #     for x in range(0, num_cols)], nullable=False)
    # command_args = [f.col('_c' + str(x)) for x in range(0, num_cols)]
    # command_args.append(s1)
    # data_type = data_gen.data_type
    # assert_gpu_and_cpu_are_equal_collect(
    #         lambda spark : gen_df(spark, gen).select(
    #             f.greatest(*command_args)))

The error is from:

s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))
   --  return f.lit(data).cast(data_type) in datagen.py

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone]
  • Spark 311

Additional context
Detail error is:

________________________ test_greatest3[Decimal(36,-5)] ________________________

data_gen = Decimal(36,-5)

    @pytest.mark.parametrize('data_gen', all_basic_gens + _arith_decimal_gens, ids=idfn)
    def test_greatest3(data_gen):
        num_cols = 20
>       s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))

../../src/main/python/arithmetic_ops_test.py:991: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../src/main/python/data_gen.py:859: in gen_scalar
    v = list(gen_scalars(data_gen, 1, seed=seed, force_no_nulls=force_no_nulls))
../../src/main/python/data_gen.py:855: in <genexpr>
    return (_mark_as_lit(src.gen(force_no_nulls=force_no_nulls), data_type) for i in range(0, count))
../../src/main/python/data_gen.py:833: in _mark_as_lit
    return f.lit(data).cast(data_type)
/home/chongg/progs/sparks/spark-home/python/lib/pyspark.zip/pyspark/sql/functions.py:98: in lit
    return col if isinstance(col, Column) else _invoke_function("lit", col)
/home/chongg/progs/sparks/spark-home/python/lib/pyspark.zip/pyspark/sql/functions.py:58: in _invoke_function
    return Column(jf(*args))
/home/chongg/progs/sparks/spark-home/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py:1304: in __call__
    return_value = get_return_value(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = ('xro889', <py4j.java_gateway.GatewayClient object at 0x7fc007704c10>, 'z:org.apache.spark.sql.functions', 'lit')
kw = {}
converted = AnalysisException('decimal can only support precision up to 38', 'org.apache.spark.sql.AnalysisException: decimal can ...:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:750)\n', None)

    def deco(*a, **kw):
        try:
            return f(*a, **kw)
        except py4j.protocol.Py4JJavaError as e:
            converted = convert_exception(e.java_exception)
            if not isinstance(converted, UnknownException):
                # Hide where the exception came from that shows a non-Pythonic
                # JVM exception message.
>               raise converted from None
E               pyspark.sql.utils.AnalysisException: decimal can only support precision up to 38

/home/chongg/progs/sparks/spark-home/python/lib/pyspark.zip/pyspark/sql/utils.py:117: AnalysisException

@res-life res-life added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 9, 2023
@res-life
Copy link
Collaborator Author

res-life commented Oct 9, 2023

One more mininal repro:

$SPARK_HOME/bin/pyspark
Spark 3.1.1
>>> from decimal import Decimal
>>> from pyspark.sql.functions import *
>>> d = Decimal('4.8764759382421948924115781565938778E+39')
>>> lit(d)

Some times f.lit(4.8764759382421948924115781565938778E+39) can pass, but some times it failed.

@res-life
Copy link
Collaborator Author

res-life commented Oct 9, 2023

It's from #9289 (comment)

@res-life
Copy link
Collaborator Author

res-life commented Oct 9, 2023

Thanks @pxLi

He found the root cause, in rare case, we did not set spark.sql.legacy.allowNegativeScaleOfDecimal=true when creating a literal scalar.

If the Spark session already initilized with this config, then the cases can pass.
If no Spark session is initilized and therefore this config value is false, then create literal scalar will fail.

pyspark --conf spark.sql.legacy.allowNegativeScaleOfDecimal=true
>>> spark.sparkContext.getConf().get('spark.sql.legacy.allowNegativeScaleOfDecimal')
'true'
>>> from pyspark.sql.functions import *
>>> from decimal import Decimal
>>> d = Decimal('4.87647593824219489241157815659387781E+39')
>>> lit(d)
Column<'4.87647593824219489241157815659387781E+39'>
>>>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants