Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State schema tws #12

Closed
wants to merge 925 commits into from
Closed

State schema tws #12

wants to merge 925 commits into from

Conversation

ericm-db
Copy link
Owner

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan and others added 30 commits June 7, 2024 10:11
…loadTable

### What changes were proposed in this pull request?

This is a followup of apache#44335 , which missed to handle `loadTable`

### Why are the changes needed?

better error message

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#46905 from cloud-fan/jdbc.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…pendencies

### What changes were proposed in this pull request?

The core module shipped with some bundled dependencies, it's better to add LICENSE/NOTICE to conform to the ASF policies.

### Why are the changes needed?

ASF legal compliance

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

pass build

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#46891 from yaooqinn/SPARK-48548.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…pported plotting functions

### What changes were proposed in this pull request?
Throw `PandasNotImplementedError` for unsupported plotting functions:
- {Frame, Series}.plot.hist
- {Frame, Series}.plot.kde
- {Frame, Series}.plot.density
- {Frame, Series}.plot(kind="hist", ...)
- {Frame, Series}.plot(kind="hist", ...)
- {Frame, Series}.plot(kind="density", ...)

### Why are the changes needed?
the previous error message is confusing:
```
In [3]: psdf.plot.hist()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1017: PandasAPIOnSparkAdviceWarning: The config 'spark.sql.ansi.enabled' is set to True. This can cause unexpected behavior from pandas API on Spark since pandas API on Spark follows the behavior of pandas, not SQL.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
[*********************************************-----------------------------------] 57.14% Complete (0 Tasks running, 1s, Scanned[*********************************************-----------------------------------] 57.14% Complete (0 Tasks running, 1s, Scanned[*********************************************-----------------------------------] 57.14% Complete (0 Tasks running, 1s, Scanned                                                                                                                                ---------------------------------------------------------------------------
PySparkAttributeError                     Traceback (most recent call last)
Cell In[3], line 1
----> 1 psdf.plot.hist()

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:951, in PandasOnSparkPlotAccessor.hist(self, bins, **kwds)
    903 def hist(self, bins=10, **kwds):
    904     """
    905     Draw one histogram of the DataFrame’s columns.
    906     A `histogram`_ is a representation of the distribution of data.
   (...)
    949         >>> df.plot.hist(bins=12, alpha=0.5)  # doctest: +SKIP
    950     """
--> 951     return self(kind="hist", bins=bins, **kwds)

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:580, in PandasOnSparkPlotAccessor.__call__(self, kind, backend, **kwargs)
    577 kind = {"density": "kde"}.get(kind, kind)
    578 if hasattr(plot_backend, "plot_pandas_on_spark"):
    579     # use if there's pandas-on-Spark specific method.
--> 580     return plot_backend.plot_pandas_on_spark(plot_data, kind=kind, **kwargs)
    581 else:
    582     # fallback to use pandas'
    583     if not PandasOnSparkPlotAccessor.pandas_plot_data_map[kind]:

File ~/Dev/spark/python/pyspark/pandas/plot/plotly.py:41, in plot_pandas_on_spark(data, kind, **kwargs)
     39     return plot_pie(data, **kwargs)
     40 if kind == "hist":
---> 41     return plot_histogram(data, **kwargs)
     42 if kind == "box":
     43     return plot_box(data, **kwargs)

File ~/Dev/spark/python/pyspark/pandas/plot/plotly.py:87, in plot_histogram(data, **kwargs)
     85 psdf, bins = HistogramPlotBase.prepare_hist_data(data, bins)
     86 assert len(bins) > 2, "the number of buckets must be higher than 2."
---> 87 output_series = HistogramPlotBase.compute_hist(psdf, bins)
     88 prev = float("%.9f" % bins[0])  # to make it prettier, truncate.
     89 text_bins = []

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:189, in HistogramPlotBase.compute_hist(psdf, bins)
    183 for group_id, (colname, bucket_name) in enumerate(zip(colnames, bucket_names)):
    184     # creates a Bucketizer to get corresponding bin of each value
    185     bucketizer = Bucketizer(
    186         splits=bins, inputCol=colname, outputCol=bucket_name, handleInvalid="skip"
    187     )
--> 189     bucket_df = bucketizer.transform(sdf)
    191     if output_df is None:
    192         output_df = bucket_df.select(
    193             F.lit(group_id).alias("__group_id"), F.col(bucket_name).alias("__bucket")
    194         )

File ~/Dev/spark/python/pyspark/ml/base.py:260, in Transformer.transform(self, dataset, params)
    258         return self.copy(params)._transform(dataset)
    259     else:
--> 260         return self._transform(dataset)
    261 else:
    262     raise TypeError("Params must be a param map but got %s." % type(params))

File ~/Dev/spark/python/pyspark/ml/wrapper.py:412, in JavaTransformer._transform(self, dataset)
    409 assert self._java_obj is not None
    411 self._transfer_params_to_java()
--> 412 return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sparkSession)

File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:1696, in DataFrame.__getattr__(self, name)
   1694 def __getattr__(self, name: str) -> "Column":
   1695     if name in ["_jseq", "_jdf", "_jmap", "_jcols", "rdd", "toJSON"]:
-> 1696         raise PySparkAttributeError(
   1697             error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name}
   1698         )
   1700     if name not in self.columns:
   1701         raise PySparkAttributeError(
   1702             error_class="ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name}
   1703         )

PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail.
```

after this PR:
```
In [3]: psdf.plot.hist()
---------------------------------------------------------------------------
PandasNotImplementedError                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 psdf.plot.hist()

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:957, in PandasOnSparkPlotAccessor.hist(self, bins, **kwds)
    909 """
    910 Draw one histogram of the DataFrame’s columns.
    911 A `histogram`_ is a representation of the distribution of data.
   (...)
    954     >>> df.plot.hist(bins=12, alpha=0.5)  # doctest: +SKIP
    955 """
    956 if is_remote():
--> 957     return unsupported_function(class_name="pd.DataFrame", method_name="hist")()
    959 return self(kind="hist", bins=bins, **kwds)

File ~/Dev/spark/python/pyspark/pandas/missing/__init__.py:23, in unsupported_function.<locals>.unsupported_function(*args, **kwargs)
     22 def unsupported_function(*args, **kwargs):
---> 23     raise PandasNotImplementedError(
     24         class_name=class_name, method_name=method_name, reason=reason
     25     )

PandasNotImplementedError: The method `pd.DataFrame.hist()` is not implemented yet.
```

### Does this PR introduce _any_ user-facing change?
yes, error message improvement

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46911 from zhengruifeng/ps_plotting_unsupported.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…oking initialization of GlobalaTempViewManager

### What changes were proposed in this pull request?

It's not necessary to create `GlobalaTempViewManager` only for getting the global temp db name. This PR updates the code to avoid this, as global temp db name is just a config.

### Why are the changes needed?

avoid unnecessary RPC calls to check existence of global temp db

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#46907 from willwwt/master.

Authored-by: Weitao Wen <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…loadTable and Fix UT

### What changes were proposed in this pull request?
This is a followup of apache#46905, to fix `some UT` on GA.

### Why are the changes needed?
Fix UT.

### Does this PR introduce _any_ user-facing change?
No.,

### How was this patch tested?
Manually test.
Pass GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46912 from panbingkun/SPARK-46393_FOLLOWUP.

Lead-authored-by: panbingkun <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
Move a test out of parity tests

### Why are the changes needed?
it is not tested in Spark Classic, not a parity test

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#46914 from zhengruifeng/move_a_non_parity_test.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…uffle

### Why are the changes needed?

Support SPJ one-side shuffle if other side has partition transform expression

  ### How was this patch tested?

New unit test in KeyGroupedPartitioningSuite

  ### Was this patch authored or co-authored using generative AI tooling?

 No.

Closes apache#46255 from szehon-ho/spj_auto_bucket.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
### What changes were proposed in this pull request?

This PR proposes to make StreamingQueryListener.spark settable

### Why are the changes needed?

```python
from pyspark.sql.streaming.listener import StreamingQueryListener

class MyListener(StreamingQueryListener):
  def __init__(self, spark):
    self.spark = spark

  def onQueryStarted(self, event):
    pass

  def onQueryProgress(self, event):
    pass

  def onQueryTerminated(self, event):
    pass

MyListener(spark)
```

is broken from 3.5.0 after SPARK-42941.

### Does this PR introduce _any_ user-facing change?

Yes, end users who implement `StreamingQueryListener` can add `spark` attribute in their implementation.

### How was this patch tested?

Manually tested, and added a unittest.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46909 from HyukjinKwon/compat-spark-prop.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
Propagate cached schema in set operations

### Why are the changes needed?
to avoid extra RPC to get the schema of result data frame

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46915 from zhengruifeng/set_op_schema.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…E & ICU collations

### What changes were proposed in this pull request?
String lowercase/uppercase conversion in UTF8_BINARY_LCASE now works using ICU default locale, similar to how other ICU collations currently work in Spark.

### Why are the changes needed?
All collations apart from UTF8_BINARY should use the same interface (UCharacter) that utilizes ICU toLowerCase/toUpperCase implementation, rather than mixing JVM & ICU implementations.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests and e2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46720 from uros-db/lower-upper-initcap.

Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
… collations

### What changes were proposed in this pull request?
String titlecase conversion under UTF8_BINARY_LCASE and other ICU collations now work using the appropriate ICU default locale for character mapping, and uses ICU BreakIterator.getWordInstance to locate boundaries between words.

### Why are the changes needed?
Similar Spark expressions such as Lower & Upper use the same interface (UCharacter) to perform collation-aware string transformation, and InitCap should offer a consistant way to titlecase strings across the collation space.

### Does this PR introduce _any_ user-facing change?
Yes, InitCap should now work properly for all collations other than UTF8_BINARY.

### How was this patch tested?
New and existing unit tests, as well as existing e2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46732 from uros-db/initcap-icu.

Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?

1. In connect, when a streaming query name is not specified, it's query.name should return None. Currently it returns an empty string without this patch.
2. In classic spark, one cannot set the streaming query's name to be empty string. This check was missing in Spark Connect. Adding it back.

### Why are the changes needed?

Edge case handling.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46920 from WweiL/SPARK-48569-query-name-None.

Authored-by: Wei Liu <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

- Changed the `ineffectiveRules` variable of the `TreeNode` class to initialize lazily. This will reduce unnecessary driver memory pressure.

### Why are the changes needed?

- Plans with large expression or operator trees are known to cause driver memory pressure; this is one step in alleviating that issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UT covers behavior. Outwards facing behavior does not change.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46919 from n-young-db/ineffective-rules-lazy.

Authored-by: Nick Young <[email protected]>
Signed-off-by: Josh Rosen <[email protected]>
### What changes were proposed in this pull request?
This pr aims upgrade `pickle` from 1.3 to 1.5.

### Why are the changes needed?
The new version include a new  fix related to [empty bytes object construction](irmen/pickle@badc8fe)

All changes from 1.3 to 1.5 are as follows:

- irmen/pickle@pickle-1.3...pickle-1.5

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46913 from LuciferYang/pickle-1.5.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…ataFrame.select(None)`

### What changes were proposed in this pull request?
the refactor PR apache#45636 changed the error message of `DataFrame.select(None)` from `PySparkTypeError` to `AssertionError`, this PR restore the previous error message

### Why are the changes needed?
error message improvement

### Does this PR introduce _any_ user-facing change?
yes, error message improvement

### How was this patch tested?
added test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#46930 from zhengruifeng/py_restore_select_error.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

Thread dump display in UI is not pretty as before, this is side-effect introduced by SPARK-44863

### Why are the changes needed?

Restore thread dump display in UI.

### Does this PR introduce _any_ user-facing change?

Yes, it only affects UI display.

### How was this patch tested?

Current master:
<img width="1545" alt="master-branch" src="https://github.com/apache/spark/assets/26535726/5c6fd770-467f-481c-a635-2855a2853633">

With this patch applied:
<img width="1542" alt="Xnip2024-06-07_20-00-38" src="https://github.com/apache/spark/assets/26535726/3998c2aa-671f-4921-8444-b7bca8667202">

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46916 from pan3793/SPARK-48565.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request?

This PR improves perf for escapePathName with algorithms briefly described as:
- If a path contains no special characters, we return the original identity instead of creating a new StringBuilder to append char by char
- If a path contains special characters, we relocate the IDX of the first special character. Then initialize the StringBuilder with [0, IDX) of the original string, and do heximal padding if necessary starting from IDX.
- An optimized char-to-hex function replaces the `String.format`

Add a fast path for storage paths or their parts that do not require escaping to avoid creating a StringBuilder to append per character.
### Why are the changes needed?

performance improvement for hotspots

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

- new tests in ExternalCatalogUtilsSuite
- Benchmark results (9x faster)
### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#46894 from yaooqinn/SPARK-48551.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…thod

### What changes were proposed in this pull request?
followup of apache#46685, to remove unused helper method

### Why are the changes needed?
method `_tree_string` is no longer needed

### Does this PR introduce _any_ user-facing change?
No, internal change only

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46936 from zhengruifeng/tree_string_followup.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?
Introduce collation support for `levenshtein` string expression (pass-through).

### Why are the changes needed?
Add collation support for Levenshtein expression in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for string function: levenshtein.

### How was this patch tested?
E2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46788 from uros-db/levenshtein.

Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…D_GROUPING_EXPRESSION

### What changes were proposed in this pull request?

Following sequence of queries produces `UNSUPPORTED_GROUPING_EXPRESSION` error:
```
create table t1(a int, b int) using parquet;
select grouping(a), dummy from t1 group by a with rollup;
```
However, the appropriate error should point the user to the invalid `dummy` column name.

Fix the problem by deprioritizing `Grouping` and `GroupingID` nodes in plan which were not resolved and thus cause the unwanted error.

### Why are the changes needed?

To fix the described issue.

### Does this PR introduce _any_ user-facing change?

Yes, it displays proper error message to user instead of misleading one.

### How was this patch tested?

Added test to `QueryCompilationErrorsSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46900 from nikolamand-db/SPARK-48556.

Authored-by: Nikola Mandic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
Renaming `UTF8_BINARY_LCASE` collation to `UTF8_LCASE`.

### Why are the changes needed?
As part of the collation effort in Spark, we've moved away from byte-by-byte logic towards character-by-character logic, so what we used to call `UTF8_BINARY_LCASE` is now more precisely `UTF8_LCASE`. For example, string searching in UTF8_LCASE now works on character-level (rather than on byte-level), which is reflected in this PRs: apache#46511, apache#46589, apache#46682, apache#46761, apache#46762. In addition, string comparison also works on character-level now, as per the changes introduced in this PR: apache#46700.

### Does this PR introduce _any_ user-facing change?
Yes, what was previously named `UTF8_BINARY_LCASE` collation, will from now on be named `UTF8_LCASE`.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46924 from uros-db/rename-lcase.

Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
Renaming `UTF8_BINARY_LCASE` collation to `UTF8_LCASE` in leftover tests.

### Why are the changes needed?
Due to a merge conflict, one additional test was using the old collation name.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46939 from uros-db/renaming-fix.

Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…ctionRegistry"

### What changes were proposed in this pull request?

Reverts apache#44976 as it breaks thread-safety

### Why are the changes needed?

Fix thread-safety

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#46940 from cloud-fan/revert.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
This pr aims to upgrade `braces` from 3.0.2 to 3.0.3 in ui-test.

The original pr was submitted by `dependabot`: apache#46931

### Why are the changes needed?
The new version fix vulnerability https://security.snyk.io/vuln/SNYK-JS-BRACES-6838727

- micromatch/braces@9f5b4cf

The complete list of changes is as follows:

- micromatch/braces@3.0.2...3.0.3

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46933 from LuciferYang/SPARK-48582.

Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: YangJie <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request?
This PR adds a test for API DropDuplicateWithinWatermark in Python, which was previously missing.

### Why are the changes needed?
Check the correctness of API DropDuplicateWithinWatermark.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passed:
```
python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming
python/run-tests --testnames pyspark.sql.tests.connect.streaming.test_parity_streaming
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46740 from eason-yuchen-liu/DropDuplicateWithinWatermark_test.

Authored-by: Yuchen Liu <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
### What changes were proposed in this pull request?

Upgrade dropwizard metrics to 4.2.26.

### Why are the changes needed?

There are some bug fixes as belows:

- Correction for the Jetty-12 QTP metrics by dkaukov in dropwizard/metrics#4181

- Fix metrics for InstrumentedEE10Handler by zUniQueX in dropwizard/metrics#3928

The full release notes:
https://github.com/dropwizard/metrics/releases/tag/v4.2.26

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Passed GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46932 from wayneguow/codahale.

Authored-by: Wei Guo <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request?

This PR improves perf for unescapePathName with algorithms briefly described as:
- If a path contains no '%' or contains '%' at `position > path.length-2`, we return the original identity instead of creating a new StringBuilder to append char by char
- Otherwise, we loop with 2 indices, `plaintextStartIdx` which starts from 0 and then points to the next char after resolving `%xx`, and `plaintextEndIdx` which points to the next `'%'`. `plaintextStartIdx` moves to `plaintextEndIdx + 3` if `%xx` is valid, or moves to `plaintextEndIdx + 1` if `%xx` is invalid.
- Instead of using Integer.parseInt with error capture, we identify the high and low characters manually.

### Why are the changes needed?

performance improvement for hotspots

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

- new tests in ExternalCatalogUtilsSuite
- Benchmark results (9-11x faster)
### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#46938 from yaooqinn/SPARK-48584.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…compress`

### What changes were proposed in this pull request?
This pr use `org.apache.commons.io.output.CountingOutputStream` instead of `org.apache.commons.compress.utils.CountingOutputStream` to fix the following compilation warnings related to 'commons-compress':

```
[WARNING] [Warn] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala:308: class CountingOutputStream in package utils is deprecated
Applicable -Wconf / nowarn filters for this warning: msg=<part of the message>, cat=deprecation, site=org.apache.spark.deploy.history.RollingEventLogFilesWriter.countingOutputStream, origin=org.apache.commons.compress.utils.CountingOutputStream
[WARNING] [Warn] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala:351: class CountingOutputStream in package utils is deprecated
Applicable -Wconf / nowarn filters for this warning: msg=<part of the message>, cat=deprecation, site=org.apache.spark.deploy.history.RollingEventLogFilesWriter.rollEventLogFile.$anonfun, origin=org.apache.commons.compress.utils.CountingOutputStream
```

The fix refers to:

https://github.com/apache/commons-compress/blob/95727006cac0892c654951c4e7f1db142462f22a/src/main/java/org/apache/commons/compress/utils/CountingOutputStream.java#L25-L33

```
/**
 * Stream that tracks the number of bytes read.
 *
 * since 1.3
 * NotThreadSafe
 * deprecated Use {link org.apache.commons.io.output.CountingOutputStream}.
 */
Deprecated
public class CountingOutputStream extends FilterOutputStream {
```

### Why are the changes needed?
Cleanup deprecated api usage related to `commons-compress`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46950 from LuciferYang/SPARK-48595.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request?

This pull request optimizes the `Hex.hex(num: Long)` method by removing leading zeros, thus eliminating the need to copy the array to remove them afterward.
### Why are the changes needed?

- Unit tests added
- Did a benchmark locally (30~50% speedup)

```scala
Hex Long Tests:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Legacy                                             1062           1094          16          9.4         106.2       1.0X
New                                                 739            807          26         13.5          73.9       1.4X
```

```scala
object HexBenchmark extends BenchmarkBase {
  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val N = 10_000_000
    runBenchmark("Hex") {
      val benchmark = new Benchmark("Hex Long Tests", N, 10, output = output)
      val range = 1 to 12
      benchmark.addCase("Legacy") { _ =>
        (1 to N).foreach(x => range.foreach(y => hexLegacy(x - y)))
      }

      benchmark.addCase("New") { _ =>
        (1 to N).foreach(x => range.foreach(y => Hex.hex(x - y)))
      }
      benchmark.run()
    }
  }

  def hexLegacy(num: Long): UTF8String = {
    // Extract the hex digits of num into value[] from right to left
    val value = new Array[Byte](16)
    var numBuf = num
    var len = 0
    do {
      len += 1
      // Hex.hexDigits need to be seen here
      value(value.length - len) = Hex.hexDigits((numBuf & 0xF).toInt)
      numBuf >>>= 4
    } while (numBuf != 0)
    UTF8String.fromBytes(java.util.Arrays.copyOfRange(value, value.length - len, value.length))
  }
}
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#46952 from yaooqinn/SPARK-48596.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
yaooqinn and others added 11 commits July 2, 2024 09:53
…lumnVector.offheap.enabled's doc field

### What changes were proposed in this pull request?

Followup of apache#42394
```
   * - spark.sql.columnVector.offheap.enabled
     - When true, use OffHeapColumnVector in ColumnarBatch. Defaults to ConfigEntry(key=spark.memory.offHeap.enabled, defaultValue=false, doc=If true, Spark will attempt to use off-heap memory for certain operations. If off-heap memory use is enabled, then spark.memory.offHeap.size must be positive., public=true, version=1.6.0).
     - <value of spark.memory.offHeap.enabled>
     - 2.3.0
```

The doc field shall be interpolated by MEMORY_OFFHEAP_ENABLED.key instead of MEMORY_OFFHEAP_ENABLED. In this PR, we remove the doc redundant doc as it's also can be found in the `MEMORY_OFFHEAP_ENABLED.defaultValueString`
### Why are the changes needed?

docfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manually debugging

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47165 from yaooqinn/minor2.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…ior change behavior change since Spark 3.4

### What changes were proposed in this pull request?

Add migration guide for `CREATE TABLE AS SELECT...` behavior change.

SPARK-41859 changes the behaviour for `CREATE TABLE AS SELECT ...` from OVERWRITE to APPEND when `spark.sql.legacy.allowNonEmptyLocationInCTAS` is set to `true`:

```
drop table if exists test_table;
create table test_table location '/tmp/test_table' stored as parquet as select 1 as col union all select 2 as col;
drop table if exists test_table;
create table test_table location '/tmp/test_table' stored as parquet as select 3 as col union all select 4 as col;
select * from test_table;

```
This produces {3, 4} in Spark <3.4.0 and {1, 2, 3, 4} in Spark 3.4.0 and later. This is a silent change in `spark.sql.legacy.allowNonEmptyLocationInCTAS` behaviour which introduces wrong results in the user application.

### Why are the changes needed?
This documents a behavior change starting in Spark 3.4 for `CREATE TABLE AS SELECT`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`doc build
`
### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47152 from asl3/allowNonEmptyLocationInCTAS.

Authored-by: Amanda Liu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…stack

### What changes were proposed in this pull request?

This PR proposes to fix internal function `_capture_call_site` for filtering out IPython-related frames from user stack.

### Why are the changes needed?

IPython-related frames are unnecessarily polluting the user stacks so it harms debuggability of IPython Notebook.

For example, there are some garbage stacks recorded from `IPython` and `ipykernel` such as:

- `...lib/python3.9/site-packages/IPython/core/interactiveshell.py...`
- `...lib/python3.9/site-packages/ipykernel/zmqshell.py...`

### Does this PR introduce _any_ user-facing change?

No API changes, but the user stack from IPython will be cleaned up as below:

**Before**
<img width="457" alt="Screenshot 2024-07-01 at 3 26 45 PM" src="https://github.com/apache/spark/assets/44108233/67ba8b49-f52f-4a7d-8031-b7272fceb581">

**After**
<img width="456" alt="Screenshot 2024-07-01 at 3 25 07 PM" src="https://github.com/apache/spark/assets/44108233/950035cd-4397-41a5-9664-7040b84ebd6f">

### How was this patch tested?

The existing CI should pass

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47159 from itholic/ipython_followup.

Authored-by: Haejoon Lee <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

### Why are the changes needed?

Fixes quite a few bugs on the Parquet side: https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Using the existing unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46447 from Fokko/fd-bump-parquet.

Authored-by: Fokko Driesprong <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…rtitionId to state data source

### What changes were proposed in this pull request?

This PR defines two new options, snapshotStartBatchId and snapshotPartitionId, for the existing state reader. Both of them should be provided at the same time.
1. When there is no snapshot file at `snapshotStartBatch` (note there is an off-by-one issue between version and batch Id), throw an exception.
2. Otherwise, the reader should continue to rebuild the state by reading delta files only, and ignore all snapshot files afterwards.
3. Note that if a `batchId` option is already specified. That batchId is the ending batchId, we should then end at that batchId.
4. This feature supports state generated by HDFS state store provider and RocksDB state store provider with changelog checkpointing enabled. **It does not support RocksDB with changelog disabled which is the default for RocksDB.**

### Why are the changes needed?

Sometimes when a snapshot is corrupted, users want to bypass it when reading a later state. This PR gives user ability to specify the starting snapshot version and partition. This feature can be useful for debugging purpose.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Created test cases for testing edge cases for the input of new options. Created test for the new public function `replayReadStateFromSnapshot`. Created integration test for the new options against four stateful operators: limit, aggregation, deduplication, stream-stream join. Instead of generating states within the tests which is unstable, I prepare golden files for the integration test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46944 from eason-yuchen-liu/skipSnapshotAtBatch.

Lead-authored-by: Yuchen Liu <[email protected]>
Co-authored-by: Yuchen Liu <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
@github-actions github-actions bot added the DEPLOY label Jul 2, 2024
@ericm-db ericm-db closed this Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment