[BUG] %%send-to-spark fails for dataframes with '\n' or ' characters #862

spbansal · 2023-12-01T05:39:15Z

Describe the bug
When using %%send-to-spark with a local pandas df which contains '\n' or ' (single quote) character, the command fails with the following error
SyntaxError: invalid syntax
To Reproduce
You can run the following notebook for reproducing this issue

%%local
csv_string="""Column2,Column3
0.5205293969790654,A's and B's
0.2995962535000757,B's_another_example
0.7410351292041173,C's_example
"""
large_df = pd.read_csv(StringIO(csv_string))

%%send_to_spark -i large_df -t df -n large_df

Expected behavior
The dataframe should have been properly send to the remote spark kernel.

Screenshots

Versions:

SparkMagic - 0.20.0
Spark 2.4.3

Additional context
A workaround for this was to base64 encode any string fields before sending to spark. It would be great if this or a different solution could be patched in the repo.

The text was updated successfully, but these errors were encountered:

dosubot · 2023-12-01T05:39:27Z

🤖

Hello @spbansal,

I'm Dosu, a bot here to assist you while a human maintainer is not available. I'm designed to help resolve issues, answer questions, and guide you on how to contribute to our projects. I'm here to help, so let's work together to find a solution to your issue.

Thank you for reporting this issue. It seems like the problem is due to the '\n' and single quote characters not being properly escaped when the pandas dataframe is converted to a JSON string. This can be fixed by modifying the _pandas_df_to_json method in the SendPandasDfToSparkCommand class to properly escape these characters.

Here's a patch that should fix the issue:

from pandas.io.json import dumps

class SendPandasDfToSparkCommand(Command):
    # ...

    def _pandas_df_to_json(self, records):
        return dumps(records, ensure_ascii=False).replace('\n', '\\n').replace("'", "\\'")

This patch modifies the _pandas_df_to_json method to replace newline characters ('\n') with their escaped version ('\n') and single quotes with escaped single quotes ('\''). The dumps function from pandas.io.json is used to convert the dataframe to a JSON string.

Please apply this patch to your local copy of sparkmagic and test it with your use case. If it works as expected, you may want to submit it as a pull request to the sparkmagic repository so that it can be included in future versions of the software.

As for your question about how the '%%send-to-spark' command is implemented in the sparkmagic codebase, it is implemented in the SendToSparkCommand class in the sendtosparkcommand.py file. This class extends the Command class and overrides the execute method. The execute method takes a session as an argument and tries to execute a command that is returned by the to_command method. The to_command method takes the session kind, input variable name, input variable value, and output variable name as arguments. Depending on the session kind, it returns a PySpark, Scala, or R command. These commands are created by the _pyspark_command, _scala_command, and _r_command methods respectively. These methods are marked as abstract and need to be implemented in a subclass. They raise a NotImplementedError if they are not overridden.

I hope this helps! If you have any other questions, feel free to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

lepusarticus · 2024-01-17T20:04:59Z

@dosu-bot

Based on the error message, I think that a cell containing double quotes would also be a problem.

I prefer the human suggestion to encode everything in base64 and decode on the other end. It would be safer.

dosubot bot added the kind:bug An unexpected error or issue with sparkmagic label Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] %%send-to-spark fails for dataframes with '\n' or ' characters #862

[BUG] %%send-to-spark fails for dataframes with '\n' or ' characters #862

spbansal commented Dec 1, 2023 •

edited

Loading

dosubot bot commented Dec 1, 2023 •

edited

Loading

lepusarticus commented Jan 17, 2024 •

edited

Loading

[BUG] %%send-to-spark fails for dataframes with '\n' or ' characters #862

[BUG] %%send-to-spark fails for dataframes with '\n' or ' characters #862

Comments

spbansal commented Dec 1, 2023 • edited Loading

dosubot bot commented Dec 1, 2023 • edited Loading

Sources

lepusarticus commented Jan 17, 2024 • edited Loading

spbansal commented Dec 1, 2023 •

edited

Loading

dosubot bot commented Dec 1, 2023 •

edited

Loading

lepusarticus commented Jan 17, 2024 •

edited

Loading