Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change cell delimiter? #48

Open
jpmorris opened this issue Jun 30, 2022 · 4 comments
Open

Change cell delimiter? #48

jpmorris opened this issue Jun 30, 2022 · 4 comments

Comments

@jpmorris
Copy link

jpmorris commented Jun 30, 2022

I know it may break earlier code, but is there anyway we can change the cell delimiter to something else besides /n/n?

Two newlines is very common, things shift, and I think this (along with accidently opening as a normal sql txt) seem to be causing an error that my cells are CONSTANTLY collapsing into one monolithic cell. (Also I suspect that it may be introducing merge conflicts when I rebase.)

Wouldn't something like:
--#vscode-sql-notebook

or anything really that would help retain cell formatting after presses, pushed and pulled from repos and opening in other editors.

@cmoog
Copy link
Owner

cmoog commented Jun 30, 2022

seem to be causing an error that my cells are CONSTANTLY collapsing into one monolithic cell.

I suspect this is due to an automated SQL formatter that collapses \n\n tokens. If not, can you share more details for debugging?

something like: --#vscode-sql-notebook

I think this is a good suggestion. One additional consideration: there is some benefit to reducing the introduction of custom syntax because it keeps the UX of opening otherwise plain SQL files as notebooks functional. Specifically, if a user opens an existing SQL file I don't want a broken-by-default experience. Using this new delimiter would mean the default new experience would be one giant monolithic cell.

One potential solution is to deserialize \n\n as a delimiter but then serialize --#vscode-sql-notebook as the delimiter. This has the obvious downside of introducing erroneous diffs. I'd like to maintain the following property: $\text{serialize}(\text{deserialize}(x)) = x$.

@jpmorris thoughts?

@jpmorris
Copy link
Author

jpmorris commented Jun 30, 2022

I'll see what it is so particular about my workflow that I keep running into issues. It's not an easy issue to debug, but if I'm very careful to watch the file for changes as I use git rebase and opening and closing with vscode maybe I'll notice something.

I agree \n\n is the most portable for sure, but unfortunately there is no standard or convention of sql authors. I think \n\n is a good first guess of most authors to designate "this section is a new query or separate 'code block'". But even though that's what most people use, it's probably only 30% (wild guess) of the authors are that rigorous. If most people are like me, you put in 1, 2, 3 or n newlines with comments and all sorts of mess. Moreover, most people who have sql files never intended to use them as cell-based workflow. What you are doing with this databricks/juupyter-style workflow is very avant garde (kudos btw). But most data engineeers I've seen have this horrible copy-to-notebook++ right out of their DB Admin tool of choice workflow. Few, in my estimation, are doing a juypter/zepplin/databricks notebook-style SQL editing. So it seems like the desire to auto-parse any .sql that was never intended for cell-based workflow into cells probably has limited ( but yes real) use.

Yeah I wouldn't have the serialize/deserialize operations be different.

Maybe there's a way to allow for both parsings? Maybe a config in the UI to allow power users to use a standard (or specify their own) delimiter? Then the plugin would try its best to auto-cell any sql file that you throw at it, but it would allow us power users to force their cells where they like with complete control (and hopefully avoid multiple-editor, git issues).

As an aside, if you're not aware, this question of "additional markup" must have been on the minds of the juypter folk too. But they made some poor choices in my mind that broke any version control and other important workflow (just parsing/cating files). So I think your intuition to build on top of standard plain sql files solid. Both Pycharm and VSCode have worked around juypter's file format issues by introducing their own parsing markup. Pycharm calls this scientific mode, and you can annotate any cell in a standard .py file (not .ipynb) you want with #%%

https://blog.jetbrains.com/pycharm/2018/04/pycharm-scientific-mode-with-code-cells/

VSCode Also adopted a similar convention with # %% (not sure how the space differs in both):

https://code.visualstudio.com/docs/python/jupyter-support-py

This allows them to have a cell-based workflow without relying on Jupyter's horrible xml- (or json?)-based file format that is impossible to version control and gets you little (in my mind), since I believe both these workarounds still allow for markdown.

I think they had to create a xml-based file format to retain output. But surely there was a better way to do this. Something that interpolates and segregates input and output and then let the application layer (and not the format) worry about the fancy graphs and output. (I don't see how dynamic code-block insertion doesn't solve all of juypter's fancy output requirements; and any app could take care of that presentation-layer rather than be a format parser.) By segregating input/output you wouldn't give another reason for us who work with personal data another reason to scrap the jupyter format.

Maybe, some things to think about (esp because I see you are getting requests to retain output).

@cmoog
Copy link
Owner

cmoog commented Jul 21, 2022

My apologies for the delayed response. @jpmorris lots of great thinking there, I appreciate the depth.

Maybe there's a way to allow for both parsings? Maybe a config in the UI to allow power users to use a standard (or specify their own) delimiter?

This is a good suggestion, but thinking through the implementation details, I think it would add unnecessary complexity for the user. Consider the UX of opening a shared notebook with a collaborator

  • do they need to coordinate delimiter selection?
  • if the extension tries to auto detect delimiters, what happens when multiple are used in the same document?
  • if I change delimiter selection does it attempt to coerce and overwrite existing delimiters?

Few, in my estimation, are doing a juypter/zepplin/databricks notebook-style SQL editing. So it seems like the desire to auto-parse any .sql that was never intended for cell-based workflow into cells probably has limited ( but yes real) use.

Upon further rumination, I think you're correct. And since a configurable delimiter isn't worth the UX overhead for the reasons listed above, I think the best path forward is to replace the current \n\n with --#sql-cell. This also feels generic enough that it could be adopted by other tools (as compared to --#vscode-sql-notebook).

Of course, this will collapse all existing sql notebooks into a single cell. So long as we call out this breaking change in the changelog, I think it's fine. We can even provide a little bash command in the release notes that will replace all instances of \n\n with --#sql-cell.

@jpmorris
Copy link
Author

Yes if you want to pick a standard then I would say that's better than user customization. I would only say that user-defined delimiter is better keeping it ONLY \n\n Teams have to use all sorts of tools (linters, IDEs, config files, git hooks and ,git line ending handlings, tab vs space coding standards, coding widths, workflows and dockerfiles etc) I don't see this being that different. The team would coordinate on a delimiter and they don't care about other delimiters (so the extension would only need to prefer user-config and no coercion necessary). But, again, if you want to pick a standard that's better than letting the user decide. (But letting the user decide is better than ONLY \n\n.)

Part of the reason the issue you raised is because there's no standard here. If there was a standard then people would mostly abide by that standard. (User-defined delimiter would also probably be unnecessary, or if there were 2 standards--like line endings or tab v space, or even auto-formatters (like python black)--the config would be a simple dropdown like many of the settings in vscode UI.)

I think --#sql-cell sounds good. I was thinking you might consider the current 'standard' (as much as it can be) that I mentioned that is trying to be established by VSCode and Pycharm of using #%% for non-jupyter cell-based notebooks/workflow. But since you'll need to put the two dashes in front to make it a sql comment I don't think it gets you much since you're not REALLY adopting their cell-delimiter standard. You're out in front in this space. Databricks (and I think zepplin) do support SQL notebooks, but all the cell delimiting happens through the UI. They can export in certain formats like html (for presentation) and I think they can export as a more usable format. If you care at all to match ANY standard you could launch their free databricks trial and see if their exported notebooks have some sort of delimiter you can be consistent with. I wouldn't be surprised if it was a propitiatory format or a non-extendable json/xml source though. Or maybe just google to see if ANYONE is doing anything like you and see if they want to coordinate on a standard. If not, then you're in that unique spot that you get to pick a standard.

Thanks for all the hard work and continual development on this. Do you have a tip jar someone can donate to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants