Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-line string literals (blocks of lines) #161

Closed
Tronic opened this issue Aug 17, 2019 · 41 comments · Fixed by nim-lang/Nim#15264
Closed

Multi-line string literals (blocks of lines) #161

Tronic opened this issue Aug 17, 2019 · 41 comments · Fixed by nim-lang/Nim#15264

Comments

@Tronic
Copy link

Tronic commented Aug 17, 2019

I would suggest -- instead of, or in addition to Python-style """ literals -- using indented block syntax for multi-line string literals. E.g.

proc foo =
  let str = ":
    Hello
      World!
  stdout.write str

Where str is defined equivalent to

let str = "Hello\l  World!\l"

This syntax avoids the indentation problem with string literals that .unindent attempts to address. Also, for clarity, all string content appears within the block, not on the opening or closing lines as is with """.

The literal terminates as soon as the block ends (i.e. a non-empty line indented less is found), avoiding the need for """ at the end. This also avoids the need to escape double quotes that belong to the string.

Whitespace at the end of any line and empty lines at the end would be omitted (and could be added via escape sequences in the rare cases where needed). Whitespace-only lines in the middle would become simply \l (no matter if there are spaces or not). This removes any ambiguity with source code formatting and makes the intention explicit.

This suggestion proposes string block to be indented by exactly two spaces (compared to the line with ": in it). Any further initial spaces would become string content.

This could still be used within parenthesis or other expression, provided that the continuation of that expression appears less indented than the string content.

@Tronic
Copy link
Author

Tronic commented Aug 17, 2019

This is based on standard practices with text file formatting (removal of extra whitespace and adding LF after each line).

Adding \r explicitly at the end of each line completes the CR-LF sequence for Internet protocols (not even Windows needs it in text files anymore).

Any line within the block may be terminated by a backslash. This is useful for splitting otherwise overly long lines on multiple source code lines without adding LFs to the string, and on the last line to prevent the final newline.

@awr1
Copy link

awr1 commented Aug 17, 2019

This makes the heredoc situation in Nim too complicated IMO. """something goes here""".unindent is a simple and satisfactory solution that needs no further enhancement, plus the fact that "oh, you can use escape sequences now!" makes things weirder to me. If you have too many concerns about lengthy string literals, honestly it's better to just not use heredocs and just staticRead() from a textfile instead.

@Tronic
Copy link
Author

Tronic commented Aug 18, 2019

The """ hack of Python is problematic precisely because it mixes source code formatting with string contents. Having PyDocs or .unindent "handle" this is far from satisfactory.

import strutils

# Correct output but messed up source code formatting
for i in 1..2:
  stdout.write("""<li>
  Item
</li>
""")

# Incorrect output (Item not indented)
for i in 1..2:
  stdout.write("""
    <li>
      Item
    </li>
    """.unindent)

# Proposed string literal: clean source code that matches output
for i in 1..2:
  stdout.write(
    ":
      <li>
        Item
      </li>
  )

Fixing this in Python would be quite problematic at this time, but Nim as a new language based on indented blocks definitely /should/ get it right.

@Tronic
Copy link
Author

Tronic commented Aug 18, 2019

@awr1 Escape sequences and whitespace handling are mentioned for completeness. This proposal requires less of them than the current string literals do. Reading from external files is not really a solution. The need for longer string literals (beyond docstrings) is clear and that's why """ literals exist in the first place; their implementation just sucks.

@awr1
Copy link

awr1 commented Aug 18, 2019

Then IMO the behavior of unindent() is probably incorrect, it should eliminate enough whitespace for up to the first non-whitespace character in the string (recording the number of whitespace characters as some variable x) and repeat that operation for every line in the string, eliminating only the first x whitespace characters.

@awr1
Copy link

awr1 commented Aug 18, 2019

A new function could be probably added to strutils, or you could add a defaulted boolean option to unindent() to avoid breaking API compat. I agree that this problem should be fixed, but the core language should not have to change for it.

@krux02
Copy link
Contributor

krux02 commented Aug 18, 2019

Generally I like the idea. I never really liked triple string literals as they are messy. Yet I don't like to change the language for this minor annoyance if the workarounds that don't need a language change haven't been fully explored. Scala's solution to this problem is stripMargin

val speech = """Four score and
               |seven years ago""".stripMargin

Another big problem is, I have no idea how to tell my editor (and github and all the other editors out there) that ": is the start of a indentation block based string literal.

@Tronic
Copy link
Author

Tronic commented Aug 18, 2019

I made a quick proof of concept with minimal changes to lexer. Needs some further work even if accepted to language (like separate lexer token type for this literal).

@Araq
Copy link
Member

Araq commented Aug 18, 2019

IMHO the syntax should be:

const foo = '''
  string literal here that
  needs no closing quotes

but it's far too late for this. Yet another way to write string literals is the last thing we need. We would need to patch nimpretty and every Nim syntax highlighter out there. And without highlighting support this feature seems to be quite dangerous.

@Tronic
Copy link
Author

Tronic commented Aug 18, 2019

@krux02 Most editors seem to ship Nim mode already, and would probably update their handling promptly if the language was changed.

Meanwhile, this certainly is a problem because many editors and Github syntax highlighter consider anything that follows to be a string, until the next " appears somewhere else, although even with the current language syntax (with any language out there, really) they should terminate single-quoted string processing at the first newline.

Indentation is not so much a problem; one extra tab press at most, because standard auto-indent behaves well with this literal.

Library solutions cannot work properly because once the string is formed, information about source code indentation is no longer available. Adding another special character to denote margin isn't really helpful. Also, such solutions cannot avoid the need to escape quote marks within the literal, like the string block does.

@Tronic
Copy link
Author

Tronic commented Aug 18, 2019

In any case, fixing this sort of issue is much better to do at Nim 0.21, a language used by a handful of projects, rather than after 1.0. Using ": as the token also does not affect existing software (although I would like to see """ deprecated and eventually removed entirely -- far prior to 1.0 release). First I considered """: or similar, but that would break existing software. Also, ":, if put on its own line, provides visual cue to where the left margin of string content goes (given that a string block must be indented exactly two spaces, which is already the recommended indentation for Nim).

@awr1
Copy link

awr1 commented Aug 18, 2019

Can this issue be moved to RFCs?

@Tronic
Copy link
Author

Tronic commented Aug 19, 2019

Regarding the symbol used to start it, ": directly communicates that it is string and a block but has the disadvantage of being mishandled by existing tools. Something that is not considered to be a start of string would be less invasive, e.g. $: would probably communicate the same thing in Nim context but the content would be seen as code in syntax highlighters, and the colon might trigger smart indentation in some tools (in particular, those based on Python rules).

I am definitely open to this sort of suggestion, although I believe that in the long run the support of current tools should not really be a consideration. The benefit of ": is that it instantly triggers any coder to notice that something unconventional is happening, while with $: that might not be as apparent, and the content being a string would be not at all apparent to non-Nim coders.

@narimiran narimiran transferred this issue from nim-lang/Nim Aug 19, 2019
@narimiran narimiran removed their assignment Aug 19, 2019
@juancarlospaco
Copy link
Contributor

juancarlospaco commented Aug 19, 2019

YAML already has a very well known and documented contruct for this, why not just use that.

I think is awesome that you can use literal JSON on Nim code directly,
then maybe copy that feature of YAML too.
YAML is an open format, and already supported by tons of software.

YAML syntax can be very friendly as start of a block because it uses :> or :|,
it can live on the sugar module after all thats what Sugar suppose to do.

let variable0 = :>
    YAML like literals.

let variable1 = :|
    YAML like literals.

https://en.wikipedia.org/wiki/YAML#Indented_delimiting
🤔

@SolitudeSF
Copy link

:| and :> could clash with user defined operators, while ": cant.
but i dont see why this should be a language change, if all it does is breaks every syntax highlighter.

@juancarlospaco
Copy link
Contributor

sugar.`:>` 

then ❔

I agree that I dont feel a huge need for this. 🤷‍♀️

@Tronic
Copy link
Author

Tronic commented Aug 19, 2019

FWIW, a comment at the end avoids problems with current highlighters without changing anything else (a simple hack - not part of RFC):

    await client.send ":
      HTTP/1.1 200 OK\r
      content-type: text/plain\r
      content-length: 13\r
      \r
      Hello World!
    #"

@juancarlospaco
Copy link
Contributor

a comment at the end avoids problems
a hack

🤔

@SolitudeSF
Copy link

@Tronic which highlighters? github doesnt highlight correctly anyway.
in my editor its this
image

which is correct representation of current syntax, since " strings cant be multiline.
and no, #" is not a solution even if it worked.

@Tronic
Copy link
Author

Tronic commented Aug 19, 2019

@SolitudeSF I use this in VSCode. Obviously tools need to be fixed, and that really shouldn't be a big issue. After all, they already manage to handle the mix of different quotation formats & comment parsing, incl. Nim-specific syntax and escape sequences.

@juancarlospaco
Copy link
Contributor

For stuff like this I just use staticRead 🤷‍♀️

@SolitudeSF
Copy link

i dont see how this can be trivially fixed, since most editors use regex based highlighting which cant have indentation awareness.

@juancarlospaco
Copy link
Contributor

Too bad you can not do the strformat formatted multi-line literal fmt""" """ in there. 😿

@krux02
Copy link
Contributor

krux02 commented Aug 19, 2019

@Tronic If editor support can't be provided, I can only reject this feature. What value does it have when virtually no editor will support it, or if it will take years until the editors will have a solution for it? Also I am the one who maintains the emacs integration at this point, it is not like that emacs will magically grow support for this feature.

@awr1
Copy link

awr1 commented Aug 20, 2019

I'll admit I was wrong about unindent() not needing any change, but I would much prefer unindent() to be fixed. It honestly feels way too late in the game for a grammar change like this, especially one that may not be reliably workable with certain editor syntax highlighting engines.

@Tronic
Copy link
Author

Tronic commented Aug 20, 2019

@SolitudeSF Regex cannot match indent?

\n([ \t]*)[^\n]*":\n(\1  [^\n]*\n|[ \t]*\n)*

matches this string block. Use backward lookup or editor's custom handling of captures, if necessary. Every serious editor implements some sort of recursive matching in addition to basic regex to be able to do parenthesis matching, to handle HTML closing tags etc.

If nimpretty is a concern, I am sure I can quickly patch that as well.

@GULPF
Copy link
Member

GULPF commented Aug 20, 2019

If this can be properly highlighted with a tmLanguage syntax definition (what vscode and many other editors uses) I would be interested to see how. I think it's impossible but I don't know for sure. I tested the YAML tmLanguage syntax definition and it seems pretty broken for strings.

@Tronic
Copy link
Author

Tronic commented Aug 20, 2019

This sort of approach seems to work (tried in VSCode):

"begin": "( *)(\":)$",
"while": "^\\1  ",

I'll have a proper look later.

@Clyybber
Copy link

@Tronic AFAICT your regex cannot deal with arbitrary indentation. In Nim indenting with all numbers of spaces is allowed. If there exists a regex that can work with arbitrary indentation then I will support this feature.

@Tronic
Copy link
Author

Tronic commented Aug 21, 2019

VSCode highlighter updated to support r": and ": literals. It seems to be working but needs more testing.

@Clyybber Surrounding code may be indented by arbitrary number of spaces. String block contents must be indented by exactly two spaces, compared to the leading line, as discussed in this thread. This is to allow indentation to appear within string content, so any indentation on top of those two spaces are included in the string.

The highlighter marks string content and block indent with separate classes, so that in principle one could style and make the two-space margin visible by CSS effects (not that I recommend doing so).

@Clyybber
Copy link

Clyybber commented Aug 21, 2019

@Tronic I don't think we should enforce those to be indented by exactly two spaces.
Instead make the first line dictate the indentation, or make the line with the least indentation inside the string block dictate the indentation.

@timotheecour
Copy link
Member

timotheecour commented Aug 21, 2019

@Tronic
relevant discussion: https://forum.nim-lang.org/t/471#23415 (Does Nimrod have a heredoc syntax?)
this RFC would have to compare its merits against heredoc.

pros of heredoc

(as used in D, see https://forum.nim-lang.org/t/471#23415):

  • visually clear (and easy to grep) where string ends
  • copy pasting a string doesn't require re-indenting it; but indenting at same level as code (followed by .unindent) is an option if user prefers to keep their string at block indent
  • works in all cases (a suitable identifier always exists that prevents a clash with given string); see note in https://forum.nim-lang.org/t/471#23415 regarding a non-ambiguous way to terminate the heredoc string that can represent any string even if it doesn't end with \n
  • works better with editors (see below, at least github and sublimetext is ok)
let s = q"EOS
This is a multi-line
heredoc string; no need to re-indentEOS"
echo s

produces: This is a multi-line\nheredoc string; no need to re-indent

@Araq
Copy link
Member

Araq commented Aug 22, 2019

If the argument is "you can always come up with a delimiter that isn't used" then Nim's triple quotes work just as well:

const
  s = """
foobar
UNUSED_DELIM
baz
""".replace("UNUSED_DELIM", "\"\"\"")

Requires no language change and is easier to implement for highlighters as it doesn't involve a regex with backtracking (which is NP complete iirc?)

@krux02
Copy link
Contributor

krux02 commented Aug 27, 2019

This is how string literals work in c++11, where R"V0G0N( and )V0G0N" act as delimiters.

const char * vogon_poem = R"V0G0N(
             O freddled gruntbuggly thy micturations are to me
                 As plured gabbleblochits on a lurgid bee.
              Groop, I implore thee my foonting turlingdromes.   
           And hooptiously drangle me with crinkly bindlewurdles,
Or I will rend thee in the gobberwarts with my blurlecruncheon, see if I don't.
                (by Prostetnic Vogon Jeltz; see p. 56/57)
)V0G0N";

Not only does it allow to specify arbitrary delimiters that won't clash with the content, it would also allow to write editor extensions that detect such string blocks for syntax highlighting. Then you can can have SQL strings, python strings, etc all with correct syntax highlighting. Currently Nim has call string literals, for example SQL"""select elephant from africa""". This already works partially, but it won't work for embedded python strings that well, as """ is a very common python token.

@timotheecour
Copy link
Member

yes, that's C++'s version of D's heredoc string I mentioned above in #161 (comment) . Ability to copy paste code without messing with replace to fixup a delimiter to escape (#161 (comment)) is nice. Yes, it's one more thing to learn though.

@Tronic
Copy link
Author

Tronic commented Aug 28, 2019

@krux02 Theoretically it can overcome the delimiter appearing in content problem. In practice everyone just uses it as another form of """ and complains that R"(...)" looks uglier than the same thing in some other language. As @Araq pointed out, one should not be required to invent unique identifiers. Also, this sort of literal completely fails to address the indentation problem.

Indented-block literals make a clear separation between source code formatting (indent of the block) and string content (any characters within the block). This way clean source code formatting can be preserved without introducing extra whitespace into the string.

For me it is actually really hard to understand how in 2010's people still design formats with issues that were widely understood and fixed in 1990's if not decades earlier. I presume that the argument has always been that "we cannot fix this because of compatibility" and that "it would take years". As I have demonstrated in this thread, fixing it both in the Nim compiler and in popular text editors took only few hours of work, and frankly I've already spent far more than that here, arguing for it.

@Araq
Copy link
Member

Araq commented Aug 28, 2019

As I have demonstrated in this thread, fixing it both in the Nim compiler and in popular text editors took only few hours of work

Well we need to check that. I'm not convinced that popular text editors can be "fixed".

kosz78 pushed a commit to pragmagic/vscode-nim that referenced this issue Sep 30, 2019
Implements highlight for string block literals as discussed in nim-lang/RFCs#161
@krux02
Copy link
Contributor

krux02 commented Apr 11, 2020

@Tronic I might have changed my opinion about this kind of string literals. They are very valuable for emit and asm statements. They are also very useful for my very own project here.

@Clyybber
Copy link

With the introduction of strutils.dedent this can now be closed:

import strutils
proc foo =
  let str = dedent """
    Hello
      World!
  """
  stdout.write str

foo()

will print

Hello
  World!

@Araq Araq closed this as completed Sep 22, 2020
@AmjadHD
Copy link

AmjadHD commented Aug 24, 2022

So, I have to import strutils for this. IMO triple string literals should be dedented by default in 2.0 (like julia).

@Varriount
Copy link

@AmjadHD I would suggest writing another proposal for that. I'm even optimistic that it would be accepted, since it's unlikely to cause too much breakage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.