Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the ? format specifier for strings and characters #3656

Merged
merged 21 commits into from
May 18, 2023

Conversation

cpplearner
Copy link
Contributor

@cpplearner cpplearner commented Apr 17, 2023

This implements [format.string.escaped], which is part of WG21-P2286R8 "Formatting Ranges" and modified by WG21-P2713R1 "Escaping Improvements In std::format". Works towards #2919.

To implement this feature, two arrays, __printable_ranges and _Grapheme_Extend_ranges, are added to __msvc_format_ucd_tables.hpp.

  • __printable_ranges represents code points whose General_Category is in the groups L, M, N, P, S (that is, code points that are not from categories Z or C), plus the ASCII space character.
    • Characters outside of these ranges are always escaped, usually using the \u{hex-digit-sequence} format. ([format.string.escaped]/(2.2.1.2.1))
    • It might make sense to store the unmodified General_Category, instead of this invented property. This requires more storage and a new data structure, though.
  • _Grapheme_Extend_ranges represents code points with the Unicode property Grapheme_Extend=Yes.
    • Characters in these ranges are escaped unless they immediately follow an unescaped character. ([format.string.escaped]/(2.2.1.2.2))
    • It would be more space efficient to reuse the existing data for Grapheme_Cluster_Break: Grapheme_Extend=Yes is Grapheme_Cluster_Break=Extend minus Emoji_Modifier=Yes, and Emoji_Modifier=Yes is just 1F3FB..1F3FF. I chose to define a new array for simplicity.

When the literal encoding is not UTF-8, UTF-16, or UTF-32, the set of "separator or non-printable characters" is implementation-defined. In this implementation, the set consists of all characters that correspond to non-printable Unicode code points (that is, code points outside of __printable_ranges, see above). If a character is non-printable, it is translated into \u{XXXX}, where XXXX is the hex value of the Unicode code point (not the original value).

If a code unit sequence cannot be converted to a Unicode scalar value, the \x{XX} escape sequence is used.

@cpplearner cpplearner requested a review from a team as a code owner April 17, 2023 08:22
stl/inc/__msvc_format_ucd_tables.hpp Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Show resolved Hide resolved
stl/inc/format Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej self-assigned this Apr 26, 2023
StephanTLavavej

This comment was marked as resolved.

Comment on lines +1681 to +1685
#if _HAS_CXX23
case '?':
_Cat = _Presentation_type_category::_Escape;
break;
#endif // _HAS_CXX23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential future improvement:

We could recognize this in c++20 and earlier modes and emit a helpful diagnostic.

stl/inc/__msvc_format_ucd_tables.hpp Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
tests/std/tests/P2286R8_formatting_ranges/test.cpp Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej removed their assignment May 12, 2023
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej
Copy link
Member

FYI @barcharcraz, changes have been pushed since your approval but they were minor so I don't think you need to take another look.

@StephanTLavavej StephanTLavavej self-assigned this May 17, 2023
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@CaseyCarter CaseyCarter merged commit 9fdc896 into microsoft:main May 18, 2023
@CaseyCarter
Copy link
Member

Thanks for implementing this sub-feature, which is not questionable in any way!

@cpplearner cpplearner deleted the p2286r8-formatting-ranges branch May 18, 2023 23:50
JMazurkiewicz added a commit to JMazurkiewicz/STL that referenced this pull request Jul 30, 2023
* This PR implements debug-enabled standard `formatter` specializations,
* Drive-by: rename tests added in microsoft#3656 - use `text_formatting_` prefix instead of `formatting_ranges_` (for consistency),
* Towards microsoft#2919.
JMazurkiewicz added a commit to JMazurkiewicz/STL that referenced this pull request Jul 30, 2023
* This PR implements debug-enabled standard `formatter` specializations,
* Drive-by: rename tests added in microsoft#3656 - use `text_formatting_` prefix instead of `formatting_ranges_` (for consistency),
* Towards microsoft#2919.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cxx23 C++23 feature format C++20/23 format
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants