♻️ REFACTOR: Replace character codes with strings #270

chrisjsewell · 2023-06-01T00:51:02Z

The use of StateBase.srcCharCode is deprecated (with backward-compatibility), and all core uses are replaced by StateBase.src.

Conversion of source string characters to an integer representing the Unicode character is prevalent in the upstream JavaScript implementation, to improve performance.
However, it is unnecessary in Python and leads to harder to read code and performance deprecations (during the conversion in the StateBase initialisation).

StateBase.srcCharCode is no longer populated on initiation, but is left as an on-demand, cached property, to allow backward compatibility for plugins (deprecation warnings are emitted to identify where updates are required).

isStrSpace is supplied as a replacement for isSpace, and similarly StateBlock.skipCharsStr/StateBlock.skipCharsStrBack replace StateBlock.skipChars/StateBlock.skipCharsBack

Co-authored-by: Taneli Hukkinen [email protected]

codecov · 2023-06-01T00:53:51Z

Codecov Report

Patch coverage: 97.63% and project coverage change: -0.43 ⚠️

Comparison is base (c6754a2) 95.68% compared to head (4eeba11) 95.25%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #270      +/-   ##
==========================================
- Coverage   95.68%   95.25%   -0.43%     
==========================================
  Files          62       62              
  Lines        3315     3333      +18     
==========================================
+ Hits         3172     3175       +3     
- Misses        143      158      +15

Flag	Coverage Δ
pytests	`95.25% <97.63%> (-0.43%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
markdown_it/ruler.py	`87.59% <50.00%> (-2.65%)`	⬇️
markdown_it/rules_block/state_block.py	`88.15% <96.15%> (-6.88%)`	⬇️
markdown_it/common/utils.py	`88.88% <100.00%> (+0.93%)`	⬆️
markdown_it/helpers/parse_link_destination.py	`96.66% <100.00%> (ø)`
markdown_it/helpers/parse_link_label.py	`100.00% <100.00%> (ø)`
markdown_it/main.py	`87.07% <100.00%> (ø)`
markdown_it/parser_block.py	`91.66% <100.00%> (ø)`
markdown_it/rules_block/blockquote.py	`98.65% <100.00%> (ø)`
markdown_it/rules_block/fence.py	`100.00% <100.00%> (ø)`
markdown_it/rules_block/heading.py	`100.00% <100.00%> (ø)`
... and 20 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

executablebooks/markdown-it-py#270 deprecates `srcCharCode` and makes it immutable.

@hukkinj1

## 3.0.0 - 2023-06-03 ⚠️ This release contains some minor breaking changes in the internal API and improvements to the parsing strictness. **Full Changelog**: <executablebooks/markdown-it-py@v2.2.0...v3.0.0> ### ⬆️ UPGRADE: Drop support for Python 3.7 Also add testing for Python 3.11 ### ⬆️ UPGRADE: Update from upstream markdown-it `12.2.0` to `13.0.0` A key change is the addition of a new `Token` type, `text_special`, which is used to represent HTML entities and backslash escaped characters. This ensures that (core) typographic transformation rules are not incorrectly applied to these texts. The final core rule is now the new `text_join` rule, which joins adjacent `text`/`text_special` tokens, and so no `text_special` tokens should be present in the final token stream. Any custom typographic rules should be inserted before `text_join`. A new `linkify` rule has also been added to the inline chain, which will linkify full URLs (e.g. `https://example.com`), and fixes collision of emphasis and linkifier (so `http://example.org/foo._bar_-_baz` is now a single link, not emphasized). Emails and fuzzy links are not affected by this. * ♻️ Refactor backslash escape logic, add `text_special` [#276](executablebooks/markdown-it-py#276) * ♻️ Parse entities to `text_special` token [#280](executablebooks/markdown-it-py#280) * ♻️ Refactor: Add linkifier rule to inline chain for full links [#279](executablebooks/markdown-it-py#279) * ‼️ Remove `(p)` => `§` replacement in typographer [#281](executablebooks/markdown-it-py#281) * ‼️ Remove unused `silent` arg in `ParserBlock.tokenize` [#284](executablebooks/markdown-it-py#284) * 🐛 FIX: numeric character reference passing [#272](executablebooks/markdown-it-py#272) * 🐛 Fix: tab preventing paragraph continuation in lists [#274](executablebooks/markdown-it-py#274) * 👌 Improve nested emphasis parsing [#273](executablebooks/markdown-it-py#273) * 👌 fix possible ReDOS in newline rule [#275](executablebooks/markdown-it-py#275) * 👌 Improve performance of `skipSpaces`/`skipChars` [#271](executablebooks/markdown-it-py#271) * 👌 Show text of `text_special` in `tree.pretty` [#282](executablebooks/markdown-it-py#282) ### ♻️ REFACTOR: Replace most character code use with strings The use of `StateBase.srcCharCode` is deprecated (with backward-compatibility), and all core uses are replaced by `StateBase.src`. Conversion of source string characters to an integer representing the Unicode character is prevalent in the upstream JavaScript implementation, to improve performance. However, it is unnecessary in Python and leads to harder to read code and performance deprecations (during the conversion in the `StateBase` initialisation). See [#270](executablebooks/markdown-it-py#270), thanks to [@hukkinj1](https://github.com/hukkinj1). ### ♻️ Centralise indented code block tests For CommonMark, the presence of indented code blocks prevent any other block element from having an indent of greater than 4 spaces. Certain Markdown flavors and derivatives, such as mdx and djot, disable these code blocks though, since it is more common to use code fences and/or arbitrary indenting is desirable. Previously, disabling code blocks did not remove the indent limitation, since most block elements had the 3 space limitation hard-coded. This change centralised the logic of applying this limitation (in `StateBlock.is_code_block`), and only applies it when indented code blocks are enabled. This allows for e.g. ```md <div> <div> I can indent as much as I want here. <div> <div> ``` See [#260](executablebooks/markdown-it-py#260) ### 🔧 Maintenance changes Strict type annotation checking has been applied to the whole code base, [ruff](https://github.com/charliermarsh/ruff) is now used for linting, and fuzzing tests have been added to the CI, to integrate with Google [OSS-Fuzz](https://github.com/google/oss-fuzz/tree/master/projects/markdown-it-py) testing, thanks to [@DavidKorczynski](https://github.com/DavidKorczynski). * 🔧 MAINTAIN: Make type checking strict [#](executablebooks/markdown-it-py#267) * 🔧 Add typing of rule functions [#283](executablebooks/markdown-it-py#283) * 🔧 Move linting from flake8 to ruff [#268](executablebooks/markdown-it-py#268) * 🧪 CI: Add fuzzing workflow for PRs [#262](executablebooks/markdown-it-py#262) * 🔧 Add tox env for fuzz testcase run [#263](executablebooks/markdown-it-py#263) * 🧪 Add OSS-Fuzz set up by @DavidKorczynski in [#255](executablebooks/markdown-it-py#255) * 🧪 Fix fuzzing test failures [#254](executablebooks/markdown-it-py#254)

chrisjsewell added 2 commits June 1, 2023 02:34

♻️ REFACTOR: Replace character codes with strings

3c48507

Update profiler.py

19f3907

chrisjsewell added a commit to executablebooks/mdit-py-plugins that referenced this pull request Jun 1, 2023

👌 Make field_list compatible with latest upstream

a0a94ab

executablebooks/markdown-it-py#270 deprecates `srcCharCode` and makes it immutable.

chrisjsewell mentioned this pull request Jun 1, 2023

👌 Make field_list compatible with latest upstream executablebooks/mdit-py-plugins#75

Merged

chrisjsewell added a commit to executablebooks/mdit-py-plugins that referenced this pull request Jun 1, 2023

👌 Make field_list compatible with latest upstream (#75)

2040839

executablebooks/markdown-it-py#270 deprecates `srcCharCode` and makes it immutable.

chrisjsewell linked an issue Jun 1, 2023 that may be closed by this pull request

Improving performance #198

Closed

Update port.yaml

4eeba11

chrisjsewell merged commit f52249e into master Jun 1, 2023

chrisjsewell deleted the ord-to-char branch June 1, 2023 01:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

♻️ REFACTOR: Replace character codes with strings #270

♻️ REFACTOR: Replace character codes with strings #270

chrisjsewell commented Jun 1, 2023 •

edited

Loading

codecov bot commented Jun 1, 2023 •

edited

Loading

♻️ REFACTOR: Replace character codes with strings #270

♻️ REFACTOR: Replace character codes with strings #270

Conversation

chrisjsewell commented Jun 1, 2023 • edited Loading

codecov bot commented Jun 1, 2023 • edited Loading

Codecov Report

chrisjsewell commented Jun 1, 2023 •

edited

Loading

codecov bot commented Jun 1, 2023 •

edited

Loading