Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5$: refactor: Reuse simdjson's number parsing in quick-lint-js #915

Closed
strager opened this issue Jan 22, 2023 · 7 comments
Closed

5$: refactor: Reuse simdjson's number parsing in quick-lint-js #915

strager opened this issue Jan 22, 2023 · 7 comments
Assignees
Labels
for hire Get paid for working on this task: https://quick-lint-js.com/hiring.html good first issue Good for newcomers and C++ beginners performance Slowness or potential optimization

Comments

@strager
Copy link
Collaborator

strager commented Jan 22, 2023

quick-lint-js/port/integer.cpp contains implementations for from_chars functions. These functions either use std::from_chars or std::strtol or a manual (slow) implementation.

We already use simdjson. We should try using simdjson's parse_unsigned and parse_integer functions instead. This might improve performance, but more importantly, it might reduce binary size.

@strager strager added good first issue Good for newcomers and C++ beginners for hire Get paid for working on this task: https://quick-lint-js.com/hiring.html performance Slowness or potential optimization labels Jan 22, 2023
@strager strager changed the title 5$: efactor: Reuse simdjson's number parsing in quick-lint-js 5$: refactor: Reuse simdjson's number parsing in quick-lint-js Jan 22, 2023
@TonyBhargav
Copy link

Hi @strager, I'd like to work on this issue, can you please assign this to me
Thank You

@strager
Copy link
Collaborator Author

strager commented Mar 19, 2023

@TonyBhargav Are you still working on this task?

@msharipov
Copy link
Contributor

@strager Hi, I would like to work on this issue but I could not find the from_chars function in the repo. Has this issue been resolved already?

@strager
Copy link
Collaborator Author

strager commented Feb 17, 2024

@msharipov from_chars is now called parse_integer_exact. Its implementation now lives in src/quick-lint-js/util/integer.cpp:

template <class Char, class Base, class T>
Parse_Integer_Exact_Error parse_integer_exact_generic(
std::basic_string_view<Char> s, T &value) {

@msharipov
Copy link
Contributor

@strager I claim this for-hire task. I expect payment after I complete this task. I will email the quick-lint-js team if I am assigned this task.

@msharipov
Copy link
Contributor

@strager So, I've looked into this issue and I don't see a way to refactor it in a manner that would improve the performance and reduce the size of the compiled binaries. There are two problems:

  1. simdjson does not seem to provide functions for parsing hexadecimal or w_char strings. Thus, it is still necessary to keep the original parse_integer_exact_generic code to call in those cases.
  2. Original implementation and simdjson disagree on what input strings are considered invalid. For example, one of the test cases requires that parsing "123 " returns Parse_Integer_Exact_Error::invalid while parse_integer parses it successfully and returns 123. In a different case, "0777" should be parsed as 777 but simdjson does not allow leading zeroes, so it returns an error. In order to reconcile these discrepancies I had to run additional checks on the input string before passing it to parse_integer which incurs such an overhead that refactoring this code probably hurts more than it helps.

Here's what I came up with:


template <class T>
Parse_Integer_Exact_Error parse_integer_exact(std::string_view s, T &value) {
  if (s.empty()) {
    return Parse_Integer_Exact_Error::invalid;
  }

  if constexpr (std::numeric_limits<T>::max() >
                std::numeric_limits<long long>::max()) {
    return parse_integer_exact_generic<char, Decimal<char>, T>(s, value);
  }

  const char *s_first_digit = &(s.front());
  const char *s_end = &(s.back()) + 1;
  bool negative = false;

  while (s_first_digit < s_end) {
    char c = *s_first_digit;
    if (c >= '1' && c <= '9') {
      break;
    } else if (c == '-') {
      if (negative) {
        return Parse_Integer_Exact_Error::invalid;
      } else {
        negative = true;
      }
    } else if (c == '0') {
      if (negative) {
        return Parse_Integer_Exact_Error::invalid;
      }
    } else {
      return Parse_Integer_Exact_Error::invalid;
    }
    s_first_digit++;
  }

  if (s_first_digit != s_end) {
    for (const char *i = s_first_digit; i < s_end; i++) {
      if (*i < '0' || *i > '9') {
        return Parse_Integer_Exact_Error::invalid;
      }
    }
  } else {
    value = static_cast<T>(0);
    return Parse_Integer_Exact_Error::ok;
  }

  const simdjson::simdjson_result<int64_t> parse_result =
      simdjson::fallback::numberparsing::parse_integer(
          reinterpret_cast<const uint8_t *>(
              static_cast<const char *>(s_first_digit - negative)),
          reinterpret_cast<const uint8_t *>(static_cast<const char *>(s_end)));

  switch (parse_result.error()) {
  case (simdjson::SUCCESS):
    break;
  case (simdjson::INCORRECT_TYPE):
    return Parse_Integer_Exact_Error::out_of_range;
  case (simdjson::NUMBER_ERROR):
    return Parse_Integer_Exact_Error::invalid;
  default:
    QLJS_UNREACHABLE();
  }

  int64_t parse_value = parse_result.value_unsafe();

  static constexpr T result_min = std::numeric_limits<T>::min();
  static constexpr T result_max = std::numeric_limits<T>::max();

  if (parse_value < result_min || parse_value > result_max) {
    return Parse_Integer_Exact_Error::out_of_range;
  }

  value = static_cast<T>(parse_value);
  return Parse_Integer_Exact_Error::ok;
}

Though I doubt that the new code is faster, I haven't run the benchmarks yet. Also, this results in the size of binaries increasing by about 7-10KB each:

< -rwxr-xr-x  1 msharipov msharipov 20064648 Feb 19 12:32 quick-lint-js*
< -rwxr-xr-x  1 msharipov msharipov  5264840 Feb 19 12:32 quick-lint-js-compile-translations*
< -rwxr-xr-x  1 msharipov msharipov  4582400 Feb 19 12:32 quick-lint-js-generate-diagnostic-metadata*
< -rwxr-xr-x  1 msharipov msharipov  4161904 Feb 19 12:32 quick-lint-js-generate-lex-keyword*
< -rwxr-xr-x  1 msharipov msharipov  5033328 Feb 19 12:32 quick-lint-js-generate-trace-sources*
---
> -rwxr-xr-x  1 msharipov msharipov 20071968 Feb 19 13:10 quick-lint-js*
> -rwxr-xr-x  1 msharipov msharipov  5273080 Feb 19 13:10 quick-lint-js-compile-translations*
> -rwxr-xr-x  1 msharipov msharipov  4591008 Feb 19 13:10 quick-lint-js-generate-diagnostic-metadata*
> -rwxr-xr-x  1 msharipov msharipov  4170144 Feb 19 13:10 quick-lint-js-generate-lex-keyword*
> -rwxr-xr-x  1 msharipov msharipov  5041560 Feb 19 13:10 quick-lint-js-generate-trace-sources*

@strager
Copy link
Collaborator Author

strager commented Feb 20, 2024

@msharipov Thanks for the research!

simdjson does not seem to provide functions for parsing hexadecimal or w_char strings.

Ah, good points.

We could work around the lack of wchar_t support by copying the wchar_t string into a char array. But that feels gross...

Original implementation and simdjson disagree on what input strings are considered invalid.

Yeah, these differences are dealbreakers. simdjson's parse_integer really is focused on JSON's syntax; it's not general-purpose.

Also, this results in the size of binaries increasing by about 7-10KB each:

(Note for posterity: We should compare the size of optimized, statically linked builds created by CI. You can create a pull request and download the files at https://c.quick-lint-js.com/builds/COMMITHASH/manual/. Because you are a new contributor, I'll need to approve the CI jobs.)


I think the right thing to do is to keep our implementation and not reuse simdjson's implementation. @msharipov I consider this task complete to my liking. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for hire Get paid for working on this task: https://quick-lint-js.com/hiring.html good first issue Good for newcomers and C++ beginners performance Slowness or potential optimization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants