Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder ByteSequence constants #364

Closed

Conversation

jmwebservices
Copy link

Fix for #363

@@ -17,27 +17,27 @@
interface ByteSequence
{
/**
* UTF-8 BOM sequence.
* UTF-32 BE BOM sequence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the re-ordering is not the issue, the issue is in the implementation of bom_match the function should be updated by sorting the constant using their value length.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too thought relying on a definition order in the interface was flaky but it was the path of least resistance. And, the test would guard it from becoming out of order in the future.

@nyamsprod
Copy link
Member

@jmwebservices thanks for the PR. This is a good start but to resolve the issue, the bom_match function should be instead changed to something along these lines

<?php

    function bom_match(string $str): string
    {
        static $list;
        if (null === $list) {
            $list = (new ReflectionClass(ByteSequence::class))->getConstants();

            rsort($list, SORT_NATURAL | SORT_FLAG_CASE);
        }

        foreach ($list as $sequence) {
            if (0 === strpos($str, $sequence)) {
                return $sequence;
            }
        }
        
        return '';
    }

I have no tested the result but this is the way to go, Changing the order in the interface makes it part of a specific implementation and that defeat the purpose of an interface.

@jmwebservices
Copy link
Author

Would you be interested in the following?

  1. Removing the Reflection magic and unrolling the tests so that each BOM sequence is tested individually? I believe it would reduce the LOC and make it clearer.
  2. Operating on the first 4 bytes of $str since strpos will continue checking for the 1st occurrence through the end of $str.

@nyamsprod
Copy link
Member

@jmwebservices

1/ the reflection magic is done for maintenance and speed:

  • if we need to add more constants the bom_match code does not need any update.
  • a static local variable is used = the calculation is done only once, the first time the function is called, on subsequent calls nothing it done or executed since the value are already present and hopefully after your PR sorted 😉

So I would be against changing this part of the code.

2/ some tests should be conducted or some literature (with a link) should be provided to see if substr($long_str, 0, 10) does require or not the string full length to operate.

If the full length is still computed then I see no gain in doing so
If not yes please go ahead.. It's an optimization that can be handy for large CSV file IMHO

@jmwebservices
Copy link
Author

2/ some tests should be conducted or some literature (with a link) should be provided to see if substr($long_str, 0, 10) does require or not the string full length to operate.

If the full length is still computed then I see no gain in doing so
If not yes please go ahead.. It's an optimization that can be handy for large CSV file IMHO

I am getting crazy performance improvements. On my machine, using substr() to strpos() the first 4 bytes takes about 1% of the time than if the entire string is checked. Please verify for yourself.

https://gist.github.com/jmwebservices/3cd9cb273ec52f835a6dd13d62b2bc76

@jmwebservices
Copy link
Author

This optimization may be unccessary since a quick search over the code base indicates that bom_match() is only called once where the passed string is at most 4 bytes.

$this->input_bom = bom_match((string) $this->document->fread(4));

@nyamsprod
Copy link
Member

@jmwebservices are you gonna modify the function or not else I can take over your PR and merge the changes so I can release before the end of the year 🤔🤔😉

@jmwebservices
Copy link
Author

jmwebservices commented Dec 12, 2019

@jmwebservices are you gonna modify the function or not else I can take over your PR and merge the changes so I can release before the end of the year

I was waiting on your reply on my last comment about whether the optimization is needed at all. But yes, please go ahead and modify to address the bug. The end of year has me swamped.

Thanks for your contribution!

@nyamsprod
Copy link
Member

@jmwebservices sorry I thought by your last answer that you were not going to add the optimization since its useless in the CSV codebase 😢 . I'll still mention your contribution in the changelog don't worry about that 😉

@nyamsprod
Copy link
Member

this merge is closed and I'll be using #365 instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants