Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pr/multi record stitch optimisations #17

Merged
merged 6 commits into from
May 10, 2021

Conversation

far-blue
Copy link
Contributor

This PR addresses some performance issues with large many-to-many datasets.

As an example:

We have 3 tables - ports, destinations and a ports-destinations many-to-many linking table. We have ~ 4500 ports and ~ 400 destinations and ports are linked to an average of 1.2 destinations (~ 80% of ports are only in 1 destination).

The existing code takes approx. 40 seconds to return a complete json structure based on this data. Profiling showed that approx. 1.2 million calls to RegularRelationship.php's recordsMatch method were being made via OneToMany.php's stitchIntoRecords method and then an additional 1.4 million calls to RegularRelationship.php's recordsMatch method were being made via OneToOne.php's stitchIntoRecords method.

The reason for this is that for each port object the code loops through every port-destination linking object and attempts to match it. Being a one-to-many call we can't exit early so every relation needs to be checked. Once every linking object has been found there is a further loop to find all the records in the destinations table that relate to the linking record using OneToOne relations.

Each call to recordsMatch is pretty quick but 3 million calls to any class method in php isn't going to be the fastest, esp. when __get magic method calls are involved. The real problem is that the same records are being repeatedly matched in a 2-level foreach loop and there's no caching of the values being used to do the matching so they are being recalculated millions of times.

This PR addresses these issues by replacing the stitchIntoRecords methods for the OneToMany and OneToOne relationship classes. These new methods first make a single pass through the foreign records to calculate a hash of the values that record would previously have been matched on in the recordsMatch call. The foreign records are then keyed on this hash in an array. Next, a single pass through the native records generates the same hash and then checks in the hash-keyed foreign record array for matches. Where there's a match the foreign record(s) are matched up to the native record just like they would have been in the original code.

The result of the new methods is a drop in our testing from 40 seconds to 5 seconds for the described data sets.

All thoughts and comments appreciated to improve this PR!

@pmjones
Copy link
Contributor

pmjones commented Mar 31, 2021

@far-blue The generateMatchHash() method is interesting -- the IdentityMap does something similar, but only with primary key values, and not by hashing per se but by concatenating with separators.

https://github.com/atlasphp/Atlas.Mapper/blob/1.x/src/Identity/IdentityMap.php#L90

Would that method be suitable for this purpose and faster than hashing? If so, I might be able to expose the IdentityMap from the Mapper class so you can call it.

@pmjones
Copy link
Contributor

pmjones commented Apr 1, 2021

Ah, I see that the primary-key specificity would not work, it has to be specific to the $on relationship keys. Even so, the serialization approach may yet be quicker in this scenario. For example:

    protected function generateMatchHash(Record $record, array $cols) : string
    {
        if (empty($cols)) {
            return '';
        }

        $sep = "|\x1F"; // a pipe, and ASCII 31 ("unit separator")
        $serial = $sep;
        $row = $record->getRow();

        foreach ($cols as $col) {
            if ($this->ignoreCase) {
                $col = strtolower($col);
            }

            $serial .= $row->$col . $sep;
        }

        return $serial;
    }

Thoughts?

@far-blue
Copy link
Contributor Author

far-blue commented Apr 3, 2021

I have no issue with concatenation with separators - after all, array keys are internally hashed and bucketed anyhow for key-based referencing. I chose hashing so there were no chars in the original values that could be mistaken for the special separator value and so the tracking of which columns were numeric also wasn't mistaken for a column value. If you are comfortable with the ASCII 31 char as a safe separator then I'm fine with that :)

In the example code above I notice firstly that you are lowercasing the col name, not the column value and secondly that there is no tracking of which columns are numeric. I also don't see the benefit of the | char as part of the separator. How about this version:

protected function generateMatchHash(Record $record, array $cols) : string
{
    $serial = '';
    $sep = "\x1F"; // ASCII 31 ("unit separator")

    if (empty($cols)) {
        return $serial;
    }

    $row = $record->getRow();

    foreach ($cols as $col) {
        $value = $this->ignoreCase ? strtolower($row->$col) : $row->$col;
        $serial .= $value . $sep;
        if (is_numeric($value)) {
            $serial .= $sep;
        }
    }

    return $serial;
}

Here, the lowercasing is on the value, not the col name (mimicking the behaviour of valuesMatch()) and a double separator is used to mark the value as numeric to make sure a numeric and a string don't match.

One possible downside of not hashing is going to be edge case memory use: while highly unlikely, it is possible that the values of the columns used here could be very large. Why someone would use, for example, a base64 encoded string of a jpeg as a foreign key value I really don't know but it is theoretically possible! When used as an array key the value will be hashed, as I mentioned, but the original is also kept.

@pmjones
Copy link
Contributor

pmjones commented Apr 3, 2021

If you are comfortable with the ASCII 31 char as a safe separator then I'm fine with that

Well, I have been so far -- it's been fine for the Identity Map work as far as I can tell.

In the example code above I notice firstly that you are lowercasing the col name, not the column value and secondly that there is no tracking of which columns are numeric. I also don't see the benefit of the | char as part of the separator. How about this version:

Yeah, I noticed the lowercasing right after I sent the message and fixed it -- good catch on your part, regardless!

As far as "which columns are numeric" -- since they're getting converted to strings anyway, does it matter? The parts will always be in the same order as $on, so there should be no chance of matching one part against another by accident -- but I may be missing something.

On ASCII 31 plus the | character, the pipe is there so that when you echo the string (for whatever reason) you can actually see the separator; ASCII 31 won't actually display.

One possible downside of not hashing is going to be edge case memory use

Noted. If we run into that here, I'd guess we would see it at Identity Map as well (since $on tends to be primary/foreign keys in the first place).

Unless you have objections (or need to test further) then I'll accept this PR and then modify it to use the concatenation approach. That sound OK?

@far-blue
Copy link
Contributor Author

far-blue commented Apr 3, 2021

As far as "which columns are numeric" -- since they're getting converted to strings anyway, does it matter? The parts will always be in the same order as $on, so there should be no chance of matching one part against another by accident -- but I may be missing something.

I only worried about the numeric vs string comparison because that's what the original code does. The original code relies on RegularRelationship's valuesMatch method which includes this:

        // cannot match if one is numeric and other is not
        if (is_numeric($nativeVal) && ! is_numeric($foreignVal)) {
            return false;
        }

If in this situation you don't feel this particular behaviour is important then I have no issue with not bothering :)

Yeah, feel free to accept and then modify the PR as you wish. The core concept of calculating a matching hash once per record is the bit that delivers the performance improvement :)

@pmjones pmjones merged commit 9187e95 into atlasphp:1.x May 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants