-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr/multi record stitch optimisations #17
Pr/multi record stitch optimisations #17
Conversation
…he official package.
@far-blue The generateMatchHash() method is interesting -- the IdentityMap does something similar, but only with primary key values, and not by hashing per se but by concatenating with separators. https://github.com/atlasphp/Atlas.Mapper/blob/1.x/src/Identity/IdentityMap.php#L90 Would that method be suitable for this purpose and faster than hashing? If so, I might be able to expose the IdentityMap from the Mapper class so you can call it. |
Ah, I see that the primary-key specificity would not work, it has to be specific to the protected function generateMatchHash(Record $record, array $cols) : string
{
if (empty($cols)) {
return '';
}
$sep = "|\x1F"; // a pipe, and ASCII 31 ("unit separator")
$serial = $sep;
$row = $record->getRow();
foreach ($cols as $col) {
if ($this->ignoreCase) {
$col = strtolower($col);
}
$serial .= $row->$col . $sep;
}
return $serial;
} Thoughts? |
I have no issue with concatenation with separators - after all, array keys are internally hashed and bucketed anyhow for key-based referencing. I chose hashing so there were no chars in the original values that could be mistaken for the special separator value and so the tracking of which columns were numeric also wasn't mistaken for a column value. If you are comfortable with the ASCII 31 char as a safe separator then I'm fine with that :) In the example code above I notice firstly that you are lowercasing the col name, not the column value and secondly that there is no tracking of which columns are numeric. I also don't see the benefit of the protected function generateMatchHash(Record $record, array $cols) : string
{
$serial = '';
$sep = "\x1F"; // ASCII 31 ("unit separator")
if (empty($cols)) {
return $serial;
}
$row = $record->getRow();
foreach ($cols as $col) {
$value = $this->ignoreCase ? strtolower($row->$col) : $row->$col;
$serial .= $value . $sep;
if (is_numeric($value)) {
$serial .= $sep;
}
}
return $serial;
} Here, the lowercasing is on the value, not the col name (mimicking the behaviour of One possible downside of not hashing is going to be edge case memory use: while highly unlikely, it is possible that the values of the columns used here could be very large. Why someone would use, for example, a base64 encoded string of a jpeg as a foreign key value I really don't know but it is theoretically possible! When used as an array key the value will be hashed, as I mentioned, but the original is also kept. |
Well, I have been so far -- it's been fine for the Identity Map work as far as I can tell.
Yeah, I noticed the lowercasing right after I sent the message and fixed it -- good catch on your part, regardless! As far as "which columns are numeric" -- since they're getting converted to strings anyway, does it matter? The parts will always be in the same order as $on, so there should be no chance of matching one part against another by accident -- but I may be missing something. On ASCII 31 plus the | character, the pipe is there so that when you echo the string (for whatever reason) you can actually see the separator; ASCII 31 won't actually display.
Noted. If we run into that here, I'd guess we would see it at Identity Map as well (since $on tends to be primary/foreign keys in the first place). Unless you have objections (or need to test further) then I'll accept this PR and then modify it to use the concatenation approach. That sound OK? |
I only worried about the numeric vs string comparison because that's what the original code does. The original code relies on RegularRelationship's // cannot match if one is numeric and other is not
if (is_numeric($nativeVal) && ! is_numeric($foreignVal)) {
return false;
} If in this situation you don't feel this particular behaviour is important then I have no issue with not bothering :) Yeah, feel free to accept and then modify the PR as you wish. The core concept of calculating a matching hash once per record is the bit that delivers the performance improvement :) |
This PR addresses some performance issues with large many-to-many datasets.
As an example:
We have 3 tables - ports, destinations and a ports-destinations many-to-many linking table. We have ~ 4500 ports and ~ 400 destinations and ports are linked to an average of 1.2 destinations (~ 80% of ports are only in 1 destination).
The existing code takes approx. 40 seconds to return a complete json structure based on this data. Profiling showed that approx. 1.2 million calls to RegularRelationship.php's
recordsMatch
method were being made via OneToMany.php'sstitchIntoRecords
method and then an additional 1.4 million calls to RegularRelationship.php'srecordsMatch
method were being made via OneToOne.php'sstitchIntoRecords
method.The reason for this is that for each port object the code loops through every port-destination linking object and attempts to match it. Being a one-to-many call we can't exit early so every relation needs to be checked. Once every linking object has been found there is a further loop to find all the records in the destinations table that relate to the linking record using OneToOne relations.
Each call to recordsMatch is pretty quick but 3 million calls to any class method in php isn't going to be the fastest, esp. when
__get
magic method calls are involved. The real problem is that the same records are being repeatedly matched in a 2-level foreach loop and there's no caching of the values being used to do the matching so they are being recalculated millions of times.This PR addresses these issues by replacing the
stitchIntoRecords
methods for the OneToMany and OneToOne relationship classes. These new methods first make a single pass through the foreign records to calculate a hash of the values that record would previously have been matched on in therecordsMatch
call. The foreign records are then keyed on this hash in an array. Next, a single pass through the native records generates the same hash and then checks in the hash-keyed foreign record array for matches. Where there's a match the foreign record(s) are matched up to the native record just like they would have been in the original code.The result of the new methods is a drop in our testing from 40 seconds to 5 seconds for the described data sets.
All thoughts and comments appreciated to improve this PR!