Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip most long matches in lazy hash table update #2755

Merged
merged 4 commits into from
Sep 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 45 additions & 15 deletions lib/compress/zstd_lazy.c
Original file line number Diff line number Diff line change
Expand Up @@ -971,7 +971,7 @@ FORCE_INLINE_TEMPLATE void ZSTD_row_prefetch(U32 const* hashTable, U16 const* ta
* Fill up the hash cache starting at idx, prefetching up to ZSTD_ROW_HASH_CACHE_SIZE entries,
* but not beyond iLimit.
*/
static void ZSTD_row_fillHashCache(ZSTD_matchState_t* ms, const BYTE* base,
FORCE_INLINE_TEMPLATE void ZSTD_row_fillHashCache(ZSTD_matchState_t* ms, const BYTE* base,
U32 const rowLog, U32 const mls,
U32 idx, const BYTE* const iLimit)
{
Expand Down Expand Up @@ -1011,35 +1011,65 @@ FORCE_INLINE_TEMPLATE U32 ZSTD_row_nextCachedHash(U32* cache, U32 const* hashTab
}
}

/* ZSTD_row_update_internal():
* Inserts the byte at ip into the appropriate position in the hash table.
* Determines the relative row, and the position within the {16, 32} entry row to insert at.
/* ZSTD_row_update_internalImpl():
* Updates the hash table with positions starting from updateStartIdx until updateEndIdx.
*/
FORCE_INLINE_TEMPLATE void ZSTD_row_update_internal(ZSTD_matchState_t* ms, const BYTE* ip,
U32 const mls, U32 const rowLog,
U32 const rowMask, U32 const useCache)
FORCE_INLINE_TEMPLATE void ZSTD_row_update_internalImpl(ZSTD_matchState_t* ms,
U32 updateStartIdx, U32 const updateEndIdx,
U32 const mls, U32 const rowLog,
U32 const rowMask, U32 const useCache)
{
U32* const hashTable = ms->hashTable;
U16* const tagTable = ms->tagTable;
U32 const hashLog = ms->rowHashLog;
const BYTE* const base = ms->window.base;
const U32 target = (U32)(ip - base);
U32 idx = ms->nextToUpdate;

DEBUGLOG(6, "ZSTD_row_update_internal(): nextToUpdate=%u, current=%u", idx, target);
for (; idx < target; ++idx) {
U32 const hash = useCache ? ZSTD_row_nextCachedHash(ms->hashCache, hashTable, tagTable, base, idx, hashLog, rowLog, mls)
: (U32)ZSTD_hashPtr(base + idx, hashLog + ZSTD_ROW_HASH_TAG_BITS, mls);
DEBUGLOG(6, "ZSTD_row_update_internalImpl(): updateStartIdx=%u, updateEndIdx=%u", updateStartIdx, updateEndIdx);
for (; updateStartIdx < updateEndIdx; ++updateStartIdx) {
U32 const hash = useCache ? ZSTD_row_nextCachedHash(ms->hashCache, hashTable, tagTable, base, updateStartIdx, hashLog, rowLog, mls)
: (U32)ZSTD_hashPtr(base + updateStartIdx, hashLog + ZSTD_ROW_HASH_TAG_BITS, mls);
U32 const relRow = (hash >> ZSTD_ROW_HASH_TAG_BITS) << rowLog;
U32* const row = hashTable + relRow;
BYTE* tagRow = (BYTE*)(tagTable + relRow); /* Though tagTable is laid out as a table of U16, each tag is only 1 byte.
Explicit cast allows us to get exact desired position within each row */
U32 const pos = ZSTD_row_nextIndex(tagRow, rowMask);

assert(hash == ZSTD_hashPtr(base + idx, hashLog + ZSTD_ROW_HASH_TAG_BITS, mls));
assert(hash == ZSTD_hashPtr(base + updateStartIdx, hashLog + ZSTD_ROW_HASH_TAG_BITS, mls));
((BYTE*)tagRow)[pos + ZSTD_ROW_HASH_TAG_OFFSET] = hash & ZSTD_ROW_HASH_TAG_MASK;
row[pos] = idx;
row[pos] = updateStartIdx;
}
}

/* ZSTD_row_update_internal():
* Inserts the byte at ip into the appropriate position in the hash table, and updates ms->nextToUpdate.
* Skips sections of long matches as is necessary.
*/
FORCE_INLINE_TEMPLATE void ZSTD_row_update_internal(ZSTD_matchState_t* ms, const BYTE* ip,
U32 const mls, U32 const rowLog,
U32 const rowMask, U32 const useCache)
{
U32 idx = ms->nextToUpdate;
const BYTE* const base = ms->window.base;
const U32 target = (U32)(ip - base);
const U32 kSkipThreshold = 384;
const U32 kMaxMatchStartPositionsToUpdate = 96;
const U32 kMaxMatchEndPositionsToUpdate = 32;

if (useCache) {
/* Only skip positions when using hash cache, i.e.
* if we are loading a dict, don't skip anything.
* If we decide to skip, then we only update a set number
* of positions at the beginning and end of the match.
*/
if (UNLIKELY(target - idx > kSkipThreshold)) {
U32 const bound = idx + kMaxMatchStartPositionsToUpdate;
ZSTD_row_update_internalImpl(ms, idx, bound, mls, rowLog, rowMask, useCache);
idx = target - kMaxMatchEndPositionsToUpdate;
ZSTD_row_fillHashCache(ms, base, rowLog, mls, idx, ip+1);
Copy link
Contributor

@Cyan4973 Cyan4973 Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding about this line is limited.
It looks to me that it's both caching hash results, and prefetching match positions from the corresponding rows.
However :

  • These rows are, at this stage, not yet updated. I presume that it's the purpose of ZSTD_row_update_internal() to do so. And there are many positions yet to add (kMaxPositionsToUpdate at this point). So is the pre-fetching helpful ? are these lines still in L1 cache after all these updates ?
  • Why is there no such equivalent need for smaller matches, which length is < kMaxPositionsToUpdate ?
  • (More general : what's the saving from storing the hash values into the hashCache ?)

Depending on the level of complexity, we may need to take this discussion off line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash cache is the hashes of the next 8 positions. It is used to prefetch the hash table rows a few positions in advance. ZSTD_row_update_internal() requires the hash cache to be correctly filled, it will consume hashCache[(ip - base) & 7] to get the hash of ip - base, and re-fill it with the hash of (ip + 8 - base).

Normally, we process every position, so we only need to fill it at the beginning of the block. But, now that we are skipping positions, we need to re-fill it when we skip positions.

(More general : what's the saving from storing the hash values into the hashCache ?)

We need to compute the hash ahead of time to prefetch. We've measured both re-computing the hash when we need it, and keeping it cached in the hash-cache. The hash-cache out-performed re-computing the hash.

However, now that we are skipping positions, that calculus changes a little. But, it is still rare, so I wouldn't expect a big difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this makes sense.
I now realize that I have confused idx with target in my initial reading.
Prefetching the next 8 positions starting from idx is still useful, since they are going to be used in the loop just after.

}
}
assert(target >= idx);
ZSTD_row_update_internalImpl(ms, idx, target, mls, rowLog, rowMask, useCache);
ms->nextToUpdate = target;
}

Expand Down
Loading