Adjust the DOH string hash function
The one we're currently using only considers the last five characters plus the least significant bit of the last-but-sixth character, which unsurprisingly generates a lot of many-way collisions. This change seems to give about a 4% reduction in wallclock time for processing li_std_list_wrap.i from the testsuite for Python. The hash collision rate for this example drops from 39% to 0! Closes #2303
This commit is contained in:
parent
cd46d65beb
commit
eaaf893605
2 changed files with 25 additions and 6 deletions
|
|
@ -7,6 +7,17 @@ the issue number to the end of the URL: https://github.com/swig/swig/issues/
|
|||
Version 4.1.0 (in progress)
|
||||
===========================
|
||||
|
||||
2022-09-29: olly
|
||||
#2303 SWIG's internal hash tables now use a better hash function.
|
||||
|
||||
The old hash function only considerd the last five characters
|
||||
plus the least significant bit of the last-but-sixth character,
|
||||
which as you might guess generated a lot of many-way collisions.
|
||||
|
||||
This change seems to give about a 4% reduction in wallclock time
|
||||
for processing li_std_list_wrap.i from the testsuite for Python.
|
||||
The hash collision rate for this example drops from 39% to 0!
|
||||
|
||||
2022-09-29: wsfulton
|
||||
#2303 Type tables are now output in a fixed order whereas previously
|
||||
the order may change with any minor input code change. This shouldn't
|
||||
|
|
|
|||
|
|
@ -180,19 +180,27 @@ static int String_hash(DOH *so) {
|
|||
if (s->hashkey >= 0) {
|
||||
return s->hashkey;
|
||||
} else {
|
||||
char *c = s->str;
|
||||
/* We use the djb2 hash function: https://theartincode.stanis.me/008-djb2/
|
||||
*
|
||||
* One difference is we use initial seed 0. It seems the usual seed value
|
||||
* is intended to help spread out hash values, which is beneficial if
|
||||
* linear probing is used but DOH Hash uses a chain of buckets instead, and
|
||||
* grouped hash values are probably more cache friendly. In tests using
|
||||
* 0 seems slightly faster anyway.
|
||||
*/
|
||||
const char *c = s->str;
|
||||
unsigned int len = s->len > 50 ? 50 : s->len;
|
||||
unsigned int h = 0;
|
||||
unsigned int mlen = len >> 2;
|
||||
unsigned int i = mlen;
|
||||
for (; i; --i) {
|
||||
h = (h << 5) + *(c++);
|
||||
h = (h << 5) + *(c++);
|
||||
h = (h << 5) + *(c++);
|
||||
h = (h << 5) + *(c++);
|
||||
h = h + (h << 5) + *(c++);
|
||||
h = h + (h << 5) + *(c++);
|
||||
h = h + (h << 5) + *(c++);
|
||||
h = h + (h << 5) + *(c++);
|
||||
}
|
||||
for (i = len - (mlen << 2); i; --i) {
|
||||
h = (h << 5) + *(c++);
|
||||
h = h + (h << 5) + *(c++);
|
||||
}
|
||||
h &= 0x7fffffff;
|
||||
s->hashkey = (int)h;
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue