Adjust the DOH string hash function

The one we're currently using only considers the last five characters plus the least significant bit of the last-but-sixth character, which unsurprisingly generates a lot of many-way collisions. This change seems to give about a 4% reduction in wallclock time for processing li_std_list_wrap.i from the testsuite for Python. The hash collision rate for this example drops from 39% to 0! Closes #2303
2022-07-07 11:50:00 +12:00 · 2022-07-07 11:50:00 +12:00 · eaaf893605
commit eaaf893605
parent cd46d65beb
2 changed files with 25 additions and 6 deletions
--- a/CHANGES.current
+++ b/CHANGES.current
@ -7,6 +7,17 @@ the issue number to the end of the URL: https://github.com/swig/swig/issues/
 Version 4.1.0 (in progress)
 ===========================

+2022-09-29: olly
+	    #2303 SWIG's internal hash tables now use a better hash function.
+
+	    The old hash function only considerd the last five characters
+	    plus the least significant bit of the last-but-sixth character,
+	    which as you might guess generated a lot of many-way collisions.
+
+	    This change seems to give about a 4% reduction in wallclock time
+	    for processing li_std_list_wrap.i from the testsuite for Python.
+	    The hash collision rate for this example drops from 39% to 0!
+
 2022-09-29: wsfulton
            #2303 Type tables are now output in a fixed order whereas previously
            the order may change with any minor input code change. This shouldn't
--- a/Source/DOH/string.c
+++ b/Source/DOH/string.c
@ -180,19 +180,27 @@ static int String_hash(DOH *so) {
  if (s->hashkey >= 0) {
    return s->hashkey;
  } else {
-    char *c = s->str;
+    /* We use the djb2 hash function: https://theartincode.stanis.me/008-djb2/
+     *
+     * One difference is we use initial seed 0.  It seems the usual seed value
+     * is intended to help spread out hash values, which is beneficial if
+     * linear probing is used but DOH Hash uses a chain of buckets instead, and
+     * grouped hash values are probably more cache friendly.  In tests using
+     * 0 seems slightly faster anyway.
+     */
+    const char *c = s->str;
    unsigned int len = s->len > 50 ? 50 : s->len;
    unsigned int h = 0;
    unsigned int mlen = len >> 2;
    unsigned int i = mlen;
    for (; i; --i) {
-      h = (h << 5) + *(c++);
-      h = (h << 5) + *(c++);
-      h = (h << 5) + *(c++);
-      h = (h << 5) + *(c++);
+      h = h + (h << 5) + *(c++);
+      h = h + (h << 5) + *(c++);
+      h = h + (h << 5) + *(c++);
+      h = h + (h << 5) + *(c++);
    }
    for (i = len - (mlen << 2); i; --i) {
-      h = (h << 5) + *(c++);
+      h = h + (h << 5) + *(c++);
    }
    h &= 0x7fffffff;
    s->hashkey = (int)h;