Hash table lookup is very slow - though only ever iterating through one entry

Question

So I'm writing a hobby programming language interpreter in C. It works but has serious performance problems.

According to gprof, the following function takes 30% of the running time (after excluding the time spent in the profiler itself). 75% of the 30% is spent only inside the function itself, not including its called children.

bool table_get_value_directly(Table* table, Value key, Value* out) { if (table->capacity == 0) { return false; } unsigned long hash; if (!value_hash(&key, &hash)) { FAIL("Couldn't hash"); } int slot = hash % table->capacity; Node* root_node = table->entries[slot]; Node* node = root_node; while (node != NULL) { if (keys_equal(node->key, key)) { *out = node->value; return true; } node = node->next; } return false; }

The function doesn't seem to do anything so slow that it would take up that much of the running time.

Important to note: the while loop in the function never does more than one iteration, because a table in this test case (please see test case description below) never has more than one entry. So it can't be the loop I guess (?).

There's also the option that it's just called a lot, which it is. But according to gprof, the function's average running time does seem to be considerably high compared to all other functions in the system. Please consider self ns/call column of gprof output:

 % cumulative self self total time seconds seconds calls ns/call ns/call name 33.40 24.94 24.94 _mcount_private 22.80 41.96 17.02 __fentry__ 11.16 50.29 8.33 vm_interpret_frame 10.41 58.06 7.77 269775125 28.80 37.77 table_get_value_directly 3.67 60.80 2.74 get_string_from_cache 1.89 62.21 1.41 125659994 11.22 53.29 table_get_cstring_key 1.58 63.39 1.18 266250220 4.43 4.43 object_thread_push_eval_stack 1.38 64.42 1.03 321615299 3.20 4.48 value_compare 1.23 65.34 0.92 266250219 3.46 3.46 object_thread_pop_eval_stack 1.08 66.15 0.81 57173998 14.17 16.91 table_iterate 1.03 66.92 0.77 18455049 41.72 50.03 table_set_value_directly 0.84 67.55 0.63 269775227 2.34 4.82 value_hash 0.74 68.10 0.55 new_stack_frame.constprop.0 0.71 68.63 0.53 107205032 4.94 61.41 cell_table_get_value_cstring_key 0.71 69.16 0.53 18454948 28.72 156.39 cell_table_set_value_cstring_key 0.66 69.65 0.49 144115145 3.40 3.40 hash_string_bounded 0.62 70.11 0.46 144115059 3.19 40.96 table_get 0.51 70.49 0.38 gc_mark_table 0.44 70.82 0.33 181025001 1.82 1.82 cstrings_equal 0.42 71.13 0.31 144114987 2.15 2.15 object_string_copy 0.42 71.44 0.31 144114946 2.15 2.15 object_string_copy_from_null_terminated 0.36 71.71 0.27 107205046 2.52 56.46 cell_table_get_cell_cstring_key 0.36 71.98 0.27 36910241 7.32 28.69 table_free 0.36 72.25 0.27 18454950 14.63 68.96 table_set_cstring_key 0.32 72.49 0.24 load_extension_module 0.24 72.67 0.18 125660109 1.43 1.43 object_hash 0.24 72.85 0.18 18454935 9.75 9.75 vm_call_function 0.23 73.02 0.17 57174015 2.97 3.57 pointer_array_free 0.19 73.16 0.14 92275417 1.52 1.52 table_init 0.19 73.30 0.14 collect_values 0.16 73.42 0.12 18454935 6.50 16.26 vm_call_object 0.13 73.52 0.10 167904869 0.60 0.60 deallocate 0.13 73.62 0.10 18455019 5.42 5.42 object_cell_new 0.12 73.71 0.09 25492005 3.53 4.59 pointer_array_write

So what we have is a function that doesn't look like it does much (again, the loop only ever iterates once), but we still spend a very long time in.

So I would like to hear from those who are good at optimization and might have a clue what this function might do that may cause it to run so slow.

EDIT:

After reading the remarks by @vnp and other comments, adding a few things here.

A. Stats on hash tables in the program

I measured the stats. Please note, table->capacity is the number of root nodes being currently allowed in the base array of the table.

Total times function is called: 18455028 Avg capacity: 8 Max capacity: 32 Avg bucket count: 1 Avg entries count: 1

B. The test case

@vnp mentioned that "table_get_value_directly is called quarter of trillion times [corrected to quarter of a billion]. Either you have a seriously large test case, or the calling code does something seriously wrong".

The test case is the following: I run a recursive fibonacci(35) function in the interpreter programming langauge. This means the fibonacci function should be called 18454929 times.

The code for it is:

fib = { takes n to if n < 0 { print("Incorrect input") } elsif n == 1 { return 0 } elsif n == 2 { return 1 } else { return fib(n - 1) + fib(n - 2) } }

For each of these iterations, it seems we should access the table for the local variable n between 1 and 5 times. That would bring us to approximately 1/8 of a billion... I'm not really sure where the other accesses come from (I will check), but would you say it still looks extremely problematic, or reasonable for this type of test case?

C. The entire hash table code and related files

table.h

#ifndef plane_table_h #define plane_table_h #include "common.h" #include "value.h" #include "pointerarray.h" typedef struct { Value key; Value value; } EntryNew; typedef struct Node { struct Node* next; Value key; Value value; } Node; typedef struct { int capacity; int bucket_count; Node** entries; bool is_memory_infrastructure; bool is_growing; // for debugging size_t num_entries; } Table; struct ObjectString; Table table_new_empty(void); void table_init(Table* table); void table_init_memory_infrastructure(Table* table); void table_set(Table* table, struct Value key, Value value); bool table_get(Table* table, struct Value key, Value* out); void table_set_value_directly(Table* table, struct Value key, Value value); bool table_get_value_directly(Table* table, Value key, Value* out); void table_set_cstring_key(Table* table, const char* key, Value value); bool table_get_cstring_key(Table* table, const char* key, Value* out); bool table_delete(Table* table, Value key); void table_free(Table* table); PointerArray table_iterate(Table* table, const char* alloc_string); void table_print_debug_as_buckets(Table* table, bool show_empty_buckets); void table_debug_print_general_stats(void); #endif

table.c

#include <string.h> #include "table.h" #include "plane_object.h" #include "memory.h" #include "common.h" #include "value.h" /* For debugging */ static size_t times_called = 0; static size_t capacity_sum = 0; static size_t max_capacity = 0; static size_t bucket_sum = 0; static size_t avg_bucket_count = 0; static size_t entries_sum = 0; static size_t avg_entries_sum = 0; static double avg_capacity = 0; /* ..... */ void table_debug_print_general_stats(void) { printf("Times called: %" PRI_SIZET "\n", times_called); printf("Sum capacity: %" PRI_SIZET "\n", capacity_sum); printf("Avg capacity: %g\n", avg_capacity); printf("Max capacity: %" PRI_SIZET "\n", max_capacity); printf("Sum buckets: %" PRI_SIZET "\n", bucket_sum); printf("Avg bucket count: %" PRI_SIZET "\n", avg_bucket_count); printf("Entries sum: %" PRI_SIZET "\n", entries_sum); printf("Avg entries: %" PRI_SIZET "\n", avg_entries_sum); } #define MAX_LOAD_FACTOR 0.75 static void* allocate_suitably(Table* table, size_t size, const char* what) { if (!table->is_memory_infrastructure) { return allocate(size, what); } return allocate_no_tracking(size); } static void deallocate_suitably(Table* table, void* pointer, size_t size, const char* what) { if (!table->is_memory_infrastructure) { deallocate(pointer, size, what); return; } deallocate_no_tracking(pointer); } static bool keys_equal(Value v1, Value v2) { int compare_result = -1; bool compare_success = value_compare(v1, v2, &compare_result); return compare_success && (compare_result == 0); } static void grow_table(Table* table) { DEBUG_MEMORY("Growing table."); if (table->is_growing) { FAIL("grow_table() called while table is already growing."); // For current debugging, remove later } table->is_growing = true; int old_capacity = table->capacity; Node** old_entries = table->entries; table->capacity = GROW_CAPACITY(table->capacity); table->bucket_count = 0; table->num_entries = 0; table->entries = allocate_suitably(table, sizeof(Node*) * table->capacity, "Hash table array"); for (size_t i = 0; i < table->capacity; i++) { table->entries[i] = NULL; } DEBUG_MEMORY("Old capacity: %d. New capacity: %d", old_capacity, table->capacity); for (size_t i = 0; i < old_capacity; i++) { Node* old_entry = old_entries[i]; while (old_entry != NULL) { table_set_value_directly(table, old_entry->key, old_entry->value); Node* current = old_entry; old_entry = old_entry->next; deallocate_suitably(table, current, sizeof(Node), "Table linked list node"); } } DEBUG_MEMORY("Deallocating old table array."); deallocate_suitably(table, old_entries, sizeof(Node*) * old_capacity, "Hash table array"); table->is_growing = false; } void table_init(Table* table) { table->bucket_count = 0; table->capacity = 0; table->is_memory_infrastructure = false; table->is_growing = false; table->entries = NULL; table->num_entries = 0; } void table_init_memory_infrastructure(Table* table) { table_init(table); table->is_memory_infrastructure = true; } void table_set_cstring_key(Table* table, const char* key, Value value) { Value copied_key = MAKE_VALUE_OBJECT(object_string_copy_from_null_terminated(key)); table_set_value_directly(table, copied_key, value); } void table_set_value_directly(Table* table, struct Value key, Value value) { if (table->bucket_count + 1 > table->capacity * MAX_LOAD_FACTOR) { grow_table(table); } unsigned long hash; if (!value_hash(&key, &hash)) { FAIL("Couldn't hash"); } int slot = hash % table->capacity; Node* root_node = table->entries[slot]; Node* node = root_node; if (root_node == NULL) { table->bucket_count++; } while (node != NULL) { if (keys_equal(node->key, key)) { node->value = value; break; } node = node->next; } if (node == NULL) { Node* new_node = allocate_suitably(table, sizeof(Node), "Table linked list node"); new_node->key = key; new_node->value = value; new_node->next = root_node; table->entries[slot] = new_node; table->num_entries++; } } bool table_get_cstring_key(Table* table, const char* key, Value* out) { Value value_key = MAKE_VALUE_OBJECT(object_string_copy_from_null_terminated(key)); return table_get_value_directly(table, value_key, out); } bool table_get_value_directly(Table* table, Value key, Value* out) { if (table->capacity == 0) { return false; } unsigned long hash; if (!value_hash(&key, &hash)) { FAIL("Temporary FAIL: Couldn't hash in table_get."); } int slot = hash % table->capacity; Node* root_node = table->entries[slot]; Node* node = root_node; while (node != NULL) { if (keys_equal(node->key, key)) { *out = node->value; return true; } node = node->next; } /* Temporary: just for stats for this question */ times_called++; capacity_sum += table->capacity; avg_capacity = capacity_sum / times_called; if (table->capacity > max_capacity) { max_capacity = table->capacity; } bucket_sum += table->bucket_count; avg_bucket_count = bucket_sum / times_called; entries_sum += table->num_entries; avg_entries_sum = entries_sum / times_called; /* ....... */ return false; } bool table_delete(Table* table, Value key) { if (table->capacity == 0) { return false; } unsigned long hash; if (!value_hash(&key, &hash)) { return false; } int slot = hash % table->capacity; Node* root_node = table->entries[slot]; Node* node = root_node; Node* previous = NULL; while (node != NULL) { if (keys_equal(node->key, key)) { if (previous != NULL) { previous->next = node->next; deallocate_suitably(table, node, sizeof(Node), "Table linked list node"); } else { table->entries[slot] = node->next; deallocate_suitably(table, node, sizeof(Node), "Table linked list node"); } if (table->num_entries <= 0) { FAIL("table->num_entries <= 0 while deleting entry."); } table->num_entries--; return true; } else { previous = node; node = node->next; } } return false; } PointerArray table_iterate(Table* table, const char* alloc_string) { PointerArray array; pointer_array_init(&array, alloc_string); // TODO: Is it correct to iterate on table->capacity instead of table->count? for (size_t i = 0; i < table->capacity; i++) { Node* node = table->entries[i]; while (node != NULL) { pointer_array_write(&array, node); node = node->next; } } return array; } void table_print_debug_as_buckets(Table* table, bool show_empty_buckets) { for (size_t i = 0; i < table->capacity; i++) { Node* node = table->entries[i]; if (node != NULL || (node == NULL && show_empty_buckets)) { printf("Bucket [%3" PRI_SIZET "Z]", i); while (node != NULL) { printf(" ---> "); printf("<"); value_print(node->key); printf(" : "); value_print(node->value); printf(">"); node = node->next; } printf("\n"); } } } void table_free(Table* table) { PointerArray entries = table_iterate(table, "table free table_iterate buffer"); for (size_t i = 0; i < entries.count; i++) { Node* node = entries.values[i]; deallocate_suitably(table, node, sizeof(Node), "Table linked list node"); } pointer_array_free(&entries); deallocate_suitably(table, table->entries, sizeof(Node*) * table->capacity, "Hash table array"); table_init(table); } Table table_new_empty(void) { Table table; table_init(&table); return table; }

value.h

#ifndef plane_value_h #define plane_value_h #include "bytecode.h" #include "common.h" #include "dynamic_array.h" typedef enum { VALUE_NUMBER, VALUE_BOOLEAN, VALUE_NIL, VALUE_RAW_STRING, VALUE_CHUNK, VALUE_OBJECT, VALUE_ALLOCATION, // Internal VALUE_ADDRESS // Internal } ValueType; typedef struct { const char* data; int length; unsigned long hash; } RawString; typedef struct Value { ValueType type; union { double number; bool boolean; RawString raw_string; Bytecode chunk; struct Object* object; Allocation allocation; uintptr_t address; } as; } Value; #define MAKE_VALUE_NUMBER(n) (Value){.type = VALUE_NUMBER, .as.number = (n)} #define MAKE_VALUE_BOOLEAN(val) (Value){.type = VALUE_BOOLEAN, .as.boolean = (val)} #define MAKE_VALUE_NIL() (Value){.type = VALUE_NIL, .as.number = -1} #define MAKE_VALUE_RAW_STRING(cstring, the_length, the_hash) (Value){.type = VALUE_RAW_STRING, .as.raw_string \ = (RawString) {.data = (cstring), .length = (the_length), .hash = (the_hash)}} #define MAKE_VALUE_OBJECT(o) (Value){.type = VALUE_OBJECT, .as.object = (struct Object*)(o)} #define MAKE_VALUE_CHUNK(the_chunk) (Value){.type = VALUE_CHUNK, .as.chunk = (the_chunk)} #define MAKE_VALUE_ALLOCATION(the_name, the_size) (Value) {.type = VALUE_ALLOCATION, \ .as.allocation = (Allocation) {.name = the_name, .size = the_size}} #define MAKE_VALUE_ADDRESS(the_address) (Value) {.type = VALUE_ADDRESS, .as.address = (uintptr_t) the_address } #define ASSERT_VALUE_TYPE(value, expected_type) \ do { \ if (value.type != expected_type) { \ FAIL("Expected value type: %d, found: %d", expected_type, value.type); \ } \ } while (false) void value_print(Value value); bool value_compare(Value a, Value b, int* output); bool value_hash(Value* value, unsigned long* result); #endif

value.c

#include <stdio.h> #include <string.h> #include <math.h> #include "value.h" #include "plane_object.h" #include "memory.h" void value_print(Value value) { switch (value.type) { case VALUE_NUMBER: { printf("%g", value.as.number); return; } case VALUE_OBJECT: { object_print(value.as.object); return; } case VALUE_BOOLEAN: { printf(value.as.boolean ? "true" : "false"); return; } case VALUE_RAW_STRING: { RawString string = value.as.raw_string; printf("\"%.*s\"", string.length, string.data); return; } case VALUE_CHUNK: { Bytecode chunk = value.as.chunk; printf("< Chunk of size %d pointing at '%p' >", chunk.count, chunk.code); return; } case VALUE_NIL: { printf("nil"); return; } case VALUE_ADDRESS: { printf("%" PRIxPTR , value.as.address); return; } case VALUE_ALLOCATION: { Allocation allocation = value.as.allocation; printf("<Internal: allocation marker of '\%s' size %" PRI_SIZET ">", allocation.name, allocation.size); return; } } FAIL("Unrecognized VALUE_TYPE: %d", value.type); } bool value_compare(Value a, Value b, int* output) { if (a.type != b.type) { *output = -1; return true; } switch (a.type) { case VALUE_NUMBER: { double n1 = a.as.number; double n2 = b.as.number; if (n1 == n2) { *output = 0; } else if (n1 > n2) { *output = 1; } else { *output = -1; } return true; } case VALUE_BOOLEAN: { bool b1 = a.as.boolean; bool b2 = b.as.boolean; if (b1 == b2) { *output = 0; } else { *output = -1; } return true; } case VALUE_OBJECT: { bool objectsEqual = object_compare(a.as.object, b.as.object); if (objectsEqual) { *output = 0; } else { *output = -1; } return true; } case VALUE_NIL: { *output = 0; return true; } case VALUE_ADDRESS: { uintptr_t addr1 = a.as.address; uintptr_t addr2 = b.as.address; if (addr1 == addr2) { *output = 0; } else if (addr1 < addr2) { *output = -1; } else { *output = 1; } return true; } case VALUE_ALLOCATION: { Allocation alloc1 = a.as.allocation; Allocation alloc2 = b.as.allocation; *output = (alloc1.size == alloc2.size) && (strcmp(alloc1.name, alloc2.name) == 0); /* BUG ??? */ return true; } case VALUE_CHUNK: { FAIL("Attempting to compare chunks. Shouldn't happen."); return false; } case VALUE_RAW_STRING: { RawString s1 = a.as.raw_string; RawString s2 = b.as.raw_string; if (cstrings_equal(s1.data, s1.length, s2.data, s2.length)) { *output = 0; } else { *output = -1; } return true; } } FAIL("Couldn't compare values. Type A: %d, type B: %d", a.type, b.type); return false; } bool value_hash(Value* value, unsigned long* result) { switch (value->type) { case VALUE_OBJECT: { unsigned long hash; if (object_hash(value->as.object, &hash)) { *result = hash; return true; } return false; } case VALUE_CHUNK: { FAIL("Since Bytecode values aren't supposed to be reachable directly from user code, this should never happen."); return false; } case VALUE_BOOLEAN: { *result = value->as.boolean ? 0 : 1; return true; } case VALUE_NUMBER: { *result = hash_int(floor(value->as.number)); // TODO: Not good at all, redo this return true; } case VALUE_NIL: { *result = 0; return true; } case VALUE_RAW_STRING: { RawString string = value->as.raw_string; *result = string.hash; return true; } case VALUE_ADDRESS: { *result = hash_int(value->as.address); // Not good at all, but should logically work return true; } case VALUE_ALLOCATION: { return false; } } FAIL("value.c:hash_value - shouldn't get here."); return false; }

pointerarray.h

#ifndef plane_pointerarray_h #define plane_pointerarray_h typedef struct { int count; int capacity; void** values; const char* alloc_string; /* Kind of ugly, purely for debugging */ } PointerArray; void pointer_array_init(PointerArray* array, const char* alloc_string); void pointer_array_write(PointerArray* array, void* value); void pointer_array_free(PointerArray* array); void** pointer_array_to_plain_array(PointerArray* array, const char* what); #endif

memory.h

#ifndef plane_memory_h #define plane_memory_h #include "common.h" typedef struct { const char* name; size_t size; } Allocation; #define GROW_CAPACITY(capacity) (capacity) < 8 ? 8 : (capacity) * 2 size_t get_allocated_memory(); size_t get_allocations_count(); void memory_init(void); void* allocate(size_t size, const char* what); void deallocate(void* pointer, size_t oldSize, const char* what); void* reallocate(void* pointer, size_t oldSize, size_t newSize, const char* what); void* allocate_no_tracking(size_t size); void deallocate_no_tracking(void* pointer); void* reallocate_no_tracking(void* pointer, size_t new_size); void memory_print_allocated_entries(); // for debugging #endif

plane_object.h

#ifndef plane_object_h #define plane_object_h #include <Windows.h> #include "bytecode.h" #include "common.h" #include "table.h" #include "value.h" #include "cell_table.h" typedef enum { OBJECT_STRING, OBJECT_FUNCTION, OBJECT_CODE, OBJECT_TABLE, OBJECT_CELL, OBJECT_MODULE, OBJECT_THREAD, OBJECT_CLASS, OBJECT_INSTANCE, OBJECT_BOUND_METHOD } ObjectType; typedef enum { METHOD_ACCESS_SUCCESS, METHOD_ACCESS_NO_SUCH_ATTR, METHOD_ACCESS_ATTR_NOT_BOUND_METHOD } MethodAccessResult; typedef struct Object { ObjectType type; struct Object* next; CellTable attributes; bool is_reachable; } Object; typedef struct ObjectString { Object base; char* chars; int length; unsigned long hash; } ObjectString; typedef struct ObjectTable { Object base; Table table; } ObjectTable; typedef struct ObjectCode { Object base; Bytecode bytecode; } ObjectCode; typedef bool (*NativeFunction)(Object*, ValueArray, Value*); typedef struct ObjectCell { Object base; Value value; bool is_filled; } ObjectCell; typedef struct ObjectFunction { Object base; char* name; char** parameters; int num_params; bool is_native; CellTable free_vars; union { NativeFunction native_function; ObjectCode* code; }; } ObjectFunction; typedef struct ObjectModule { Object base; ObjectString* name; ObjectFunction* function; HMODULE dll; } ObjectModule; #define THREAD_EVAL_STACK_MAX 255 #define THREAD_CALL_STACK_MAX 255 typedef struct { uint8_t* return_address; ObjectFunction* function; Object* base_entity; CellTable local_variables; bool is_entity_base; bool is_native; bool discard_return_value; } StackFrame; typedef struct ObjectThread { Object base; char* name; struct ObjectThread* previous_thread; struct ObjectThread* next_thread; uint8_t* ip; ObjectFunction* base_function; Value* eval_stack_top; Value eval_stack[THREAD_EVAL_STACK_MAX]; StackFrame* call_stack_top; StackFrame call_stack[THREAD_CALL_STACK_MAX]; } ObjectThread; typedef struct ObjectInstance ObjectInstance; typedef void (*DeallocationFunction)(struct ObjectInstance *); typedef Object** (*GcMarkFunction)(struct ObjectInstance *); typedef struct ObjectClass { /* TODO: Make distinction between plane and native classes clearer. Different types? Flag? Union? */ Object base; char* name; int name_length; ObjectFunction* base_function; size_t instance_size; DeallocationFunction dealloc_func; GcMarkFunction gc_mark_func; } ObjectClass; typedef struct ObjectInstance { Object base; ObjectClass* klass; bool is_initialized; } ObjectInstance; typedef struct ObjectBoundMethod { Object base; Object* self; ObjectFunction* method; } ObjectBoundMethod; ObjectString* object_string_copy(const char* string, int length); ObjectString* object_string_take(char* chars, int length); ObjectString* object_string_clone(ObjectString* original); ObjectString** object_create_copied_strings_array(const char** strings, int num, const char* allocDescription); ObjectString* object_string_copy_from_null_terminated(const char* string); ObjectFunction* object_user_function_new(ObjectCode* code, char** parameters, int numParams, CellTable free_vars); ObjectFunction* object_native_function_new(NativeFunction nativeFunction, char** parameters, int numParams); void object_function_set_name(ObjectFunction* function, char* name); ObjectFunction* make_native_function_with_params(char* name, int num_params, char** params, NativeFunction function); ObjectCode* object_code_new(Bytecode chunk); ObjectTable* object_table_new(Table table); ObjectTable* object_table_new_empty(void); ObjectCell* object_cell_new(Value value); ObjectCell* object_cell_new_empty(void); ObjectClass* object_class_new(ObjectFunction* base_function, char* name); ObjectClass* object_class_native_new( char* name, size_t instance_size, DeallocationFunction dealloc_func, GcMarkFunction gc_mark_func, ObjectFunction* constructor, void* descriptors[][2]); void object_class_set_name(ObjectClass* klass, char* name, int length); ObjectInstance* object_instance_new(ObjectClass* klass); ObjectModule* object_module_new(ObjectString* name, ObjectFunction* function); ObjectModule* object_module_native_new(ObjectString* name, HMODULE dll); ObjectBoundMethod* object_bound_method_new(ObjectFunction* method, Object* self); ObjectThread* object_thread_new(ObjectFunction* function, char* name); void object_thread_push_eval_stack(ObjectThread* thread, Value value); Value object_thread_pop_eval_stack(ObjectThread* thread); void object_thread_push_frame(ObjectThread* thread, StackFrame frame); StackFrame object_thread_pop_frame(ObjectThread* thread); StackFrame* object_thread_peek_frame(ObjectThread* thread, int offset); bool object_compare(Object* a, Object* b); bool object_strings_equal(ObjectString* a, ObjectString* b); void object_free(Object* object); void object_print(Object* o); void object_print_all_objects(void); void object_thread_print(ObjectThread* thread); void object_thread_print_diagnostic(ObjectThread* thread); bool object_hash(Object* object, unsigned long* result); // MethodAccessResult object_get_method(Object* object, const char* method_name, ObjectFunction** out); MethodAccessResult object_get_method(Object* object, const char* method_name, ObjectBoundMethod** out); #define OBJECT_AS_STRING(o) (object_as_string(o)) #define OBJECT_AS_FUNCTION(o) (object_as_function(o)) ObjectString* object_as_string(Object* o); ObjectFunction* object_as_function(Object* o); bool object_value_is(Value value, ObjectType type); void object_set_attribute_cstring_key(Object* object, const char* key, Value value); bool object_load_attribute(Object* object, ObjectString* name, Value* out); bool object_load_attribute_cstring_key(Object* object, const char* name, Value* out); bool load_attribute_bypass_descriptors(Object* object, ObjectString* name, Value* out); /* Internal: only external to be used by some tests */ ObjectInstance* object_descriptor_new(ObjectFunction* get, ObjectFunction* set); ObjectInstance* object_descriptor_new_native(NativeFunction get, NativeFunction set); bool is_instance_of_class(Object* object, char* klass_name); bool is_value_instance_of_class(Value value, char* klass_name); ObjectFunction* object_make_constructor(int num_params, char** params, NativeFunction function); #define VALUE_AS_OBJECT(value, object_type, cast) object_value_is(value, object_type) ? (cast*) value.as.object : NULL #define ASSERT_VALUE_AS_OBJECT(variable, value, object_type, cast, error) \ do { \ variable = VALUE_AS_OBJECT((value), object_type, cast); \ if (variable == NULL) { \ FAIL(error); \ } \ } while (false); #define ASSERT_VALUE_IS_OBJECT(value, object_type, error_message) \ do { \ if (!object_value_is(value, object_type)) { \ FAIL(error_message); \ } \ } while (false); \ #endif

common.h

#ifndef plane_common_h #define plane_common_h #include <stdlib.h> #include <stddef.h> #include <stdint.h> #include <stdbool.h> #include <stdio.h> #include <inttypes.h> #include <assert.h> /* Generally keep these OFF unless you need them specifically */ #define DISABLE_GC 0 // Only set to 1 for debugging purposes when you need the GC to not run #define DEBUG 0 // General debug printing #define DEBUG_TRACE_EXECUTION 0 // Show stack operations #define DEBUG_THREADING 0 #define DEBUG_GC 0 // Show GC operations #define DEBUG_OBJECTS 0 // Show object operations #define DEBUG_MEMORY_EXECUTION 0 // Show low-level memory operations #define DEBUG_SCANNER 0 // Show low level lexing output and such #define DEBUG_PAUSE_AFTER_OPCODES 0 // Wait for user input after each opcode #define DEBUG_TABLE_STATS 0 // Collect statistics on general hash table behavior /* ****************** */ /* Probably leave this ON most of the time during DEV. Disable for release. */ #define GC_STRESS_TEST 0 // Run GC every loop iteration. Used to help GC bugs surface. Obviously really bad for performance /* ************** */ /* Always leave these two ON in DEV. Probably disable for release */ #define MEMORY_DIAGNOSTICS 0 // Usually leave on in dev. Disable for release #define DEBUG_IMPORTANT 1 // Pretty much always leave this on, at least in dev - printing critical diagnosis and such /* **************** */ #if DEBUG #define DEBUG_PRINT(...) do { \ fprintf(stdout, "DEBUG: "); \ fprintf (stdout, __VA_ARGS__); \ fprintf(stdout, "\n"); \ } while (false) #else #define DEBUG_PRINT(...) do {} while(false) #endif #if DEBUG_MEMORY_EXECUTION #define DEBUG_MEMORY(...) do { \ fprintf(stdout, "MEMORY: "); \ fprintf (stdout, __VA_ARGS__); \ fprintf(stdout, "\n"); \ } while (false) #else #define DEBUG_MEMORY(...) do {} while(false) #endif #if DEBUG_THREADING #define DEBUG_THREADING_PRINT(...) do { \ fprintf (stdout, __VA_ARGS__); \ } while (false) #else #define DEBUG_THREADING_PRINT(...) do {} while(false) #endif #if DEBUG_IMPORTANT #define DEBUG_IMPORTANT_PRINT(...) do { \ fprintf (stdout, __VA_ARGS__); \ } while (false) #else #define DEBUG_IMPORTANT_PRINT(...) do {} while(false) #endif #if DEBUG_TRACE_EXECUTION #define DEBUG_TRACE(...) do { \ fprintf (stdout, __VA_ARGS__); \ fprintf (stdout, "\n"); \ } while (false) #else #define DEBUG_TRACE(...) do {} while(false) #endif #if DEBUG_SCANNER #define DEBUG_SCANNER_PRINT(...) do { \ fprintf (stdout, "Scanner: " __VA_ARGS__); \ fprintf (stdout, "\n"); \ } while (false) #else #define DEBUG_SCANNER_PRINT(...) do {} while(false) #endif #if DEBUG_OBJECTS #define DEBUG_OBJECTS_PRINT(...) do { \ fprintf (stdout, __VA_ARGS__); \ fprintf (stdout, "\n"); \ } while (false) #else #define DEBUG_OBJECTS_PRINT(...) do {} while(false) #endif #if DEBUG_GC #define DEBUG_GC_PRINT(...) do { \ fprintf (stdout, __VA_ARGS__); \ fprintf (stdout, "\n"); \ } while (false) #else #define DEBUG_GC_PRINT(...) do {} while(false) #endif // TODO: actual assertion or error mechanisms #define FAIL(...) do { \ fprintf(stdout, "\nFAILING! Reason:'"); \ fprintf(stdout, __VA_ARGS__); \ fprintf(stdout, "'\n"); \ exit(EXIT_FAILURE); \ } while(false) #endif #define PRINTLN(str) do { \ printf("\n"); \ printf(str); \ printf("\n"); \ } while (false) #ifdef _WIN32 # ifdef _WIN64 # define PRI_SIZET PRIu64 # else # define PRI_SIZET PRIu32 # endif #else # define PRI_SIZET "zu" #endif

What is the average collision chain length, what is the average number of loop iterations in each a) successful b) unsuccessful search? There'd be nothing wrong with 2…. 42 would hurt one way, 1.007 the other. — greybeard
– greybeard, Commented Apr 28, 2020 at 4:49
% is spent only inside the function itself, not including its called children Did you control/check inlining? How and to what result? — greybeard
– greybeard, Commented Apr 28, 2020 at 4:51
This question needs an answer from people who are very good at optimization. For that I am assisting the OP with a bounty of 200 pts. — pacmaninbw
– pacmaninbw ♦, Commented Apr 30, 2020 at 20:15
Calling std::unordered_map::find takes about 20ns per call if it only contains a few integers. So it's faster, but not by a lot. Are you sure you're not calling this too often? — sudo rm -rf slash
– sudo rm -rf slash, Commented Apr 30, 2020 at 22:24

G. Sliepen · Accepted Answer · 2020-05-01 19:31:47Z

Code review

I am just going to review table_get_value_directly() for efficiency. However, the findings might or might not be relevant for the performance figures you are getting, since it depends a lot on how the program is compiled (optimization level, whether LTO is used or not), and what data is is actually hashing.

bool table_get_value_directly(Table* table, Value key, Value* out) {

Already at the start there is room for improvement: you can make table const, since this function should not change the contents of the hash tabel. Furthermore, key is copied by value here, it is likely to be more efficient to pass it as a const pointer.

 if (table->capacity == 0) { return false; }

The first thing you are doing is checking whether the capacity is zero. However, if you would ensure that a valid Table always has a non-zero capacity, you can omit this check.

 unsigned long hash; if (!value_hash(&key, &hash)) { FAIL("Couldn't hash"); }

Here is error checking for the case that someone provides a Value that cannot be hashed. It would be better if there was no possibility for an error here. Either make sure all possible Values have a valid hash value, or just return a special value for hash errors. This will cause a lookup in the rest of the code, but then it will discard the result because keys_equal() will return false. Since it is hopefully very unlikely that you call table_get_value_directly() with an unhashable Value, this extra check will not hurt the average performance.

 int slot = hash % table->capacity;

The modulo operation does an integer division, which can be very slow (it is typically tens of CPU cycles, but the exact speed depends on your CPU model). If you ensure that the table capacity is always a power of two, you could instead do a bitwise AND instead, which is very fast, like so:

 int slot = hash & (table->capacity - 1);

If you ensure you store capacity as the actual number minus 1, you can omit the -1 here.

 Node* root_node = table->entries[slot]; Node* node = root_node; while (node != NULL) { if (keys_equal(node->key, key)) { *out = node->value; return true; }

This bit is making a lot of copies. First, keys_equal() makes copies of the two Values. The compiler could optimize that away since it is a static function, however it in turn calls value_compare() which also takes the parameter by value, so copies still have to be made.

Furthermore, once the right node is found, this is making a copy of node->value. It will likely be more efficient to just return a const pointer to node->value.

All this of course depends on how big a Value is, but since it is a large union that includes some non-trivial types, I think it is better to use pointers.

 node = node->next; } return false; }

One issue with your use of a linked list per hash bucket is that, if you were to have many items in each bucket, the above code would have to do pointer lookups for each node in the list. That would be slower than using linear probing. However, your statistics suggest that you only have one entry in there, so this won't matter in this case.

Analysis of the assembler output

I compiled your table_get_value_directly(), as well as an version that I optimized according to what I wrote above, and looked at the generated assembly.

Looking at raw instruction counts, your version has 75 instructions, while the optimized version only uses 34 instructions. But looking more closely, your version results in a div instruction for the modulo (which is quite slow), but it also spends 17 mov and 8 push instructions on the inlined call to keys_equal(), which are there to prepare the arguments of the call to value_compare(). For the line *out = node->value, it seems GCC manages to combine a lot of moves into a few SSE instructions.

The optimized version, which uses (const) pointers to Values where possible, uses much less mov and push instructions, and has no div. So that means less instructions to execute, less memory bandwidth used, and less CPU cycles to spend.

Reason for the large self-time

Proper hash tables are O(1), however that doesn't mean that a lookup will be fast, only that it will not get slower the more entries you have in a hash table. There still is an overhead for doing a hash table lookup. And if the keys you are looking up are very small, then that overhead will be large compared to a call to value_compare(), and thus a relatively large time will be spent in the hash table lookup logic itself.

About performance measurements

Measuring performance is always a bit tricky, since it often is the case that the mere act of instrumenting your code to measure its performance changes the performance.

If you can run your code on Linux, then instead of using gprof, I would consider using Linux perf. It does not need the binary to be instrumented, just ensure it is compiled with debugging symbols. It uses statistical profiling, where it interrupts the program at random times to check at which instruction it is. Use the perf record -g command to capture call graphs, so you can get similar information as with gperf. However, perf record will also capture exact locations within the binary, so you can also use it to zoom in on a function, and see the assembly code annotated with how often each instruction is hit, which can give you hints to which lines of code are the largest contributors to the performance.

For Windows, there are alternatives such as Intel VTune and Visual Studio Profiler that might be able to give similar insights.

Toby Speight · Accepted Answer · 2021-10-24 14:41:43Z

Suspect: Weak hash

GROW_CAPACITY() only makes for powers of 2 in table->capacity ...

#define GROW_CAPACITY(capacity) (capacity) < 8 ? 8 : (capacity) * 2 ... value_hash(&key, &hash) int slot = hash % table->capacity;

... so all that work to make a good hash results in only using the last few bits from value_hash() as code is modding by a power-of-2. The entire quality of the hash is thus dependent of its least significant bits.

If value_hash() is a real good hash, then using any bits is OK. Yet if value_hash() has weaknesses, (say it favors forming even hash values or not uniformly distributed for the keys given to it in its least significant bits), then the later code will call keys_equal() more often than with a good hash due to increased collisions, potentially reducing performance to that of a linked-list. This is a source of inefficiency.

while (node != NULL) { if (keys_equal(node->key, key)) {

To help along weak hash functions, simply use a prime capacity, rather than doubling at each step.

Then slot will depend on all bits of hash.

I recommend using a table of primes just lower than powers of 4 for the capacity.

size_t prime2[capacity_index] = { 3, 13, 61, 251, ... }

Conclusion: Performing % table->capacity with a prime will not harm a good hash function, yet will improve weaker ones and reduce collisions.

[Edit] Hmmm. OP has "though only ever iterating through one entry" so this may not be the case. OP does have "never does more than one iteration" yet that never seems suspicious as that is too perfect.

Use function pointers

For a linear improvement, use a pointer to various hash functions rather than one hash function with a switch().

typedef bool (*)(Value* value, unsigned long* result) func_t; bool value_hash(Value* value, unsigned long* result) { // switch (value->type) { // ... func_t hash_func[VALUE_ADDRESS + 1] = {value_hash_NUMBER, value_hash_BOOLEAN, ... }; return hash_func[value->type](value, result); }

Same for value_compare().

@Aviv Cohn I hope the ideas resulted in noticeable performance improvements. — chux
– chux, Commented May 9, 2020 at 2:46

Stack Exchange Network

Hash table lookup is very slow - though only ever iterating through one entry

2 Answers 2

Code review

Analysis of the assembler output

Reason for the large self-time

About performance measurements

You must log in to answer this question.

Hot Network Questions

Hash table lookup is very slow - though only ever iterating through one entry

2 Answers 2

Code review

Analysis of the assembler output

Reason for the large self-time

About performance measurements

You must log in to answer this question.

Related

Hot Network Questions