Building a Hash Table in C.

Source

A hash table is an associative array, what is an associative array? An associative array helps one store and find things using words or names. One way to imagine an associative array is with laundry. Imagine you have a dresser and two drawers, one for socks and the other for shirts. Only socks can go in the sock drawer and shirts in the shirt drawer. This is a key, value structure. When you need a pair of socks you read the key label "socks" on the drawer and pull out a pair of socks which are the value.

hash_table.h


// we have the sock and the sock drawer
typedef struct {
    char* key;,
    char* value;
} ht_item;
// hash talbe stores an array of pointers to items,
// the size and the count
typedef struct {
    int size;
    int count;
    ht_item** item; // pointer to a pointer of ht_item
} ht_hash_table;

A hash table item `ht_item` is the sock drawer and the pair of socks. Next we have the hash table itself


hash_table.c

Now we move on to building out our hash table in hash_table.c. We need to carve out some memory on the stack for the size of an ht_item which holds key value pairs. In this line of code ht_item* i = malloc(sizeof(ht_item)); we are carving out a chunk of memory on the heap for a pointer of an `ht_item` that will hold a key value pair.

static ht_item* ht_new_item(const char* k, const char* v){

    ht_item* i = malloc(sizeof(ht_item));
    i->key = strdup(k);
    i->value = strdup(v);

}

Next we need a function to initialize the hash table we will be making. To make a hash table we need carve out memory on the heap and have a pointer to it. Then we choose an arbitrary number for the size of a hash table being 53, which is a prime and something we will get into a little later. The count is set to 0 because there is currently nothing in our hash table. This last line t->items = calloc((size_t)ht->size, sizeof(ht_item*)); seems a little complex so lets break it down. Calloc allocates memory on the heap for the number of items, or buckets our hash table will have (size_t)ht->size,. The function calloc stands for contiguous allocation, which allocates memory for an array of elements and initializes all the bytes to zero, where as malloc does not initialize all the bytes. Then we have , sizeof(ht_item*) is used to tell calloc how much memory to allocate for each pointer to a key-value pair.

ht_hash_table* ht_new() {
	ht_hash_table* ht = malloc(sizeof(ht_hash_table));
	ht->size = 53;
	ht->count = 0; 
	ht->items = calloc((size_t)ht->size, sizeof(ht_item*));
	return ht;
}

Next we need an option to delete the hash table itself. The function takes a point to a hash table struct, we iterate through it and if a key-value pair is not empty we use our delete function. Afterwards we free items in the hash table because we used malloc to make them and then we do the same for the hash table itself.

static void ht_del_item(ht_item* i) {
	free(i->key);
	free(i->value);
	free(i);
}
void ht_del_hash_table(ht_hash_table* ht) {
	// we loop through the entire hash table
	for (int i = 0; i < ht->size; i++) {
		// refer to the current item a key-value pair
		ht_item* item = ht->items[i];
		// if it's not null delete it
		if(item != NULL) {
			ht_del_item(item);
		}
	}

	// free items in hash table
	free(ht->items);
	// free the hash table
	free(ht);
}
// create and destroy the hash table
int main() {
	ht_hash_table* ht = ht_new();
	ht_del_hash_table(ht);
}

Now is the part we talk about hashing in general. Im not a big math guy so take what I say here with a grain of salt. In the function we are most concerned with a and m, a is the multiplier or the base, and m is the modulus or the hash table size or number of buckets. For "a" the choice of the multiplier is important, usually it will be a prime number that is larger than the alphabets size, if the function only accepts the english alphabet, which is 26 letters a good prime number would be 53 because it's twice the size of the alphabet. A larger prime number is chosen to reduce the number of collisions and ensure a good spread of hashed values, for example hashing "abc" and "cba" could potentially "collide" or have the same hash value if a prime number like 19 was chosen. For m, m constrains the hashing value and should be equal to the size of the hash table this ensures that the hash values generated will be valid indices within the array we are using.

// function takes in a pointer to our string "key",
// a or the multiplier and m which is the modulus
static int ht_hash(const char* s, const int a, const int m) {
	// initialize the hash so there is no garbage
	long hash = 0;
	// get the length of our key
	const int len_s = strlen(s);
	// loop through the length of our key
	for(int i = 0; i < len_s; i++) {
		// our hash function
		hash += (long)pow(a, len_s - (i+1)) *s[i];
		hash = hash % m;
	}

	return (int)hash;
}