HyperLogLog Variants and Benchmarking

Introduction

This project focuses on implementing and benchmarking a practical family of HyperLogLog algorithms: HL2, HL3, and an enhanced version, HL4. The aim is to develop efficient versions in terms of time and memory usage, accepting rare "accidents" whose impact is negligible in practice.

Guide to Clone, Compile, and Run the HLX Project

This guide explains how to clone the HLX project's Git repository, compile the source code, and run the ./benchmark program with custom or default parameters.

Cloning the Project

Open your terminal.
Clone the HLX repository from GitHub using the git clone command:

git clone git@github.com:cagret/HLX.git cd HLX

Compiling the Project

Ensure you have GCC (the C compiler) installed on your system.
Use the make command to compile the project. This command will utilize the provided Makefile:

make

Running the `./benchmark` Program

You can execute the ./benchmark program using the following command with custom p and q parameters, or use the default parameters if you prefer:

./benchmark -p 15 -q 8 -n 100000

In our HyperLogLog (HLL) implementation, we utilize the XXH64 function from the XXHash library for efficient hashing of data elements. This function is integral to the accuracy and performance of the HLL algorithm. Here's a brief overview of how XXH64 is used:

Function Purpose: XXH64 is a high-speed hashing function that generates a 64-bit hash value. It's known for its exceptional speed and is widely used in scenarios where performance is critical.
Parameters:
- p: This is a pointer to the data you wish to hash. The data can be of any type, such as a string, a number, or a custom structure.
- q (length): The size of the data pointed to by p in bytes. This length tells XXH64 how much data to read and hash from p.
- n: Number of unique elements to generate and insert into the HyperLogLog algorithm for benchmarking tests.
Usage in HLL: The hash value returned by XXH64 is used to determine the register index to update and the rank of the hash, which are key components in estimating the cardinality (the number of unique elements) in a dataset.
The -p option allows you to specify the value of p (e.g., -p 15).
The -q option allows you to specify the value of q (e.g., -q 8).
The -n option allows you to specify the value of n (e.g., -n 100000).

HL2 and HL3: The Basics

HL2 (HyperLogLog 2): An early version of the HyperLogLog algorithm, focusing on basic probabilistic counting with a simple data structure.
HL3 (HyperLogLog 3): An improved version of HL2, offering better accuracy and efficiency in counting distinct elements.

HL4: An Advanced HyperLogLog Variant

HL4: Builds upon HL3 by introducing a layered structure comprising a mega counter, a list of super counters, and a list (of lists) of counters.
Insertion Logic:
If the hash value is lower than the mega counter, the process ends.
If not, the algorithm checks the corresponding super counter.
If the super counter's value is also higher, the algorithm updates the regular counter.

Parameters of HL4

Base: Experimenting with bases 2 and 4, with a potential exploration of base 8.
Counter Size: Restricted to 2 and 4 for implementation reasons. Sizes 1 and 3 are impractical, and 5 approximates a standard sketch.
Super Counter Size: Set to 4, as 2 might result in too few super counters (thus, fewer sub-sketches), and 6 could lead to an excessively large number of sub-sketches.
Number of Super Counters / Sub-Sketches: This parameter is crucial and needs testing to determine the optimal size that balances memory costs and accident probabilities.

Benchmarking

The benchmarking process involves comparing HL2, HL3, and HL4 in terms of memory usage and execution time. The goal is to evaluate the performance trade-offs, especially how the enhancements in HL4 impact the overall efficiency and accuracy.

Conclusion

HL4 represents an innovative approach in the HyperLogLog family, aiming to strike a balance between computational efficiency and accuracy. The benchmarking will shed light on its practical applicability and performance metrics.

Contributions

Contributions are always welcome! Please fork the project and open a pull request with your proposed changes.

License

Distributed under the GNU GENERAL PUBLIC LICENSE Version 3, dated 29 June 2007. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
benchmark.c		benchmark.c
common_hll.c		common_hll.c
common_hll.h		common_hll.h
hl2.c		hl2.c
hl2.h		hl2.h
hl3.c		hl3.c
hl3.h		hl3.h
hl4.c		hl4.c
hl4.h		hl4.h
xxhash.c		xxhash.c
xxhash.h		xxhash.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HyperLogLog Variants and Benchmarking

Introduction

Guide to Clone, Compile, and Run the HLX Project

Cloning the Project

Compiling the Project

Running the `./benchmark` Program

HL2 and HL3: The Basics

HL4: An Advanced HyperLogLog Variant

Parameters of HL4

Benchmarking

Conclusion

Contributions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HyperLogLog Variants and Benchmarking

Introduction

Guide to Clone, Compile, and Run the HLX Project

Cloning the Project

Compiling the Project

Running the ./benchmark Program

HL2 and HL3: The Basics

HL4: An Advanced HyperLogLog Variant

Parameters of HL4

Benchmarking

Conclusion

Contributions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Running the `./benchmark` Program

Packages