This project focuses on implementing and benchmarking a practical family of HyperLogLog algorithms: HL2, HL3, and an enhanced version, HL4. The aim is to develop efficient versions in terms of time and memory usage, accepting rare "accidents" whose impact is negligible in practice.
This guide explains how to clone the HLX project's Git repository, compile the source code, and run the ./benchmark program with custom or default parameters.
-
Open your terminal.
-
Clone the HLX repository from GitHub using the
git clonecommand:
git clone git@github.com:cagret/HLX.git cd HLX-
Ensure you have GCC (the C compiler) installed on your system.
-
Use the
makecommand to compile the project. This command will utilize the providedMakefile:
makeYou can execute the ./benchmark program using the following command with custom p and q parameters, or use the default parameters if you prefer:
./benchmark -p 15 -q 8 -n 100000In our HyperLogLog (HLL) implementation, we utilize the XXH64 function from the XXHash library for efficient hashing of data elements. This function is integral to the accuracy and performance of the HLL algorithm. Here's a brief overview of how XXH64 is used:
-
Function Purpose:
XXH64is a high-speed hashing function that generates a 64-bit hash value. It's known for its exceptional speed and is widely used in scenarios where performance is critical. -
Parameters:
p: This is a pointer to the data you wish to hash. The data can be of any type, such as a string, a number, or a custom structure.q(length): The size of the data pointed to bypin bytes. This length tellsXXH64how much data to read and hash fromp.n: Number of unique elements to generate and insert into the HyperLogLog algorithm for benchmarking tests.
-
Usage in HLL: The hash value returned by
XXH64is used to determine the register index to update and the rank of the hash, which are key components in estimating the cardinality (the number of unique elements) in a dataset. -
The
-poption allows you to specify the value ofp(e.g.,-p 15). -
The
-qoption allows you to specify the value ofq(e.g.,-q 8). -
The
-noption allows you to specify the value ofn(e.g.,-n 100000).
- HL2 (HyperLogLog 2): An early version of the HyperLogLog algorithm, focusing on basic probabilistic counting with a simple data structure.
- HL3 (HyperLogLog 3): An improved version of HL2, offering better accuracy and efficiency in counting distinct elements.
- HL4: Builds upon HL3 by introducing a layered structure comprising a mega counter, a list of super counters, and a list (of lists) of counters.
- Insertion Logic:
- If the hash value is lower than the mega counter, the process ends.
- If not, the algorithm checks the corresponding super counter.
- If the super counter's value is also higher, the algorithm updates the regular counter.
- Base: Experimenting with bases 2 and 4, with a potential exploration of base 8.
- Counter Size: Restricted to 2 and 4 for implementation reasons. Sizes 1 and 3 are impractical, and 5 approximates a standard sketch.
- Super Counter Size: Set to 4, as 2 might result in too few super counters (thus, fewer sub-sketches), and 6 could lead to an excessively large number of sub-sketches.
- Number of Super Counters / Sub-Sketches: This parameter is crucial and needs testing to determine the optimal size that balances memory costs and accident probabilities.
The benchmarking process involves comparing HL2, HL3, and HL4 in terms of memory usage and execution time. The goal is to evaluate the performance trade-offs, especially how the enhancements in HL4 impact the overall efficiency and accuracy.
HL4 represents an innovative approach in the HyperLogLog family, aiming to strike a balance between computational efficiency and accuracy. The benchmarking will shed light on its practical applicability and performance metrics.
Contributions are always welcome! Please fork the project and open a pull request with your proposed changes.
Distributed under the GNU GENERAL PUBLIC LICENSE Version 3, dated 29 June 2007. See LICENSE for more information.