simhash is a lightweight Go package for generating Simhash tokens and calculating their similarity using the Moses Charikar Simhash algorithm. It is ideal for applications like text deduplication, plagiarism detection, and near-duplicate content detection and fingerprinting.
For detailed usage, check this.
To get started with simhash, install it using:
go get github.com/erfanmomeniii/simhashNext, include it in your application:
import "github.com/erfanmomeniii/simhash"The following example demonstrates how to generate Simhash tokens and calculate similarity:
package main import ( "fmt" "github.com/erfanmomeniii/simhash" ) func main() { // Create a new Simhash instance s := simhash.NewSimhash() // Add features with weights s.AddFeature("example", 2) s.AddFeature("test", 5) // Generate a Simhash token token1 := s.GenerateToken() // Create another Simhash instance with different features s2 := simhash.NewSimhash() s2.AddFeature("example", 2) s2.AddFeature("testcase", 5) // Generate another token token2 := s2.GenerateToken() // Compute similarity between the two tokens similarity := simhash.ComputeSimilarity(token1, token2) fmt.Printf("Token1: %s\nToken2: %s\nSimilarity: %f\n", token1, token2, similarity) }Output:
Token1: F9E6E6EF197C2B25 Token2: FDA981914657B7D1 Similarity: 43.75 Add features with their weights to the Simhash generator:
s.AddFeature("example", 5) s.AddFeature(12345, 10)Generate a 64-bit hexadecimal Simhash token based on the added features:
token := s.GenerateToken()Calculate the similarity between two Simhash tokens as a percentage (normalized Hamming distance):
similarity := simhash.ComputeSimilarity(token1, token2)The AddFeature method accepts the following types:
- Strings: e.g., "example"
- Numbers: e.g., 123, float64, etc.
- Byte slices: e.g., []byte("example")
- Any other type: Converted using JSON serialization
Pull requests are welcome! For any changes, please open an issue first to discuss the proposed modification. Ensure tests are updated accordingly.