Skip to content

ErfanMomeniii/simhash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go version license version

simhash

simhash is a lightweight Go package for generating Simhash tokens and calculating their similarity using the Moses Charikar Simhash algorithm. It is ideal for applications like text deduplication, plagiarism detection, and near-duplicate content detection and fingerprinting.

For detailed usage, check this.


Documentation

Install

To get started with simhash, install it using:

go get github.com/erfanmomeniii/simhash

Next, include it in your application:

import "github.com/erfanmomeniii/simhash"

Quick Start

The following example demonstrates how to generate Simhash tokens and calculate similarity:

package main import ( "fmt" "github.com/erfanmomeniii/simhash" ) func main() { // Create a new Simhash instance s := simhash.NewSimhash() // Add features with weights s.AddFeature("example", 2) s.AddFeature("test", 5) // Generate a Simhash token token1 := s.GenerateToken() // Create another Simhash instance with different features s2 := simhash.NewSimhash() s2.AddFeature("example", 2) s2.AddFeature("testcase", 5) // Generate another token token2 := s2.GenerateToken() // Compute similarity between the two tokens similarity := simhash.ComputeSimilarity(token1, token2) fmt.Printf("Token1: %s\nToken2: %s\nSimilarity: %f\n", token1, token2, similarity) }

Output:

Token1: F9E6E6EF197C2B25 Token2: FDA981914657B7D1 Similarity: 43.75 

Features

Add Feature

Add features with their weights to the Simhash generator:

s.AddFeature("example", 5) s.AddFeature(12345, 10)

Generate Token

Generate a 64-bit hexadecimal Simhash token based on the added features:

token := s.GenerateToken()

Compute Similarity

Calculate the similarity between two Simhash tokens as a percentage (normalized Hamming distance):

similarity := simhash.ComputeSimilarity(token1, token2)

Supported Feature Types

The AddFeature method accepts the following types:

  • Strings: e.g., "example"
  • Numbers: e.g., 123, float64, etc.
  • Byte slices: e.g., []byte("example")
  • Any other type: Converted using JSON serialization

Contributing

Pull requests are welcome! For any changes, please open an issue first to discuss the proposed modification. Ensure tests are updated accordingly.

About

A lightweight Go package implementing Charikar's Simhash algorithm for generating hash fingerprints and calculating similarity, ideal for deduplication and content fingerprinting

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages