semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models such as BERT.
- fasttext
- transformers
- pytorch
- numpy
- flask
$ pip install semantic-shfrom semantic_sh import SemanticSimHash sh = SemanticSimHash(model_type='bert-base-multilingual-cased', dim=768) sh = SemanticSimHash(model_type='fasttext', dim=300, model_path='/path/to/cc.en.300.bin') sh = SemanticSimHash(model_type='glove', dim=300, model_path='/path/to/glove.6B.50d.txt') sh = SemanticSimHash(model_type='word2vec', dim=300, model_path='/path/to/en.w2v.txt') Customize threshold (default:0) , hash length (default: 256-bit) and add stop words list.
sh = SemanticSimHash(model_type='fasttext', key_size=128, dim=300, model_path='pat_to_fasttext_vectors.bin', thresh=0.8, stop_words=['the', 'i', 'you', 'he', 'she', 'it', 'we', 'they']) Note: BERT-based models do not require stop words list.
sh.get_hash(['<your_text_0>', '<your_text_1>']) Add your document to the proper group
sh.add_document(['<your_text_0>', '<your_text_1>']) Get all documents in the same group with the given text
sh.find_similar('<your_text>') sh.get_distance('<first_text>', '<second_text>') Get all similar document groups which have more than 1 document
for docs in sh.get_similar_groups(): print(docs) Save added documents, hash function, model and parameters
sh.save('model.dat') Load all parameters, documents, hash function and model from saved file
sh = SemanticSimHash.load('model.dat') Easily deploy a simple text similarity engine on web.
$ git clone https://github.com/KeremZaman/semantic-sh.gitserver.py [-h] [--host HOST] [--port PORT] [--model-type MODEL_TYPE] [--model-path MODEL_PATH] [--key-size KEY_SIZE] [--dim DIM] [--stop-words [STOP_WORDS [STOP_WORDS ...]]] [--load-from LOAD_FROM] optional arguments: -h, --help show this help message and exit app: --host HOST --port PORT model: --model-type MODEL_TYPE Type of model to run: fasttext or any pretrained model name from huggingface/transformers --model-path MODEL_PATH Path to vector files of fasttext models --key-size KEY_SIZE Hash length in bits --dim DIM Dimension of text representations according to chosen model type --stop-words [STOP_WORDS [STOP_WORDS ...]] List of stop words to exclude loader: --load-from LOAD_FROM Load previously saved state from gevent.pywsgi import WSGIServer from server import init_app app = init_app(params) # same params as initialize SemantcSimHash object http_server = WSGIServer(('', 5000), app) http_server.serve_forever() NOTE: Sample code uses gevent but you can use any WSGI container which can be used with Flask app object instead.
POST /api/hash
Return hashes of given documents
Request Body
{ "documents": [ "Here is the first document", "and second document" ] } Response Body
{ "hashes": [ "0x7f636944d8c8", "0x5d134944428a4" ] } POST /api/add
Add given documents and return hash and custom IDs of the documents
Request Body
{ "documents": [ "Here is the first document", "and second document" ] } Response Body
{ "documents": [ { "id": 1, "hash": 0x5d134944428a4" }, { "id": 2, "hash": 0x7f636944d8c8" } ] } POST /api/find-similar
Return similar documents to given text
Request Body
{ "text": "Here is the text" } Response Body
{ "similar_texts": [ "Here is the text", "First text here", "Here is text" ] } POST /api/distance
Return Hamming distance between source and target texts
Request Body
{ "src": "Here is the source text", "tgt": "Target text for measuring distance" } Response Body
{ "distance": 21 } GET /api/similarity-groups
Return buckets having more than one document ID
GET /api/text/<int:id>
Return the document according to its ID
Run the api server on port 4000
docker run -ti -p 4000:4000 -v `pwd`/data:/opt/data semantic-sh:latest --port=4000 --model-type=bert-base-multilingual-cased --model-path=/opt/dataRun the api server on port 4000
docker-compose up -d semantic-shThis is a simplified implementation of simhash by just creating random vectors and assigning 1 or 0 according to the result of dot product of each of these vectors with represantation of the text.
MIT