I am currently working on Elasticsearch with a huge number of documents(around 500K) in an index. I want to store n-grams of each document's text data(This is also huge ~ per doc contains 2 pages of text data) in another index. So I calculating term vectors and their count in each document to store them in the new index. So I can execute aggregation queries on the new Index.
The setting of the old index has enabled me to execute termvector and mtermvector API's. I don't want to hit too many requests to Elasticsearch server in a short amount of time so I am going with mtermvectors python API. I am trying to get termvectors of 25 documents by passing id's of 25 documents.
Sample HTTP URL after calling mtermvector API in python
http://*servername*/elastic/*indexname*/article/_mtermvectors?offsets=false&fields=plain_text&ids=608467%2C608469%2C608473%2C608475%2C608477%2C608482%2C608485%2C608492%2C608498%2C608504%2C608509%2C608511%2C608520%2C608522%2C608528%2C608530%2C608541%2C608549%2C608562%2C608570%2C608573%2C608576%2C608577%2C608579%2C608585&field_statistics=true&term_statistics=true&payloads=false&positions=false Some times I am getting expected response but most of the times I am getting the following error:
Proxy Error The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /elastic/*indexname*/article/_mtermvectors. Reason: Error reading from remote server Index setting and mapping
{ "settings": { "analysis": { "analyzer": { "shingleAnalyzer": { "tokenizer": "letter_tokenizer", "filter": [ "lowercase", "custom_stop", "custom_shingle", "custom_stemmer", "length_filter" ] } }, "filter": { "custom_stemmer": { "type": "stemmer", "name": "english" }, "custom_stop": { "type": "stop", "stopwords": "_english_" }, "custom_shingle": { "type": "shingle", "min_shingle_size": "2", "max_shingle_size": "4", "filler_token":"" }, "length_filter": { "type": "length", "min": 2 } }, "tokenizer": { "letter_tokenizer": { "type": "letter" } } } }, "mappings": { "properties": {"article_id":{"type": "text"}, "plain_text": { "term_vector": "with_positions_offsets_payloads", "store": true, "analyzer": "shingleAnalyzer", "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } I don't think there is any problem with this setting and mapping as sometimes I am getting expected response.
Please let me know if you need more information from my side. Any help will be appreciated.