18

I need to index 3 levels (or more) of child-parent. For example, the levels might be an author, a book, and characters from that book.

However, when indexing more than two-levels there is a problem with has_child and has_parent queries and filters. If I have 5 shards, I get about one fifth of the results when running a "has_parent" query on the lowest level (characters) or a has_child query on the second level(books).

My guess is that a book gets indexed to a shard by it's parent id and so will reside together with his parent (author), but a character gets indexed to a shard based on the hash of the book id, which does not necessarily complies with the actual shard the book was indexed on.

And so, this means that all character of books of the same author do not necessarily reside in the same shard (kind of crippling the whole child-parent advantage really).

Am I doing something wrong? How can I resolve this, as I am in real need for complex queries such as "what authors wrote books with female characters" for example.

I mad a gist showing the problem, at: https://gist.github.com/eranid/5299628

Bottom line is, that if I have a mapping:

"author" : { "properties" : { "name" : { "type" : "string" } } }, "book" : { "_parent" : { "type" : "author" }, "properties" : { "title" : { "type" : "string" } } }, "character" : { "_parent" : { "type" : "book" }, "properties" : { "name" : { "type" : "string" } } } 

and a 5 shards index, I cannot make queries with "has_child" and "has_parent"

The query:

curl -XPOST 'http://localhost:9200/index1/character/_search?pretty=true' -d '{ "query": { "bool": { "must": [ { "has_parent": { "parent_type": "book", "query": { "match_all": {} } } } ] } } }' 

returns only a fifth (approximately) of the characters.

2 Answers 2

26

You are correct, parent/child relationship can only work when all children of a given parent resides in the same shard as the parent. Elasticsearch achieves this by using parent id as a routing value. It works great on one level. However, it breaks on the second and consecutive levels. When you have parent/child/grandchild relationship parents are routed based on their id, children are routed based on the parent ids (works), but then grandchildren are routed based on the children ids and they end up in wrong shards. To demonstrate it on an example, let's assume that we are indexing 3 documents:

curl -XPUT localhost:9200/test-idx/author/Douglas-Adams -d '{...}' curl -XPUT localhost:9200/test-idx/book/Mostly-Harmless?parent=Douglas-Adams -d '{...}' curl -XPUT localhost:9200/test-idx/character/Arthur-Dent?parent=Mostly-Harmless -d '{...}' 

Elasticsearch uses value Douglas-Adams to calculate routing for the document Douglas-Adams - no surprise here. For the document Mostly-Harmless, Elasticsearch sees that it has parent Douglas-Adams, so it uses again Douglas-Adams to calculate routing and everything is good - same routing value means same shard. But for the document Arthur-Dent Elasticsearch sees that it has parent Mostly-Harmless, so it uses value Mostly-Harmless as a routing and as a result document Arthur-Dent ends up in wrong shard.

The solution for this is to explicitly specify routing value for the grandchildren equal to the id of the grandparent:

curl -XPUT localhost:9200/test-idx/author/Douglas-Adams -d '{...}' curl -XPUT localhost:9200/test-idx/book/Mostly-Harmless?parent=Douglas-Adams -d '{...}' curl -XPUT localhost:9200/test-idx/character/Arthur-Dent?parent=Mostly-Harmless&routing=Douglas-Adams -d '{...}' 
Sign up to request clarification or add additional context in comments.

10 Comments

using routing parameter on URL. See routing section here - elasticsearch.org/guide/reference/api/index_
Thanks. Can I also specify this in the post-data somehow? specifically for bulk_index, where I want to specify routing for each doc?
Yes, you can add _routing field to the _bulk item. See routing section of elasticsearch.org/guide/reference/api/bulk
I was wondering if you could clarify how this problem happens - if children are routed to the same shard as the parent, and grandchildren are routed to the same shard as the child, shouldn't "relatives" all end up on the same shard?
I recently struggled with the same problem. Would like to confirm whether parent/child relation with more than 2 levels are an acceptable way of mapping data. It gets my job done with least redundancy but are there any significant trade offs that I should be aware of (other than the search overhead due to same shard).
|
0

For the grandpa docs, you need to get the _id as the _routing. For the father docs, just use the _parent (grandpa._id) as the _routing. For the children docs, just use the grandpa._id as the _routing.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.