Extracting data from _source document in ElasticSearch results with jq

Question

I am reading this StackOverFlow discussion converting JSON into CSV and it seems great, but I cant get basic jq to work.. I am not sure what I am doing wrong. I have tried the basic thing and I cant crack whats wrong. Here is my ES query in a Shell Script

curl -XGET 'http://es-1:9200/data_latest/customer/_search?pretty' -H 'Content-Type: application/json' -d' { "_source": ["customer_app_version", "customer_num_apps", "customer_name","app_disk_size_bytes","app_memory_capacity_bytes"], "query": { "bool": { "must": [{ "term": { "is_app_customer": { "value": "true" } } }] } }, "aggs": { "Customer_UUID": { "terms": { "field": "customer_uuid", "size": 100 } } } }

' Shell Script Output

{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 6171, "max_score": 1.8510876, "hits": [ { "_index": "data_latest_v1", "_type": "customer", "_id": "0003245-4844-9015-1z2e-d4ae5234rd56", "_score": 1.8510876, "_source": { "customer_app_version": "el7.20150513", "customer_num_apps": 3, "app_memory_capacity_bytes": 405248409600, "customer_name": "Timbuktu Inc", "app_disk_size_bytes": 25117047875604 } }, { "_index": "data_latest_v1", "_type": "customer", "_id": "0003245-4844-9015-1z2e-d4ae5234rd56", "_score": 1.8510876, "_source": { "customer_app_version": "el4.20150513", "customer_num_apps": 34, "app_memory_capacity_bytes": 58923439600, "customer_name": "Bunnies Inc", "app_disk_size_bytes": 36517984275604 } } ] } }

(truncated, but the subset above is syntactically valid)

How do I use jq within the shell script to output the Keys and values in the _source field (nothing else) as a CSV? I know I am asking something that is described in the other discussion, but I tried and could not get it

For example, I added after the ' (end of the above script) I added | jq -r '."customer_name"'

and also tried

| jq -r '.customer_name'

For both I get output like this.

 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 103 13566 100 13566 0 346 507k 13248 --:--:-- --:--:-- --:--:-- 537k null

What am i doing wrong? What do I need to do? would be super helpful if someone can guide me here.

You might want to suppress curl's status output so it doesn't complicate things here. — Charles Duffy
– Charles Duffy, Commented Mar 16, 2017 at 22:44
Also, the output you gave isn't actually well-formed JSON -- it's missing some close elements. Please make sure the data you provide is correct enough to allow folks to test their answers. — Charles Duffy
– Charles Duffy, Commented Mar 16, 2017 at 22:45
Thanks! I will make sure that it is valid Json.. I did call out that the output was being truncated. I am still learning how to suppress some of the output from ES (ex: took, shards etc.. ) — Sam G
– Sam G, Commented Mar 16, 2017 at 23:15

Charles Duffy · Accepted Answer · 2017-03-16 23:40:41Z

To describe in your jq query how to navigate in the document to the data you want to extract might look like the following:

jq -r '.hits.hits[]._source.customer_name'

In this case, the output is:

Timbuktu Inc Bunnies Inc

To generate a key/value CSV, one might use:

jq -r '.hits.hits[]._source | to_entries | .[] | [.key, .value] | @csv'

...with output:

"customer_app_version","el7.20150513" "customer_num_apps",3 "app_memory_capacity_bytes",405248409600 "customer_name","Timbuktu Inc" "app_disk_size_bytes",25117047875604 "customer_app_version","el4.20150513" "customer_num_apps",34 "app_memory_capacity_bytes",58923439600 "customer_name","Bunnies Inc" "app_disk_size_bytes",36517984275604

If you want customer name to be a column of its own, this might instead be:

jq -r '.hits.hits[]._source | .customer_name as $name | del(.customer_name) | to_entries | .[] | [$name, .key, .value] | @csv'

...with output:

"Timbuktu Inc","customer_app_version","el7.20150513" "Timbuktu Inc","customer_num_apps",3 "Timbuktu Inc","app_memory_capacity_bytes",405248409600 "Timbuktu Inc","app_disk_size_bytes",25117047875604 "Bunnies Inc","customer_app_version","el4.20150513" "Bunnies Inc","customer_num_apps",34 "Bunnies Inc","app_memory_capacity_bytes",58923439600 "Bunnies Inc","app_disk_size_bytes",36517984275604

If you're willing to hardcode the column names, consider instead:

jq -r '.hits.hits[]._source | [.customer_name, .customer_app_version, .customer_num_apps, .app_memory_capacity_bytes, .app_disk_size_bytes] | @csv'

with output:

"Timbuktu Inc","el7.20150513",3,405248409600,25117047875604 "Bunnies Inc","el4.20150513",34,58923439600,36517984275604

Thanks for pointing out that I need to traverse the JSON structure. I will go read the other thread on how to generate the CSV I am looking for.
Charles! You are a rock star! super helpful and so nice of you! I was just reading the other thread and started playing with jq and just figure out how to just list the _source section and was about to read how to get to CSV. You already have described the answer. I will go read it now. Thanks a ton! I will update once I am done.
Super helpful! One minor thing is I had to add double quotes to each attribute to retrieve the values. I haven't figured out yet how to get the column names as the first lane. For now, I write the column names into the file before I run the ES script.
Everything I gave here worked for me exactly as given in the answer, so if you had to make changes, I'm curious about the details of exactly what you were running instead and exactly how it failed. If you wouldn't mind uploading a gist with a reproducer that shows the failure, I'd be very interested to see it.
The only change I had to do was add double quotes for this part [."customer_name", ."customer_app_version", ."customer_num_apps", ."app_memory_capacity_bytes", ."app_disk_size_bytes"]

Collectives™ on Stack Overflow

Extracting data from _source document in ElasticSearch results with jq

1 Answer 1

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Linked

Related