1

I have a large table of data that I would like to convert to json and am not sure if a tool like jq, mlr, or similar would be able to perform such a task without having to resort to my poor awk skills.

Sample table:

Balance_sheet for AAPL: 2023-09-30 2022-09-30 2021-09-30 2020-09-30 Treasury Shares Number 0.0 NaN NaN NaN Ordinary Shares Number 15550061000.0 15943425000.0 16426786000.0 16976763000.0 

Preferred output:

{ "Balance_sheet for AAPL": { "Treasury Shares Number": { "2023-09-30": "0.0", "2022-09-30": "NaN", "2021-09-30": "NaN", "2020-09-30": "NaN" }, "Ordinary Shares Number": { "2023-09-30": "15550061000.0", "2022-09-30": "15943425000.0", "2021-09-30": "16426786000.0", "2020-09-30": "16976763000.0" } } } 

The following format would also work but less desired:

{ "Balance_sheet for AAPL": { "2023-09-30": { "Treasury Shares Number": "0.0", "Ordinary Shares Number": "15550061000.0" }, "2022-09-30": { "Treasury Shares Number": "NaN", "Ordinary Shares Number": "15943425000.0" }, "2021-09-30": { "Treasury Shares Number": "NaN", "Ordinary Shares Number": "16426786000.0" }, "2020-09-30": { "Treasury Shares Number": "NaN", "Ordinary Shares Number": "16976763000.0" } } } 

Does anyone know a sane method of accomplishing this?

3
  • I would reach for Python or some other higher level language, but perhaps you could make some progress using awk's support for fixed width fields. Commented Dec 19, 2023 at 15:44
  • Does your table use tabs between columns or are they fixed width? Commented Dec 19, 2023 at 16:52
  • @Shawn fixed width, spaces. Commented Dec 19, 2023 at 16:55

3 Answers 3

5

I'd use perl:

$ perl -MJSON::PP -ae ' if (/^(.*):$/) {$sheet = $1} elsif (/^\h+\d/) {$n = (@dates = @F)} elsif (/^(.*?)((?:\h+)\H+){$n}$/) { $i = -$n; $j{$sheet}->{$1} = {map {$_ => $F[$i++]} @dates} } END {print JSON::PP->new->pretty->encode(\%j)}' your-file { "Balance_sheet for AAPL" : { "Ordinary Shares Number" : { "2023-09-30" : "15550061000.0", "2020-09-30" : "16976763000.0", "2022-09-30" : "15943425000.0", "2021-09-30" : "16426786000.0" }, "Treasury Shares Number" : { "2020-09-30" : "NaN", "2023-09-30" : "0.0", "2021-09-30" : "NaN", "2022-09-30" : "NaN" } } } 

Which distinguishes 3 types of lines in that input based on regular expressions:

  • those that end in : which determine the current "sheet" (the keys in the top level objects)
  • lines that start with at least one (+) \horizontal space followed by one \decimal digit which are the dates (keys in the third level objects), we record them in the @dates array and their number in $n.
  • Lines that contain at least $n blank-delimited fields, where the part before the last $n fields makes up the key in the second-level object, and we build the third level object for that key using the @dates as keys and those last $n fields as values.
  • anything else (which in the sample input is just the empty lines) is ignored.

Note that as the JSON objects are the representations of perl associative arrays, the order of the members within them will be random. You can get a consistent order by setting the canonical flag (JSON::PP->new->pretty->canonical->encode(\%j)) where each object will have members sorted by key.

If it's important that the order of fields in the JSON objects reflect the order in the table, as mentioned in perldoc JSON::PP, you can tie those arrays to different sorts of hashes that preserve ordering using things like

$ perl -MTie::Hash::Indexed -MJSON::PP -ae ' BEGIN{tie %j, $m = "Tie::Hash::Indexed"} if (/^(.*):$/) {tie my %s, $m; $j{$sheet = $1} = \%s} elsif (/^\h+\d/) {$n = (@dates = @F)} elsif (/^(.*?)((?:\h+)\H+){$n}$/) { tie my %s, $m; $i = -$n; %s = map {$_ => $F[$i++]} @dates; $j{$sheet}->{$1} = \%s } END {print JSON::PP->new->pretty->encode(\%j)}' your-file { "Balance_sheet for AAPL" : { "Treasury Shares Number" : { "2023-09-30" : "0.0", "2022-09-30" : "NaN", "2021-09-30" : "NaN", "2020-09-30" : "NaN" }, "Ordinary Shares Number" : { "2023-09-30" : "15550061000.0", "2022-09-30" : "15943425000.0", "2021-09-30" : "16426786000.0", "2020-09-30" : "16976763000.0" } } } 

Tie::Hash::Indexed (libtie-hash-indexed-perl Debian package) is one of several such modules that provide ordered hashes.

In case that's important, for a format closer to your expected format with 4-space indentation and spaces after but not before the :s, replace pretty with indent->indent_length(4)->space_after (indent_length(2) for jq-style pretty-printing).

0
3

Using any POSIX awk:

$ cat tst.awk BEGIN { inStep = 4 print "{" } sub(/:$/,"") { indent = inStep printf "%*s\"%s\": {\n", indent, "", $0 next } !numDates && /^[[:space:]]/ { numDates = split($0,dates) next } numDates && match($0,"[[:space:]]+([^[:space:]]+[[:space:]]*){"numDates"}$") { indent += inStep printf "%s%*s\"%s\": {\n", (numItems++ ? ",\n" : ""), indent, "", substr($0,1,RSTART-1) indent += inStep $0 = substr($0,RSTART,RLENGTH) for ( i=1; i<=numDates; i++ ) { printf "%*s\"%s\": \"%s\"%s\n", indent, "", dates[i], $i, (i<numDates ? "," : "") } indent -= inStep printf "%*s}", indent, "" indent -= inStep } END { printf "\n%*s}\n", indent, "" print "}" } 

$ awk -f tst.awk file { "Balance_sheet for AAPL": { "Treasury Shares Number": { "2023-09-30": "0.0", "2022-09-30": "NaN", "2021-09-30": "NaN", "2020-09-30": "NaN" }, "Ordinary Shares Number": { "2023-09-30": "15550061000.0", "2022-09-30": "15943425000.0", "2021-09-30": "16426786000.0", "2020-09-30": "16976763000.0" } } } 

If you need to handle multiple "Balance Sheet" blocks then just add this:

if ( numTables++ ) { printf "\n%*s},\n", indent, "" } numDates = numItems = 0 

immediately below the sub() line, e.g. given this input:

$ cat file2 Balance_sheet for AAPL: 2023-09-30 2022-09-30 2021-09-30 2020-09-30 Treasury Shares Number 0.0 NaN NaN NaN Ordinary Shares Number 15550061000.0 15943425000.0 16426786000.0 16976763000.0 Balance_sheet for foo: 2023-09-30 2022-09-30 2021-09-30 2020-09-30 Treasury Shares Number 0.0 NaN NaN NaN Ordinary Shares Number 15550061000.0 15943425000.0 16426786000.0 16976763000.0 

this script:

$ cat tst.awk BEGIN { inStep = 4 print "{" } sub(/:$/,"") { if ( numTables++ ) { printf "\n%*s},\n", indent, "" } numDates = numItems = 0 indent = inStep printf "%*s\"%s\": {\n", indent, "", $0 next } !numDates && /^[[:space:]]/ { numDates = split($0,dates) next } numDates && match($0,"[[:space:]]+([^[:space:]]+[[:space:]]*){"numDates"}$") { indent += inStep printf "%s%*s\"%s\": {\n", (numItems++ ? ",\n" : ""), indent, "", substr($0,1,RSTART-1) indent += inStep $0 = substr($0,RSTART,RLENGTH) for ( i=1; i<=numDates; i++ ) { printf "%*s\"%s\": \"%s\"%s\n", indent, "", dates[i], $i, (i<numDates ? "," : "") } indent -= inStep printf "%*s}", indent, "" indent -= inStep } END { printf "\n%*s}\n", indent, "" print "}" } 

will produce this output:

$ awk -f tst.awk file2 { "Balance_sheet for AAPL": { "Treasury Shares Number": { "2023-09-30": "0.0", "2022-09-30": "NaN", "2021-09-30": "NaN", "2020-09-30": "NaN" }, "Ordinary Shares Number": { "2023-09-30": "15550061000.0", "2022-09-30": "15943425000.0", "2021-09-30": "16426786000.0", "2020-09-30": "16976763000.0" } }, "Balance_sheet for foo": { "Treasury Shares Number": { "2023-09-30": "0.0", "2022-09-30": "NaN", "2021-09-30": "NaN", "2020-09-30": "NaN" }, "Ordinary Shares Number": { "2023-09-30": "15550061000.0", "2022-09-30": "15943425000.0", "2021-09-30": "16426786000.0", "2020-09-30": "16976763000.0" } } } 
0
1

Using Raku (formerly known as Perl_6)

~$ raku -MJSON::Fast -e ' my $a = lines[0..1].trim-trailing; \ my @a = slurp.map("Date" ~ *).lines.map: \ *.subst(:global, / <alpha>+ % " " /, { .trans(" " => "_") } ).split(/ \s+ /); \ @a = [Z] @a; my %h; for 1..^@a[0].elems -> $j { \ %h.append: @a.[0][$j] => %(@a.map( { $_.[0] => $_.[$j] } )[1..*]) }; \ put to-json( $a => %h, :sorted-keys );' file 

Above is an answer coded in Raku, a member of the Perl-family of programming languages. Given the limited "Balance Sheet" test file posted by the OP, the sheet 'pre-processing' is correspondingly limited and possibly bespoke. The first few coding lines used to massage the data into a standard format should therefore be construed as giving a 'flavor' of the Raku language, and is not meant to represent a method for handling all 'Balance Sheets'.

  1. The JSON::Fast Raku module is loaded on the command line.
  2. The first statement reads the first two lines into $a scalar, trim-ming off trailing whitespace.
  3. The second statement a). slurps the remainder of the file into memory, b). prepends the "Date" string, c). breaks everything into lines, d). each line is mapped into, in order to: e). substitute the pattern of alphabetic letters % " " separated by single-character space such that (in the replacement-half) the spaces get translated into _ underscores, thus removing whitespace from Row Labels. Then finally f). the data is split on remaining \s+ whitespace, and this (now rectangular) table is stored in the @a array.
  4. The @a table is [Z] "zip"-transformed such that rows-become-columns and vice versa (take care if using this operator on anything other than rectangular data).
  5. A hash %h is declared.
  6. The number of "Shares" data columns is iterated over. a). A %() anonymous hash containing key-value pairs is created, each pair having a Date (row-label) as key for the corresponding "Shares" data-column value (here, using index [1..*] drops the superfluous "labels"-pairing).
  7. Adding another level, b). the appropriate "Shares" (i.e. column) label become the key for the %() anonymous (Date/Shares) hash, which becomes the second-level value. The "Date/Shares" columns-converted-to-key/values are appended to the %h hash.
  8. Finally (really level c at this point), the $a "Balance_Sheet" header is added back-in as key to the %h hash as value, and this (now multi-level) final hash-table is converted to-json(), adding the :sorted-keys named argument, and output.

Sample Input:

Balance_sheet for AAPL: 2023-09-30 2022-09-30 2021-09-30 2020-09-30 Treasury Shares Number 0.0 NaN NaN NaN Ordinary Shares Number 15550061000.0 15943425000.0 16426786000.0 16976763000.0 

Sample Input Table after [Z] transform (minus "Balance_Sheet" header line):

(Date Treasury_Shares_Number Ordinary_Shares_Number) (2023-09-30 0.0 15550061000.0) (2022-09-30 NaN 15943425000.0) (2021-09-30 NaN 16426786000.0) (2020-09-30 NaN 16976763000.0) 

Final JSON Output:

{ "Balance_sheet for AAPL:": { "Ordinary_Shares_Number": { "2020-09-30": "16976763000.0", "2021-09-30": "16426786000.0", "2022-09-30": "15943425000.0", "2023-09-30": "15550061000.0" }, "Treasury_Shares_Number": { "2020-09-30": "NaN", "2021-09-30": "NaN", "2022-09-30": "NaN", "2023-09-30": "0.0" } } } 

https://raku.land/cpan:TIMOTIMO/JSON::Fast
https://docs.raku.org/language/hashmap
https://docs.raku.org/
https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.