Re-order lines and merge others based on a specific criteria

Question

A weak point in my cli foo is awk. I could probably solve the following with elaborate scripting, but I'm pretty sure awk is the best tool for the job and for the life of me I can't figure out the right approach.

Lets say I have a data file like this (Ledger):

2019/05/31 (MMEX948) Gürmar Assets:Cash:Marina ₺-28,14 Expenses:Food:Groceries:Meat ₺28,14 Assets:Cash:Marina ₺-28,14 Expenses:Food:Groceries:Meat ₺28,14 Assets:Cash:Marina ₺-3,45 Expenses:Food:Groceries:Basic ₺3,45 Assets:Cash:Marina ₺-15,00 Expenses:Food:Groceries:Produce ₺15,00 2019/06/01 (MMEX932) A101 Assets:Cash:Caleb $-3.00 Assets:Cash:Marina $-2.50 Expenses:Food:Groceries:Basic $5.50 2019/06/01 (MMEX931) Şemikler Pazar Yeri Assets:Cash:Marina ₺-24,00 Expenses:Food:Groceries:Basic ₺24,00 Assets:Cash:Marina ₺-31,00 Expenses:Food:Groceries:Meat ₺31,00 Assets:Cash:Marina ₺-65,00 Expenses:Food:Groceries:Produce ₺65,00

Each blank line separated paragraph is a transaction, each indented line is a posting, each posting has an account and an amount (separated by at least 2 spaces).

I want two things to happen to this data. I don't care if these happen in the same command or not, it might be easier to do in one pass or two depending on the tool...

All the postings with negative amounts should be arranged after the postings with positive amounts.
Any postings with negative amounts and duplicate accounts should be merged. Ideally the amounts would be summed, but that is really complicated because of currency formats and is not necessary because I can regenerate the amount lines. Removing the amount entirely from merged postings is sufficient so long as no more than one unique account gets merged per pass.

The result should look like this:

2019/05/31 (MMEX948) Gürmar Expenses:Food:Groceries:Meat ₺28,14 Expenses:Food:Groceries:Meat ₺28,14 Expenses:Food:Groceries:Basic ₺3,45 Expenses:Food:Groceries:Produce ₺15,00 Assets:Cash:Marina 2019/06/01 (MMEX932) A101 Expenses:Food:Groceries:Basic $5.50 Assets:Cash:Marina $-2.50 Assets:Cash:Caleb 2019/06/01 (MMEX931) Şemikler Pazar Yeri Expenses:Food:Groceries:Basic ₺24,00 Expenses:Food:Groceries:Meat ₺31,00 Expenses:Food:Groceries:Produce ₺65,00 Assets:Cash:Marina

Notes that make this a little more complicated than just a scan for duplicates:

In the first transaction, there are two different accounts that are duplicated. Only one of them should be merged and cleared (it would be possible to merge both, but only one per pass or I won't be able to fix the ammounts).
In the middle transaction there is nothing to merge, but it would be a mistake to blindly clear the amounts from all negative transactions. Since there is no merge it doesn't need to be cleared at all, but could be if that makes it easier to process.

How would I step through this problem in awk? Or if Awk isn't the best solution, what is? In most scripting languages (perl, python, zsh) I would parse everything, throw it all into a multi dimensional array, sort based on regex matches of the ammount then and secondarily on alpha for the accounts, then iterate over it to output it, always drop the last ammount and merge only the last duplicate (if any).

Note I did work up a way to parse and merge duplicate transactions in Awk the other day:

awk 'NF { if (/^20/) { if (last != $$0) print "\n" $$0; last = $$0 } else { print $$0 } }' |

But more complicated awk logic is defying me right now.

Caleb · Accepted Answer · 2019-06-27 10:57:48Z

This GNU awk script works for me:

#! /usr/local/bin/awk -f BEGIN { FS = "[[:space:]][[:space:]]+" } function dump() { for (acct in post) { # dump unmerged postings of current transaction if (post[acct]) print post[acct]; } if (merged) { # dump merged posting, if any printf " %s\n", merged } merged = ""; # clear variables for next round delete post; txn = ""; } !NF && txn { # blank line, end of transaction dump(); print; next } END { # end-of-file, print merged postings of last txn dump(); } !txn { # new transaction txn = $0; print; next } { acct = $2; amt = $3 } amt ~ /-/ { # negative amounts, keep for later if (acct in post) { # duplicate entry if (!merged || merged == acct) { # only merge and clear one duplicate account post[acct] = ""; merged = acct; } else # tack on to existing record without merging post[acct] = post[acct] "\n" $0 } else post[acct] = $0 next } 1

In action:

~ ./foo.awk foo 2019/05/31 (MMEX948) Gürmar Expenses:Food:Groceries:Meat ₺28,14 Expenses:Food:Groceries:Meat ₺28,14 Expenses:Food:Groceries:Basic ₺3,45 Expenses:Food:Groceries:Produce ₺15,00 Assets:Cash:Marina 2019/06/01 (MMEX932) A101 Expenses:Food:Groceries:Basic $5.50 Assets:Cash:Marina $-2.50 Assets:Cash:Caleb $-3.00 2019/06/01 (MMEX931) Şemikler Pazar Yeri Expenses:Food:Groceries:Basic ₺24,00 Expenses:Food:Groceries:Meat ₺31,00 Expenses:Food:Groceries:Produce ₺65,00 Assets:Cash:Marina

I do appreciate the answer and I'll lean a lot about awk from this, but I'd really rather a solution that did not try to do the math involved because the number formats are far more complex than this simplistic example. I have different commodities with localized formats including thousand separators is some that are decimal separators in others, RTL scripts in the units, etc. I'd rather not try to validate a round trip between localized and normalized formats. — Caleb
– Caleb, Commented Jun 27, 2019 at 7:25
@Caleb in that case, how can I determine that the number is negative? Would just checking for - in the number be sufficient? Or do you have negative numbers that use the financial way (50) instead of -50? — muru
– muru, Commented Jun 27, 2019 at 7:29
Good question. And actually yes, matching - anywhere in the amount field is sufficient, that is universal although it's position in the amount changes with localization. — Caleb
– Caleb, Commented Jun 27, 2019 at 7:31
Thanks, and thanks for the code comments! It's actually a lot more convoluted in awk than I expected when I asked this and I'm not sure if I wouldn't have been better off just writing a parsing script, but it does what I wanted and I learned a few things along the way. — Caleb
– Caleb, Commented Jun 27, 2019 at 7:56
@Caleb usually awk ignores leading spaces, but not when you set a custom FS. So $1 and $2 would now actually be $2 and $3 (the first field being empty). — muru
– muru, Commented Jun 27, 2019 at 10:54

Ed Morton · Accepted Answer · 2019-06-26 15:08:31Z

With GNU awk for gensub(), arrays of arrays and sorted_in:

$ cat tst.awk BEGIN { RS=""; FS="\n"; localeDecPt="."; PROCINFO["sorted_in"]="@val_num_desc" } { delete sum print $1 denom = gensub(/.*([^0-9.,-]).+$/,"\\1",1,$2) for (i=2; i<=NF; i++) { account = gensub(/[[:space:]]+[^[:space:]]+$/,"",1,$i) amount = gensub(/.*[^0-9.,-](.+)$/,"\\1",1,$i) inputDecPt = gensub(/[0-9-]+/,"","g",amount) sum[account] += gensub("["inputDecPt"]",localeDecPt,"g",amount) } for (account in sum) { amount = denom gensub("["localeDecPt"]",inputDecPt,"g",sprintf("%0.2f",sum[account])) printf "%-*s%*s\n", 40, account, 10, amount } print "" }

.

$ awk -f tst.awk file 2019/05/31 (MMEX948) Gürmar Expenses:Food:Groceries:Meat ₺56,28 Expenses:Food:Groceries:Produce ₺15,00 Expenses:Food:Groceries:Basic ₺3,45 Assets:Cash:Marina ₺-74,73 2019/06/01 (MMEX932) A101 Expenses:Food:Groceries:Basic $5.50 Assets:Cash:Marina $-2.50 Assets:Cash:Caleb $-3.00 2019/06/01 (MMEX931) Şemikler Pazar Yeri Expenses:Food:Groceries:Produce ₺65,00 Expenses:Food:Groceries:Meat ₺31,00 Expenses:Food:Groceries:Basic ₺24,00 Assets:Cash:Marina ₺-120,00

If . isn't the decimal point in your locale then just change localeDecPt="." to whatever it is. If your input amounts contain, say, commas as thousands separators then the code I posted won't work and you should provide input that includes that to test against. I hard-coded the output field widths to 40 and 10 - you can fairly easily calculate the max width of each field and use that instead or use tabs as the OFS and pipe the output to column but it doesn't seem like any of that'd be necessary.

To be honest I don't understand your requirements around what to merge and how to identify duplicates (e.g. why not merge all duplicates in the first transaction and why clear out the amount from one non-duplicate account in the 2nd transaction?) so I just merged the amounts for all duplicates and left the amounts for non-duplicates. If that doesn't work for you then please clarify the requirements in your question.

I guess it has to do with how Ledger does things (From an old overview: lwn.net/Articles/501681 - "The final line lacks a dollar value, so Ledger fills it in to make the transaction balance") — muru
– muru, Commented Jun 26, 2019 at 15:21
As muru suggested, the odd requirement is because I can use hledger ... print -x to easily do the math even across different commodities and number formats to fill in the value of the one missing amount per transaction. I was guessing letting Ledger take care of this would be less error prone than trying to handle all the different formats in the data (my current project has 7) and their various units (including RTL scripts!) and separators (` ,', ., ,` for thousands, ., and , for decimal). — Caleb
– Caleb, Commented Jun 27, 2019 at 7:22

Stack Exchange Network

Re-order lines and merge others based on a specific criteria

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Re-order lines and merge others based on a specific criteria

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions