2

We have files with some chars represented by decimal(!) ascii values enclosed in cid(#) as e.g. (cid:104) for h. The string hello is thus represented as (cid:104)(cid:101)(cid:108)(cid:108)(cid:111).

How can I substitute this with the corresponding ascii characters using sed?

Here is an example file:

$ cat input.txt first line pre (cid:104)(cid:101)(cid:108)(cid:108)(cid:111) post last line 

What I've tried so far is:

$ x="(cid:104)(cid:101)(cid:108)(cid:108)(cid:111)" $ echo $x | sed 's/(cid:\([^\)]*\))/\1/g' 104101108108111 

But wee need the output to be hello

$ cat output.txt first line pre hello post last line 

I'm trying to use printf in sed. But cannot find out how to pass the backreference \1 to printf

sed 's/(cid:\([^\)]*\))/'`printf "\x$(printf %x \1)"`'/g' 
2
  • 1
    given your updated question, what is the exact, desired output? Note it is important to provide a minimal reproducible example from the very beginning, since your update invalidates our current answers. Commented Jul 25, 2016 at 9:28
  • You might need to explain why 'using sed' is a requirement. That is much, much more difficult than using a more suitable tool such as awk or perl... Commented Jul 25, 2016 at 9:42

2 Answers 2

3
$ cat input.txt first line pre (cid:104)(cid:101)(cid:108)(cid:108)(cid:111) post last line $ perl -pe 's/\(cid:(\d+)\)/chr($1)/ge' input.txt > output.txt $ cat output.txt first line pre hello post last line 

Thanks @123 for suggesting to use chr($1) instead of sprintf "%c", $1. See chr for documentation

Reference: Integer ASCII value to character in BASH using printf

Sign up to request clarification or add additional context in comments.

8 Comments

in our special case there are also "normal" characters. i.e. not all characters are represented as (cid:#) only some of them. I edit my original question to show an example file
You can use chr instead of sprintf, i.e perl -pe 's/\(cid:(\d+)\)/chr($1)/ge'
@123 thanks :) ... didn't know about that function.. will edit the answer after OP clarifies his requirement
@wolfrevo That isn't going to happen.
@wolfrevo , I don't think that would be possible.. see stackoverflow.com/questions/22544044/…
|
0

Using %c you can convert an ASCII code into its corresponding character:

$ awk 'BEGIN {printf "%c", 104}' h 

So it is a matter of extracting the numbers from within (cid:XX). This I do by setting the FS to ( and looping through the fields:

awk -v FS='(' '{for (i=2; i<=NF; i++) { r=gensub(/cid:([0-9]+)\)/, "\\1", "g", $i); printf "%c", r+0 } }' file 

This uses gensub() and accesses to the captured groups as described in GNU awk: accessing captured groups in replacement text. Hence dependent on a GNU awk.

For your given input it returns:

$ awk -v FS='(' '{for (i=2; i<=NF; i++) {r=gensub(/cid:([0-9]+)\)/, "\\1", "g", $i); printf "%c", r+0}}' file hello 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.