Removing non-printable characters using POSIX sed

Question

Files created with roff and other "old-school" tools (for example man pages on many Unix systems) generate bold and underlined text in minimalistic terminals using tricks involving non-printable ASCII characters like "half-backspace" ^H to obtain bold and underlined text, for example:

b^Hbo^Hol^Hld^Hd and _^Hu_^Hn_^Hd_^He_^Hr_^Hl_^Hi_^Hn_^He_^Hd

If I wish to convert this into the human readable plain text bold and underline (ignoring the formatting), I can easily achieve this in vim using something like :%s:\(.\)\b\1:\1:ge | %s:_\b\(.\):\1:ge.

I can also pipe the text through tr -dc and use some of perl's regex magic to look for words that are built entirely of pairs of repeated characters.

However, this looks like something that plain sed should be able to handle, which would make it much cleaner to use in scripts.

Question: Is it possible to do this translation only using POSIX sed, i.e. without using GNU or BSD extensions?

What's giving me trouble here is only the non-printable character ^H (ASCII #8). There's a trick mentioned in Bruce Barnett's Sed - An Introduction, but somehow I was unable to get it to work.

The col command has historically been used to clean up nroff output. Try col -b to remove overstruck characters. — Mark Plotnick
– Mark Plotnick, Commented Jul 17, 2014 at 6:11

Michael Homer · Accepted Answer · 2014-07-17 04:13:22Z

Can you do this only using POSIX sed? Yes:

sed -e 's/.^H//g' < data

where ^H is just a literal backspace character. POSIX sed uses POSIX basic regular expressions, which are defined over bytes - printing characters or not, they don't care, so this behaves the same as if ^H were a letter. There are no extensions involved here. Note that all you really want to do is remove the characters that were backspaced over, so the capturing groups in your example aren't really necessary.

You can type the backspace character in most cases with Ctrl+V Ctrl+H.

I think the latent question you have is "how do I do that in a shell script?", where a literal backspace character can be unpleasant to work with (although vim will quite happily accept that same Ctrl+V Ctrl+H to write one in). This is where the introduction you linked uses tr.

POSIX tr supports various escape characters, including the symbolic \b escape for a backspace character. You can save a backspace character into a variable and substitute that variable into the sed expression above:

BACKSPACE=$(echo x | tr 'x' '\b') sed -e "s/.$BACKSPACE//g" < data

We just tell tr to replace an x with the backspace character, and give it a single x as input. This works fine on every system I have access to, including Solaris. However, printf is also a POSIX-defined tool, and it supports the same escapes:

BACKSPACE=$(printf '\b') sed -e "s/.$BACKSPACE//g" < data

This is simpler and more direct than the tr version. Note the double quoting around the sed expression, so that we're not suppressing variable interpolation any more. You could also use command substitution inline to put the printf '\b' in directly if you're only going to use it once, rather than using a variable.

We can check that this works with hexdump (or hd):

$ dash $ hexdump -C data 00000000 62 08 62 6f 08 6f 6c 08 6c 64 08 64 0a |b.bo.ol.ld.d.| $ BACKSPACE=$(printf '\b') $ sed -e "s/.$BACKSPACE//g" < data | hexdump -C 00000000 62 6f 6c 64 0a |bold.|

As desired, the backspace character and the erased preceding character are removed from the output (0a is the terminating newline).

Thanks a lot for this very thorough and extremely helpful answer -- and for pointing out that I was being stupid^W^W overcomplicating things. All your proposed solutions work like a charm. — Simon G.
– Simon G., Commented Jul 17, 2014 at 5:18

Stack Exchange Network

Removing non-printable characters using POSIX sed

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Removing non-printable characters using POSIX sed

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions