Is there an alternative to sed that supports unicode?

Question

For example:

sed 's/\u0091//g' file1

Right now, I have to do hexdump to get hex number and put into sed as follows:

$ echo -ne '\u9991' | hexdump -C 00000000 e9 a6 91 |...| 00000003

And then:

$ sed 's/\xe9\xa6\x91//g' file1

Flimm · Accepted Answer · 2016-10-17 16:52:07Z

35

Just use that syntax:

sed 's/馑//g' file1

Or in the escaped form:

sed "s/$(echo -ne '\u9991')//g" file1

(Note that older versions of Bash and some shells do not understand echo -e '\u9991', so check first.)

edited Oct 17, 2016 at 16:52

Flimm

4,4918 gold badges34 silver badges42 bronze badges

answered Apr 17, 2015 at 8:46

chaos

49.4k11 gold badges129 silver badges147 bronze badges

1

Does sed count 馑 as one character or 3? That is, does echo 馑 | sed s/...// print anything?

Stack Exchange Broke The Law
– Stack Exchange Broke The Law

2015-04-17 11:22:14 +00:00
Commented Apr 17, 2015 at 11:22
@immibis Since sed has the g modifier it replaces all occurence also when they follow each other. Also sed should count it as one character, see: echo -ne "馑" | wc -m gives 1. If you count the bytes (wc -c) it would return 3. Did I understand your question correctly?

chaos
– chaos

2015-04-17 11:28:25 +00:00
Commented Apr 17, 2015 at 11:28
I meant: does . mean "one character" or "one byte"?

Stack Exchange Broke The Law
– Stack Exchange Broke The Law

2015-04-17 11:30:58 +00:00
Commented Apr 17, 2015 at 11:30
@immibis I matches one character hence echo 馑 | sed s/...// gives me 馑 (nothing is replaced)

chaos
– chaos

2015-04-17 11:33:30 +00:00
Commented Apr 17, 2015 at 11:33
4

@chaos: It works under en_US.UTF-8, but doesn't under C.

choroba
– choroba

2015-04-17 12:28:30 +00:00
Commented Apr 17, 2015 at 12:28

| Show 3 more comments

choroba · Accepted Answer · 2015-04-17 08:50:59Z

17

Perl can do that:

echo 汉典“馑”字的基本解释 | perl -CS -pe 's/\N{U+9991}/Jin/g'

-CS turns on UTF-8 for standard input, output and error.

answered Apr 17, 2015 at 8:50

choroba

49.7k7 gold badges92 silver badges119 bronze badges

9

Perl can do almost anything.....

wobbily_col
– wobbily_col

2015-04-17 10:49:17 +00:00
Commented Apr 17, 2015 at 10:49
@wobbily_col Maybe. But it's written in C. So..if Perl can do almost anything C as the foundation of Perl can do anything. As it should be.

Pryftan
– Pryftan

2020-01-16 15:51:40 +00:00
Commented Jan 16, 2020 at 15:51
@wobbily_col In Raku (aka Perl6): echo 汉典“馑”字的基本解释 | raku -pe 's:g/\x9991/Jin/' #OUTPUT 汉典“Jin”字的基本解释.

jubilatious1
– jubilatious1

2021-09-29 16:40:54 +00:00
Commented Sep 29, 2021 at 16:40

Add a comment |

The Spooniest · Accepted Answer · 2015-04-17 12:54:20Z

A number of versions of sed support Unicode:

Heirloom sed, which is based on "original Unix material".
GNU sed, which is its own codebase.
Plan 9 sed, which has been ported to Unix-like operating systems.

I couldn't find information on BSD sed, which I thought was strange, but I think the odds are good that it supports Unicode too. Unfortunately, there is no standard way to tell sed which encoding to use, so each one does this in its own ways.

UTF-16 is pretty unusable in Unix-based OSes. It's also an abomination that should have never seen the light of day. — Brian Bi
– Brian Bi, Commented Apr 17, 2015 at 19:11
Whether or not they support UTF-16 depends on the implementation, and I'm afraid I don't have that data. I doubt that Plan 9 sed does (the original OS is UTF-8 everywhere), but I can't be sure, and even if it doesn't, the others might. — The Spooniest
– The Spooniest, Commented Apr 17, 2015 at 19:30

Dave Rove · Accepted Answer · 2019-10-02 06:17:54Z

7

With recent versions of BASH, just omit the quotes around the sed expression and you can use BASH's escaped strings. Spaces within the sed expression or parts of the sed expression that might be interpreted by BASH as wildcards can be individually quoted.

$ echo "饥馑荐臻" | sed s/$'\u9991'//g 饥荐臻

answered Oct 2, 2019 at 6:17

Dave Rove

1,4351 gold badge13 silver badges9 bronze badges

2

This should be the new accepted answer, simple and clean!

Allen Wang
– Allen Wang

2019-11-06 22:28:15 +00:00
Commented Nov 6, 2019 at 22:28
2

@AllenWang For reference, the $'...' type of quotes comes from ksh93 in 1993 while the \uxxxx within them comes from zsh in 2003 (inspired from GNU printf). Added in bash in 4.2 in 2010. So unless you're on macos which still comes with 3.2, that answer would have also been valid in 2015 when that question was asked.

Stéphane Chazelas
– Stéphane Chazelas

2021-06-22 05:54:28 +00:00
Commented Jun 22, 2021 at 5:54

Add a comment |

Aryeh Leib Taurog · Accepted Answer · 2018-04-17 18:21:57Z

This works for me:

$ vim -nEs +'%s/\%u9991//g' +wq file1

It’s a drop more verbose than I’d like; here’s a full explanation:

-n disable vim swap file
-E Ex improved mode
-s silent mode
+'%s/\%u9991//g' execute the substitution command
+wq save and exit

I suppose this modifies file1 in-place, is that correct? — gerrit
– gerrit, Commented Jan 10, 2019 at 10:32

Janis · Accepted Answer · 2015-04-17 10:16:32Z

Works for me with GNU sed (version 4.2.1):

$ echo -ne $'\u9991' | sed 's/\xe9\xa6\x91//g' | hexdump -C $ echo -ne $'\u9991' | hexdump -C 00000000 e9 a6 91

(As another replacement for sed you could also use GNU awk; but it don't seem necessary.)

jubilatious1 · Accepted Answer · 2021-06-21 21:54:37Z

Using Raku (formerly known as Perl_6)

~$ echo 汉典“馑”字的基本解释 | raku -pe 's:g/\x9991/Jin/;' 汉典“Jin”字的基本解释 ~$ echo "饥馑荐臻" | raku -pe s:g/'\x9991'//; 饥荐臻 ~$ raku -e 'print "e", "e\x301", "\x000e9";' eéé ~$ raku -e 'say "e\x301" eq "\x000e9";' True ~$ echo "Stephane" | raku -pe 's/e/e\x301/;' Stéphane ~$ echo "Stephane" | raku -pe 's/e/\x000e9/;' Stéphane

[Rakudo 2020.10; code tested on GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14]

https://raku.org/

Stack Exchange Network

Is there an alternative to sed that supports unicode?

7 Answers 7

You must log in to answer this question.

Linked

Hot Network Questions

Is there an alternative to sed that supports unicode?

7 Answers 7

You must log in to answer this question.

Linked

Related

Hot Network Questions