Revisions to sed on cygwin can only replace one character?

Now that I'm home and can look it up, use actual tag name

edit approved Mar 21, 2017 at 0:55

381
2
13

Most probably, that file is encoded in UTF-16, that is with 2 or 4 bytes per characters, probably even with a Byte-Order-Mark at the beginning.

The characters that are shown in your sample (all ASCII characters) are typically encoded on 2 bytes, the first or second of which (depending on whether it's a big-enfian or little-endian UTF-16 encoding) being 0 and the other one being the ASCII/Unicode code. The 0 byte is typically invisible on a terminal, so that text appears OK when dumped there as the rest is just ASCII, but in effect the text contains:

<[NUL]B[NUL]o[NUL]o[NUL]l[NUL]e[NUL]a[NUL]n[NUL]T[NUL]a[NUL]g<[NUL]S[NUL]t[NUL]a[NUL]r[NUL]t[NUL]W[NUL]h[NUL]e[NUL]n[NUL]...

You'd need to convert that text to your locale's charset for sed to be able to deal with it. Note that UTF-16 cannot be used as a character encoding in a locale on Unix. You won't find a locale that uses UTF-16 as its character encoding.

iconv -f utf-16 < MyFileTask_01.xml | sed 's~<BooleanTag>true<'s~<StartWhenAvailable>true</BooleanTag>~<BooleanTag>false<StartWhenAvailable>~<StartWhenAvailable>false</BooleanTag>~g'StartWhenAvailable>~g' | iconv -t utf-16 > MyFileTask_01.xml.out

That assumes the input has a BOM. If not, you need to determine if it's big endian or little endian (probably little endian) and change that utf-16 to utf-16le or utf-16be.

If the locale's charset is UTF-8, there shouldn't be anything lost in translation even if the text contains non-ASCII characters.

As Cygwin's sed is typically GNU sed, it will also be able to deal with that type of binary (since it contains NUL bytes) input by itself, so you can also do something like:

LC_ALL=C sed -i 's/t\x00r\x00u\x00e/f\x00a\x00l\x00s\x00e/g' fileTask_01.xml

The file command should be able to tell you if the input is indeed UTF-16. You can use sed -n l or od -tc to see those hidden NUL characters. Example of little-endian UTF-16 text with BOM:

$ echo true | iconv -t utf-16 | od -tc 0000000 377 376 t \0 r \0 u \0 e \0 \n \0 0000014 $ echo true | iconv -t utf-16 | sed -n l \377\376t\000r\000u\000e\000$ \000$ $ echo true | iconv -t utf-16 | file - /dev/stdin: Little-endian UTF-16 Unicode text, with no line terminators

To process several files with zsh/bash/ksh93:

set -o pipefail for file in ./*.xml; do cp -ai "$file" "$file.back"bak" && iconv -f utf-16 < "$file.back"bak" | sed 's~<BooleanTag>true<'s~<StartWhenAvailable>true</BooleanTag>~<BooleanTag>false<StartWhenAvailable>~<StartWhenAvailable>false</BooleanTag>~g'StartWhenAvailable>~g' | iconv -t utf-16 > "$file" && rm -f "$file.back"bak" done

Most probably, that file is encoded in UTF-16, that is with 2 or 4 bytes per characters, probably even with a Byte-Order-Mark at the beginning.

The characters that are shown in your sample (all ASCII characters) are typically encoded on 2 bytes, the first or second of which (depending on whether it's a big-enfian or little-endian UTF-16 encoding) being 0 and the other one being the ASCII/Unicode code. The 0 byte is typically invisible on a terminal, so that text appears OK when dumped there as the rest is just ASCII, but in effect the text contains:

<[NUL]B[NUL]o[NUL]o[NUL]l[NUL]e[NUL]a[NUL]n[NUL]T[NUL]a[NUL]g...

You'd need to convert that text to your locale's charset for sed to be able to deal with it. Note that UTF-16 cannot be used as a character encoding in a locale on Unix. You won't find a locale that uses UTF-16 as its character encoding.

iconv -f utf-16 < MyFile.xml | sed 's~<BooleanTag>true</BooleanTag>~<BooleanTag>false</BooleanTag>~g' | iconv -t utf-16 > MyFile.xml.out

That assumes the input has a BOM. If not, you need to determine if it's big endian or little endian (probably little endian) and change that utf-16 to utf-16le or utf-16be.

If the locale's charset is UTF-8, there shouldn't be anything lost in translation even if the text contains non-ASCII characters.

As Cygwin's sed is typically GNU sed, it will also be able to deal with that type of binary (since it contains NUL bytes) input by itself, so you can also do something like:

LC_ALL=C sed -i 's/t\x00r\x00u\x00e/f\x00a\x00l\x00s\x00e/g' file.xml

The file command should be able to tell you if the input is indeed UTF-16. You can use sed -n l or od -tc to see those hidden NUL characters. Example of little-endian UTF-16 text with BOM:

$ echo true | iconv -t utf-16 | od -tc 0000000 377 376 t \0 r \0 u \0 e \0 \n \0 0000014 $ echo true | iconv -t utf-16 | sed -n l \377\376t\000r\000u\000e\000$ \000$ $ echo true | iconv -t utf-16 | file - /dev/stdin: Little-endian UTF-16 Unicode text, with no line terminators

To process several files with zsh/bash/ksh93:

set -o pipefail for file in ./*.xml; do cp -ai "$file" "$file.back" && iconv -f utf-16 < "$file.back" | sed 's~<BooleanTag>true</BooleanTag>~<BooleanTag>false</BooleanTag>~g' | iconv -t utf-16 > "$file" && rm -f "$file.back" done

Most probably, that file is encoded in UTF-16, that is with 2 or 4 bytes per characters, probably even with a Byte-Order-Mark at the beginning.

The characters that are shown in your sample (all ASCII characters) are typically encoded on 2 bytes, the first or second of which (depending on whether it's a big-enfian or little-endian UTF-16 encoding) being 0 and the other one being the ASCII/Unicode code. The 0 byte is typically invisible on a terminal, so that text appears OK when dumped there as the rest is just ASCII, but in effect the text contains:

<[NUL]S[NUL]t[NUL]a[NUL]r[NUL]t[NUL]W[NUL]h[NUL]e[NUL]n[NUL]...

You'd need to convert that text to your locale's charset for sed to be able to deal with it. Note that UTF-16 cannot be used as a character encoding in a locale on Unix. You won't find a locale that uses UTF-16 as its character encoding.

iconv -f utf-16 < Task_01.xml | sed 's~<StartWhenAvailable>true</StartWhenAvailable>~<StartWhenAvailable>false</StartWhenAvailable>~g' | iconv -t utf-16 > Task_01.xml.out

That assumes the input has a BOM. If not, you need to determine if it's big endian or little endian (probably little endian) and change that utf-16 to utf-16le or utf-16be.

If the locale's charset is UTF-8, there shouldn't be anything lost in translation even if the text contains non-ASCII characters.

As Cygwin's sed is typically GNU sed, it will also be able to deal with that type of binary (since it contains NUL bytes) input by itself, so you can also do something like:

LC_ALL=C sed -i 's/t\x00r\x00u\x00e/f\x00a\x00l\x00s\x00e/g' Task_01.xml

The file command should be able to tell you if the input is indeed UTF-16. You can use sed -n l or od -tc to see those hidden NUL characters. Example of little-endian UTF-16 text with BOM:

$ echo true | iconv -t utf-16 | od -tc 0000000 377 376 t \0 r \0 u \0 e \0 \n \0 0000014 $ echo true | iconv -t utf-16 | sed -n l \377\376t\000r\000u\000e\000$ \000$ $ echo true | iconv -t utf-16 | file - /dev/stdin: Little-endian UTF-16 Unicode text, with no line terminators

To process several files with zsh/bash/ksh93:

set -o pipefail for file in ./*.xml; do cp -ai "$file" "$file.bak" && iconv -f utf-16 < "$file.bak" | sed 's~<StartWhenAvailable>true</StartWhenAvailable>~<StartWhenAvailable>false</StartWhenAvailable>~g' | iconv -t utf-16 > "$file" && rm -f "$file.bak" done

added 343 characters in body

Source Link

edited Mar 20, 2017 at 20:34

Stéphane Chazelas

586.2k
96
1.1k
1.7k

Most probably, that file is encoded in UTF-16, that is with 2 or 4 bytes per characters, probably even with a Byte-Order-Mark at the beginning.

The characters that are shown in your sample (all ASCII characters) are typically encoded on 2 bytes, the first or second of which (depending on whether it's a big-enfian or little-endian UTF-16 encoding) being 0 and the other one being the ASCII/Unicode code. The 0 byte is typically invisible on a terminal, so that text appears OK when dumped there as the rest is just ASCII, but in effect the text contains:

<[NUL]B[NUL]o[NUL]o[NUL]l[NUL]e[NUL]a[NUL]n[NUL]T[NUL]a[NUL]g...

You'd need to convert that text to your locale's charset for sed to be able to deal with it. Note that UTF-16 cannot be used as a character encoding in a locale on Unix. You won't find a locale that uses UTF-16 as its character encoding.

iconv -f utf-16 < MyFile.xml | sed 's~<BooleanTag>true</BooleanTag>~<BooleanTag>false</BooleanTag>~g' | iconv -t utf-16 > MyFile.xml.out

That assumes the input has a BOM. If not, you need to determine if it's big endian or little endian (probably little endian) and change that utf-16 to utf-16le or utf-16be.

If the locale's charset is UTF-8, there shouldn't be anything lost in translation even if the text contains non-ASCII characters.

As Cygwin's sed is typically GNU sed, it will also be able to deal with that type of binary (since it contains NUL bytes) input by itself, so you can also do something like:

LC_ALL=C sed -i 's/t\x00r\x00u\x00e/f\x00a\x00l\x00s\x00e/g' file.xml

The file command should be able to tell you if the input is indeed UTF-16. You can use sed -n l or od -tc to see those hidden NUL characters. Example of little-endian UTF-16 text with BOM:

$ echo true | iconv -t utf-16 | od -tc 0000000 377 376 t \0 r \0 u \0 e \0 \n \0 0000014 $ echo true | iconv -t utf-16 | sed -n l \377\376t\000r\000u\000e\000$ \000$ $ echo true | iconv -t utf-16 | file - /dev/stdin: Little-endian UTF-16 Unicode text, with no line terminators

To process several files with zsh/bash/ksh93:

set -o pipefail for file in ./*.xml; do cp -ai "$file" "$file.back" && iconv -f utf-16 < "$file.back" | sed 's~<BooleanTag>true</BooleanTag>~<BooleanTag>false</BooleanTag>~g' | iconv -t utf-16 > "$file" && rm -f "$file.back" done

Most probably, that file is encoded in UTF-16, that is with 2 or 4 bytes per characters, probably even with a Byte-Order-Mark at the beginning.

The characters that are shown in your sample (all ASCII characters) are typically encoded on 2 bytes, the first or second of which (depending on whether it's a big-enfian or little-endian UTF-16 encoding) being 0 and the other one being the ASCII/Unicode code. The 0 byte is typically invisible on a terminal, so that text appears OK when dumped there as the rest is just ASCII, but in effect the text contains:

<[NUL]B[NUL]o[NUL]o[NUL]l[NUL]e[NUL]a[NUL]n[NUL]T[NUL]a[NUL]g...

You'd need to convert that text to your locale's charset for sed to be able to deal with it. Note that UTF-16 cannot be used as a character encoding in a locale on Unix. You won't find a locale that uses UTF-16 as its character encoding.

iconv -f utf-16 < MyFile.xml | sed 's~<BooleanTag>true</BooleanTag>~<BooleanTag>false</BooleanTag>~g' | iconv -t utf-16 > MyFile.xml.out

That assumes the input has a BOM. If not, you need to determine if it's big endian or little endian (probably little endian) and change that utf-16 to utf-16le or utf-16be.

If the locale's charset is UTF-8, there shouldn't be anything lost in translation even if the text contains non-ASCII characters.

As Cygwin's sed is typically GNU sed, it will also be able to deal with that type of binary (since it contains NUL bytes) input by itself, so you can also do something like:

LC_ALL=C sed -i 's/t\x00r\x00u\x00e/f\x00a\x00l\x00s\x00e/g' file.xml

The file command should be able to tell you if the input is indeed UTF-16. You can use sed -n l or od -tc to see those hidden NUL characters. Example of little-endian UTF-16 text with BOM:

$ echo true | iconv -t utf-16 | od -tc 0000000 377 376 t \0 r \0 u \0 e \0 \n \0 0000014 $ echo true | iconv -t utf-16 | sed -n l \377\376t\000r\000u\000e\000$ \000$ $ echo true | iconv -t utf-16 | file - /dev/stdin: Little-endian UTF-16 Unicode text, with no line terminators

Most probably, that file is encoded in UTF-16, that is with 2 or 4 bytes per characters, probably even with a Byte-Order-Mark at the beginning.

The characters that are shown in your sample (all ASCII characters) are typically encoded on 2 bytes, the first or second of which (depending on whether it's a big-enfian or little-endian UTF-16 encoding) being 0 and the other one being the ASCII/Unicode code. The 0 byte is typically invisible on a terminal, so that text appears OK when dumped there as the rest is just ASCII, but in effect the text contains:

<[NUL]B[NUL]o[NUL]o[NUL]l[NUL]e[NUL]a[NUL]n[NUL]T[NUL]a[NUL]g...

You'd need to convert that text to your locale's charset for sed to be able to deal with it. Note that UTF-16 cannot be used as a character encoding in a locale on Unix. You won't find a locale that uses UTF-16 as its character encoding.

iconv -f utf-16 < MyFile.xml | sed 's~<BooleanTag>true</BooleanTag>~<BooleanTag>false</BooleanTag>~g' | iconv -t utf-16 > MyFile.xml.out

That assumes the input has a BOM. If not, you need to determine if it's big endian or little endian (probably little endian) and change that utf-16 to utf-16le or utf-16be.

If the locale's charset is UTF-8, there shouldn't be anything lost in translation even if the text contains non-ASCII characters.

As Cygwin's sed is typically GNU sed, it will also be able to deal with that type of binary (since it contains NUL bytes) input by itself, so you can also do something like:

LC_ALL=C sed -i 's/t\x00r\x00u\x00e/f\x00a\x00l\x00s\x00e/g' file.xml

The file command should be able to tell you if the input is indeed UTF-16. You can use sed -n l or od -tc to see those hidden NUL characters. Example of little-endian UTF-16 text with BOM:

$ echo true | iconv -t utf-16 | od -tc 0000000 377 376 t \0 r \0 u \0 e \0 \n \0 0000014 $ echo true | iconv -t utf-16 | sed -n l \377\376t\000r\000u\000e\000$ \000$ $ echo true | iconv -t utf-16 | file - /dev/stdin: Little-endian UTF-16 Unicode text, with no line terminators

To process several files with zsh/bash/ksh93:

set -o pipefail for file in ./*.xml; do cp -ai "$file" "$file.back" && iconv -f utf-16 < "$file.back" | sed 's~<BooleanTag>true</BooleanTag>~<BooleanTag>false</BooleanTag>~g' | iconv -t utf-16 > "$file" && rm -f "$file.back" done

added 589 characters in body

Source Link

edited Mar 20, 2017 at 17:22

Stéphane Chazelas

586.2k
96
1.1k
1.7k

Most probably, that file is encoded in UTF-16, that is with 2 or 4 bytes per characters, probably even with a Byte-Order-Mark at the beginning.

The characters that are shown in your sample and are(all ASCII characters) are typically encoded withon 2 bytes, the first or second byteof which (depending on whether it's a big-enfian or little-endian UTF-16 encoding) being 0 and the other onesone being the ASCII/Unicode code. The 0 byte is typically invisible on a terminal, so that text appears OK when dumpdumped there as the rest is just ASCII, but in effect the text contains:

<[NUL]B[NUL]o[NUL]o[NUL]l[NUL]e[NUL]a[NUL]n[NUL]T[NUL]a[NUL]g...

You'd need to convert that text to your locale's charset for sed to be able to deal with it. Note that UTF-16 cannot be used as a character encoding in a locale on Unix. You won't find a locale that uses UTF-16 as its character encoding.

iconv -f utf-16 < MyFile.xml | sed 's~<BooleanTag>true</BooleanTag>~<BooleanTag>false</BooleanTag>~g' | iconv -t utf-16 > MyFile.xml.out

That assumes the input has a BOM. If not, you need to determine if it's big endian or little endian (probably little endian) and change that utf-16 to utf-16le or utf-16be.

If the locale's charset is UTF-8, there shouldn't be anything lost in translation even if the text contains non-ASCII characters.

As Cygwin's sed is typically GNU sed, it will also be able to deal with that type of binary (since it contains NUL bytes) input by itself, so you can also do something like:

LC_ALL=C sed -i 's/t\x00r\x00u\x00e/f\x00a\x00l\x00s\x00e/g' file.xml

The file command should be able to tell you if the input is indeed UTF-16. You can use sed -n l or od -tc to see those hidden NUL characters. Example of little-endian UTF-16 text with BOM:

$ echo true | iconv -t utf-16 | od -tc 0000000 377 376 t \0 r \0 u \0 e \0 \n \0 0000014 $ echo true | iconv -t utf-16 | sed -n l \377\376t\000r\000u\000e\000$ \000$ $ echo true | iconv -t utf-16 | file - /dev/stdin: Little-endian UTF-16 Unicode text, with no line terminators

Most probably, that file is encoded in UTF-16, that is with 2 or 4 bytes per characters, probably even with a Byte-Order-Mark at the beginning.

The characters that are shown in your sample and are ASCII characters are typically encoded with the first or second byte (depending on whether it's a big-enfian or little-endian UTF-16 encoding) being 0 and the other ones being the ASCII/Unicode code. The 0 byte is typically invisible on a terminal, so that text appears OK when dump there, but in effect the text contains:

<[NUL]B[NUL]o[NUL]o[NUL]l[NUL]e[NUL]a[NUL]n[NUL]T[NUL]a[NUL]g

You'd need to convert that text to your locale's charset for sed to be able to deal with it. Note that UTF-16 cannot be used as a character encoding on Unix.

iconv -f utf-16 < MyFile.xml | sed 's~<BooleanTag>true</BooleanTag>~<BooleanTag>false</BooleanTag>~g' | iconv -t utf-16 > MyFile.xml.out

That assumes the input has a BOM. If not, you need to determine if it's big endian or little endian (probably little endian) and change that utf-16 to utf-16le or utf-16be.

If the locale's charset is UTF-8, there shouldn't be anything lost in translation even if the text contains non-ASCII characters.

As Cygwin's sed is typically GNU sed, it will also be able to deal with that type of binary (since it contains NUL bytes) input, so you can also do something like:

LC_ALL=C sed -i 's/t\x00r\x00u\x00e/f\x00a\x00l\x00s\x00e/g' file.xml

Most probably, that file is encoded in UTF-16, that is with 2 or 4 bytes per characters, probably even with a Byte-Order-Mark at the beginning.

The characters that are shown in your sample (all ASCII characters) are typically encoded on 2 bytes, the first or second of which (depending on whether it's a big-enfian or little-endian UTF-16 encoding) being 0 and the other one being the ASCII/Unicode code. The 0 byte is typically invisible on a terminal, so that text appears OK when dumped there as the rest is just ASCII, but in effect the text contains:

<[NUL]B[NUL]o[NUL]o[NUL]l[NUL]e[NUL]a[NUL]n[NUL]T[NUL]a[NUL]g...

You'd need to convert that text to your locale's charset for sed to be able to deal with it. Note that UTF-16 cannot be used as a character encoding in a locale on Unix. You won't find a locale that uses UTF-16 as its character encoding.

iconv -f utf-16 < MyFile.xml | sed 's~<BooleanTag>true</BooleanTag>~<BooleanTag>false</BooleanTag>~g' | iconv -t utf-16 > MyFile.xml.out

That assumes the input has a BOM. If not, you need to determine if it's big endian or little endian (probably little endian) and change that utf-16 to utf-16le or utf-16be.

If the locale's charset is UTF-8, there shouldn't be anything lost in translation even if the text contains non-ASCII characters.

As Cygwin's sed is typically GNU sed, it will also be able to deal with that type of binary (since it contains NUL bytes) input by itself, so you can also do something like:

LC_ALL=C sed -i 's/t\x00r\x00u\x00e/f\x00a\x00l\x00s\x00e/g' file.xml

The file command should be able to tell you if the input is indeed UTF-16. You can use sed -n l or od -tc to see those hidden NUL characters. Example of little-endian UTF-16 text with BOM:

$ echo true | iconv -t utf-16 | od -tc 0000000 377 376 t \0 r \0 u \0 e \0 \n \0 0000014 $ echo true | iconv -t utf-16 | sed -n l \377\376t\000r\000u\000e\000$ \000$ $ echo true | iconv -t utf-16 | file - /dev/stdin: Little-endian UTF-16 Unicode text, with no line terminators

Source Link

answered Mar 20, 2017 at 17:16

Stéphane Chazelas

586.2k
96
1.1k
1.7k

Loading

Stack Exchange Network

Return to Answer