2

I'm trying to replace an XML element in 20+ files on Windows using sed and cygwin. The line is:

cd "D:\Backups\Tasks" sed -i 's~<StartWhenAvailable>true</StartWhenAvailable>~<StartWhenAvailable>false</StartWhenAvailable>~g' "Task_01.xml" 

This replaces nothing. However, if I try:

sed 's~<~[~g' "Task_01.xml" 

It outputs:

[AllowHardTerminate>true[/AllowHardTerminate> [StartWhenAvailable>true[/StartWhenAvailable> [RunOnlyIfNetworkAvailable>false[/RunOnlyIfNetworkAvailable> 

However, if I try to add just a single character, it just outputs the document as-is:

sed 's~<B~[B~g' "Task_01.xml" 

The above does nothing. What am I doing wrong? Is the chevron a special character or am I misusing sed? Or is it a fault in cygwin?

5
  • Can't reproduce, even when running from powershell.exe. May the file contain some hidden character? Try sed -n l < MyFile.xml to reveal them. Commented Mar 20, 2017 at 16:53
  • I can not reproduce this in Cygwin. Commented Mar 20, 2017 at 16:54
  • 4
    I'd bet it's a UTF-16 file. Try iconv -f utf-16 < file.xml | sed... Commented Mar 20, 2017 at 16:55
  • Have you thought of using a different string delimiter character? Like maybe semi-colon (;)? Commented Mar 21, 2017 at 0:51
  • I use ~ since it's just about the only character that I can guarantee isn't in these XML files. Since they're Windows tasks, they store commands which can be ; delimited, so I wouldn't be able to do a find/replace on those. Commented Mar 21, 2017 at 1:05

2 Answers 2

10

Most probably, that file is encoded in UTF-16, that is with 2 or 4 bytes per characters, probably even with a Byte-Order-Mark at the beginning.

The characters that are shown in your sample (all ASCII characters) are typically encoded on 2 bytes, the first or second of which (depending on whether it's a big-enfian or little-endian UTF-16 encoding) being 0 and the other one being the ASCII/Unicode code. The 0 byte is typically invisible on a terminal, so that text appears OK when dumped there as the rest is just ASCII, but in effect the text contains:

<[NUL]S[NUL]t[NUL]a[NUL]r[NUL]t[NUL]W[NUL]h[NUL]e[NUL]n[NUL]... 

You'd need to convert that text to your locale's charset for sed to be able to deal with it. Note that UTF-16 cannot be used as a character encoding in a locale on Unix. You won't find a locale that uses UTF-16 as its character encoding.

iconv -f utf-16 < Task_01.xml | sed 's~<StartWhenAvailable>true</StartWhenAvailable>~<StartWhenAvailable>false</StartWhenAvailable>~g' | iconv -t utf-16 > Task_01.xml.out 

That assumes the input has a BOM. If not, you need to determine if it's big endian or little endian (probably little endian) and change that utf-16 to utf-16le or utf-16be.

If the locale's charset is UTF-8, there shouldn't be anything lost in translation even if the text contains non-ASCII characters.

As Cygwin's sed is typically GNU sed, it will also be able to deal with that type of binary (since it contains NUL bytes) input by itself, so you can also do something like:

LC_ALL=C sed -i 's/t\x00r\x00u\x00e/f\x00a\x00l\x00s\x00e/g' Task_01.xml 

The file command should be able to tell you if the input is indeed UTF-16. You can use sed -n l or od -tc to see those hidden NUL characters. Example of little-endian UTF-16 text with BOM:

$ echo true | iconv -t utf-16 | od -tc 0000000 377 376 t \0 r \0 u \0 e \0 \n \0 0000014 $ echo true | iconv -t utf-16 | sed -n l \377\376t\000r\000u\000e\000$ \000$ $ echo true | iconv -t utf-16 | file - /dev/stdin: Little-endian UTF-16 Unicode text, with no line terminators 

To process several files with zsh/bash/ksh93:

set -o pipefail for file in ./*.xml; do cp -ai "$file" "$file.bak" && iconv -f utf-16 < "$file.bak" | sed 's~<StartWhenAvailable>true</StartWhenAvailable>~<StartWhenAvailable>false</StartWhenAvailable>~g' | iconv -t utf-16 > "$file" && rm -f "$file.bak" done 
4
  • Yes yes yes! I noticed that when I did sed 's/\</[/g' file.xml that it changed every tag to [P[a[r[e[n[t> or something like it. It must have been picking up the null characters and replacing them. The XML file does indeed have a BOM and it is a Windows-generated file. I'll try these changes when I get home. Commented Mar 20, 2017 at 18:04
  • Of course, the whole point of using sed was so that I could replace the tag in all of the files in a single folder, and this means I'd have to do one file at a time so I might as well use Notepad2 at that point. I don't know if a UTF-16 compatible find and replace program exists for Windows. Commented Mar 20, 2017 at 18:13
  • I ran the for loop and got cp: failed to preserve ownership for './Task_01.xml.bak': Permission denied for every single file I tried to run it on (changed extension to .bak instead of .back because I have a registry key for .bak to open in Notepad2). It also doesn't seem to have changed the tag within the file. Commented Mar 21, 2017 at 0:52
  • I removed the -a flag from the cp command and it worked! Thanks a lot. I'll be sure to use this with all of the mass updates that I have to make to XML files. Commented Mar 21, 2017 at 1:03
2

Place your sed command inside a file, like say, sed.cmds & then invoke sed as:

sed -i -f "sed.cmds" "MyFile.xml" 

Also try to change the delimiter to _, like as:

s_<BooleanTag>true</BooleanTag>_<BooleanTag>false</BooleanTag>_g

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.