Why might sed not make any change to a file?

Question

Short question:

Why might sed not make any change to a file and is there any way to check?

Long question:

I've tried running a sed command that had always worked with my files before. I learned this here back in September. Each quarter I get 4 huge files with a bunch of white space and a column that should be one, but is split in to two. I run the following command to skim the white space and merge the 41st and 42 columns:

sudo sed -i -e 's/ \{1,\}"/"/g' -e 's/" \{1,\}/"/g' -e 's/","//41' original_file.txt

For the first time yesterday, nothing happened at all. It waits about 3 seconds and then nothing happens, whereas typically it should take 20-30 minutes. I check the file and the spaces are still there. I have 3 times the size of the file still free on the system and twice the file size available in RAM (512GB ram), not that the ram matters, just wanted to throw that in.

I tried writing it to another file using

sudo sed -e 's/ \{1,\}"/"/g' -e 's/" \{1,\}/"/g' -e 's/","//41' original_file.txt > formatted_file.txt

This creates formatted_file.txt but it's completely blank.

Can anyone tell what I'm doing wrong or how to check the issue?

EDIT:

The sample input can be see on stackoverflow except that there are over 300 columns.

Are you running the command in a system where the same command has worked before? — hegez
– hegez, Commented Jan 11, 2018 at 2:39
Is it possible that space (0x20) characters have been replaced by tabs (0x09)? You could use \s\s* instead of ` \{1,\}` — Hauke Laging
– Hauke Laging, Commented Jan 11, 2018 at 2:42
@hegez. Yes. It's the same system. This would have been the third time importing the files and running this command on them. — Aunt Jemima
– Aunt Jemima, Commented Jan 11, 2018 at 2:49
Maybe the file format is slightly different and the regexes fail? E.g. are the double quotes there? Is the number of columns >= 42? — NickD
– NickD, Commented Jan 11, 2018 at 3:41

cas · Accepted Answer · 2018-01-12 03:52:56Z

In comments, it was discovered that the input file is in big-endian UTF-16 format rather than plain old 7-bit ASCII or 8-bit extended ascii. UTF-16 is a 2-bytes-per-character format, and if used to encode plain ASCII, the "ASCII" character have 0x00 (a NUL byte, displayed as ^@ by cat -A, less, and other programs) as the first byte of the 2-byte pair (big-endian. reversed for little-endian).

The fix is to convert the file to plain ASCII. e.g. instead of using the standard fromdos or similar utility to convert CR-LF (dos/windows line-endings) to LF (unix line-endings), you need to do something like the following to convert the text into a format usable by the remainder of the sed script:

sed -e '1 s/^\xff\xfe|^\xfe\xff//; s/\x00//g; s/\x0d$//'

This sed script:

strips the 0xfffe or 0xfeff byte-order markers form the beginning of the first line.
removes all NUL characters from all input lines, wherever they occur.
removes the carriage-return (0x0d) character from the end of any line

Note: this is only suitable for UTF-16 encoded text that contains only characters that would otherwise be ASCII. It will completely mangle any UTF-16 text file that contains other kinds of characters (e.g. non-english text).

Finally, perl has excellent support for text in a variety of common formats, including plain ascii, UTF-8, UTF-16, and more. It has library modules for working with and converting between all formats. It is fairly easy to convert simple sed scripts to perl, so a perl version of the script might be as simple as (untested, but it might even work):

#!/usr/bin/perl use strict; use feature 'unicode_strings'; while(<>) { s/^\xff\xfe|^\xfe\xff// if ($. == 1); # strip Byte Order marker from 1st line s/\x0d$//; # strip CR from each end-of-line s/ *"/"/g; # get rid of all spaces immediately before " characters s/" */"/g; # get rid of all spaces immediately after " characters # A very primitive split(). Should use a real CSV parser here, like the # Text::CSV module which properly copes with embedded quotes and commas etc # in string fields. This would also allow proper processing of each field to # remove any extra whitespace characters rather than the quick-and-dirty hack of # global regexp substitutions above. my @fields = split /,/; # perl arrays start from zero. This appends the "fake" field 42 onto field 41, # and then deletes field 42. $fields[40] .= $fields[41]; delete $fields[41]; print join(',',@fields), "\n"; }

Old answer that still contains (IMO) useful info:

awk is a better tool for this job than sed.

For example, with GNU awk (or any other awk that understands PCRE like \s and \S):

awk '{$0=gensub(/\s*(\S+)/,"\\1",42)}1' original > fixed

That merges columsn 41 & 42 by removing any spaces immediately preceding column 42.

For non-PCRE awk, use [[:space:]] instead of \s and [^[:space:]] instead of \S:

awk '{$0=gensub(/[[:space:]]*(\[^[:space:]]+)/,"\\1",42)}1' original > fixed

Also, depending on the exact nature of the input file, perl may be an even better tool for this job than awk. For example, it has modules for parsing CSV files and working with the individual fields in a CSV record.

BTW, IMO that sed script is horrible, not least because you're using multiple -e args rather than a single sed script with ; as command separator. If you want to use sed then at least use it effectively and efficiently. Your sed script is better written as:

sed -e 's/ \{1,\}"/"/g; s/" \{1,\}/"/g; s/","//41' original > fixed

or even:

sed -e 's/ \{1,\}"/"/g s/" \{1,\}/"/g s/","//41' original > fixed

You'll still need to fix the bug, but at least you'll have something more readable to debug - which makes it FAR easier to see where the problem might be.

Also BTW, -i or --in-place isn't as "in place" an edit as you might think. It works by creating a temporary file and then mv-ing it into place afterwards. This breaks anything that requires the inode to remain the same, including hard links.

It's far better to write the changed output to a temporary file (e.g. temp.txt) and then cat temp.txt > original.txt; rm temp.txt - that overwrites the original file with the changed version, while still keeping the same inode.

The script can be further simplified to sed 's/ *"/"/g; s/" */"/g; s/","//41'. But while everything you write is true, I can't force myself to upvote your answer, as it doesn't address the core problem of the question. — Philippos
– Philippos, Commented Jan 11, 2018 at 6:29
the core problem of his question is fixing his input data - he may have xyproblem-ed that into "sed will fix it" --> "my sed script isn't working any more", but the underlying problem remains. — cas
– cas, Commented Jan 11, 2018 at 6:49
Using multiple -e is not going to make any difference in terms of efficiency, compared to semicolons. Internally, they're replaced with newlines. On the other hand, separating sed commands with ; does not always work for some like w, r (and :, b, t, }... in POSIX) — Stéphane Chazelas
– Stéphane Chazelas, Commented Jan 11, 2018 at 7:48
That looks like it might be a UTF-16 file (a format that is commonly used on windows systems, where each character takes 2 bytes...and "ASCII" character have 0x00 as the low byte). If it is utf-16, that would explain why the sed script doesn't work those ^@s (a representation of NUL bytes) are actually in the file - they just don't get displayed by default with less or cat. Can you add the output of file --mime-encoding yourfile to your question. (or just file yourfile if your version of file doesn't support the --mime-encoding option). — cas
– cas, Commented Jan 11, 2018 at 22:35
If it is utf-16, try adding this sed script before the rest of your sed scripts: '1 s/\xff\xfe//; s/\x00//g; s/\x0d$//'. That will strip the 0xFFFE starting bytes from the first line, as well as all NUL bytes from all lines, and the carriage returns from the end of each line. Note: this will result in garbage output if the text file contains other utf-16 characters (e.g. non-english text), it only works for text files that would otherwise be plain ascii if they weren't encoded as 2-byte UTF-16. Also note: using a language like perl that supports UTF-16 would be better. — cas
– cas, Commented Jan 11, 2018 at 22:54

Stack Exchange Network

Why might sed not make any change to a file?

1 Answer 1

Old answer that still contains (IMO) useful info:

You must log in to answer this question.

Hot Network Questions

Why might sed not make any change to a file?

1 Answer 1

Old answer that still contains (IMO) useful info:

You must log in to answer this question.

Related

Hot Network Questions