In comments, it was discovered that the input file is in big-endian UTF-16 format rather than plain old 7-bit ASCII or 8-bit extended ascii. UTF-16 is a 2-bytes-per-character format, and if used to encode plain ASCII, the "ASCII" character have 0x00 (a NUL byte, displayed as ^@ by cat -A, less, and other programs) as the first byte of the 2-byte pair (big-endian. reversed for little-endian).
The fix is to convert the file to plain ASCII. e.g. instead of using the standard fromdos or similar utility to convert CR-LF (dos/windows line-endings) to LF (unix line-endings), you need to do something like the following to convert the text into a format usable by the remainder of the sed script:
sed -e '1 s/^\xff\xfe|^\xfe\xff//; s/\x00//g; s/\x0d$//'
This sed script:
- strips the
0xfffe or 0xfeff byte-order markers form the beginning of the first line. - removes all NUL characters from all input lines, wherever they occur.
- removes the carriage-return (
0x0d) character from the end of any line
Note: this is only suitable for UTF-16 encoded text that contains only characters that would otherwise be ASCII. It will completely mangle any UTF-16 text file that contains other kinds of characters (e.g. non-english text).
Finally, perl has excellent support for text in a variety of common formats, including plain ascii, UTF-8, UTF-16, and more. It has library modules for working with and converting between all formats. It is fairly easy to convert simple sed scripts to perl, so a perl version of the script might be as simple as (untested, but it might even work):
#!/usr/bin/perl use strict; use feature 'unicode_strings'; while(<>) { s/^\xff\xfe|^\xfe\xff// if ($. == 1); # strip Byte Order marker from 1st line s/\x0d$//; # strip CR from each end-of-line s/ *"/"/g; # get rid of all spaces immediately before " characters s/" */"/g; # get rid of all spaces immediately after " characters # A very primitive split(). Should use a real CSV parser here, like the # Text::CSV module which properly copes with embedded quotes and commas etc # in string fields. This would also allow proper processing of each field to # remove any extra whitespace characters rather than the quick-and-dirty hack of # global regexp substitutions above. my @fields = split /,/; # perl arrays start from zero. This appends the "fake" field 42 onto field 41, # and then deletes field 42. $fields[40] .= $fields[41]; delete $fields[41]; print join(',',@fields), "\n"; }
Old answer that still contains (IMO) useful info:
awk is a better tool for this job than sed.
For example, with GNU awk (or any other awk that understands PCRE like \s and \S):
awk '{$0=gensub(/\s*(\S+)/,"\\1",42)}1' original > fixed
That merges columsn 41 & 42 by removing any spaces immediately preceding column 42.
For non-PCRE awk, use [[:space:]] instead of \s and [^[:space:]] instead of \S:
awk '{$0=gensub(/[[:space:]]*(\[^[:space:]]+)/,"\\1",42)}1' original > fixed
Also, depending on the exact nature of the input file, perl may be an even better tool for this job than awk. For example, it has modules for parsing CSV files and working with the individual fields in a CSV record.
BTW, IMO that sed script is horrible, not least because you're using multiple -e args rather than a single sed script with ; as command separator. If you want to use sed then at least use it effectively and efficiently. Your sed script is better written as:
sed -e 's/ \{1,\}"/"/g; s/" \{1,\}/"/g; s/","//41' original > fixed
or even:
sed -e 's/ \{1,\}"/"/g s/" \{1,\}/"/g s/","//41' original > fixed
You'll still need to fix the bug, but at least you'll have something more readable to debug - which makes it FAR easier to see where the problem might be.
Also BTW, -i or --in-place isn't as "in place" an edit as you might think. It works by creating a temporary file and then mv-ing it into place afterwards. This breaks anything that requires the inode to remain the same, including hard links.
It's far better to write the changed output to a temporary file (e.g. temp.txt) and then cat temp.txt > original.txt; rm temp.txt - that overwrites the original file with the changed version, while still keeping the same inode.
0x20) characters have been replaced by tabs (0x09)? You could use\s\s*instead of ` \{1,\}`