How to count the number of lines in a UTF-16LE/CR-LF/BOM file?

Question

The immediate thought is wc, but then the next not-so-immediate thought is... Is *nix's wc purely for *nix line endings \x0a?... It seems so.

I've semi-wangled my way around it, but I feel there may/must be a simpler way than working on a hex-dump of the original.

Here is my version, but there is still a mysterious discrepancy in the tallies. wc reports 1 more 0a than the sum of this script's CRLF + 0a.

 file="nagaricb.nag" echo Report on CR and LF in UTF-16LE/CR-LF echo ===================================== cat "$file" | # a useles comment, courtesy of cat xxd -p -c 2 | sed -nr ' /0a../{ /0a00/!{ i ‾‾`0a: embedded in non-newline chars b } } /0d../{ /0d00/!{ i ‾‾`0d: embedded in non-newline chars b } } /0a00/{ i ‾‾`CR: found stray 0a00 b } /0d00/{ N /0d00\n0a00/{ i ‾‾`CRLF: found as normal newline pairs b } i ‾‾`LF: found stray 0d00 }' | sort | uniq -c echo " =====" printf ' %s ‾‾`wc\n' $(<"$file" wc -l)

Output

Report on CR and LF in UTF-16LE/CR-LF ===================================== 125 ‾‾`0a: embedded in non-newline chars 407 ‾‾`0d: embedded in non-newline chars 31826 ‾‾`CRLF: found as normal newline pairs ===== 31952 ‾‾`wc

Is there some more standard/simple way to do this?

@Matt: Which version do you have? I'm using wc (GNU coreutils) 7.4 — Peter.O
– Peter.O, Commented May 31, 2012 at 12:48
@Matt: Have you tried it with unicode chars like \u0a0a ਊ or \u090aऊ... That's the only time the problem shows itself...My file has 532 such chars. — Peter.O
– Peter.O, Commented May 31, 2012 at 12:52
Ah, no those get miscounted. You should count each 0a that is not "legitimate" I guess, to fix your script. (xx0a doesn't get counted, 0a0a only counts for one, if I understand it correctly). — Mat
– Mat, Commented May 31, 2012 at 13:13
There is nothing wrong with my script. It works fine. The tally error comes from wc (and awk's counting of NR is out by a further 1).. the above script's line-count is the same as shown in emacs... I'm just trying to find a less clumsy way of counting lines in a UTF-16LE/CR-LF (with BOM, in this case, if that makes a difference) file.. — Peter.O
– Peter.O, Commented May 31, 2012 at 13:15

Warren Young · Accepted Answer · 2012-06-01 00:52:49Z

6

I would convert the file to UTF-8 with LF line endings, so I can directly use the native tools:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | dos2unix | wc -l

The dos2unix part is the trickiest bit. There are many variants of this tool floating around, not all of which know how to be used in a pipeline. Sometimes it's called something else, like d2u.

edited Jun 1, 2012 at 0:52

answered May 31, 2012 at 13:41

Warren Young

73.5k17 gold badges182 silver badges172 bronze badges

1

Thanks Waren... I haven't got either d2u or dos2unix in my Ubuntu repo, but in all this, I've just discovered a similar solution.. It's very much the same, and slightly different, so here it is, for general reference... <"$file" recode UTF-16LE..UTF-8 |wc -l ... recode uses the iconv libraries, and has added the concept of surfaces which I'm finding to be quite handy.

Peter.O
– Peter.O

2012-05-31 13:47:11 +00:00
Commented May 31, 2012 at 13:47
@Peter.O: Really? sudo apt-get install dos2unix worked just fine here on my 11.10 box. This version of dos2unix works in pipelines. Since iconv(1) appears to be installed by default, that should be all you need.

Warren Young
– Warren Young

2012-05-31 14:20:24 +00:00
Commented May 31, 2012 at 14:20
@Peter.O: I just upgraded that box to 12.04, and dos2unix is still in the stock package repo. Then I checked an old stable 8.04 box, and there's not a package called dos2unix, but there is a package tofrodos which includes a fromdos command symlinked to dos2unix, which works in a pipeline.

Warren Young
– Warren Young

2012-05-31 15:17:48 +00:00
Commented May 31, 2012 at 15:17
For interest's sake, I just now installed a fresh Ubuntu 11.10 as a VM and dos2unix doesn't show up in that defaut system, either... The only tab completion for dos is dosfs... Oh well, it doesn't matter if the line endings are CR-LF or just LF. for a simple line count, so long as it is UTF-8 (which means that there are no extraneous \x0a bytes floating about.

Peter.O
– Peter.O

2012-05-31 16:06:12 +00:00
Commented May 31, 2012 at 16:06
It's in the universe repo. Perhaps you have only the main repo enabled in /etc/apt/sources.list?

Warren Young
– Warren Young

2012-05-31 16:32:17 +00:00
Commented May 31, 2012 at 16:32

| Show 1 more comment

Mat · Accepted Answer · 2012-05-31 13:42:16Z

Here's a perl script that opens files (given as command line arguments) in UTF-16 (endianness detected via BOM), and counts the lines.

#! /usr/bin/env perl use strict; use warnings; while (my $file = shift @ARGV) { my $fh; if (!open($fh, '<:encoding(UTF-16)', $file)) { print STDERR "Failed to open [$file]: $!\n"; next; } my $count = 0; $count++ while (<$fh>); print "$file: $count\n"; close $fh; }

(Dies if the BOM is not understood.)

Good, thanks. It works well. I'm still on the outer fringes of perl, so its a good learning example for me. It seems as simple as telling perl to read the file as UTF-16.. I like that... This method has no manipulation of the data (a definite plus), and Warren's method is very simple to write (no scripting is also a definte plus). so both answers are great... — Peter.O
– Peter.O, Commented May 31, 2012 at 14:06

Erwin Waterlander · Accepted Answer · 2015-11-26 08:36:41Z

If you have dos2unix version >= 7.1 you can use the -i option to get information about the number of line breaks. UTF-16 files are also supported. When the file has a BOM dos2unix automatically detects it is UTF-16, LE or BE. When the file has no BOM you can use option -ul to tell it is UTF-16LE (or -ub for UTF-16BE).

dos2unix -i will print the number of DOS, Unix, and Mac line breaks in that order. Example (with BOM):

$ dos2unix -i utf16le.txt 50 0 0 UTF-16LE text utf16le.txt

Without BOM:

$ dos2unix -ul -i utf16len.txt 50 0 0 no_bom text utf16len.txt

See the manual for more information.

Stack Exchange Network

How to count the number of lines in a UTF-16LE/CR-LF/BOM file?

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

How to count the number of lines in a UTF-16LE/CR-LF/BOM file?

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions