I have a Perl script that parses data sent to me from a bunch of school districts. I'm adding a new school and have run into a problem I've never faced before. When I do $line = <INPUT>, it slurps up the whole file instead of one line.
If I run file on the file, it returns UTF-8 Unicode text, with CRLF, CR line terminators. All my other files return ASCII text, with CRLF line terminators. I've run it through dos2unix but it still operates as one long string. When I edit it in emacs, it still shows ^M for the line endings.
What can I do to convert these line endings into usable line endings?
Update: The vendor sent me another file with different line endings which still don't work. They report as CRLF, LF. I've extracted a few sample lines.
Here's some snippets from my code:
$line = <INPUT> if ($schooldistricts{$schooldistrict}{'header'}); LINE: foreach $line (<INPUT>) { next LINE unless ($line =~ /\S/); <do stuff> } The file does have a header which gets stripped off correctly. Then in the foreach loop it reads the first line successfully and then that's it -- it's like the rest of the file is empty.
I tried setting $/ to \r\n\n but then the script does nothing. Same if I try \r\n. Is there a way to definitively see what characters are encoded for the line ending?
Second update: As an experiment, I brought the file into Excel, split it out, and saved it as a tab-delimited file. On the server, I ran dos2unix. The Perl script still won't parse after the second line. File now returns UTF-8 Unicode text, with CRLF line terminators. That's the right line ending so that leaves Unicode as being the issue. Is there something different about how Unicode would encode the line endings?
od -c, it should show CRs as\rand LFs as\n. And you can use the same escapes with e.g.printf;printf 'one\rtwo\nthree\r\n'would print stuff with three different CR/LF-combinations. (Also I'm not sure if you tried the solutions you got in answers.)awk,sed,ruby,raku,python, etc. also fit the bill?hexdump -Cand tell us what you see for line endings. Additionally, if you think your problem is Unicode-related have a look at: stackoverflow.com/q/13836352/7270649