4

I have a big file >1.5GB, and it has '#@#@#' as row delimiter. I will replace this with a CRLF character before I process it through Informatica. The issue is, I have CR ,LF characters in the file, and I need to get rid of them before the replacement. I have found couple options to do that, but due to the size, I get OutofMemory exceptions.

param ( [string]$Source, [string]$Destination ) echo $Source echo $Destination $Writer = New-Object IO.StreamWriter $Destination $Writer.Write( [String]::Join("", $(Get-Content $Source)) ) $Writer.Close() 

My question is, is there anyway to set my row delimiter as '#@#@#' and then read the file line by line to remove CR,LF characters.

8
  • Sounds like you need more memory then. I know I can load a 1.5GB file with 4GB of ram on my computer. Most of the utilities I have ever used to remove the CRLF require it to load the entire file into memory first before it can remove the CRLF. Hopefully someone comes up with a better solution for you. Commented Dec 21, 2016 at 16:34
  • thank you Squashman, It is my last option to ask for extra memory. I will see if anyone comes up with a different solution. Thank you. Commented Dec 21, 2016 at 16:48
  • Can you try not using the StreamWriter and may be not loading into the memory itself ? something like: Get-Content "C:\source.txt" | Foreach-Object {$_.Replace(..) } | Set-Content C:\..Output.txt Commented Dec 21, 2016 at 17:29
  • The input file contains CR and LF characters, but can they also occur as an CR+LF sequence? what do you want CR and LF to be replaced with? Commented Dec 21, 2016 at 18:14
  • 1
    This sounds like a job for sed or tr, etc and Enable true linux shell on windows 10 anniversary edition Commented Dec 21, 2016 at 22:56

2 Answers 2

4

OK, my first attempt was unbearably slow. Here is a good solution that was able to process a 1.8 GB file in 2 min 48 sec :-)

I used hybrid batch/JScript, so it runs on any Windows machine from XP onward - no 3rd party exe file is needed, nor is any compilation needed.

I read and write ~1 MB chunks. The logic is actually pretty simple.

I replace all \r\n with a single space, and #@#@# with \r\n. You can easily change the string values in the code to suit your needs.

fixLines.bat

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment ::--- Batch section within JScript comment that calls the internal JScript ---- @echo off setlocal disableDelayedExpansion if "%~1" equ "" ( echo Error: missing input argument exit /b 1 ) if "%~2" equ "" ( set "out=%~f1.new" ) else ( set "out=%~2" ) <"%~1" >"%out%" cscript //nologo //E:JScript "%~f0" if "%~2" equ "" move /y "%out%" "%~1" >nul exit /b ----- End of JScript comment, beginning of normal JScript ------------------*/ var delim='#@#@#', delimReplace='\r\n', nl='\r\n', nlReplace=' ', pos=0, str=''; var delimRegex=new RegExp(delim,"g"), nlRegex=new RegExp(nl,"g"); while( !WScript.StdIn.AtEndOfStream ) { str=str.substring(pos)+WScript.StdIn.Read(1000000); pos=str.lastIndexOf(delim) if (pos>=0) { pos+=delim.length; WScript.StdOut.Write(str.substring(0,pos).replace(nlRegex,nlReplace).replace(delimRegex,delimReplace)); } else { pos=0 } } if (str.length>pos) WScript.StdOut.Write(str.substring(pos).replace(nlRegex,nlReplace)); 

To fix input.txt and write the output to output.txt:

fixLines input.txt output.txt 

To overwrite the original file test.txt

fixLines test.txt 

Just for kicks, I attempted to process the 1.8 GB file using JREPL.BAT. I didn't think it would work because it must load the entire file into memory. It doesn't matter how much memory is installed in the computer - JScript is limited to 2GB max string size. And I think there are additional constraints that come into play.

jrepl "\r?\n:#@#@#" " :\r\n" /m /x /t : /f input.txt /o output.txt 

It took 5 minutes for the command to fail with an "Out Of Memory" error. And then it took a long time for my computer to recover from the serious abuse of memory.

Below is my original custom batch/JScript solution that reads and writes one character at a time.

slow.bat

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment ::--- Batch section within JScript comment that calls the internal JScript ---- @echo off setlocal disableDelayedExpansion if "%~1" equ "" ( echo Error: missing input argument exit /b 1 ) if "%~2" equ "" ( set "out=%~f1.new" ) else ( set "out=%~2" ) <"%~1" >"%out%" cscript //nologo //E:JScript "%~f0" if "%~2" equ "" move /y "%out%" "%~1" >nul exit /b ----- End of JScript comment, beginning of normal JScript ------------------*/ var delim='#@#@#', delimReplace='\r\n', nlReplace=' ', read=1, write=2, pos=0, char; while( !WScript.StdIn.AtEndOfStream ) { chr=WScript.StdIn.Read(1); if (chr==delim.charAt(pos)) { if (++pos==delim.length) { WScript.StdOut.Write(delimReplace); pos=0; } } else { if (pos) { WScript.StdOut.Write(delim.substring(0,pos)); pos=0; } if (chr=='\n') { WScript.StdOut.Write(nlReplace); } else if (chr!='\r') { WScript.StdOut.Write(chr); } } } if (pos) WScript.StdOut.Write(delim.substring(0,pos)); 

It worked, but it was a dog. Here is a summary of timing results to process a 155 MB file:

slow.bat 3120 sec (52 min) jrepl.bat 55 sec fixLines.bat 15 sec 

I verified that all three solutions gave the same result.

Sign up to request clarification or add additional context in comments.

13 Comments

How long is that going to take on 1,500,000,000 bytes? This seems better suited for JREPL if she had enough ram in her computer.
@Squashman - The size is the whole problem - JREPL would require /M option, which means the entire file must fit in memory, and I don't think JREPL can work with such a big file in memory. I imagine performance could be improved a bit by reading a block of data at a time (1mb maybe?), but you still would have to process char by char, and handle delims that span blocks.
This looks like a lot of excessive file io, and seems prone to memory leaks. Why not try something simpler like chunking the file into larger blocks or the csv parser with a custom line end ? Or maybe something like cc.davelozinski.com/c-sharp/…
@SamuelJackson - Memory leaks? I think not ! But yes, reading byte by byte may slow it down a bit. I just acknowledged to Squashman that reading blocks might improve performance a bit. But I don't think the performance will suffer that bad. I'm not aware of any existing utilities (csv or otherwise) that can efficiently read lines with arbitrary line terminators. Any solution will have to logically process byte by byte, but the IO could be buffered.
@Squashman - I've confirmed that JREPL fails with 1.8 GB file. Also my original solution was a complete dog. But the new solution is pretty good!.
|
2

Conceptually simple and memory-efficient, but slow PowerShell solution:

This PowerShell (v2+) solution is slow, but it is conceptually simple and you shouldn't run out of memory, because the input lines are processed one at a time, using #@#@# as the line delimiter.

Note: This solution combines your two steps:

  • It replaces the original line breaks with a single space each,

  • and it replaces each #@#@# sequence with a newline.

# Create sample input file. @' line 1 starts here and ends here#@#@#line 2 is all on one line#@#@#line 3 spans two lines#@#@# '@ > file # Determine the input file. $inFile = 'file' # Create the output file. $outFile = 'out' $null = New-Item -Type File $outFile Get-Content -Delimiter '#@#@#' $inFile | % { Add-Content -Value ($_.Replace("`r`n", " ").Replace($sep, '')) $outFile } 

Note:

  • When you use -Delimiter, the specified delimiter is included in each item passed through the pipeline (unlike the default behavior, where the default delimiter (newline) is stripped).

  • Add-Content automatically adds a trailing CRLF to its output (in PSv5+, this could be suppressed with -NoNewLine).

  • The approach uses the [string] type's .Replace() method rather than PowerShell's flexible, regex-based -replace operator, because .Replace() performs literal replacements, which are faster (the equivalent command is
    Add-Content -Value (($_ -replace '\r\n', ' ') -replace '#@#@#') $outFile.
    That said, the speed gain is negligible; it's the file I/O part that takes up most of the time).


Much faster PowerShell solution with on-demand compilation of C# code

dbenham's clever and elegant batch + JScript solution is significantly faster than the above PowerShell solution.

Here's an adaptation of his approach to using C# code inside a PowerShell script that is compiled on demand.

Compilation is surprisingly quick (on the order of 0.3 seconds on my late-2012 iMac), and the use of compiled code to process the file results in a significant performance gain.
Also note that compilation is only performed once per session, so subsequent invocations do not pay this penalty.

Processing a ~ 1 GB file (created by repetition of the contents of the above sample file) with the script printed below yields the following:

Compiling... Processing file... Completed: Compilation time: 00:00:00.2343647 File-processing time: 00:00:26.0714467 Total: 00:00:26.3278546 

Execution times in real-world applications will differ based on many factors, but based on @dbenham's timings mentioned in the comments below, the the on-demand compilation solutions is about twice as fast as the batch+JScript solution.


Source code of the fast PowerShell solution:

# Determine the input and output files. $inFile = 'file' $outFile = 'out' # Get current time stamp for measuring duration. $dtStart = [datetimeoffset]::UtcNow # How many characters to read at a time. # !! Make sure that this at least as large as the max. input.line length. $kCHUNK_SIZE = 1000000 Write-Host 'Compiling...' # Note: This statement performs on-demand compilation, but only # on *first* invocation in a given session. $tsCompilation = Measure-Command { Add-Type @" using System; using System.IO; namespace net.same2u.so { public static class Helper { public static void TransformFile(string inFile, string outFile, string sep) { char[] bufChars = new char[$kCHUNK_SIZE]; using (var sw = new StreamWriter(outFile)) using (var sr = new StreamReader(inFile)) { int pos = 0; bool eof = false; string bufStr, rest = String.Empty; while (!(eof = sr.EndOfStream) || rest.Length > 0) { if (eof) { bufStr = rest; } else { int count = sr.Read(bufChars, 0, $kCHUNK_SIZE); bufStr = rest.Length > 0 ? rest + new string(bufChars, 0, count) : new string(bufChars, 0, count); } if (-1 == (pos = bufStr.LastIndexOf(sep))) // should only happen at the very end { sw.Write(bufStr); rest = String.Empty; } else { pos += sep.Length; rest = bufStr.Substring(pos); sw.Write(bufStr.Substring(0, pos).Replace(Environment.NewLine, " ").Replace(sep, Environment.NewLine)); } } } } } } // class Helper "@ if (-not $?) { exit 1 } } Write-Host 'Processing file...' # Make sure the .NET framework sees the same current dir. as PS. [System.IO.Directory]::SetCurrentDirectory($PWD) $tsFileProcessing = Measure-Command { [net.same2u.so.Helper]::TransformFile($inFile, $outFile, '#@#@#') } Write-Host @" Completed: Compilation time: $tsCompilation File-processing time: $tsFileProcessing Total: $([datetimeoffset]::UtcNow - $dtStart) "@ 

11 Comments

Add some parallel processing to that script and should be able to cut the time down dramatically, while preserving memory :) -- upvoted for minimal impact on system resources. Still prone to leaks (as most large file operations are), but overall a workable solution in my opinion (just needs some PP)
Thanks, @SamuelJackson. I assume you're kidding about the parallel processing (do tell me if you're not, and what you had in mind). What leaks could there be?
Scroll down a ways on this site to see the parallel processing examples for reading in c# -- I believe PowerShell has some parallel processing capabilities (but I could be wrong). The memory leaks with manipulating large files come naturally with buffering data (which is what is necessary to 'look' for something to replace with something else. It would almost be faster to do byte replacement, so replace '#@#@#' with '\n ' (5 spaces) as this wouldn't require the initial pointer to move.
Ugh, only marginally better than my original 1 byte at a time batch/JScript solution: 39 min to process a 155 MB file vs 52 min. That still scales to 6.5 hrs for a 1.5 GB file - not very practical. Compare that to my latest batch/JScript solution at 2 min 48 sec for a 1.8 GB file. I imagine you could translate my JScript algorithm into powershell and get a bit more performance.
@dbenham: Thanks for running a benchmark. I knew it wouldn't be fast, but I also didn't expect it to be that slow. PowerShell is the champion of abstraction, which is great, but at the expense of performance, which sometimes makes its use impractical.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.