Windows Batch FOR Loop improvement

Question

I have a batch to check the duplicate line in TXT file (over one million line) with 13MB, that will be running over 2hr...how can I speed up that? Thank you!!

TXT file

11 22 33 44 . . . 44 (over one million line)

Existing Batch

setlocal set var1=* sort original.txt>sort.txt for /f %%a in ('type sort.txt') do (call :run %%a) goto :end :run if %1==%var1% echo %1>>duplicate.txt set var1=%1 goto :eof :end

Only try running with BAT file...Could you show me the powershell code about that? — Alfred Suen Work
– Alfred Suen Work, Commented Mar 3, 2017 at 9:16
I testing with powershell code ( $lines = @(); Get-Content 1.txt | %{ if (($lines -eq $_).length -eq 0) {$lines = $lines + $_}}; $lines > done .txt) and still running over 45mins...not yet done — Alfred Suen Work
– Alfred Suen Work, Commented Mar 3, 2017 at 10:07
Get-Content .\example.txt | Group-Object | Where { $_.Count -ne 1 } — Roger Lipscombe
– Roger Lipscombe, Commented Mar 3, 2017 at 13:09

Aacini · Accepted Answer · 2017-03-03 10:20:30Z

2

This should be the fastest method using a Batch file:

@echo off setlocal EnableDelayedExpansion set var1=* sort original.txt>sort.txt (for /f %%a in (sort.txt) do ( if "%%a" == "!var1!" ( echo %%a ) else ( set "var1=%%a" ) )) >duplicate.txt

answered Mar 3, 2017 at 10:20

Aacini

67.7k12 gold badges74 silver badges114 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

aschipfl Over a year ago

Since sort works case-insensitively, there might be some duplicates not detected: imagine three lines duplicate, Duplicate, duplicate; your script is not going to report duplicates, unless you add /I to your if query; if the OP wants a case-sensitive approach, sort will not help... (this is not a revenge comment ;-))

Aacini Over a year ago

@aschipfl: I suppose you are right, although the original code have not the /I switch and the example data are just numbers... Just the OP may clear this point. And talking about revenges, I invite you to review my new solution! ;)

Aacini Over a year ago

Just to clear this point: are you saying that your original method took over 2 hr, the PowerShell method took over 45 mins, and my solution took 1 min? Using the same data file? :) I'll appreciate it if you post here the times in HH:MM:SS format of all methods posted here that you have tested...

Aacini Over a year ago

Ah! And please do an additional test changing this line: sort original.txt>sort.txt by this one: sort original.txt /O sort.txt

Aacini · Accepted Answer · 2017-03-03 11:46:23Z

This method use findstr command as in aschipfl's answer, but in this case each line and its duplicates are removed from the file after being revised by findstr. This method could be faster if the number of duplicates in the file is high; otherwise it will be slower because the high volume data manipulated in each turn. Just a test may confirm this point...

@echo off setlocal EnableDelayedExpansion del duplicate.txt 2>NUL copy /Y original.txt input.txt > NUL :nextTurn for %%a in (input.txt) do if %%~Za equ 0 goto end < input.txt ( set /P "line=" findstr /X /C:"!line!" find /V "!line!" > output.txt ) >> duplicate.txt move /Y output.txt input.txt > NUL goto nextTurn :end

Although I am not sure whether find /V "!line!" should be replaced by findstr /V /X /C:"!line!", I like this method because it does not loop through the text file line by line; +1...
@aschipfl: The findstr command get duplicates and output they to duplicate.txt. The find command delete duplicates and store the rest of lines in output.txt. Further details here
Since you want to handle whole lines, findstr /X is needed; find also matches in case the search string is found in the middle of a line...

Magoo · Accepted Answer · 2017-03-03 09:35:40Z

@echo off setlocal enabledelayedexpansion set var1=* ( for /f %%a in ('sort q42574625.txt') do ( if "%%a"=="!var1!" echo %%a set "var1=%%a" ) )>"u:\q42574625_2.txt" GOTO :EOF

This may be faster - I don't have your file to test against

I used a file named q42574625.txt containing some dummy data for my testing.

It's not clear whether you want only one instance of a duplicate line or not. Your code would produce 5 "duplicate" lines if there were 6 identical lines in the source file.

Here's a version which will report each duplicated line only once:

@echo off setlocal enabledelayedexpansion set var1=* set var2=* ( for /f %%a in ('sort q42574625.txt') do ( if "%%a"=="!var1!" IF "!var2!" neq "%%a" echo %%a&SET "var2=%%a" set "var1=%%a" ) )>"u:\q42574625.txt" GOTO :EOF

Your code is faster than me (you: 120 minutes i: 160 minutes) ... but i want to finish in 30 minutes ... but i really appreciate your help!

aschipfl · Accepted Answer · 2017-03-03 09:42:04Z

0

Supposing you provide the text file as the first command line argument, you could try the following:

@echo off for /F "usebackq delims=" %%L in ("%~1") do ( for /F "delims=" %%K in (' findstr /X /C:"%%L" "%~1" ^| find /C /V "" ') do ( if %%K GTR 1 echo %%L ) )

This returns all duplicate lines, but multiple times each, namely as often as each occurs in the file.

edited Mar 3, 2017 at 9:42

answered Mar 3, 2017 at 9:35

aschipfl

35.3k12 gold badges62 silver badges106 bronze badges

3 Comments

Alfred Suen Work Over a year ago

Thank you for your code, I trying and report you once done!

Aacini Over a year ago

I am pretty sure that this method will be slower than the original. You are running three copies of cmd.exe (one for the nested for /F command and one more for each side of the pipe) plus findstr.exe (that process the entire file) plus find.exe, for each line of the file!

aschipfl Over a year ago

@Aacini, yes, you might be right, I guess. I did not test it, but my thought was that the findstr command might be faster than for /F containing if comparisons and sub-routine calls.

Collectives™ on Stack Overflow

Windows Batch FOR Loop improvement

4 Answers 4

4 Comments

3 Comments

2 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

3 Comments

2 Comments

3 Comments

Linked

Related