1

I have a batch to check the duplicate line in TXT file (over one million line) with 13MB, that will be running over 2hr...how can I speed up that? Thank you!!

TXT file

11 22 33 44 . . . 44 (over one million line) 

Existing Batch

setlocal set var1=* sort original.txt>sort.txt for /f %%a in ('type sort.txt') do (call :run %%a) goto :end :run if %1==%var1% echo %1>>duplicate.txt set var1=%1 goto :eof :end 
6
  • Use PowerShell? Commented Mar 3, 2017 at 9:06
  • @RogerLipscombe Or no CLI at all Commented Mar 3, 2017 at 9:11
  • Only try running with BAT file...Could you show me the powershell code about that? Commented Mar 3, 2017 at 9:16
  • I testing with powershell code ( $lines = @(); Get-Content 1.txt | %{ if (($lines -eq $_).length -eq 0) {$lines = $lines + $_}}; $lines > done .txt) and still running over 45mins...not yet done Commented Mar 3, 2017 at 10:07
  • Get-Content .\example.txt | Group-Object | Where { $_.Count -ne 1 } Commented Mar 3, 2017 at 13:09

4 Answers 4

2

This should be the fastest method using a Batch file:

@echo off setlocal EnableDelayedExpansion set var1=* sort original.txt>sort.txt (for /f %%a in (sort.txt) do ( if "%%a" == "!var1!" ( echo %%a ) else ( set "var1=%%a" ) )) >duplicate.txt 
Sign up to request clarification or add additional context in comments.

4 Comments

Since sort works case-insensitively, there might be some duplicates not detected: imagine three lines duplicate, Duplicate, duplicate; your script is not going to report duplicates, unless you add /I to your if query; if the OP wants a case-sensitive approach, sort will not help... (this is not a revenge comment ;-))
@aschipfl: I suppose you are right, although the original code have not the /I switch and the example data are just numbers... Just the OP may clear this point. And talking about revenges, I invite you to review my new solution! ;)
Just to clear this point: are you saying that your original method took over 2 hr, the PowerShell method took over 45 mins, and my solution took 1 min? Using the same data file? :) I'll appreciate it if you post here the times in HH:MM:SS format of all methods posted here that you have tested...
Ah! And please do an additional test changing this line: sort original.txt>sort.txt by this one: sort original.txt /O sort.txt
2

This method use findstr command as in aschipfl's answer, but in this case each line and its duplicates are removed from the file after being revised by findstr. This method could be faster if the number of duplicates in the file is high; otherwise it will be slower because the high volume data manipulated in each turn. Just a test may confirm this point...

@echo off setlocal EnableDelayedExpansion del duplicate.txt 2>NUL copy /Y original.txt input.txt > NUL :nextTurn for %%a in (input.txt) do if %%~Za equ 0 goto end < input.txt ( set /P "line=" findstr /X /C:"!line!" find /V "!line!" > output.txt ) >> duplicate.txt move /Y output.txt input.txt > NUL goto nextTurn :end 

3 Comments

Although I am not sure whether find /V "!line!" should be replaced by findstr /V /X /C:"!line!", I like this method because it does not loop through the text file line by line; +1...
@aschipfl: The findstr command get duplicates and output they to duplicate.txt. The find command delete duplicates and store the rest of lines in output.txt. Further details here
Since you want to handle whole lines, findstr /X is needed; find also matches in case the search string is found in the middle of a line...
0
@echo off setlocal enabledelayedexpansion set var1=* ( for /f %%a in ('sort q42574625.txt') do ( if "%%a"=="!var1!" echo %%a set "var1=%%a" ) )>"u:\q42574625_2.txt" GOTO :EOF 

This may be faster - I don't have your file to test against

I used a file named q42574625.txt containing some dummy data for my testing.

It's not clear whether you want only one instance of a duplicate line or not. Your code would produce 5 "duplicate" lines if there were 6 identical lines in the source file.

Here's a version which will report each duplicated line only once:

@echo off setlocal enabledelayedexpansion set var1=* set var2=* ( for /f %%a in ('sort q42574625.txt') do ( if "%%a"=="!var1!" IF "!var2!" neq "%%a" echo %%a&SET "var2=%%a" set "var1=%%a" ) )>"u:\q42574625.txt" GOTO :EOF 

2 Comments

Thank you for your code, I trying and report you once done!
Your code is faster than me (you: 120 minutes i: 160 minutes) ... but i want to finish in 30 minutes ... but i really appreciate your help!
0

Supposing you provide the text file as the first command line argument, you could try the following:

@echo off for /F "usebackq delims=" %%L in ("%~1") do ( for /F "delims=" %%K in (' findstr /X /C:"%%L" "%~1" ^| find /C /V "" ') do ( if %%K GTR 1 echo %%L ) ) 

This returns all duplicate lines, but multiple times each, namely as often as each occurs in the file.

3 Comments

Thank you for your code, I trying and report you once done!
I am pretty sure that this method will be slower than the original. You are running three copies of cmd.exe (one for the nested for /F command and one more for each side of the pipe) plus findstr.exe (that process the entire file) plus find.exe, for each line of the file!
@Aacini, yes, you might be right, I guess. I did not test it, but my thought was that the findstr command might be faster than for /F containing if comparisons and sub-routine calls.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.