Given an input file of text lines, I want duplicate lines to be identified and removed. Please show a simple snippet of C# that accomplishes this.
- There are various methods, some easier to implement than others. The approach to be taken can depend on the size of the text file and the expected number of matching lines. Can you describe the specific problem you're trying to solve? Thanks :)Binary Worrier– Binary Worrier2009-08-07 15:47:21 +00:00Commented Aug 7, 2009 at 15:47
- . . . and the desired performance.Binary Worrier– Binary Worrier2009-08-07 15:48:29 +00:00Commented Aug 7, 2009 at 15:48
5 Answers
For small files:
string[] lines = File.ReadAllLines("filename.txt"); File.WriteAllLines("filename.txt", lines.Distinct().ToArray()); 1 Comment
This should do (and will copy with large files).
Note that it only removes duplicate consecutive lines, i.e.
a b b c b d will end up as
a b c b d If you want no duplicates anywhere, you'll need to keep a set of lines you've already seen.
using System; using System.IO; class DeDuper { static void Main(string[] args) { if (args.Length != 2) { Console.WriteLine("Usage: DeDuper <input file> <output file>"); return; } using (TextReader reader = File.OpenText(args[0])) using (TextWriter writer = File.CreateText(args[1])) { string currentLine; string lastLine = null; while ((currentLine = reader.ReadLine()) != null) { if (currentLine != lastLine) { writer.WriteLine(currentLine); lastLine = currentLine; } } } } } Note that this assumes Encoding.UTF8, and that you want to use files. It's easy to generalize as a method though:
static void CopyLinesRemovingConsecutiveDupes (TextReader reader, TextWriter writer) { string currentLine; string lastLine = null; while ((currentLine = reader.ReadLine()) != null) { if (currentLine != lastLine) { writer.WriteLine(currentLine); lastLine = currentLine; } } } (Note that that doesn't close anything - the caller should do that.)
Here's a version that will remove all duplicates, rather than just consecutive ones:
static void CopyLinesRemovingAllDupes(TextReader reader, TextWriter writer) { string currentLine; HashSet<string> previousLines = new HashSet<string>(); while ((currentLine = reader.ReadLine()) != null) { // Add returns true if it was actually added, // false if it was already there if (previousLines.Add(currentLine)) { writer.WriteLine(currentLine); } } } Comments
For a long file (and non consecutive duplications) I'd copy the files line by line building a hash // position lookup table as I went.
As each line is copied check for the hashed value, if there is a collision double check that the line is the same and move to the next. (
Only worth it for fairly large files though.
Comments
Here's a streaming approach that should incur less overhead than reading all unique strings into memory.
var sr = new StreamReader(File.OpenRead(@"C:\Temp\in.txt")); var sw = new StreamWriter(File.OpenWrite(@"C:\Temp\out.txt")); var lines = new HashSet<int>(); while (!sr.EndOfStream) { string line = sr.ReadLine(); int hc = line.GetHashCode(); if(lines.Contains(hc)) continue; lines.Add(hc); sw.WriteLine(line); } sw.Flush(); sw.Close(); sr.Close(); 1 Comment
I am new to .net & have written something more simpler,may not be very efficient.Please fill free to share your thoughts.
class Program { static void Main(string[] args) { string[] emp_names = File.ReadAllLines("D:\\Employee Names.txt"); List<string> newemp1 = new List<string>(); for (int i = 0; i < emp_names.Length; i++) { newemp1.Add(emp_names[i]); //passing data to newemp1 from emp_names } for (int i = 0; i < emp_names.Length; i++) { List<string> temp = new List<string>(); int duplicate_count = 0; for (int j = newemp1.Count - 1; j >= 0; j--) { if (emp_names[i] != newemp1[j]) //checking for duplicate records temp.Add(newemp1[j]); else { duplicate_count++; if (duplicate_count == 1) temp.Add(emp_names[i]); } } newemp1 = temp; } string[] newemp = newemp1.ToArray(); //assigning into a string array Array.Sort(newemp); File.WriteAllLines("D:\\Employee Names.txt", newemp); //now writing the data to a text file Console.ReadLine(); } }