0

I'm trying to parse a csv of over 100,000 lines and the performance problems don't even let me get to the end of the file before hitting "Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded"

Is there something wrong, or any way I can improve?

public static List<String[]> readCSV(String filePath) throws IOException{ List<String[]> csvLine= new ArrayList<String[]>(); CSVReader reader = new CSVReader(new FileReader(filePath), '\n'); String[] row; while((row = reader.readNext()) != null){ csvLine.add(removeWhiteSpace(row[0].toString().split(","))); } reader.close(); return csvLine; } private static String[] removeWhiteSpace(String[] split) { for(int index =0; index < split.length;index++){ split[index] = split[index].trim(); } return split; } 
16
  • You are attempting to load the entire 100,000 line data set into memory. Increase the heap size to something larger than the expected size of the dataset, or change the program so it doesn't load all the data at once. Commented Nov 13, 2017 at 23:19
  • don't store the entire csv file in your program. why are you reading the csv? what do you intend to do with the data being read? Commented Nov 13, 2017 at 23:20
  • 1
    create the objects as you go and calculate stats as you read the file then erase the objects and continue to do this until you reach the end of the csv. what stats are you performing and how do you take the csv data and turn it into an object? Commented Nov 13, 2017 at 23:23
  • 2
    Well, yes, the issue is your algorithm is wrong. You don't appear to need to load all the data into memory, rewrite the code so it processes lines one at a time. Commented Nov 13, 2017 at 23:24
  • 1
    You can probably halve the memory requirement by combining the for and while loops: dividedList.add(removeWhiteSpace(row[0].split(","))); Commented Nov 13, 2017 at 23:35

2 Answers 2

3

First you are running out of memory because all rows are being added to a list.

Second you are using String.split() which is extremely slow.

Third never try processing CSV by writing your own parsing code as there are many edge cases around this format (need to handle escape of delimiter, quotes, etc).

The solution is to use a library for that, such as univocity-parsers. You should be able to read 1 million rows in less than a second.

To parse, just do this:

public static IterableResult<String[], ParsingContext> readCSV(String filePath) { File file = new File(filePath); //configure the parser here. By default all values are trimmed CsvParserSettings parserSettings = new CsvParserSettings(); //create the parser CsvParser parser = new CsvParser(parserSettings); //create an iterable over rows. This will not load everything into memory. IterableResult<String[], ParsingContext> rows = parser.iterate(file); return rows; } 

Now you can use your method like this:

public static void main(String... args) { IterableResult<String[], ParsingContext> rows = readCSV("c:/path/to/input.csv"); try { for (String[] row : rows) { //process the rows however you want } } finally { //the parser closes itself but in case any errors processing the rows (outside of the control of the iterator), close the parser. rows.getContext().stop(); } } 

This is just an example of how you can use the parser, but there are many different ways to use it.

Now for writing, you can do this:

public static void main(String... args) { //this is your output file File output = new File("c:/path/to/output.csv"); //configure the writer if you need to CsvWriterSettings settings = new CsvWriterSettings(); //create the writer. Here we write to a file CsvWriter writer = new CsvWriter(output, settings); //get the row iterator IterableResult<String[], ParsingContext> rows = readCSV("c:/temp"); try { //do whatever you need to the rows here for (String[] row : rows) { //then write it each one to the output. writer.writeRow(row); } } finally { //cleanup rows.getContext().stop(); writer.close(); } } 

If all you want is to read the data, modify it and write it back to another file, you can just do this:

public static void main(String... args) throws IOException { CsvParserSettings parserSettings = new CsvParserSettings(); parserSettings.setProcessor(new AbstractRowProcessor() { @Override public void rowProcessed(String[] row, ParsingContext context) { //modify the row data here. } }); CsvWriterSettings writerSettings = new CsvWriterSettings(); CsvRoutines routines = new CsvRoutines(parserSettings, writerSettings); FileReader input = new FileReader("c:/path/to/input.csv"); FileWriter output = new FileWriter("c:/path/to/output.csv"); routines.parseAndWrite(input, output); } 

Hope this helps.

Disclaimer: I'm the author of this libary. It's open source and free (Apache 2.0 license).

Sign up to request clarification or add additional context in comments.

2 Comments

hey if you have time i'd like to ask a quick question (using the library which is great btw) I'm trying parse one row of a CSV1, compare it to each row of CSV2 - and then move to the next row in CSV1 and repeat. Would you expect the cost of this less than populating objects with the data and doing the same - CSV1 will have 1 million lines
Easiest thing to do is to run your app with -Xms8G -Xmx8G and load both files as lists in memory. Then sort both lists in memory and run the comparison sequentially. If your data is too big to fit in memory you can probably use a database to store the data. Just go with the file based approach if there's no way out of it.
-1

Is a design error try to put such a large file in memory. Depending of what you want to do, you should either write a new file processed, or put the lines into a dba. This implements the first:

FileInputStream inputStream = null; Scanner sc = null; try { inputStream = new FileInputStream(path); sc = new Scanner(inputStream, "UTF-8"); while (sc.hasNextLine()) { String line = sc.nextLine(); // System.out.println(line); } // note that Scanner suppresses exceptions if (sc.ioException() != null) { throw sc.ioException(); } } finally { if (inputStream != null) { inputStream.close(); } if (sc != null) { sc.close(); } } 

1 Comment

How does this illustrate implementing writing a new file?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.