Slow reading from CSV file

Question

I'm trying to read from a csv file but it's slow. Here's the code roughly explained:

private static Film[] readMoviesFromCSV() { // Regex to split by comma without splitting in double quotes. // https://regexr.com/3s3me <- example on this data var pattern = Pattern.compile(",(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)"); Film[] films = null; try (var br = new BufferedReader(new FileReader(FILENAME))) { var start = System.currentTimeMillis(); var temparr = br.lines().skip(1).collect(Collectors.toList()); // skip first line and read into List films = temparr.stream().parallel() .map(pattern::split) .filter(x -> x.length == 24 && x[7].equals("en")) // all fields(total 24) and english speaking movies .filter(x -> (x[14].length() > 0)) // check if it has x[14] (date) .map(movieData -> new Film(movieData[8], movieData[9], movieData[14], movieData[22], movieData[23], movieData[7])) // movieData[8] = String title, movieData[9] = String overview // movieData[14] = String date (constructor parses it to LocalDate object) // movieData[22] = String avgRating .toArray(Film[]::new); System.out.println(MessageFormat.format("Execution time: {0}", (System.currentTimeMillis() - start))); System.out.println(films.length); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return films; }

File is about 30 MB big and it takes about 3-4 seconds avg. I'm using streams but it's still really slow. Is it because of that splitting each time?

EDIT: I've managed to speed up reading and processing time by 3x with uniVocity-parsers library. On average it takes 950 ms to finish. That's pretty impressive.

private static Film[] readMoviesWithLib() { Film[] films = null; CsvParserSettings parserSettings = new CsvParserSettings(); parserSettings.setLineSeparatorDetectionEnabled(true); RowListProcessor rowProcessor = new RowListProcessor(); parserSettings.setProcessor(rowProcessor); parserSettings.setHeaderExtractionEnabled(true); CsvParser parser = new CsvParser(parserSettings); var start = System.currentTimeMillis(); try { parser.parse(new BufferedReader(new FileReader(FILENAME))); } catch (FileNotFoundException e) { e.printStackTrace(); } List<String[]> rows = rowProcessor.getRows(); films = rows.stream() .filter(Objects::nonNull) .filter(x -> x.length == 24 && x[14] != null && x[7] != null) .filter(x -> x[7].equals("en")) .map(movieData -> new Film(movieData[8], movieData[9], movieData[14], movieData[22], movieData[23], movieData[7])) .toArray(Film[]::new); System.out.printf(MessageFormat.format("Time: {0}",(System.currentTimeMillis()-start))); return films; }

@azro same maybe a little bit slower without parallel, but it's not correctly timed so i can't be sure. I'll try reading it with some lib. — Mario
– Mario, Commented Jul 10, 2018 at 11:17
To compute more precisly the time, use .nanoTime() it's more accurate^^ — azro
– azro, Commented Jul 10, 2018 at 11:26
Please be aware that you're not measuring the time needed to read the file. You're measuring the time needed to read the file and to process it. It very well may be that reading the file is fast but processing is slow. — TheJavaGuy-Ivan Milosavljević
– TheJavaGuy-Ivan Milosavljević, Commented Jul 10, 2018 at 11:40
Regex expression are not lighting fast, "manual" seraching for comma will be faster (and probably is) — Jacek Cz
– Jacek Cz, Commented Jul 20, 2018 at 0:28

Jeronimo Backes · Accepted Answer · 2018-07-20 00:24:30Z

Author of the univocity-parsers library here. You can speed up the code you posted in your edit a little bit further by rewriting it like this:

 //initialize an arraylist with a good size to avoid reallocation final ArrayList<Film> films = new ArrayList<Film>(20000); CsvParserSettings parserSettings = new CsvParserSettings(); parserSettings.setLineSeparatorDetectionEnabled(true); parserSettings.setHeaderExtractionEnabled(true); //don't generate strings for columns you don't want parserSettings.selectIndexes(7, 8, 9, 14, 22, 23); //keep generating rows with the same number of columns found in the input //indexes not selected will have nulls as they are not processed. parserSettings.setColumnReorderingEnabled(false); parserSettings.setProcessor(new AbstractRowProcessor(){ @Override public void rowProcessed(String[] row, ParsingContext context) { if(row.length == 24 && "en".equals(row[7]) && row[14] != null){ films.add(new Film(row[8], row[9], row[14], row[22], row[23], row[7])); } } }); CsvParser parser = new CsvParser(parserSettings); long start = System.currentTimeMillis(); try { parser.parse(new File(FILENAME), "UTF-8"); } catch (FileNotFoundException e) { e.printStackTrace(); } System.out.printf(MessageFormat.format("Time: {0}",(System.currentTimeMillis()-start))); return films.toArray(new Film[0]);

For convenience, if you have to process stuff into different classes you can also use annotations in your Film class.

Hope this helps.

Collectives™ on Stack Overflow

Slow reading from CSV file

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related