I want to parse the large CSV file as fast and efficient as possible.
Currently, I am using the openCSV library to parse my CSV file but it is taking approx 10sec to parse a CSV file which has 10776 records with 24 headings and I want to parse a CSV file with millions of records.
<dependency> <groupId>com.opencsv</groupId> <artifactId>opencsv</artifactId> <version>4.1</version> </dependency> I am using the openCSV library parsing using below code snippet.
public List<?> convertStreamtoObject(InputStream inputStream, Class clazz) throws IOException { HeaderColumnNameMappingStrategy ms = new HeaderColumnNameMappingStrategy(); ms.setType(clazz); Reader reader = new InputStreamReader(inputStream); CsvToBean cb = new CsvToBeanBuilder(reader) .withType(clazz) .withMappingStrategy(ms) .withSkipLines(0) .withSeparator('|') .withFieldAsNull(CSVReaderNullFieldIndicator.EMPTY_SEPARATORS) .withThrowExceptions(true) .build(); List<?> parsedData = cb.parse(); inputStream.close(); reader.close(); return parsedData; } I am looking for suggestions for another way to parse a CSV file with millions of records in less time frame.
--- updated the answer ----
Reader reader = new InputStreamReader(in); CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT .withFirstRecordAsHeader() .withDelimiter('|') .withIgnoreHeaderCase() .withTrim()); List<CSVRecord> recordList = csvParser.getRecords(); for (CSVRecord csvRecord : recordList) { csvRecord.get("headername"); }
BufferedInputStreamReaderBufferedInputStreamReader, which doesn’t gain anything, unless you assume that openCSV fails to enable buffering on its own. I just looked it up,this.br = (reader instanceof BufferedReader ? (BufferedReader) reader : new BufferedReader(reader));, so the OP doesn’t need to test with any buffered stream or reader, openCSV does already do that…Classargument. Perhaps, a different library performs better. Not enough information to answer that. The only thing that can be said for sure, is that additional buffering won’t help.Instant, and 22UUIDcolumns as canonical hex strings. Takes 10 seconds to merely read the 850 meg file, and another two to parse the cell values back to objects. Doing ten thousand took about half a second versus the 10 seconds your reported, a time savings of 20-fold faster.