1

I have a relatively inefficent CSVReader code, see below. It takes more than 30 seconds to read 30000+ lines. How to speed up this reading process as fast as possible?

public class DataReader { private String csvFile; private List<String> sub = new ArrayList<String>(); private List<List> master = new ArrayList<List>(); public void ReadFromCSV(String csvFile) { String line = ""; String cvsSplitBy = ","; try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) { System.out.println("Header " + br.readLine()); while ((line = br.readLine()) != null) { // use comma as separator String[] list = line.split(cvsSplitBy); // System.out.println("the size is " + country[1]); for (int i = 0; i < list.length; i++) { sub.add(list[i]); } List<String> temp = (List<String>) ((ArrayList<String>) sub).clone(); // master.add(new ArrayList<String>(sub)); master.add(temp); sub.removeAll(sub); } } catch (IOException e) { e.printStackTrace(); } System.out.println(master); } public List<List> getMaster() { return master; } } 

UPDATE: I have found that my code actually can finish the reading work in less than 1 second if run it separately. As this DataReader is a part used by my simulation model to initialize the relevant properties. And the following part is associated with the use of the data imported, WHICH TAKES 40 SECONDS TO FINISH! Anyone could help by looking at the generic part of the codes?

// add route network Network<Object> net = (Network<Object>)context.getProjection("IntraCity Network"); IndexedIterable<Object> local_hubs = context.getObjects(LocalHub.class); for (int i = 0; i <= CSV_reader_route.getMaster().size() - 1; i++) { String source = (String) CSV_reader_route.getMaster().get(i).get(0); String target = (String) CSV_reader_route.getMaster().get(i).get(3); double dist = Double.parseDouble((String) CSV_reader_route.getMaster().get(i).get(6)); double time = Double.parseDouble((String) CSV_reader_route.getMaster().get(i).get(7)); Object source_hub = null; Object target_hub = null; Query<Object> source_query = new PropertyEquals<Object>(context, "hub_code", source); for (Object o : source_query.query()) { if (o instanceof LocalHub) { source_hub = (LocalHub) o; } if (o instanceof GatewayHub) { source_hub = (GatewayHub) o; } } Query<Object> target_query = new PropertyEquals<Object>(context, "hub_code", target); for (Object o : target_query.query()) { if (o instanceof LocalHub) { target_hub = (LocalHub) o; } if (o instanceof GatewayHub) { target_hub = (GatewayHub) o; } } // System.out.println(target_hub.getClass() + " " + time); // Route this_route = (Route) net.addEdge(source_hub, target_hub); // context.add(this_route); // System.out.println(net.getEdge(source_hub, target_hub)); if (net.getEdge(source, target) == null) { Route this_route = (Route) net.addEdge(source, target); context.add(this_route); // this_route.setDist(dist); // this_route.setTime(time); } } } 
8
  • Have you tried not cloning the sub list, but just creating a new one on every iteration? Also, Arrays.asList() is probably faster than looping yourself. Commented Oct 25, 2019 at 3:26
  • 39 seconds to read 37490 lines. This is too much time for me. Commented Oct 25, 2019 at 3:29
  • You should probably also get a baseline for just reading the file so you know what you're up against. Little point in speeding up processing if IO is the bottleneck. Commented Oct 25, 2019 at 3:31
  • 2
    sub.removeAll(sub); - This seems a lot more expensive than sub.clear(); Commented Oct 25, 2019 at 3:31
  • @Jacob G. no significant difference Commented Oct 25, 2019 at 3:36

3 Answers 3

2

In your code you are doing many write operation to just add the list of values from current row in your master list which is not required. You can replace the existing code with simple one as given below.

Existing code:

String[] list = line.split(cvsSplitBy); // System.out.println("the size is " + country[1]); for (int i = 0; i &lt; list.length; i++) { sub.add(list[i]); } List<String> temp = (List<String>) ((ArrayList<String>) sub).clone(); // master.add(new ArrayList<String>(sub)); master.add(temp); sub.removeAll(sub); 

Suggested code:

master.add(Arrays.asList(line.split(cvsSplitBy))); 
Sign up to request clarification or add additional context in comments.

Comments

2

I don't have a CSV that big, but you could try the following:

public static void main(String[] args) throws IOException { Path csvPath = Paths.get("path/to/file.csv"); List<List<String>> master = Files.lines(csvPath) .skip(1) .map(line -> Arrays.asList(line.split(","))) .collect(Collectors.toList()); } 

EDIT: I tried it with a CSV sample with 50k entries and the code runs in less than one second.

2 Comments

it's not working. report Exception in thread "main" java.io.UncheckedIOException: java.nio.charset.MalformedInputException: Input length = 1 at java.base/java.nio.file.FileChannelLinesSpliterator.readLine(FileChannelLinesSpliterator.java:173)
@Jack It's probably some problem with the encoding of your file. You can pass a Charset as the second parameter of the Files.lines method. Take a look at this or this
1

With extends to the answer of @Alex R, you can process it in parallel as well like this:

public static void main(String[] args) throws IOException { Path csvPath = Paths.get("path/to/file.csv"); List<List<String>> master = Files.lines(csvPath) .skip(1).parallel() .map(line -> Arrays.asList(line.split(","))) .collect(Collectors.toList()); } 

3 Comments

Yes but in case if you dont want to keep the track of lines and just process the data then you can process them in parallel as well to get the result fast.
Reading in parallel may even slow the whole thing down... There are multiple posts about that, like this or this
Thanks @Alex R for information. If you are using Java9 or later version then it is working as expected. But for Java8 avoid parallel processing then.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.