0

I have a large csv file than has floats, ints and strings in it, using only the csv module i need to determine the data type, then perform calculations like mode, mean etc on the columns with numbers in them. So far i have this:

import csv file = open('adult.csv') reader = csv.reader(file) filename = 'output.xml' text = open(filename, 'w') text.write('<?xml version="1.0"?>') text.write('<!DOCTYPE summary [') def getType2(value): try: float(value) if "." in value: print 'float', return 'float' else: print 'int', return 'int' except ValueError: print 'str', headers = reader.next() length= len(headers) print length i=0 while i<length: print '<name>', print headers[i], print '</name>' print '<type>', value = reader.next() if getType2(value[i]) == 'int': data =[] total =0 for row in reader: data.append(float(row[i])) total += int(row[i]) print total print '</type>\n\n' print i= i+1 print '<!ELEMENT summary\n\n>' 

This correctly determines the data type, and will correctly do the first column, but i get an index error and it will not move onto the next one.

Pretty sure i am doing this an extremely convoluted way, as there must be an easier way to deal with this problem.

1 Answer 1

0

The reason it is only working for the first column is because of the way you iterate over the csv reader object:

you have a while loop that you intended to iterate over the columns, but inside that loop, just before the if-clause, you are also each time calling reader.next(), which will return the next line in the file. That means you would only be checking the elements on the diagonal of this csv file if it weren't for the actions that are happening in the if-clause. In the if-clause, you are then iterating over all the rows in the file, the reader object is an iterator which means that once you've obtained an element (e.g. a line), there's no way to get it again (you can't jump back to line 1 e.g. if you've already looked at line 2).

Your current approach is indeed somewhat convoluted. If you want to process the entire file, and operate on all columns, you'll have to load the entire file into memory and generate an efficient datastructure for that.

If the datafile you are processing is a well-behaved csv file, you would most likely be better off using numpy.genfromtxt, which will return a numpy object that has methods like mean. It is intended for numerical calculations. You could also look at pandas, which offers similar functionality. Using these libraries would prevent you from reinventing the wheel.

Once you have your dataset loaded in either a pandas dataframe or a numpy array, you could do the rest of the numerical processing with ease and continue to write the values you need to the output file. There are examples on how to use np.genfromtxt at the bottom of the link, which should help to get you started.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, that makes a lot of sense, unfortunately for a school project so cant use numpy or pandas, but what you just said helped enormously, thanks!!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.