Only first character of unicode strings getting written to csv

Question

The nutshell of my problem is that my script cannot write complete unicode strings (retrieved from a db) to a csv, instead only the first character of each string is written to the file. eg:

U,1423.0,831,1,139

Where the output should be:

University of Washington Students,1423.0,831,1,139

Some background: I'm connecting to an MSSQL database using pyodbc. I have my odbc config file set up for unicode, and connect to the db as follows:

p.connect("DSN=myserver;UID=username;PWD=password;DATABASE=mydb;CHARSET=utf-8")

I can get data no problem, but the issue arises when I try to save query results to the csv file. I've tried using csv.writer, the UnicodeWriter solution in the official docs, and most recently, the unicodecsv module I found on github. Each method yields the same results.

The weird thing is I can print the strings in the python console no problem. Yet, if I take that same string and write it to csv, the problem emerges. See my test code & results below:

Code to highlight issue:

print "'Raw' string from database:" print "\tencoding:\t" + whatisthis(report.data[1][0]) print "\tprint string:\t" + report.data[1][0] print "\tstring len:\t" + str(len(report.data[1][0])) f = StringIO() w = unicodecsv.writer(f, encoding='utf-8') w.writerows(report.data) f.seek(0) r = unicodecsv.reader(f) row = r.next() row = r.next() print "Write/Read from csv file:" print "\tencoding:\t" + whatisthis(row[0]) print "\tprint string:\t" + row[0] print "\tstring len:\t" + str(len(row[0]))

Output from test:

'Raw' string from database: encoding: unicode string print string: University of Washington Students string len: 66 Write/Read from csv file: encoding: unicode string print string: U string len: 1

What could be the reason for this issue and how might I resolve it? Thanks!

EDIT: the whatisthis function is just to check the string format, taken from this post

def whatisthis(s): if isinstance(s, str): print "ordinary string" elif isinstance(s, unicode): print "unicode string" else: print "not a string"

Why are you doing r.next() twice? Aren't you printing the 2nd row in the first case and 3rd row in the 2nd case? — Atmaram Shetye
– Atmaram Shetye, Commented Jun 30, 2013 at 20:01
That's just because the first row is the field names, which I added afterwards. I just did the r.next() twice to highlight the problem. — Andrew Hilts
– Andrew Hilts, Commented Jun 30, 2013 at 20:26
I am having a similar problem while trying to load csv data to mssql using csvsql. All the textual fields are loaded as a single char. Any idea what it might be? What should be the file's encoding? (driver is freetds on Ubuntu) — Zach Moshe
– Zach Moshe, Commented Oct 19, 2014 at 15:31

7stud · Accepted Answer · 2013-06-30 21:09:14Z

1

import StringIO as sio import unicodecsv as ucsv class Report(object): def __init__(self, data): self.data = data report = Report( [ ["University of Washington Students", 1, 2, 3], ["UCLA", 5, 6, 7] ] ) print report.data print report.data[0][0] print "*" * 20 f = sio.StringIO() writer = ucsv.writer(f, encoding='utf-8') writer.writerows(report.data) print f.getvalue() print "-" * 20 f.seek(0) reader = ucsv.reader(f) row = reader.next() print row print row[0] --output:-- [['University of Washington Students', 1, 2, 3], ['UCLA', 5, 6, 7]] University of Washington Students ******************** University of Washington Students,1,2,3 UCLA,5,6,7 -------------------- [u'University of Washington Students', u'1', u'2', u'3'] University of Washington Students

Who knows what mischief your whatisthis() function is up to.

edited Jun 30, 2013 at 21:09

answered Jun 30, 2013 at 20:26

7stud

48.8k14 gold badges107 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Andrew Hilts Over a year ago

If I create the list myself in code, as above, then there's no problem. The issue is due to the unicode data I'm pulling from the DB. Please see my edit that explains the whatisthis function.

7stud Over a year ago

Why did you even include the whatis() function in your code? How is it relevant? What output do you get for: print type(report.data)?

7stud Over a year ago

And what do you get for: print report.data?

Andrew Hilts Over a year ago

Here is the output when i print report.data: [(u'U\x00n\x00i\x00v\x00e\x00r\x00s\x00i\x00t\x00y\x00 \x00o\x00f\x00 \x00W\x00a\x00s\x00h\x00i\x00n\x00g\x00t\x00o\x00n\x00 \x00S\x00t\x00u\x00d\x00e\x00n\x00t\x00s\x00', u'2\x000\x001\x003\x00-\x000\x006\x00-\x002\x009\x00', 1423.0, 831, 1, 139)]

7stud Over a year ago

Your database data is not encoded in UTF-8. For each character in report.data, there are two bytes: U\x00, n\x00, etc. The second byte is \x00 for each character, which is an empty byte because your characters all fit in the first byte. In other words, your database encoding is using a fixed two bytes per character. The UTF-8 encoding is different. Looking at the mssql docs, it looks like the encoding is ucs2, which is a fixed two bytes per character. python doesn't have a ucs2 codec to decode such strings, but you can try "utf_16_le".

|

Collectives™ on Stack Overflow

Only first character of unicode strings getting written to csv

1 Answer 1

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Linked

Related