2

The nutshell of my problem is that my script cannot write complete unicode strings (retrieved from a db) to a csv, instead only the first character of each string is written to the file. eg:

U,1423.0,831,1,139 

Where the output should be:

University of Washington Students,1423.0,831,1,139 

Some background: I'm connecting to an MSSQL database using pyodbc. I have my odbc config file set up for unicode, and connect to the db as follows:

p.connect("DSN=myserver;UID=username;PWD=password;DATABASE=mydb;CHARSET=utf-8") 

I can get data no problem, but the issue arises when I try to save query results to the csv file. I've tried using csv.writer, the UnicodeWriter solution in the official docs, and most recently, the unicodecsv module I found on github. Each method yields the same results.

The weird thing is I can print the strings in the python console no problem. Yet, if I take that same string and write it to csv, the problem emerges. See my test code & results below:

Code to highlight issue:

print "'Raw' string from database:" print "\tencoding:\t" + whatisthis(report.data[1][0]) print "\tprint string:\t" + report.data[1][0] print "\tstring len:\t" + str(len(report.data[1][0])) f = StringIO() w = unicodecsv.writer(f, encoding='utf-8') w.writerows(report.data) f.seek(0) r = unicodecsv.reader(f) row = r.next() row = r.next() print "Write/Read from csv file:" print "\tencoding:\t" + whatisthis(row[0]) print "\tprint string:\t" + row[0] print "\tstring len:\t" + str(len(row[0])) 

Output from test:

'Raw' string from database: encoding: unicode string print string: University of Washington Students string len: 66 Write/Read from csv file: encoding: unicode string print string: U string len: 1 

What could be the reason for this issue and how might I resolve it? Thanks!

EDIT: the whatisthis function is just to check the string format, taken from this post

def whatisthis(s): if isinstance(s, str): print "ordinary string" elif isinstance(s, unicode): print "unicode string" else: print "not a string" 
3
  • 1
    Why are you doing r.next() twice? Aren't you printing the 2nd row in the first case and 3rd row in the 2nd case? Commented Jun 30, 2013 at 20:01
  • That's just because the first row is the field names, which I added afterwards. I just did the r.next() twice to highlight the problem. Commented Jun 30, 2013 at 20:26
  • I am having a similar problem while trying to load csv data to mssql using csvsql. All the textual fields are loaded as a single char. Any idea what it might be? What should be the file's encoding? (driver is freetds on Ubuntu) Commented Oct 19, 2014 at 15:31

1 Answer 1

1
import StringIO as sio import unicodecsv as ucsv class Report(object): def __init__(self, data): self.data = data report = Report( [ ["University of Washington Students", 1, 2, 3], ["UCLA", 5, 6, 7] ] ) print report.data print report.data[0][0] print "*" * 20 f = sio.StringIO() writer = ucsv.writer(f, encoding='utf-8') writer.writerows(report.data) print f.getvalue() print "-" * 20 f.seek(0) reader = ucsv.reader(f) row = reader.next() print row print row[0] --output:-- [['University of Washington Students', 1, 2, 3], ['UCLA', 5, 6, 7]] University of Washington Students ******************** University of Washington Students,1,2,3 UCLA,5,6,7 -------------------- [u'University of Washington Students', u'1', u'2', u'3'] University of Washington Students 

Who knows what mischief your whatisthis() function is up to.

Sign up to request clarification or add additional context in comments.

6 Comments

If I create the list myself in code, as above, then there's no problem. The issue is due to the unicode data I'm pulling from the DB. Please see my edit that explains the whatisthis function.
Why did you even include the whatis() function in your code? How is it relevant? What output do you get for: print type(report.data)?
And what do you get for: print report.data?
Here is the output when i print report.data: [(u'U\x00n\x00i\x00v\x00e\x00r\x00s\x00i\x00t\x00y\x00 \x00o\x00f\x00 \x00W\x00a\x00s\x00h\x00i\x00n\x00g\x00t\x00o\x00n\x00 \x00S\x00t\x00u\x00d\x00e\x00n\x00t\x00s\x00', u'2\x000\x001\x003\x00-\x000\x006\x00-\x002\x009\x00', 1423.0, 831, 1, 139)]
Your database data is not encoded in UTF-8. For each character in report.data, there are two bytes: U\x00, n\x00, etc. The second byte is \x00 for each character, which is an empty byte because your characters all fit in the first byte. In other words, your database encoding is using a fixed two bytes per character. The UTF-8 encoding is different. Looking at the mssql docs, it looks like the encoding is ucs2, which is a fixed two bytes per character. python doesn't have a ucs2 codec to decode such strings, but you can try "utf_16_le".
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.