12

I'm trying to write a very simple output statement that is output into a csv file. It just states the deviation margin of data so I am using the '±' symbol so it will read something like '5 ft/s^2 ±2.4%'.

I'm working in Python3. I have tried three different methods of using the '±' symbol: ascii, unicode, and just straight up copy pasting the character into the editor. See below

val1 = 3.2 val2 = 2.4 s1 = val1 + "ft/sec^2 " + chr(241) + val2 + "%" s2 = val1 + "ft/sec^2 " + u'\u00B1' + val2 + "%" s3 = val1 + "ft/sec^2 ±" + val2 + "%" 

however the output for all three of those methods is always the same for me...

3.2ft/sec^2 ±2.4% 

This 'Â' continues to show up. I am not at all experienced with encoding and things like that. I've searched and found a few situations that seemed to be related to mine but didn't understand enough to piece together a solution for my specific situation.

I'm using a pandas DataFrame to gather the data and then using the .to_csv() method to create the csv. The documentation for it states it defaults to 'utf-8' encoding.

Here's 7 lines that reproduce the same issue for me.

import pandas as pd df = pd.DataFrame(columns=['TestCol']) df['TestCol'] = ["Test1: " + chr(241), "Test2: " + u'\u00B1', "Test3: " + "±"] df.to_csv('TestExample.csv', index=False, encoding='utf-8') 

In my CSV I get a column that looks like:

TestCol Test1: ñ Test2: ± Test3: ± 

Any help, explanation, and knowledge is appreciated!

5
  • 1
    The second two work fine for me, the first outputs an ñ instead of the ±. Can you edit your post to include the code you're using to write to the CSV? Commented Jul 16, 2019 at 16:27
  • 1
    The second byte of the UTF-8 encodings for Unicode code points U+00A0 through U+00BF happens to coincide with the code point itself. Your UTF-8-encoded values are being interpreted as ISO-8859. Commented Jul 16, 2019 at 16:27
  • 1
    The character with code point 241 (decimal) is indeed ñ. Commented Jul 16, 2019 at 16:44
  • @mzjn that sorta came up. When I just plugged chr(241) into a python terminal it spit out ñ yet the same thing did indeed give me ± in the code that creates the CSV, which did confuse me. Commented Jul 16, 2019 at 17:01
  • @MichaelKolber I updated the post to include the method for creating the CSV I use Commented Jul 16, 2019 at 17:01

3 Answers 3

14

Excel assumes Windows encoding when opening a .csv file. This encoding is depending on the language/country, but in english and west europe countries it is cp-1252 and it is very similar to ISO-8859-1 (also known as "latin1").

This encoding uses a single byte per character. This means that it allows for 256 different characters at maximum (in fact, they are less than 256, because some of the codes are reserved for control and non-printable characters).

Python3 uses Unicode for representing the strings. Unicode does not have the limitation of "only 256" symbols, because internally it uses ~20 bits. In practice, Unicode can represent any character of any language in the world (and even some languages out of this world).

The problem is that, when Unicode has to be written into a file (or transferred via network), it has to be "encoded" as a sequence of bytes. One of the ways of doing this, and the current standard in many fields, is "UTF-8".

UTF-8 encoding uses a variable number of bytes per character. It was designed to be compatible with ASCII, so any symbol in the ASCII table is represented with a single byte (which coincides with its ascii code). But any character not in ascii will require more than 1 byte to be represented. In particular, character ± (codepoint U+00B1 or 177) when encoded in UTF-8, requires two bytes of hexadecimal values c2 and b1.

When Excel reads these bytes, since it assumes cp-1252 encoding which uses a single byte per character, it decodes the sequence c2, b1 as two separate characters. The first one is decoded as Â, and the second one, casually, as ±.

Note Incidentally, unicode ñ (codepoint U+00F1, or 241) is encoded in UTF-8 also as two bytes, of values c3, b1, which when decoded as cp-1252 are shown as ñ. Note that the first one is now à instead of Â, but the second one is again (casually again) ±.

The solution is to indicate to pandas that cp-1252 encoding should be used when writing the file:

df.to_csv("file.csv", encoding="cp1252") 

Of course this has a potential problem. Since "cp-1252" can represent only 256 symbols at maximum, while Unicode can represent more than 1M symbols, it can happen that some of the string data in your dataframe uses any character not representable in "cp-1252". In this case you will get an encoding error.

Also, when reading back this .csv with Pandas, you have to specify the encoding, since Pandas assumes it is UTF-8.

Update about utf-8-sig

Other answers and some comments refer to "utf-8-sig" encoding, which would be another valid (perhaps preferable) solution. I'll elaborate a bit what is this.

UTF8 is not the only way to convert Unicode into a sequence of bytes, although it is the recommended one in several standards. Another popular choice is (was?) UTF-16. In this encoding, all Unicode characters are encoded as 16-bit values (some of them are not representable this way, but the set can be extended by using two 16-bit values for some characters).

The problem of using 16-bits per character, instead of 8-bits, is that then the endianess is relevant. Since 16 bits is not the basic unit in which memory, network and disk operate, when you write or send a 16 bit value to memory, network or disk, two bytes are in fact sent. And the order in which those bytes are sent is architecture-dependant. For example, assume that you need to write in disk the 16-bit number 66ff (expressed in hexadecimal). You have to "break" it into 66 and ff, and decide which one write first. The sequence in disk can be 66, ff (this is called big endian order), or ff, 66 (this is called little endian order).

If you are in a little-endian architecture, such as Intel, the default order of the bytes in the disk will be different than in a big-endian architecture. Of course the problem is when you try to read the file in a machine whose architecture is different than the one in which the file was created. You may end up assembly those bytes incorrectly as ff66, which would be a different Unicode character.

So it must be some way to include in the file information about the endianity used when it was created. This is the role of the so called BOM (Byte Order Mark). It consist in Unicode character FEFF. If this caracter is written as the first one in the file, when the file is read back, if your sofware finds FEFF as first character, it will know that the endianity used to read the file is the same than the one used when writting it. But if it finds FFFE instead (the order is swapped) it will know that there is a endianity mismatch, and then it will swap each pair of bytes while reading, to get the correct Unicode characters.

By the way, Unicode standard does not have a character whose code is FFFE, to avoid confusion when reading the BOM. If you find FFFE at the beginning, it means that the endianity is wrong and you must swap bytes.

None of this relates to UTF-8, since this encoding uses bytes (and not 16-bit) as basic unit of information, and thus it is immune to the endianity problem. Nevertheless, you can encode FEFF in UTF-8 (it will result in a sequence of 3 bytes, with values EF, BB, and BF) and write it anyway as the first character in the file. This is what Python does when you specify utf-8-sig encoding.

In this case its purpose is not to help determine endianity, but instead acting as a kind of "fingerprint" which helps the software which reads back the file to guess that the encoding used was UTF-8. If the sofware finds as the first 3 bytes in the file the "magic values" EF, BB, and BF, it can conclude that the file is stored in UTF-8. These three bytes are discarded, and the rest is decoded from UTF-8.

In particular, Microsoft Windows use this technique in most of its software. Apparently, in the case of Excel this also works, so, to summarize:

  • You write your csv using df.to_csv("file.csv", encoding="utf-8-sig")
  • Excel reads the file and finds EF, BB, BF at the start. So it discards these bytes and assumes utf-8 for the rest of the file.
  • When later the squence c2, b1 appears in the file, it is correctly decoded as UTF-8 to produce ±

This has the advantage of working in any Windows computer, no matter the code page it is using (cp1252 is for Western-Europe, other countries could be using other code pages, but Unicode and UTF-8 is universal).

The potential problem is if you try to read this csv in a non-windows machine. It can happen that the first "magic bytes" EF, BB, BF are meaningless for the software that reads it. Then, you may end with "spurious" characters at the beginning of the file, which may cause problem. If the software reading the file assumes UTF-8 encoding, these three first bytes will be decoded as Unicode character FFFE, but they will not be discarded. This character is invisible and has zero width, so it is not possible to "see" it with any editor, but it will be there nevertheless. If the software reading the file assumes any other encoding, such as "latin1", these three first bytes will be incorrectly decoded as , and they will be visible at the beginning of the file.

If you use python to read back this file, you have to specify utf-8-sig encoding again, to make python to discard those three initial bytes.

Sign up to request clarification or add additional context in comments.

4 Comments

Awesome, thanks for the explanation and the solution. I need to note though that "cp-1252" threw an error for me, apparently it needed to be written "cp1252" when passed into this method. Thanks again!
@the_pied_shadow I fixed that in the answer. Thanks for noting it
@JLDiaz Excel will read UTF-8 if utf-8-sig is used as the encoding. This encodes a signature Excel recognizes as UTF-8.
@MarkTolonen Thank you for noting that. I expanded the answer to explain how this encoding works and some caveats.
3

You're writing UTF-8 to the file, but whatever you're using to view it is viewing it as latin-1 (or the similar Windows cp1252). You might try opening the file you're writing to with encoding='utf-8-sig' which puts a BOM at the beginning of the file so applications can recognize it as UTF-8. Or you might just tell your viewer program to interpret it as UTF-8. I strongly recommend not writing it as latin-1 or the like, as that will make the text non-portable to systems with other locales without explicitly telling people how to decode it.

2 Comments

Thanks for the reply. I'm not opening the file I'm writing to, I'm creating a file from a Pandas dataFrame using the .to_csv() method. Looking at the documentation it defaults to utf-8 for encoding but I tried explicitly stating it and it didn't change anything. I'm using excel to open the file for viewing after the code has completed.
@the_pied_shadow: That's why I specifically suggested utf-8-sig, not just utf-8. Writing the UTF-8 BOM tells Windows programs to interpret the bytes as UTF-8, not cp1252 (or whatever other terrible locale specific encoding it defaults to). If you're not using open, you should still be able to specify an encoding with whatever you're using to write the file; please try that before settling on cp1252, aka "the codec that will burn anyone outside Western European language speaking countries, or anyone on a non-Windows system, who uses your data".
1

s3 contains a UTF8-encoded value, where the UTF8 encoding of ± (U+00B1) is \xc2\xb1. However, your terminal is interpreting the bytes as ISO-8859-encoded text, not UTF-8 encoded text. In ISO-8859, code point C2 is (you might have guessed by now) "Â", and code point B1 is "±". In fact, for all Unicode values between U+00A0 and U+00BF inclusive, the second byte of their UTF-8 encoding coincides with their Unicode codepoint. Also, ISO-8859-1 coincides with Unicode for code points 00-FF.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.