Skip to content

Support writing unicode characters in df.to_stata() #23573

@kylebarron

Description

@kylebarron

Code Sample, a copy-pastable example if possible

import pandas as pd df = pd.DataFrame({'a': ['丆']}) df.to_stata('test.dta') # UnicodeEncodeError: 'latin-1' codec can't encode character '\u4e06' in position 0: ordinal not in range(256)

I picked an arbitrary CJK character to test this with.

Problem description

It would be possible to write Unicode strings to a Stata file by implementing a writer according to version 118 of the dta format.

I'd be interested in trying to submit a PR for this. (Edit: I don't use Stata anymore)

Expected Output

Stata file written to disk.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 2.6.32-696.18.7.el6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.23.4 pytest: 3.5.1 pip: 10.0.1 setuptools: 39.1.0 Cython: 0.28.2 numpy: 1.14.3 scipy: 1.1.0 pyarrow: 0.10.0 xarray: None IPython: 7.0.1 sphinx: 1.7.4 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: 1.2.1 tables: 3.4.3 numexpr: 2.6.5 feather: None matplotlib: 2.2.2 openpyxl: 2.5.3 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.4 lxml: 4.2.1 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.7 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.1.6 pandas_gbq: None pandas_datareader: None 

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO Stataread_stata, to_stataUnicodeUnicode strings

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions