How to encode UTF-8 strings with only "A-Z","a-z","0-9", and "_" in Python

Question

I need to build a python encoder so that I can reformat strings like this:

import codecs codecs.encode("Random 🐍 UTF-8 String ☑⚠⚡", 'name_of_my_encoder')

The reason this is even something I'm asking stack overflow is, the encoded strings need to pass this validation function. This is a hard constraint, there is no flexibility on this, its due to how the strings have to be stored.

from string import ascii_letters from string import digits valid_characters = set(ascii_letters + digits + ['_']) def validation_function(characters): for char in characters: if char not in valid_characters: raise Exception

Making an encoder seemed easy enough, but I'm not sure if this encoder is making it harder to build a decoder. Heres the encoder I've written.

from codecs import encode from string import ascii_letters from string import digits ALPHANUMERIC_SET = set(ascii_letters + digits) def underscore_encode(chars_in): chars_out = list() for char in chars_in: if char not in ALPHANUMERIC_SET: chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii'))) else: chars_out.append(char) return ''.join(chars_out)

This is the encoder I've written. I've only included it for example purposes, theres probably a better way to do this.

Edit 1 - Someone has wisely pointed out just using base32 on the entire string, which I can definitely use. However, it would be preferable to have something that is 'somewhat readable', so an escaping system like https://en.wikipedia.org/wiki/Quoted-printable or https://en.wikipedia.org/wiki/Percent-encoding would be preferred.

Edit 2 - Proposed solutions must work on Python 3.4 or newer, working in Python 2.7 as well is nice, but not required. I've added the python-3.x tag to help clarify this a little.

chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii'))) what does this do? — xrisk
– xrisk, Commented Aug 16, 2015 at 13:31
encode the whole binary string as base 32 or base 64 like in MIME — phuclv
– phuclv, Commented Aug 16, 2015 at 13:41
@RishavKundu It inserts a hex unicode representation of the character between underscores, which are the only character I can reasonably use for an escape sequence. >>> '_{}_'.format(encode('π'.encode(), 'hex').decode('ascii')) prints out '_cf80_' — Techdragon
– Techdragon, Commented Aug 16, 2015 at 14:17
@Techdragon see my answer! Python will do all the work for you! — xrisk
– xrisk, Commented Aug 16, 2015 at 14:18
@RishavKundu You definitely gave me some new ideas for how to try building this, but your code is python 2.x only. I'm unable to use Python 2.x code, I've deprecated it in all of my projects, and any 2.x only code now fails my test suites. Using the b32encode/b32decode requires a bytes object, and the bytes object doesn't concatenate so nicely with strings. which is why I wrote '_{}_'.format(encode(char.encode(), 'hex').decode('ascii')) instead of something like '_{}_'.format(base64.b16encode('π'.encode('utf-8'))) — Techdragon
– Techdragon, Commented Aug 16, 2015 at 15:07

Dunes · Accepted Answer · 2015-08-16 17:49:22Z

This seems to do the trick. Basically, alphanumeric letters are left alone. Any non-alphanumeric character in the ASCII set is encoded as a \xXX escape code. All other unicode characters are encoded using the \uXXXX escape code. However, you've said you can't use \, but you can use _, thus all escape sequences are translated to start with a _. This makes decoding extremely simple. Just replace the _ with \ and then use the unicode-escape codec. Encoding is slightly more difficult as the unicode-escape codec leaves ASCII characters alone. So first you have to escape the relevant ASCII characters, then run the string through the unicode-escape codec, before finally translating all \ to _.

Code:

from string import ascii_letters, digits # non-translating characters ALPHANUMERIC_SET = set(ascii_letters + digits) # mapping all bytes to themselves, except '_' maps to '\' ESCAPE_CHAR_DECODE_TABLE = bytes(bytearray(range(256)).replace(b"_", b"\\")) # reverse mapping -- maps `\` back to `_` ESCAPE_CHAR_ENCODE_TABLE = bytes(bytearray(range(256)).replace(b"\\", b"_")) # encoding table for ASCII characters not in ALPHANUMERIC_SET ASCII_ENCODE_TABLE = {i: u"_x{:x}".format(i) for i in set(range(128)) ^ set(map(ord, ALPHANUMERIC_SET))} def encode(s): s = s.translate(ASCII_ENCODE_TABLE) # translate ascii chars not in your set bytes_ = s.encode("unicode-escape") bytes_ = bytes_.translate(ESCAPE_CHAR_ENCODE_TABLE) return bytes_ def decode(s): s = s.translate(ESCAPE_CHAR_DECODE_TABLE) return s.decode("unicode-escape") s = u"Random UTF-8 String ☑⚠⚡" #s = '北亰' print(s) b = encode(s) print(b) new_s = decode(b) print(new_s)

Which outputs:

Random UTF-8 String ☑⚠⚡ b'Random_x20UTF_x2d8_x20String_x20_u2611_u26a0_u26a1' Random UTF-8 String ☑⚠⚡

This works on both python 3.4 and python 2.7, which is why the ESCAPE_CHAR_{DE,EN}CODE_TABLE is a bit messy bytes on python 2.7 is an alias for str, which works differently to bytes on python 3.4. This is why the table is constructed using a bytearray. For python 2.7, the encode method expects a unicode object not str.

xrisk · Accepted Answer · 2015-08-17 10:04:03Z

2

Use base32! It uses only the 26 letters of the alphabet and 0-9. You can’t use base64 because it uses the = character, which won’t pass your validator.

>>> import base64 >>> >>> print base64.b32encode('Random 🐍 UTF-8 String ☑⚠⚡"') KJQW4ZDPNUQPBH4QRUQFKVCGFU4CAU3UOJUW4ZZA4KMJDYU2UDRJVIJC >>> >>> print base64.b32decode('KJQW4ZDPNUQPBH4QRUQFKVCGFU4CAU3UOJUW4ZZA4KMJDYU2UDRJVIJC') Random 🐍 UTF-8 String ☑⚠⚡" >>>

edited Aug 17, 2015 at 10:04

answered Aug 16, 2015 at 14:16

xrisk

3,91826 silver badges48 bronze badges

5 Comments

Techdragon Over a year ago

This only behaves as expected in Python-2.x

jfs Over a year ago

@Techdragon: It should be trivial to adapt it for Python 3. If you don't know how; ask a separate question: include working Python 2 code and example input output.

phuclv Over a year ago

the thing is his set of allowed characters has only 63 different values, not 64

phuclv Over a year ago

yeah. I also thought of using base64 at first, but I've just had a look back on this and notice the set is not enough

Chen Zhuo Over a year ago

Isn't symbol = used in base32 too?

Techdragon · Accepted Answer · 2015-09-01 15:05:15Z

Despite several good answers. I ended up with a solution that seems cleaner and more understandable. So I'm posting the code of my eventual solution to answer my own question.

from string import ascii_letters from string import digits from base64 import b16decode from base64 import b16encode ALPHANUMERIC_SET = set(ascii_letters + digits) def utf8_string_to_hex_string(s): return ''.join(chr(i) for i in b16encode(s.encode('utf-8'))) def hex_string_to_utf8_string(s): return b16decode(bytes(list((ord(i) for i in s)))).decode('utf-8') def underscore_encode(chars_in): chars_out = list() for char in chars_in: if char not in ALPHANUMERIC_SET: chars_out.append('_{}_'.format(utf8_string_to_hex_string(char))) else: chars_out.append(char) return ''.join(chars_out) def underscore_decode(chars_in): chars_out = list() decoding = False for char in chars_in: if char == '_': if not decoding: hex_chars = list() decoding = True elif decoding: decoding = False chars_out.append(hex_string_to_utf8_string(hex_chars)) else: if not decoding: chars_out.append(char) elif decoding: hex_chars.append(char) return ''.join(chars_out)

jfs · Accepted Answer · 2015-09-04 19:59:14Z

You could abuse the url quoting, to get both readable and easy to decode in other languages format that passes your validation function:

#!/usr/bin/env python3 import urllib.parse def alnum_encode(text): return urllib.parse.quote(text, safe='')\ .replace('-', '%2d').replace('.', '%2e').replace('_', '%5f')\ .replace('%', '_') def alnum_decode(underscore_encoded): return urllib.parse.unquote(underscore_encoded.replace('_','%'), errors='strict') s = alnum_encode("Random 🐍 UTF-8 String ☑⚠⚡") print(s) print(alnum_decode(s))

Output

Random_20_F0_9F_90_8D_20UTF_2d8_20String_20_E2_98_91_E2_9A_A0_E2_9A_A1 Random 🐍 UTF-8 String ☑⚠⚡

Here's an implementation using a bytearray() (to move it to C later if necessary):

#!/usr/bin/env python3.5 from string import ascii_letters, digits def alnum_encode(text, alnum=bytearray(ascii_letters+digits, 'ascii')): result = bytearray() for byte in bytearray(text, 'utf-8'): if byte in alnum: result.append(byte) else: result += b'_%02x' % byte return result.decode('ascii')

With the downside of requiring much much more space to store the encoded form.
It seems the space is not an issue: len(alnum_encode("Random 🐍 UTF-8 String ☑⚠⚡")) == len(underscore_encode("Random 🐍 UTF-8 String ☑⚠⚡")) where underscore_encode() is from the accepted answer

Community · Accepted Answer · 2017-05-23 10:26:34Z

If you want a transliteration of Unicode to ASCII (e.g. ç --> c), then check out the Unidecode package. Here are their examples:

>>> from unidecode import unidecode >>> unidecode(u'ko\u017eu\u0161\u010dek') 'kozuscek' >>> unidecode(u'30 \U0001d5c4\U0001d5c6/\U0001d5c1') '30 km/h' >>> unidecode(u"\u5317\u4EB0") 'Bei Jing '

Here's my example:

# -*- coding: utf-8 -*- from unidecode import unidecode print unidecode(u'快樂星期天')

Gives as an output*

Kuai Le Xing Qi Tian

*may be nonsense, but at least it's ASCII

To remove punctuation, see this answer.

This encoding doesn't produce output that will always pass the validator function.

Collectives™ on Stack Overflow

How to encode UTF-8 strings with only "A-Z","a-z","0-9", and "_" in Python

5 Answers 5

Comments

5 Comments

Comments

Output

2 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

5 Comments

Comments

Output

2 Comments

1 Comment

Linked

Related