I need to build a python encoder so that I can reformat strings like this:
import codecs codecs.encode("Random 🐍 UTF-8 String ☑⚠⚡", 'name_of_my_encoder') The reason this is even something I'm asking stack overflow is, the encoded strings need to pass this validation function. This is a hard constraint, there is no flexibility on this, its due to how the strings have to be stored.
from string import ascii_letters from string import digits valid_characters = set(ascii_letters + digits + ['_']) def validation_function(characters): for char in characters: if char not in valid_characters: raise Exception Making an encoder seemed easy enough, but I'm not sure if this encoder is making it harder to build a decoder. Heres the encoder I've written.
from codecs import encode from string import ascii_letters from string import digits ALPHANUMERIC_SET = set(ascii_letters + digits) def underscore_encode(chars_in): chars_out = list() for char in chars_in: if char not in ALPHANUMERIC_SET: chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii'))) else: chars_out.append(char) return ''.join(chars_out) This is the encoder I've written. I've only included it for example purposes, theres probably a better way to do this.
Edit 1 - Someone has wisely pointed out just using base32 on the entire string, which I can definitely use. However, it would be preferable to have something that is 'somewhat readable', so an escaping system like https://en.wikipedia.org/wiki/Quoted-printable or https://en.wikipedia.org/wiki/Percent-encoding would be preferred.
Edit 2 - Proposed solutions must work on Python 3.4 or newer, working in Python 2.7 as well is nice, but not required. I've added the python-3.x tag to help clarify this a little.
chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii')))what does this do?>>> '_{}_'.format(encode('π'.encode(), 'hex').decode('ascii'))prints out'_cf80_''_{}_'.format(encode(char.encode(), 'hex').decode('ascii'))instead of something like'_{}_'.format(base64.b16encode('π'.encode('utf-8')))