1

How to convert a Non-ISO extended-ASCII English text, with CRLF line terminators to utf-8 in Python

4 Answers 4

1

Extending Jishiyu's Answer, you might use uchardet to identify the char set. For example

iconv -f `uchardet a_strange_file.txt` -t UTF-8 -o the_output_file.txt a_strange_file.txt 

Although this does not do the job in python.

Sign up to request clarification or add additional context in comments.

Comments

0

i think the linux command unix2dos、dos2unix、iconv will helpful。

such like

iconv -f latin-1 -t UTF-8 latin.txt >utf8.txt

1 Comment

But i need a python package that automatically converts to the specified format.
0

If you obtain a raw byte-stream for your input file, you can then decode it to utf-8. See this blog post with some Python 3 examples.

enter image description here

Comments

0

I have created an automated conversion script using the enca library, I use it on my NAS to convert subtitles to UTF-8 but it could be utilized for any automated conversion

Feel free to use :)

EDIT:

#!/bin/bash LANGUAGE=czech TO=utf8 CONVERT="enca -L $LANGUAGE -x $TO" # Find and onvert find ./ -type f -name "*.srt" | while read fn; do IS_TARGET=`enca "${fn}" | egrep -ow -m 1 'UTF-8|Unrecognized|KOI8-CS2|7bit ASCII|UCS-2|Macintosh Central European'` if [ "$IS_TARGET" != "UTF-8" ] && [ "$IS_TARGET" != "UCS-2" ] && [ "$IS_TARGET" != "Macintosh Central European" ] && [ "$IS_TARGET" != "Unrecognized" ] && [ "$IS_TARGET" != "7bit ASCII" ] && [ "$IS_TARGET" != "KOI8-CS2" ]; then echo "${fn} ---- Will be converted!" # optional backup of original srt # cp "${fn}" "${fn}.bak" $CONVERT "${fn}" fi done 

1 Comment

You should probably include the source code in your answer as opposed to just linking to it.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.