0
new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace(' ',' ').replace(' ', ' ').replace(' ', ' ').replace('\u20b9',' ').replace('\ufffd',' ').replace('\u037e',' ').replace('\u2022',' ').replace('\u200b',' ').replace('0xc3',' ') 

This is the error produced by the code:

new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace(' ', UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) 127.0.0.1 - - [29/Aug/2017 15:22:00] "GET / HTTP/1.1" 500 - 

I have tried decoding ascii from unicode.

3
  • What is text? Commented Aug 29, 2017 at 10:04
  • the text has been generated after converting a pdf document (using watson document converter ): this a part of the text:[ no title Bajaj Allianz General Insurance Company Ltd. GE Plaza, Airport Road, Yerwada, Pune - 411006(India) CERTIFICATE CUM POLICY SCHEDULE Policy Servicing Off: Bajaj Finserv Building, 1st Floor, Behind Weikfield IT-Park, Viman Nagar, Pune-411014 Phone No :1800-209-0144 Product Private Car - Liability Only Policy Period Of Insurance From: 27-May-2017 Policy issued on 25-May-2017 - To: 26-May-2018 Midnight Cover Note No / Insured Name SANJAY SINGH ] Commented Aug 29, 2017 at 11:23
  • Do the replacements one at a time instead of all at once and figure out which one is causing the error. If on Python 2, it is probably .replace('Â', ' ') and you need to use Unicode strings everywhere (u'\u00a0', etc.). Commented Aug 29, 2017 at 14:27

2 Answers 2

1

You are calling .replace on a unicode object but giving str arguments to it. The arguments are converted to unicode using the default ASCII encoding, which will fail for bytes not in range(128).

To avoid this problem do not mix str and unicode. Either pass unicode arguments to unicode methods:

new_text = text.decode('utf-8').replace(u'\\u00a0', u' ').replace(u'\\u00ad', u' ')... 

or do the replacements in the str object, assuming text is a str:

new_text = text.replace('\u00a0', ' ').replace('\u00ad', ' ')... 
Sign up to request clarification or add additional context in comments.

Comments

0

The last piece of your chained replaces seems to be the problem.

text.replace('0xc3', ' ') 

THis will try to replace the bytes 0xc3 with a space. In your code snippet it effectively reads

text.decode('utf-8').replace('0xc3', ' ') 

which means that you first decode bytes to a (unicode-)string in python and then want to replace the wrong bytes. It should work if you replace the bytes before decoding:

text.replace('0xc3', ' ').decode('utf-8') 

4 Comments

is there any way to convert utf-8 encoding directly to text.
type casting doesn't work also i am working on python 2.7
@AryanSingh what do yo mean by "utf-8 encoding" and "text"? Those are not types in python.
The last piece is not the problem. That replaces the 4-character text 0xc3 with a space, which, while not what the OP wants, is still valid code.