accented characters in a regex with Python

Question

This is my code

# -*- coding: utf-8 -*- import json import re with open("/Users/paul/Desktop/file.json") as json_file: file = json.load(json_file) print file["desc"] key="capacità" result = re.findall("((?:[\S,]+\s+){0,3})"+key+"\s+((?:[\S,]+\s*){0,3})", file["desc"], re.IGNORECASE) print result

This is the content of the file

{ "desc": "Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" }

My result is []

but what I want is result = "capacità"

Possible duplicate of python and regular expression with unicode — MaxZoom
– MaxZoom, Commented Oct 5, 2015 at 22:54
@UsiUsi capacit\u00e0 and capacità is the same word! It is you editor that is not displaying chars correctly. for example I have run my function as print(find_context(' capacit\u00e0',0,3,s) ) and it works, because comp sees only 0' and 1's. — LetzerWille
– LetzerWille, Commented Oct 5, 2015 at 23:36

Diogo Rocha · Accepted Answer · 2015-10-05 22:54:21Z

1

You need to treat your string as an Unicode string, like this:

str = u"Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+"

And as you can see if you print str.encode('utf-8') you'll get:

Frigocongelatore, capacità di 215 litri, h 122 cm, classe A+

The same way you can make your regex string an unicode or raw string with u or r respectively.

answered Oct 5, 2015 at 22:54

Diogo Rocha

10.7k4 gold badges52 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Usi Usi Over a year ago

how can I convert "capacità" in capacit\u00e0 ?

Usi Usi Over a year ago

ok I have understood... but if I read the "Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" from a file... How can I put the u in front of it to say that is a unicode?

Usi Usi Over a year ago

var="Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" var=u(var) is not valid

Usi Usi Over a year ago

I have totally rewritten my answer

nhahtdh · Accepted Answer · 2015-10-06 07:58:44Z

You can use this function to display different encodings.

The default encoding on your editor should be UTF-8. Check you settings with sys.getdefaultencoding().

def find_context(word_, n_before, n_after, string_): # finds the word and n words before and after it import re b= '\w+\W+' * n_before a= '\W+\w+' * n_after pattern = '(' + b + word_ + a + ')' return re.search(pattern, string_).groups(1)[0] s = "Frigocongelatore, capacità di 215 litri, h 122 cm, classe A+" # find 0 words before and 3 after the word capacità print(find_context('capacità',0,3,s) ) capacità di 215 litri print(find_context(' capacit\u00e0',0,3,s) ) capacità di 215 litri

It works but My problem is in the encoding... I have capacit\u00e0 not capacità
@Usi Usi you mean is is how displayed on your comp? You have to configure you editor environment. run print(sys.getdefaultencoding()) it shoud display you ecoding. Very likely it is not Utf-8

Collectives™ on Stack Overflow

accented characters in a regex with Python

2 Answers 2

4 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Linked

Related