0

This is my code

# -*- coding: utf-8 -*- import json import re with open("/Users/paul/Desktop/file.json") as json_file: file = json.load(json_file) print file["desc"] key="capacità" result = re.findall("((?:[\S,]+\s+){0,3})"+key+"\s+((?:[\S,]+\s*){0,3})", file["desc"], re.IGNORECASE) print result 

This is the content of the file

{ "desc": "Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" } 

My result is []

but what I want is result = "capacità"

11
  • Which version of Python are you using? Commented Oct 5, 2015 at 22:47
  • Possible duplicate of python and regular expression with unicode Commented Oct 5, 2015 at 22:54
  • I'm using Python 2.7.1 Commented Oct 5, 2015 at 22:55
  • @UsiUsi capacit\u00e0 and capacità is the same word! It is you editor that is not displaying chars correctly. for example I have run my function as print(find_context(' capacit\u00e0',0,3,s) ) and it works, because comp sees only 0' and 1's. Commented Oct 5, 2015 at 23:36
  • Ok but I can't catch capacità with my regex... why? Commented Oct 5, 2015 at 23:37

2 Answers 2

1

You need to treat your string as an Unicode string, like this:

str = u"Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" 

And as you can see if you print str.encode('utf-8') you'll get:

Frigocongelatore, capacità di 215 litri, h 122 cm, classe A+ 

The same way you can make your regex string an unicode or raw string with u or r respectively.

Sign up to request clarification or add additional context in comments.

4 Comments

how can I convert "capacità" in capacit\u00e0 ?
ok I have understood... but if I read the "Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" from a file... How can I put the u in front of it to say that is a unicode?
var="Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" var=u(var) is not valid
I have totally rewritten my answer
0

You can use this function to display different encodings.

The default encoding on your editor should be UTF-8. Check you settings with sys.getdefaultencoding().

def find_context(word_, n_before, n_after, string_): # finds the word and n words before and after it import re b= '\w+\W+' * n_before a= '\W+\w+' * n_after pattern = '(' + b + word_ + a + ')' return re.search(pattern, string_).groups(1)[0] s = "Frigocongelatore, capacità di 215 litri, h 122 cm, classe A+" # find 0 words before and 3 after the word capacità print(find_context('capacità',0,3,s) ) capacità di 215 litri print(find_context(' capacit\u00e0',0,3,s) ) capacità di 215 litri 

3 Comments

It works but My problem is in the encoding... I have capacit\u00e0 not capacità
@Usi Usi you mean is is how displayed on your comp? You have to configure you editor environment. run print(sys.getdefaultencoding()) it shoud display you ecoding. Very likely it is not Utf-8
I have totally rewritten my answer

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.