0

I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re library.

example (the word "شرکت" means "company" and we want to extract what the company name is):

input: شرکت تست گستران خلیج فارس output: تست گستران خلیج فارس 

I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:

re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None 
0

1 Answer 1

2

Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u in the code). You have to be explicit about it:

re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A") 

Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u as a valid escape by default.

Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..." somewhat obsolete:

from __future__ import unicode_literals 

will make unicode literals be parsed correctly with no u"".

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.