Python regex match fails with UTF-8 characters

Question

I have a selenium/python project, which uses a regex match to find html elements. These element attributes sometime includes the danish/norwegian characters ÆØÅ. The problem is in this snippet below:

if (re.match(regexp_expression, compare_string)): result = True else : result = False

Both the regex_expression and compare_string are manipulated before the regex match is executed. If i print them before the code snippet above is executed, and also print the result, I get the following output:

Regex_expression: [^log på$] compare string: [log på] result = false

I put brackets on to make sure that there were no whitespaces. They are only part of the print statement, and not part of the String variables.

If I however try to reproduce the problem in a seperate script, like this:

#!/usr/bin/env python # -*- coding: utf-8 -*- import re regexp_expression = "^log på$" compare_string = "log på" if (re.match(regexp_expression, compare_string)): print("result true") result = True else : print("result = false") result = False

Then the result is true.

How can this be? To make it even stranger, it worked earlier, and I am not sure what I edited, that made it go boom...

Full module of the regex compare method is here below. I have not coded this myself, so I am not a 100% familiar with the reason of all the replace statements, and String manipulation, but I would think it shouldn't matter, when I can check the Strings just before the failing match method in the bottom...

#!/usr/bin/env python # -*- coding: utf-8 -*- import re def regexp_compare(regexp_expression, compare_string): #final int DOTALL #try: // include try catch for "PatternSyntaxException" while testing/including a new symbol in this method.. #catch(PatternSyntaxException e): # System.out.println("Regexp>>"+regexp_expression) # e.printStackTrace() #*/ if(not compare_string.strip() and (not regexp_expression.strip() or regexp_expression.strip().lower() == "*".lower()) or (regexp_expression.strip().lower() == ".*".lower())): print("return 1") return True if(not compare_string or not regexp_expression): print("return 2") return False regexp_expression = regexp_expression.lower() compare_string = compare_string.lower() if(not regexp_expression.strip()): regexp_expression = "" if(not compare_string.strip() and (not regexp_expression.strip() or regexp_expression.strip().lower() == "*".lower()) or (regexp_expression.strip().lower() == ".*".lower())): regexp_expression = "" else: regexp_expression = regexp_expression.replace("\\","\\\\") regexp_expression = regexp_expression.replace("\\.","\\\\.") regexp_expression = regexp_expression.replace("\\*", ".*") regexp_expression = regexp_expression.replace("\\(", "\\\\(") regexp_expression = regexp_expression.replace("\\)", "\\\\)") regexp_expression_arr = regexp_expression.split("|") regexp_expression = "" for i in range(0, len(regexp_expression_arr)): if(not(regexp_expression_arr[i].startswith("^"))): regexp_expression_arr[i] = "^"+regexp_expression_arr[i] if(not(regexp_expression_arr[i].endswith("$"))): regexp_expression_arr[i] = regexp_expression_arr[i]+"$" regexp_expression = regexp_expression_arr[i] if regexp_expression == "" else regexp_expression+"|"+regexp_expression_arr[i] result = None print("Regex_expression: [" + regexp_expression+"]") print("compare string: [" + compare_string+"]") if (re.match(regexp_expression, compare_string)): print("result true") result = True else : print("result = false") result = False print("return result") return result

^log på$ is not a good use of regexes. If you don't have a pattern, why not simply using ==? — Maroun
– Maroun, Commented Jul 6, 2015 at 11:25
The thing is, in my case it is redundant with the ^*$, but this is a general util class used for several matches, and I'm not the author of it. I guess there are reasons for the regex syntax in other cases. Like if one needs to check if a single html class is present in an element's full class string. This time I just happen to match the buttons text instead of classes. — jumps4fun
– jumps4fun, Commented Jul 6, 2015 at 11:27
Just realized that I got the syntax of the regex string wrong. I though it meant "starts with OR ends with", but that OR is an AND. I would agree that that is a waste of a regex. I don't know why they chose to code it that way... — jumps4fun
– jumps4fun, Commented Jul 7, 2015 at 12:57

Coding Monkey · Accepted Answer · 2015-07-06 11:41:37Z

It's likely that your are comparing a unicode string to a non unicode string.

For example, in the following:

#!/usr/bin/env python # -*- coding: utf-8 -*- import re regexp_expression = "^log på$" compare_string = u"log på" if (re.match(regexp_expression, compare_string)): print("result true") result = True else : print("result = false") result = False

You will get the output False. So there is likely a point in your manipulation where something is not unicode.

The same false will result with the following too:

regexp_expression = u"^log på$" compare_string = "log på"

this is great. I just used this approach: stackoverflow.com/questions/4987327/…. Turns out my regex match is unicode, and the compare string is an ordinary string. They are already that way before they are sent to the regex_compare method. Thanks!

Collectives™ on Stack Overflow

Python regex match fails with UTF-8 characters

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related