I am trying to use regular expression over a text that contains some special character like à,è,ù etc.
filter_2 = ur'(?:^\|\s+)?(?:(?:main_interests)|(?:influenced)|(?:influences))\s+?=[\s\W]+?(?:[\w}])*?([\d\w\s\-()*–&;\[\]|.<>:/",\']*)(?=\n)' compiled = re.compile(filter_2, flags=re.U | re.M) filter_list = re.findall(compiled, information) The text below is the result of the evaluation of the expression.
[[Pedro Calderón de la Barca|Calderón]], [[Christian Fürchtegott Gellert|Gellert]], [[Oliver Goldsmith|Goldsmith]], [[Hafez]], [[Johann Gottfried Herder|Herder]], [[Homer]], [[Kālidāsa]], [[Kant]], [[Friedrich Gottlieb Klopstock|Klopstock]], [[Gotthold Ephraim Lessing|Lessing]], [[Carl Linnaeus|Linnaeus]], [[James Macpherson|Macpherson]], [[Jean-Jacques Rousseau|Rousseau]], [[Friedrich Schiller|Schiller]], [[William Shakespeare|Shakespeare]], [[Spinoza]], [[Emanuel Swedenborg|Swedenborg]],[[Karl Robert Mandelkow]], Bodo Morawe: Goethes Briefe. 2. edition. Vol. 1: Briefe der Jahre 1764–1786. ''Christian Wegner'', Hamburg 1968, p. 709 [[Johann Joachim Winckelmann|Winckelmann]]`
Now, when i try to use another regular expression over the above text in order to extrapolate the words in the square brackets, the result is wrong. All the words that represent a special character, like à ù or è, are removed and the result is not the one expected.
filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))' another_compiled = re.compile(filter_6, flags=re.U | re.M) another_filtered_list = re.findall(another_compiled, (str(filter_list))) These are my results:
[('Pedro Calder', ''), ('Christian F', ''), ('Oliver Goldsmith', ''), ('Hafez', ''), ('Johann Gottfried Herder', ''), ('Homer', ''), ('K', ''), ('Kant', ''), ('Friedrich Gottlieb Klopstock', ''), ('Gotthold Ephraim Lessing', ''), ('Carl Linnaeus', ''), ('James Macpherson', ''), ('Jean-Jacques Rousseau', ''), ('Friedrich Schiller', ''), ('William Shakespeare', ''), ('Spinoza', ''), ('Emanuel Swedenborg', ''), ('Karl Robert Mandelkow', ''), ('Johann Joachim Winckelmann', ''), ('Thomas Carlyle', ''), ('Ernst Cassirer', ''), ('Charles Darwin', ''), ('Sigmund Freud', ''), ('G', ''), ('Andr', ''), ('Hermann Hesse', ''), ('G.W.F. Hegel', ''), ('Muhammad Iqbal', ''), ('Daisaku Ikeda', ''), ('Carl Gustav Jung', ''), ('Milan Kundera', ''), ('S', ''), ('Jean-Baptiste Lamarck', ''), ('Joaquim Maria Machado de Assis', ''), ('Thomas Mann', ''), ('Friedrich Nietzsche', ''), ('France Pre', ''), ('Grigol Robakidze', ''), ('Friedrich Schiller', ''), ('Oswald Spengler', ''), ('Max Stirner', ''), ('Friedrich Wilhelm Joseph Schelling', ''), ('Arthur Schopenhauer', ''), ('Oswald Spengler', ''), ('Rudolf Steiner', ''), ('Henry David Thoreau', ''), ('Nikola Tesla', ''), ('Ivan Turgenev', ''), ('Ludwig Wittgenstein', ''), ('Richard Wagner', ''), ('Leopold von Ranke', '')]
These are the results i would like to achieve
MATCH 1 1. [2-28]
Pedro Calderón de la BarcaMATCH 2 1. [43-72]Christian Fürchtegott GellertMATCH 3 1. [86-102]Oliver GoldsmithMATCH 4 1. [118-123]HafezMATCH 5 1. [129-152]Johann Gottfried HerderMATCH 6 1. [165-170]HomerMATCH 7 1. [176-184]KālidāsaMATCH 8 1. [190-194]KantMATCH 9 1. [200-228]Friedrich Gottlieb KlopstockMATCH 10 1. [244-268]Gotthold Ephraim LessingMATCH 11 1. [282-295]Carl LinnaeusMATCH 12 1. [310-326]James MacphersonMATCH 13 1. [343-364]Jean-Jacques RousseauMATCH 14 1. [379-397]Friedrich SchillerMATCH 15 1. [412-431]William ShakespeareMATCH 16 1. [449-456]SpinozaMATCH 17 1. [462-480]Emanuel SwedenborgMATCH 18 1. [501-522]Karl Robert MandelkowMATCH 19 1. [659-685]Johann Joachim Winckelmann
All the regular expression are tested online and they work perfectly. There is a way to actually include these special characters?
(?=|)says equals nothing. This(?=\|)says equals literal '|'. So, the match was satisfied whenever a non-class char was found, probably\xffstuff.str(filter_list)converts your list of Unicode strings into a byte string. The byte string contains literal values like\xf3(length 4) instead ofó(length 1). Give a short(!) example of the original text, what you get from your two searches, and what you actually want. See MVCE. Regular expressions are hard enough to read without an example of what you are trying to achieve.