Python regex with devanAgarI [duplicate]

Question

Consider the following piece of code:

#! /usr/bin/python # coding: utf-8 # Prerequisite: sudo easy_install regex import regex import sys import collections test = True lines = [] if (test): test_lines = """ अ पु. संस्कृत वर्णमाला का प्रथम वर्ण [विशेषतया तीन दिन तक चलने वाले सोम-याग (त्रिरात्र) के प्रथम दिन आज्य इति”, पञ्च. ब्रा. 2०.14.3। अंश पु.1 अ. भाग (देवों, पितरों एवं मनुष्यों के लिए नियत) ऋ.वे. 1०.31.3; अ.वे. 11.1.5;1 ब. पशु-भाग, बौ.श्रौ.सू. का नाम, ऋ. प्रा. 17.4; निदा.सू. 1०5.2०। """.split("\n") lines = test_lines else: lines = sys.stdin.readlines() full_text = "\n".join(lines) full_text = regex.sub(ur'^(\S+)\s+(पु[ .])', '####\g<1>####\g<1> \g<2>', full_text, flags=regex.UNICODE|regex.MULTILINE) print(full_text)

I expect the above to produce the following output:

 ####अ####अ पु. संस्कृत वर्णमाला का प्रथम वर्ण [विशेषतया तीन दिन तक चलने वाले सोम-याग (त्रिरात्र) के प्रथम दिन आज्य इति”, पञ्च. ब्रा. 2०.14.3। ####अंश####अंश पु.1 अ. भाग (देवों, पितरों एवं मनुष्यों के लिए नियत) ऋ.वे. 1०.31.3; अ.वे. 11.1.5;1 ब. पशु-भाग, बौ.श्रौ.सू. का नाम, ऋ. प्रा. 17.4; निदा.सू. 1०5.2०।

But I get unaltered text.

Your input should also be Unicode, not only the pattern. See stackoverflow.com/questions/393843/… — Wiktor Stribiżew
– Wiktor Stribiżew, Commented May 24, 2016 at 23:08

Mark Tolonen · Accepted Answer · 2016-05-25 00:54:05Z

As @WiktorStribiżew pointed out, when dealing with Unicode text, the strings should be Unicode.

You must be using Python 2, so change:

test_lines = """

to:

test_lines = u"""

Also, for stdin change:

lines = sys.stdin.readlines()

to:

lines = [line.decode(sys.stdin.encoding) for line in sys.stdin]

Collectives™ on Stack Overflow

Python regex with devanAgarI [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related