I have a question: starting from this text example:
input_test = "أكتب الدر_س و إحفضه ثم إقرأ القصـــــــــــــــيـــــــــــدة"
I managed to clean this text using these functions:
arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ''' english_punctuations = string.punctuation punctuations_list = arabic_punctuations + english_punctuations arabic_diacritics = re.compile(""" ّ | # Tashdid َ | # Fatha ً | # Tanwin Fath ُ | # Damma ٌ | # Tanwin Damm ِ | # Kasra ٍ | # Tanwin Kasr ْ | # Sukun ـ # Tatwil/Kashida """, re.VERBOSE) def normalize_arabic(text): text = re.sub("[إأآا]", "ا", text) return text def remove_diacritics(text): text = re.sub(arabic_diacritics, '', text) return text def remove_punctuations(text): translator = str.maketrans('', '', punctuations_list) return text.translate(translator) def remove_repeating_char(text): return re.sub(r'(.)\1+', r'\1', text) Which gives me this text as the result:
result = "اكتب الدرس و احفضه ثم اقرا القصيدة" Now if I have have this case, how can I find the word "اقرا" in the orginal input_test?
The input text can be in English, too. I'm thinking of regex — but I don't know from where to start…
input_test.find("اقرا").replargument (second one) that is passed tore.sub()can be a function.