Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.
The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?
First the function has to determine whether :
- token is an xml tag, if so ignore it and move to the next token
- token is in a list of predefined
delayed sentence start, if so ignore it and move to the next token
# Skip XML tags. if re.search(r"(<\S[^>]*>)", token): continue # Skip if sentence start symbols. elif token in self.DELAYED_SENT_START: continue Then it checks whether to toggle the is_first_word condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.
if token in a list of pre-defined sentence ending and the
is_first_wordcondition is False, then setis_first_wordto True and then move on to the next tokenif there's nothing to case, since none of the characters falls under the letter regex, then set
is_first_wordto False and move on to the next token
# Resets the `is_first_word` after seeing sent end symbols. if not is_first_word and token in self.SENT_END: is_first_word = True continue # Skips words with nothing to case. if not re.search(r"[{}]".format(ll_lu_lt), token): is_first_word = False continue Then finally after checking for unweight-able words, and the function continues to finally updates the weight.
First all weights are set to 0, and then set to 1 if it's not is_first_word.
Then if the possibly_use_first_token option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.
Then finally, update the weights if it's non-zero. And set the is_first_word toggle to False
current_word_weight = 0 if not is_first_word: current_word_weight = 1 elif possibly_use_first_token: # Gated special handling of first word of sentence. # Check if first characer of token is lowercase. if token[0].is_lower(): current_word_weight = 1 elif i == 1: current_word_weight = 0.1 if current_word_weight > 0: casing[token.lower()][token] += current_word_weight is_first_word = False The full code is in the train() function below:
#!/usr/bin/env python3 # -*- coding: utf-8 -*- import re from collections import defaultdict, Counter from six import text_type from sacremoses.corpus import Perluniprops from sacremoses.corpus import NonbreakingPrefixes perluniprops = Perluniprops() class MosesTruecaser(object): """ This is a Python port of the Moses Truecaser from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl """ # Perl Unicode Properties character sets. Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter'))) Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter'))) Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter'))) def __init__(self): # Initialize the object. super(MosesTruecaser, self).__init__() # Initialize the language specific nonbreaking prefixes. self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(self.Lowercase_Letter, self.Uppercase_Letter, self.Titlecase_Letter) self.SENT_END = [".", ":", "?", "!"] self.DELAYED_SENT_START = ["(", "[", "\"", "'", "'", """, "[", "]"] def train(self, filename, possibly_use_first_token=False): casing = defaultdict(Counter) with open(filename) as fin: for line in fin: # Keep track of first words in the sentence(s) of the line. is_first_word = True for i, token in enumerate(line.split()): # Skip XML tags. if re.search(r"(<\S[^>]*>)", token): continue # Skip if sentence start symbols. elif token in self.DELAYED_SENT_START: continue # Resets the `is_first_word` after seeing sent end symbols. if not is_first_word and token in self.SENT_END: is_first_word = True continue # Skips words with nothing to case. if not re.search(self.SKIP_LETTERS_REGEX, token): is_first_word = False continue current_word_weight = 0 if not is_first_word: current_word_weight = 1 elif possibly_use_first_token: # Gated special handling of first word of sentence. # Check if first characer of token is lowercase. if token[0].is_lower(): current_word_weight = 1 elif i == 1: current_word_weight = 0.1 if current_word_weight > 0: casing[token.lower()][token] += current_word_weight is_first_word = False return casing Sample Input: https://gist.github.com/alvations/33799dedc4bab20dd24fb64970451e49
Expected Output of train(): https://gist.github.com/alvations/d6d2363bca9a4a9a16e8076f8e8c1e60