Counting lower vs non-lowercase tokens for tokenized text with several conditions

Question

Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.

The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?

First the function has to determine whether :

token is an xml tag, if so ignore it and move to the next token
token is in a list of predefined delayed sentence start, if so ignore it and move to the next token

# Skip XML tags. if re.search(r"(<\S[^>]*>)", token): continue # Skip if sentence start symbols. elif token in self.DELAYED_SENT_START: continue

Then it checks whether to toggle the is_first_word condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.

if token in a list of pre-defined sentence ending and the is_first_word condition is False, then set is_first_word to True and then move on to the next token
if there's nothing to case, since none of the characters falls under the letter regex, then set is_first_word to False and move on to the next token

# Resets the `is_first_word` after seeing sent end symbols. if not is_first_word and token in self.SENT_END: is_first_word = True continue # Skips words with nothing to case. if not re.search(r"[{}]".format(ll_lu_lt), token): is_first_word = False continue

Then finally after checking for unweight-able words, and the function continues to finally updates the weight.

First all weights are set to 0, and then set to 1 if it's not is_first_word.

Then if the possibly_use_first_token option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.

Then finally, update the weights if it's non-zero. And set the is_first_word toggle to False

current_word_weight = 0 if not is_first_word: current_word_weight = 1 elif possibly_use_first_token: # Gated special handling of first word of sentence. # Check if first characer of token is lowercase. if token[0].is_lower(): current_word_weight = 1 elif i == 1: current_word_weight = 0.1 if current_word_weight > 0: casing[token.lower()][token] += current_word_weight is_first_word = False

The full code is in the train() function below:

#!/usr/bin/env python3 # -*- coding: utf-8 -*- import re from collections import defaultdict, Counter from six import text_type from sacremoses.corpus import Perluniprops from sacremoses.corpus import NonbreakingPrefixes perluniprops = Perluniprops() class MosesTruecaser(object): """ This is a Python port of the Moses Truecaser from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl """ # Perl Unicode Properties character sets. Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter'))) Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter'))) Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter'))) def __init__(self): # Initialize the object. super(MosesTruecaser, self).__init__() # Initialize the language specific nonbreaking prefixes. self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(self.Lowercase_Letter, self.Uppercase_Letter, self.Titlecase_Letter) self.SENT_END = [".", ":", "?", "!"] self.DELAYED_SENT_START = ["(", "[", "\"", "'", "&apos;", "&quot;", "&#91;", "&#93;"] def train(self, filename, possibly_use_first_token=False): casing = defaultdict(Counter) with open(filename) as fin: for line in fin: # Keep track of first words in the sentence(s) of the line. is_first_word = True for i, token in enumerate(line.split()): # Skip XML tags. if re.search(r"(<\S[^>]*>)", token): continue # Skip if sentence start symbols. elif token in self.DELAYED_SENT_START: continue # Resets the `is_first_word` after seeing sent end symbols. if not is_first_word and token in self.SENT_END: is_first_word = True continue # Skips words with nothing to case. if not re.search(self.SKIP_LETTERS_REGEX, token): is_first_word = False continue current_word_weight = 0 if not is_first_word: current_word_weight = 1 elif possibly_use_first_token: # Gated special handling of first word of sentence. # Check if first characer of token is lowercase. if token[0].is_lower(): current_word_weight = 1 elif i == 1: current_word_weight = 0.1 if current_word_weight > 0: casing[token.lower()][token] += current_word_weight is_first_word = False return casing

Sample Input: https://gist.github.com/alvations/33799dedc4bab20dd24fb64970451e49

Expected Output of train(): https://gist.github.com/alvations/d6d2363bca9a4a9a16e8076f8e8c1e60

Could you provide some example inputs and the corresponding outputs? — 200_success
– 200_success, Commented Dec 10, 2018 at 2:53
If anyone is interested, the full code is part of this repo github.com/alvations/sacremoses/blob/truecaser/sacremoses/… — alvas
– alvas, Commented Dec 10, 2018 at 4:23

Bailey Parker · Accepted Answer · 2018-12-10 04:05:23Z

You aren't doing much heavy lifting here, but there are a few small things that can be improved:

self.SENT_END and self.DELAYED_SENT_START are both lists. This means x in self.SENT_END is an O(n) operation. In other words, we must look over the entire list to see if the element is in there. The data structure you want here is a set, which has O(1) lookup time for membership tests. All you have to do is initialize them like so: self.SENT_END = {".", ":", "?", "!"} (The {} are special syntax for the almost equivalent notation: set([".", ":", "?", "!"])

You use re.search a lot. Regexes are usually rather expensive to run. For example, since you're working on some sort of text corpus, it's a pretty safe assumption that < and > don't occur unless the word is tag. Furthermore, we may even be able to go as far as to say that the first letter must be a <. If you are comfortable making these assumptions, doing word[0] == '<' (or the slightly slower '<' in word) will almost certainly be faster than the regex. That said, if you truly need the regex (maybe your second regex is more complicated, I don't really understand what you're trying to achieve with it), try using precompiled regexes. This way, you don't have to pay the cost of parsing the regex each time you want to use it:

self.SKIP_LETTERS_REGEX = re.compile(r"[{}{}{}]".format(self.Lowercase_Letter, self.Uppercase_Letter, self.Titlecase_Letter))

Then you can use it like self.SKIP_LETTERS_REGEX.search(word).

This all said, there aren't really many moving parts here. You could try profiling individual bits, but likely you'll hit a wall of "this is as fast as you can get this to run with cpython." If speed is really critical, consider running with pypy, which is capable of jitting your code and typically is a free performance boost.

If all else fails, you aren't doing too much here, so it wouldn't be hard to port to something lower level like C++ (sans regexes you could accomplish this with iostream and std::unordered_map), but only do that if you identify this as a severe bottleneck via profiling.

Some notes on code style:

You don't need to inherit from (object) in Python 3. You also don't need to call super() either (and super in python 3 is simply super().__init__())
Consider making your all caps variables just regular constants and not part of the class
This doesn't really seem like it needs to be encapsulated in a class. A few top level functions could achieve the last result
Good comments
train() gets a little bit deep and is hard-ish to follow with all of the continues. Consider breaking it up (although this might affect performance) to make more clear what it actually does

And as a final note on performance, save your code now. Ideally, track it in git. Then, as you make changes, run it with timeit. You can do python3 -m timeit to time your program. Confirm you've measurably improved performance with your changes before committing them.

Stack Exchange Network

Counting lower vs non-lowercase tokens for tokenized text with several conditions

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Counting lower vs non-lowercase tokens for tokenized text with several conditions

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions