4
\$\begingroup\$

I was looking at a Stack Overflow question and got somewhat carried away with improving the solution, well beyond the scope of that question.

In summary, we have a string such as

ADJECTIVE panda walked to the NOUN and then VERB. A nearby NOUN was unaffected by these events. 

We are required to replace the NOUNs using a list of noun phrases; similarly for the other uppercase words.

It seems natural to me to provide the lists using a dict, like this:

replacements = { 'NOUN': ["pool", "giraffe"], 'VERB': ["smiled", "waved"], 'ADJECTIVE': ["happy"], } 

My function to make the substitutions is then

import copy import re def replace_in_string(string, replacements): ''' Replace each key of 'replacements' by successive elements from its value list. >>> replace_in_string('foobar', {'bar': ['baz']}) 'foobaz' >>> replace_in_string('foobar/fiebar', {'fie': ['foo'], 'bar': ['baz', 'bor']}) 'foobaz/foobor' Each replacement list must contain enough elements to substutite all the matches: >>> replace_in_string('foobar/fiebar', {'bar': ['baz']}) Traceback (most recent call last): ... IndexError: pop from empty list It's fine to provide more replacements than needed: >>> replace_in_string('foobar/fiebar', {'fie': ['foo', 'fo']}) 'foobar/foobar' ''' regex = re.compile('|'.join(map(re.escape, replacements.keys()))) terms = copy.deepcopy(replacements) def replace_func(m): return terms[m.group()].pop(0) return regex.sub(replace_func, string) if __name__ == '__main__': import doctest doctest.testmod() 

Here's a small demo, using examples taken from the SO question:

replacements = { 'NOUN': ["pool", "giraffe"], 'VERB': ["smiled", "waved"], 'ADJECTIVE': ["happy"], } TextFileContent ='ADJECTIVE panda walked to the NOUN and then VERB. A nearby NOUN was unaffected by these events.' print(replace_in_string(TextFileContent, replacements)) 

A possible alternative is to leave excess matches unreplaced:

 def replace_func(m): items = terms[m.group()] return items.pop(0) if items else m.group() 

Would that be more useful? Is there scope to provide both behaviours?

\$\endgroup\$
2
  • 1
    \$\begingroup\$ You can shallow copy with terms = {k: iter(vs) for k, vs in replacements.items()} and next(terms[m.group()]). \$\endgroup\$ Commented 6 hours ago
  • \$\begingroup\$ Thanks @Peilonrayz - I experimented with several options, but not that one (which seems much better). If you get time, please consider writing a short answer presenting that suggestion with its pros and cons. \$\endgroup\$ Commented 3 hours ago

4 Answers 4

5
\$\begingroup\$

Tests

It's nice to see that you have included various tests to verify that replace_in_string works as expected.

Docstring

Its great that you have provided a docstring, but I think it could be a bit more descriptive (argument string is not even mentioned). For example:

""" string: a str. replacements: a dictionary whose keys are strings and whose values are lists of string replacements. For each key in replacements, search string for occurrences of that key replacing each occurrence with successive elements of the key's value. """ 

Methodology

Using re.sub seems to be the best approach to string replacement.

Will you ever have a situation where you might wish to ensure that a string being replaced is on a word boundary? In that case you could surround the key with r'\b', but then calling re.escape on the key would not work. So your keys will already need to be regular expressions that do not need escaping. Would it be useful to have an additional argument, e.g. is_regex, that defaults to False but when True means that the keys are already regular expressions that should not be escaped?

Recommendation and Alternatives

I think it's indicative of something being amiss if there are too few values for any key in replacements. Consequently, in your alternative implementation of replace_func my preference would be for you to either raise an Exception or at least issue a warning for this situation.

An alternative would be to provide an extra, optional argument to your function, e.g. strict_mode=True, which causes behavior as just described but when False, issues no warnings or exceptions.

Another alternative is to define the function so that the last replacement value for a key is never popped; there is always at least one element in the value list for any key. This remaining value will be used for all subsequent replacements. In that way, you can replace all occurrences of a string with the same value just by having a replacement list consisting of a single element. The only error situation now would be to define an empty list as the value for a key.

\$\endgroup\$
3
\$\begingroup\$

Given that the problem states enough replacement values must be provided to the function, it would seem reasonable to catch that IndexError and raise a ValueError instead about the invalid argument.

Something along the lines of the following. Docstring elided for brevity.

def replace_in_string(string, replacements): regex = re.compile('|'.join(map(re.escape, replacements.keys()))) terms = copy.deepcopy(replacements) def replace_func(m): try: return terms[m.group()].pop(0) except IndexError: raise ValueError(f"Insufficient replacements offered for {m.group()}") return regex.sub(replace_func, string) 
\$\endgroup\$
2
\$\begingroup\$

I'm not comfortable with that mutating pop, even if it is on a copy. Also, why not make the substitution state reusable? In the following style, subsequent calls on the same object will resume substitution without repeating output words.

import re import typing class TokenMap(typing.NamedTuple): pattern: re.Pattern replacements: dict[str, typing.Callable[[], str]] @classmethod def from_dict(cls, replacements: dict[str, typing.Sequence[str]]) -> typing.Self: return cls( pattern=re.compile( '|'.join( re.escape(needle) for needle in replacements.keys() ) ), replacements={ needle: iter(outputs).__next__ for needle, outputs in replacements.items() }, ) def callback(self, match_: re.Match) -> str: return self.replacements[match_.group()]() def sub(self, haystack: str) -> str: return self.pattern.sub(repl=self.callback, string=haystack) print( TokenMap.from_dict({ 'NOUN': ("pool", "giraffe"), 'VERB': ("smiled", "waved"), 'ADJECTIVE': ("happy",), }).sub('ADJECTIVE panda walked to the NOUN and then VERB. A nearby NOUN was unaffected by these events.') ) 

Or with currying,

import functools import re import typing def compile_map( replacements: dict[str, typing.Sequence[str]], ) -> typing.Callable[[str], str]: pattern = re.compile( '|'.join( re.escape(needle) for needle in replacements.keys() ) ) replacements = { needle: iter(outputs).__next__ for needle, outputs in replacements.items() } def callback(match_: re.Match) -> str: return replacements[match_.group()]() return functools.partial(pattern.sub, callback) mapper = compile_map({ 'NOUN': ("pool", "giraffe"), 'VERB': ("smiled", "waved"), 'ADJECTIVE': ("happy",), }) print(mapper('ADJECTIVE panda walked to the NOUN and then VERB. A nearby NOUN was unaffected by these events.')) print(mapper('and then VERB')) 
\$\endgroup\$
2
  • \$\begingroup\$ IMO the replacements's iter shouldn't be shared between sub calls. Multiple calls to sub seems like a valid part of the interface. \$\endgroup\$ Commented 2 hours ago
  • \$\begingroup\$ @Peilonrayz Multiple calls to sub are a valid part of this interface, right? It depends on what the end user wants - whether they want to resume the position in the substitution sequence or not. \$\endgroup\$ Commented 2 hours ago
2
\$\begingroup\$

I would prefer to take Iterable[tuple[str, Iterable[str]]] rather than dict[str, list[str]] for two reasons:

  1. Iterable allows for a wider range of inputs than binding to concrete types.

  2. Adding .items() in the function call is largly an irrelevant change.

    print(replace_in_string(TextFileContent, replacements.items())) 
terms = copy.deepcopy(replacements) 

IIRC deep copy will copy all of the values in the dictionary, even the strings. We can reduce the memory usage by only 'copying' the dictionary and setting the values to iterators (iter) and then advancing the iterator in replace_func.

terms = {k: iter(vs) for k, vs in replacements.items()} 
def replace_func(m): return next(terms[m.group()]) 
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.