Elegant way to examine a string by "word" (whitespace-delimited substring)

Question

Suppose I have the following list of strings myStrList:

myStrList = {"This is A 123a Test", "This is A 123ab Test", "This is A 123-a 456-B 7c-89 Test"};

I wish to create a function that takes a single string str and returns the string "words" that contain exactly one letter character.

I will define a "word" as a substring delimited by whitespace but containing no whitespace of its own. I can generate a list of string "words" by simply passing the string to the function StringSplit. For example:

StringSplit["This is A 123a Test"] (* {"This", "is", "A", "123a", "Test"} *)

So the words comprising the string "This is A 123a Test" are "This", "is", "A", "123a", and "Test".

Now I wish to write a function myWordFunction that returns the words containing exactly one letter character. I can do this by using StringSplit to generate the list of words, and then I select the words with exactly one letter character:

myStrList = {"This is A 123a Test", "This is A 123ab Test", "This is A 123-a 456-B 7c-89 Test"}; myWordFunction[str_String] := Module[{wordsList, substringsList}, wordsList = StringSplit[str]; substringsList = Select[wordsList, StringCount[#, _? LetterQ] == 1 &]; Return[substringsList]; ]; myWordFunction[#] & /@ myStrList

This works, but it seems rather inelegant to first split the string into words and then do the (substring) analysis and selection. Is there a way to specify a "word" (whitespace-delimited substring) directly in a (string) pattern?

Something like StringCases["This is A 123-a 456-B 7c-89 Test", " " ~~ Shortest[a__] ~~ " " /; (StringCount[a, LetterCharacter] == 1) :> a] & should work, but I'm still experimenting, because for some reason this is only finding "A" and "456-B" and not "123-a" and "7c-89" even though, for instance, StringCount["123-a", LetterCharacter] == 1 yields True. For some strange reason, it seems to be missing the lower case letters when put inside the StringCases call. Stay tuned. — march
– march, Commented Oct 4, 2024 at 22:05
I see now why StringCases[#, " " ~~ (a__ /; StringCount[a, LetterCharacter] == 1) ~~ " " :> a] & doesn't work. The reason is that StringCases doesn't scan overlapping subsequences of characters in the string. Thus, once it has found " A ", it can't find " 123-a " because those two words share the white-space character between them. I'm not sure there's a way around this using this code, and this makes me thing it can't be done with doing something analogous to splitting the string at the white-spaces. — march
– march, Commented Oct 4, 2024 at 22:20
All that said: here's an adaptation of your code that makes things simpler: myWordFunction[str_String] := Select[StringSplit[str], StringCount[#, LetterCharacter] == 1 &]. One avoids the unnecessary use of Module. (Also, by the way, Return is not used in Mathematica in the same way as other languages. You don't need to Return substringsList. You could have just done Module[ ... substringsList = Select[wordsList, StringCount[#, _? LetterQ] == 1 &]]`.) — march
– march, Commented Oct 4, 2024 at 22:27
@march Try StringCases[#, " " ~~ (a__ /; StringCount[a, LetterCharacter] == 1) ~~ " " :> a, Overlaps -> All] & — creidhne
– creidhne, Commented Oct 4, 2024 at 23:29
@creidhne There you go! I should have look at the documentation to see if there was such an option. Thanks! Now I just need to modify it slightly to catch the edge cases where one of the words is at the beginning of or end of the string. — march
– march, Commented Oct 4, 2024 at 23:59

user1066 · Accepted Answer · 2024-10-09 21:50:07Z

Pick[#, StringCount[#, _?LetterQ], 1] & /@ StringSplit@myStrList (* {{"A", "123a"}, {"A"}, {"A", "123-a", "456-B", "7c-89"}} *)

With a Regular Expression

Pick[#, StringCount[#, RegularExpression["[A-z]"]], 1] & /@ StringSplit@myStrList (* {{"A", "123a"}, {"A"}, {"A", "123-a", "456-B", "7c-89"}} *)

or a regex with a negative look-ahead ((?![A-z])) and a negative look-behind ((?<![A-z]))

Pick[#, StringContainsQ[#, RegularExpression["(?<![A-z])[A-z](?![A-z])"]]] & /@ StringSplit@myStrList (* {{"A", "123a"}, {"A"}, {"A", "123-a", "456-B", "7c-89"}} *)

A bit more readable, perhaps:

(Match a single character from the character range [A-z] where the text before and after the match does not contain any character in the range [A-z] )

With[{str = StringTemplate["(?<!`1`)`1`(?!`1`)"]@"[A-z]"}, Pick[#, StringContainsQ[#, RegularExpression[str]]]] & /@ StringSplit@myStrList

Update

A (slight) modification of the neat answer by lericr

StringCases[RegularExpression["[\d-]*(?<![A-z])[A-z](?![A-z])[\d-]*"]] @myStrList (* {{"A", "123a"}, {"A"}, {"A", "123-a", "456-B", "7c-89"}} *)

Further checks

lst2 = {"-a123", "123a-", "-123a123-"}; StringCases[RegularExpression["[\d-]*(?<![A-z])[A-z](?![A-z])[\d-]*"]]@lst2 // Flatten (* {"-a123", "123a-", "-123a123-"} *)

and

Pick[#, (DigitQ[#] || LetterQ[#]) & /@ #, False]&@(Characters[myStrList] // Flatten // Union) (* {"-", " "} *)

lericr · Accepted Answer · 2024-10-04 23:08:50Z

If I understand your question, you want to avoid splitting the string. That is, you want to get the result "all in one go". I'm not yet 100% confident that this works for every possible case, but it works for your test cases and a few that I thought up.

StringCases[RegularExpression["\\b[^[:alpha:]\\s]*[[:alpha:]][^[:alpha:]\\s]*\\b"]]

Given that, I don't agree that splitting is inelegant. I think it's actually a clearer representation of your semantic, as I understand it.

Jason B. · Accepted Answer · 2024-10-07 19:58:07Z

4

Select[StringSplit[str], StringCount[#, _? LetterQ] == 1 &]

Is more readable than any of the other solutions. I have no clue what the OP thinks is inelegant about it.

answered Oct 7, 2024 at 19:58

Jason B.

72.3k3 gold badges152 silver badges317 bronze badges

$\begingroup$ Agreed.//////// $\endgroup$

lericr
– lericr

2024-10-07 20:29:54 +00:00
Commented Oct 7, 2024 at 20:29
$\begingroup$ Thanks - I might not have posted this cheeky response if I had read your entire answer above. $\endgroup$

Jason B.
– Jason B.

2024-10-07 21:15:19 +00:00
Commented Oct 7, 2024 at 21:15
$\begingroup$ (+1) Won't you have to Map onto myStrList, or the like? (such as Select[StringCount[#, _?LetterQ] == 1 &] /@ StringSplit[myStrList]) $\endgroup$

user1066
– user1066

2024-10-07 22:31:53 +00:00
Commented Oct 7, 2024 at 22:31
$\begingroup$ The StringSplit method seemed inelegant to me because it's destructive. For example, StringSplit["This is A 123a Test"] and StringSplit["This is A 123a\n Test"] return the same list because, by default, StringSplit splits the string at all whitespace characters. In the general case, in which the list has more than one whitespace character (or more than one whitespace type) between "words," it doesn't seem straightforward to reverse the split and obtain a processed version of the original string. I didn't request that in my original post but it seems like something one might want. $\endgroup$

Andrew
– Andrew

2024-10-10 15:16:39 +00:00
Commented Oct 10, 2024 at 15:16
$\begingroup$ The StringSplit method also seemed inelegant to me given how many tools, including WordBoundary, the Wolfram language provides for working with patterns and string patterns. I assumed I was missing some simple solution using such tools. So I was honestly surprised by the relative verbosity of a solution using the pattern/string pattern approach (see Syed's answer; this is in no way a criticism of that answer!). $\endgroup$

Andrew
– Andrew

2024-10-10 15:30:35 +00:00
Commented Oct 10, 2024 at 15:30

Add a comment |

Syed · Accepted Answer · 2024-10-05 08:37:15Z

As an exercise, I wanted to write a pattern and not use StringCount; and I did. But then to test corner cases I modified your list to include a few extra dashes:

myStrList = {"This is A --123a Test", "This is A 123ab Test", "This is A 123-a 456-B 7c-89-- Test"}; p1 = WordBoundary ~~ k : ( (Except[LetterCharacter | WhitespaceCharacter] | "-") ... ~~ LetterCharacter ~~ (Except[LetterCharacter | WhitespaceCharacter] | "-") ...) ~~ WordBoundary :> k; StringCases[#, p1] & /@ myStrList

{{"A", "123a"}, {"A"}, {"A", "123-a", "456-B", "7c-89"}}

which swallows the extra dashes. The reason (if I am not mistaken) is that WordBoundary deciphers a - as a word boundary delimiter.

So I tried a slightly modified pattern that can be used if split words are being tested.

p2 = StartOfString ~~ k : ( (Except[LetterCharacter | WhitespaceCharacter] | "-") ... ~~ LetterCharacter ~~ (Except[LetterCharacter | WhitespaceCharacter] | "-") ...) ~~ EndOfString :> k; StringCases[#, p2] & /@ (myStrList // StringSplit) // Map[Flatten[#, 1] &]

{{"A", "--123a"}, {"A"}, {"A", "123-a", "456-B", "7c-89--"}}

Stack Exchange Network

Elegant way to examine a string by "word" (whitespace-delimited substring)

4 Answers 4

Hot Network Questions

Elegant way to examine a string by "word" (whitespace-delimited substring)

4 Answers 4

Related

Hot Network Questions