Removing non-word characters in certain parts of a string

Question

I would like to ask how I can remove non-word characters from a string, but only in certain cases.

I have read this article, so I know how to get the words out of a string. My text is however a bit more complicated.

For example:

trialtext = ",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht";

From this text, I would like to get as output:

{"temp","sp.a","tiral","dump","NV-A","rambo","6833","16","rgcht"}

In other words, I want so split according to spaces, commas, hyphens and dots, EXCEPT when they have letter character before and after either a hyphen or a dot (so not commas or other signs!)

This has been my most succesful trial so far:

StringSplit[trialtext, Except[WordCharacter, WordCharacter .. ~~ "." ~~ WordCharacter]] {"temp sp.a tiral dump NV-A rambo.6833 16,rgcht"}

although I do not understand why - if I as for "." - it decides to also take "," and "-".

Therefore also the related question: can someone please explain to me why this

StringSplit[trialtext, Except[WordCharacter, ","]]

gives this output:

 {"temp sp.a tiral dump NV-A rambo.6833 16", "rgcht"}

while this:

StringSplit[trialtext, Except[WordCharacter, "."]]

produces this output:

{"temp", "sp", "a", "tiral", "dump", "NV", "A", "rambo", "6833", "16", "rgcht"}

Thanks a bunch!

It seems "." is interpreted in Except as regular expression. And "." is every character excluding newline. — Kuba
– Kuba, Commented Oct 31, 2014 at 11:59

WReach · Accepted Answer · 2014-11-01 18:02:58Z

Regular expressions are cryptic, but they offer look-ahead and look-behind capabilities that are unavailable to regular string patterns:

split[s_] := StringSplit[s, RegularExpression["( |,|(?<![[:alpha:]])[-.]|[-.](?![[:alpha:]]))+"]] split[",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht"] (* {"temp", "sp.a", "tiral", "dump", "NV-A", "rambo", "6833", "16", "rgcht"} *)

This formulation respects the special rule that dots and dashes act as delimiters except when they have letters on both sides:

split["1.2.3.a.b.c ---4-5-6-x-y-z---"] (* {"1", "2", "3", "a.b.c", "4", "5", "6", "x-y-z"} *)

The key ingredient in this solution is the use of (?<![[:alpha:]])[-.] which can be interpreted as "a dot or dash that is not preceded by an alphabetic character". Similarly, [-.](?![[:alpha:]]) means "a dot or dash that is not followed by an alphabetic character". Look-ahead and look-behind patterns are particularly useful for this problem because they allow us to examine characters for matching purposes without considering them to be part of a delimiter itself.

kglr · Accepted Answer · 2014-10-31 15:02:09Z

trialtext = ",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht"; StringTrim@StringSplit[trialtext, {"," | "-" | ".", x : PatternSequence[Except[WhitespaceCharacter] .. ~~ "." | "-" ~~LetterCharacter ..] :> x}] (* {"temp", "sp.a", "tiral", "dump", "NV-A", "rambo", "6833", "16", "rgcht"} *)

Charlotte Hadley · Accepted Answer · 2016-06-21 11:06:00Z

As of version 10.1 there is TextWords that will achieve this for you easily

TextWords[",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht"] (*{"temp", "sp.a", "tiral", "dump", "NV-A", "rambo.6833", "16,rgcht"}*)

Note that the implementation of the function is available to you with

??TextWords

It relies on a bunch of stuff from the NaturalLanguageProcessing package that rumour has it will be opened up more in Mathematica 11.

eldo · Accepted Answer · 2024-04-21 22:44:24Z

Using TextCases (new in 10.2)

Listings of the many content types

TextCases content types

str = ",,temp Naples sp.a tiral - dump NV-A rambo.6833. 16,rgcht, Rome,Denmark"; TextCases[str, "Word"]

{"temp", "Naples", "sp.", "a", "tiral", "dump", "NV-A", "rambo", ".6833", "16", "rgcht", "Rome", "Denmark"}

TextCases[str, "City"]

{"Naples", "Rome"}

TextCases[str, {"City", "Country"}]

<|"City" -> {"Naples", "Rome"}, "Country" -> {"Denmark"}|>

TextCases[str, "Word"] == TextWords[str]

True

Stack Exchange Network

Removing non-word characters in certain parts of a string

4 Answers 4

Linked

Hot Network Questions

Removing non-word characters in certain parts of a string

4 Answers 4

Linked

Related

Hot Network Questions