6
$\begingroup$

I would like to ask how I can remove non-word characters from a string, but only in certain cases.

I have read this article, so I know how to get the words out of a string. My text is however a bit more complicated.

For example:

trialtext = ",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht"; 

From this text, I would like to get as output:

{"temp","sp.a","tiral","dump","NV-A","rambo","6833","16","rgcht"} 

In other words, I want so split according to spaces, commas, hyphens and dots, EXCEPT when they have letter character before and after either a hyphen or a dot (so not commas or other signs!)

This has been my most succesful trial so far:

StringSplit[trialtext, Except[WordCharacter, WordCharacter .. ~~ "." ~~ WordCharacter]] {"temp sp.a tiral dump NV-A rambo.6833 16,rgcht"} 

although I do not understand why - if I as for "." - it decides to also take "," and "-".

Therefore also the related question: can someone please explain to me why this

StringSplit[trialtext, Except[WordCharacter, ","]] 

gives this output:

 {"temp sp.a tiral dump NV-A rambo.6833 16", "rgcht"} 

while this:

StringSplit[trialtext, Except[WordCharacter, "."]] 

produces this output:

{"temp", "sp", "a", "tiral", "dump", "NV", "A", "rambo", "6833", "16", "rgcht"} 

Thanks a bunch!

$\endgroup$
1
  • $\begingroup$ It seems "." is interpreted in Except as regular expression. And "." is every character excluding newline. $\endgroup$ Commented Oct 31, 2014 at 11:59

4 Answers 4

4
$\begingroup$

Regular expressions are cryptic, but they offer look-ahead and look-behind capabilities that are unavailable to regular string patterns:

split[s_] := StringSplit[s, RegularExpression["( |,|(?<![[:alpha:]])[-.]|[-.](?![[:alpha:]]))+"]] split[",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht"] (* {"temp", "sp.a", "tiral", "dump", "NV-A", "rambo", "6833", "16", "rgcht"} *) 

This formulation respects the special rule that dots and dashes act as delimiters except when they have letters on both sides:

split["1.2.3.a.b.c ---4-5-6-x-y-z---"] (* {"1", "2", "3", "a.b.c", "4", "5", "6", "x-y-z"} *) 

The key ingredient in this solution is the use of (?<![[:alpha:]])[-.] which can be interpreted as "a dot or dash that is not preceded by an alphabetic character". Similarly, [-.](?![[:alpha:]]) means "a dot or dash that is not followed by an alphabetic character". Look-ahead and look-behind patterns are particularly useful for this problem because they allow us to examine characters for matching purposes without considering them to be part of a delimiter itself.

$\endgroup$
3
$\begingroup$
trialtext = ",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht"; StringTrim@StringSplit[trialtext, {"," | "-" | ".", x : PatternSequence[Except[WhitespaceCharacter] .. ~~ "." | "-" ~~LetterCharacter ..] :> x}] (* {"temp", "sp.a", "tiral", "dump", "NV-A", "rambo", "6833", "16", "rgcht"} *) 
$\endgroup$
1
$\begingroup$

As of version 10.1 there is TextWords that will achieve this for you easily

TextWords[",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht"] (*{"temp", "sp.a", "tiral", "dump", "NV-A", "rambo.6833", "16,rgcht"}*) 

Note that the implementation of the function is available to you with

??TextWords 

It relies on a bunch of stuff from the NaturalLanguageProcessing package that rumour has it will be opened up more in Mathematica 11.

$\endgroup$
0
$\begingroup$

Using TextCases (new in 10.2)

Listings of the many content types

TextCases content types

str = ",,temp Naples sp.a tiral - dump NV-A rambo.6833. 16,rgcht, Rome,Denmark"; TextCases[str, "Word"] 

{"temp", "Naples", "sp.", "a", "tiral", "dump", "NV-A", "rambo", ".6833", "16", "rgcht", "Rome", "Denmark"}

TextCases[str, "City"] 

{"Naples", "Rome"}

TextCases[str, {"City", "Country"}] 

<|"City" -> {"Naples", "Rome"}, "Country" -> {"Denmark"}|>

TextCases[str, "Word"] == TextWords[str] 

True

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.