Split text into array while maintaining the punctuation in Swift

Question

I want to split the text into an array, maintaining the punctuation separated by the rest of the words, so a string like:

Hello, I am Albert Einstein.

should turn into an array like this:

["Hello", ",", "I", "am", "Albert", "Einstein", "."]

I have tried with sting.components(separatedBy: CharacterSet.init(charactersIn: " ,;;:")) but this method deletes all punctuations, and returns an array like this:

["Hello", "I", "am", "Albert", "Einstein"]

So, how can I get an array like my first example?

Have you tried splitting it up via regex and then combining the groups? i.e. If you ran something like ([A-Za-z\']*)([,\.])*, then the optional subgroups of [0, 1] would contain your parts (e.g. 'Hello', ',') and then you could run a flatMap on all of the non-nil groups to merge them into a single array of separated strings — Cruceo
– Cruceo, Commented Oct 3, 2016 at 15:32
I am confused what is the result output which you don't want? Can you also add that to your question — mfaani
– mfaani, Commented Oct 3, 2016 at 19:49

Duyen-Hoa · Accepted Answer · 2016-10-03 19:58:21Z

It's not beautiful as solution but you can try with:

var str = "Hello, I am Albert Einstein." var list = [String]() var currentSubString = ""; //enumerate to get all characters including ".", ",", ";", " " str.enumerateSubstrings(in: str.startIndex..<str.endIndex, options: String.EnumerationOptions.byComposedCharacterSequences) { (substring, substringRange, enclosingRange, value) in if let _subString = substring { if (!currentSubString.isEmpty && (_subString.compare(" ") == .orderedSame || _subString.compare(",") == .orderedSame || _subString.compare(".") == .orderedSame || _subString.compare(";") == .orderedSame ) ) { //create word if see any of those character and currentSubString is not empty list.append(currentSubString) currentSubString = _subString.trimmingCharacters(in: CharacterSet.whitespaces ) } else { //add to current sub string if current character is not space. if (_subString.compare(" ") != .orderedSame) { currentSubString += _subString } } } } //last word if (!currentSubString.isEmpty) { list.append(currentSubString) }

In Swift3:

var str = "Hello, I am Albert Einstein." var list = [String]() var currentSubString = ""; //enumerate to get all characters including ".", ",", ";", " " str.enumerateSubstrings(in: str.startIndex..<str.endIndex, options: String.EnumerationOptions.byComposedCharacterSequences) { (substring, substringRange, enclosingRange, value) in if let _subString = substring { if (!currentSubString.isEmpty && (_subString.compare(" ") == .orderedSame || _subString.compare(",") == .orderedSame || _subString.compare(".") == .orderedSame || _subString.compare(";") == .orderedSame ) ) { //create word if see any of those character and currentSubString is not empty list.append(currentSubString) currentSubString = _subString.trimmingCharacters(in: CharacterSet.whitespaces ) } else { //add to current sub string if current character is not space. if (_subString.compare(" ") != .orderedSame) { currentSubString += _subString } } } } //last word if (!currentSubString.isEmpty) { list.append(currentSubString) }

The idea is to loop for all character and create word in same time. A word is a group of consecutive character that is not , ,, . or ;. So, during the creation of word in loop, we finish the current word if we see one of those character, and the current word in construction is not empty. To break down steps with your input:

get H (not space nor other terminal character) -> currentSubString = "H"
get e (not space nor other terminal character) -> currentSubString = "He"
get l (not space nor other terminal character) -> currentSubString = "Hel"
get l (not space nor other terminal character) -> currentSubString = "Hell"
get o (not space nor other terminal character) -> currentSubString = "Hello"
get . (is terminal character)
- -> as currentSubString is not empty, add to list and restart the construction for next word, then list = ["Hello"]
- -> currentSubString = "." (the reason that I used trimming is just to remove if I get this character. but for other terminal character, we have to keep for next word.
get (is space character)
- -> as currentSubString is not empty, add to listand restart the construction -> list = ["Hello", "."]
- -> currentSubString = "" (trimmed). ... and so on.

I don't understand some lines of code, but it works! Thanks!
Look at my comment at the end. It's more clear for you maybe :)

Cruceo · Accepted Answer · 2016-10-03 15:54:36Z

To explain from my comment... Think of regular expressions as a way to nicely find patterns within Strings. In your case, the pattern is words (groups of letters) with other possible symbols (punctuation marks) in between.

Take the regex in my comment (which I've expanded a bit here), for example: ([,\.\:\"])*([A-Za-z0-9\']*)([,\.\:\"])*

In there, we have 3 groups. The first searches for any symbols (such as a leading quotation mark). The second is searching for letters, numbers, and an apostrophe (because people like to concatenate words, like "I'm"). and the third group searches for any trailing punctuation marks.

Edit to note: groups in the above are denoted by parentheses ( and ), while the [ and ] brackets denote acceptable characters for a search. So, for example, [A-Z] says that all upper case letters from A-Z are acceptable. [A-Za-z] lets you get both upper and lower, while [A-Za-z0-9] includes all letters and numbers from 0-9. Granted, there are shorthand versions to writing this, but those you'll discover down the road.

So now we have a way to separate all the words and punctuation marks, now you need to actually use it, doing something along the lines of:

func find(value: NSString) throws -> [NSString] { let regex = try NSRegularExpression(pattern: "([,\\.\\:\\\"])*([A-Za-z0-9\\']*)([,\\.\\:\\\"])*") // Notice you have to escape the values in code let results = regex.matches(in: value, range: NSRange(location: 0, length: nsString.length)) return results.map({ value.substring(with: $0.range) }).filter({ $0 != nil }) }

That should give you each non-nil group found within the String value you supply to the method.

Granted, that last filter method may not be necessary, but I'm not familiar enough with how Swift handles regex to know for sure.

But that should definitely point you in the right direction...

Cheers~

Something doesn't work as expected: ["Hello,", "playground", "I", "am", "Alessio."]
Yeah it looks like swift is automatically grouping them together instead of giving you the ranges of the subgroups. Give me a second to try and find a resource that can help you deep-dive into the subgroups
Sorry, the link I posted was JS. This is a Swift example of getting the separate capturing groups
@OttavioCocci I also don't know much about regext, but from time to time I use from 2 sites. See here & here. They are great tutorials and regex validator tools you can use.

Collectives™ on Stack Overflow

Split text into array while maintaining the punctuation in Swift

2 Answers 2

3 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Linked

Related