0

I was Doing A question where i had to calculate the no of words in a string as a part of the problem.

eg: "hi i am a programmer" this should return 5. now i thought of two methods to do this:

  1. Using split
String[] words=messages[i].split("\\s"); int length=words.length; 
  1. Using while loop:
int getwordCount(String message) { int result = 1; //for(int i=0;i<message.length();i++){ // char ch=message.charAt(i); for (char ch : message.toCharArray()) { if (ch == ' ') ++result; } return result; } 

in some cases the 2nd method was proving more efficient and i was getting better time result what method is better to use and why since the running TC for .split() is O(n) which will be similar to the TC of 2nd method which would be O(n) . Even if i do not discard the use of .toCharArray() which is O(n) the method still gives better result.

The only explanation i can think of was of using the regex \\s. what exactly is going on?

3
  • As you've surmised, regular expression evaluation can be expensive, depending on the expression. In your case, I would suppose the initial 'compile' time to be a visible factor given such small input. By the way, your code counts foo bar as 4 words. Commented Jun 3, 2022 at 12:38
  • You should use the regexp form, unless and until this is measured to be a bottleneck in your overall program, which it rarely is. The regexp form is shorter, clearer, more general, and more obviously correct (once you fix the multiple-spaces bug). Commented Jun 3, 2022 at 12:41
  • @user16320675 - that's the "more general" in my second comment. Certainly in any case I can recall coding recently, I've wanted that generality. Commented Jun 3, 2022 at 12:45

3 Answers 3

1

When you split the array on whitespaces using regex, the regex engine will have to walk down the string once, and make the splits. This option also requires allocating a new String array and populating it with the individual words. The array step increases the running time as well as the storage requirements.

Your second version, while certainly more verbose, also only requires a single walk down the String. However, the second loop version totally avoids the allocation of a String array and the time/space required to populate it. Therefore, I would expect the second version to outperform the first.

That being said, we don't necessarily use regex because of its performance, but rather the simplicity of code it offers. I would probably always use the first string split version in a production code base, unless super high performance were absolutely required (e.g. in an Android app).

Sign up to request clarification or add additional context in comments.

1 Comment

As written, the regexp code is also more general, since it handles any whitespace, not just U+0020.
1

I'm by no means a code optimization expert but i would asume the second method to be faster, especially in large text block since the first needs to create N number of string in memory where the second works of the char array of the first. If you where to use a regex i think the result would be faster if you use a pattern matcher for the \\s pattern and count the matches in a while loop

Pattern pattern = Pattern.compile("\\s"); Matcher matcher = pattern.matcher(yourString); int count = 0; while (matcher.find()) { count++; } 

Comments

0

First of all your two methods are both wrong. You're counting the number of space-separated "things" (needn't be words) in the first method rather than counting the number of non-space sequences:

jshell> ("some words").split(" ") $2 ==> String[3] { "some", "", "words" } jshell> (" leading and trailing spacing ").split(" ") $3 ==> String[5] { "", "leading", "and", "trailing", "spacing" } jshell> ("").split(" ") $4 ==> String[1] { "" } jshell> (" ").split(" ") $5 ==> String[0] { } 

as you can see the word count does not match the space-delimited content count. Counting the spaces suffers from the same issues, except it will also fail on strings with "only" trailing spacing.

Both the RegEx and the for-loop run in linear time O(n), having to visit each character exactly once; using a RegEx requires first compiling the RegEx, but for such a simple RegEx we can reasonably neglect this. This does not mean that they take the same time to complete though, the constant factor may very well differ. As others have pointed out already the RegEx + splitting obviously incurs a significant overhead - especially as this requires auxiliary space O(n) whereas the simple for loop counting variant requires just constant auxiliary space O(1) to keep track of the count and the current index.

Fixing & improving your code using RegEx is quite easy:

int count = 0; Matcher matcher = Pattern.compile("\\S+").matcher(message); while (matcher.find()) count++; 

You'll likely want to use a character class like \w (alphanumerics) for words rather than looking for non-space characters \S.

Or if you want to implement this manually using the for-loop as it may be slightly faster:

int count = message.charAt(0) == ' ' ? 0 : 1; // does the text start with a non-space character? for (int i = 1; i < message.length; i++) { // beginning of word: transition space -> non-space if (message.charAt(i) != ' ' && message.charAt(i-1) == ' ') count++ } 

1 Comment

I agree that a matcher over \w+ would provide a better result on edge cases, although i doubt that considering the question the input wil have those, so i don't feel that the original answer should be considered wrong. at worst incomplete

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.