2

My group could either be of the form x/y, x.y or x_y.z. Each group is separated by an underscore. The groups are unordered.

Example:

ABC/DEF_abc.def_PQR/STU_ghi_jkl.mno 

I would like to capture the following:

ABC/DEF abc.def PQR/STU ghi_jkl.mno 

I have done this using a fairly verbose string iteration and parsing method (shown below), but am wondering if a simple regex can accomplish this.

private static ArrayList<String> go(String s){ ArrayList<String> list = new ArrayList<String>(); boolean inSlash = false; int pos = 0 ; boolean inDot = false; for(int i = 0 ; i < s.length(); i++){ char c = s.charAt(i); switch (c) { case '/': inSlash = true; break; case '_': if(inSlash){ list.add(s.substring(pos,i)); inSlash = false; pos = i+1 ; } else if (inDot){ list.add(s.substring(pos,i)); inDot = false; pos = i+1; } break; case '.': inDot = true; break; default: break; } } list.add(s.substring(pos)); System.out.println(list); return list; } 
3
  • The underscore can be delimiter as well as part of a group?? Commented Dec 8, 2010 at 12:49
  • The difficulty seems to be in the last type of group (with the underscore in it). Could you elaborate a little bit on the rules for when an underscore should be part of a group, and when it should be the separator character? Perhaps you could post your current code. Commented Dec 8, 2010 at 12:50
  • yes, that's the fun part :) Maybe some way to look ahead for a dot and then determine if it is a delim or group? Commented Dec 8, 2010 at 12:51

4 Answers 4

2

Have a try with:

((?:[^_./]+/[^_./]+)|(?:[^_./]+\.[^_./]+)|(?:[^_./]+(?:_[^_./]+)+\.[^_./]+)) 

I don't know java syntax but in Perl:

#!/usr/bin/perl use 5.10.1; use strict; use warnings; my $str = q!ABC/DEF_abc.def_PQR/STU_ghi_jkl.mno_a_b_c.z_a_b_c_d.z_a_b_c_d_e.z!; my $re = qr!((?:[^_./]+/[^_./]+)|(?:[^_./]+\.[^_./]+)|(?:[^_./]+(?:_[^_./]+)+\.[^_./]+))!; while($str=~/$re/g) { say $1; } 

will produce:

ABC/DEF abc.def PQR/STU ghi_jkl.mno a_b_c.z a_b_c_d.z a_b_c_d_e.z 
Sign up to request clarification or add additional context in comments.

2 Comments

Great, this works! Is it possible to change the last part so that it can match the forms a_b_c.z, a_b_c_d.z, a_b_c_d_e.z etc?
Java regexes are in appearance very like Perl 5.0 regexes from around 1993, but support a couple newer features like possessive matching. They don’t support any of the modern constructs. The apparent similarity’s a case of false cognates—of faux amis so to speak. They don’t work on Unicode correctly without this fix, and their \b and \B are so terribly broken that strings like "élève" won’t in Java match the pattern /\b\w+\b/ ("\\b\\w+\\b") anywhere at all. At least, not without my fix.
0

There might be a problem with the underscore since it's not always a separator.

Maybe: ((?<=_)\w+_)?\w+[./]\.w+

3 Comments

Please be exceedingly cautious using \w in Java regexes: it’s almost always wrong. ☹
I was just following the javadoc to java.util.regex.Pattern. :)
That is part of the problem, unfortunately.
0

This regex would probably do (tested with .Net regular expressions):

[a-zA-Z]+[./][a-zA-Z]+|[a-zA-Z]+_[a-zA-Z]+\.[a-zA-Z]+ 

(If you know your input is well formed there is no need to explicitly match the separator)

4 Comments

Please do not use [a-zA-Z] as a crippled synonym for \pL. :(
@tchrist: You are right (of course) and I am lazy (which is a virtue of a programmer, I seem to recall reading that somewhere...)
there is good-laziness and there is bad-laziness. Bad-laziness avoids work now by promising to work (a lot) more in the future. Good-laziness avoids work in the future by a bit of extra work now. :)
@tchrist: Of course I agree. Rest assured that I have learned a lot from your comments. Not only with regard to character classes but more importantly with regard to the quality of my work in general. i.e. be lazy where I can and diligent where I should. Thank you for your time.
0

This one goes with positive lookahead instead of alternations

[A-Za-z]+(_(?=[A-Za-z]+\.[A-Za-z]+))?[A-Za-z]+[/.][A-Za-z]+ 

1 Comment

Please do not use [A-Z] or [a-z] in a regex when \pL is what you actually mean—which it usually really is.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.