pick up string with specific pattern in R using gsub

Question

"CATARACT; #大腿骨~2010"

I need to pick up the 大腿骨 in R using gsub, it is actually unicode that starts with &# followed by a five digits number and then ended with ;.

I know how to get rid of these unicode using the following:

gsub("&#[0-9]+;","","CATARACT; #大腿骨~2010")

But how can I retain these unicode using gsub?

Edit 01

My desired output is 大腿骨.

Edit 02

Thanks for the answer, but what if the pattern is not always like that, I need to pick up the unicode no matter where they are:

"CATARACT; #大腿骨~2010;CATARACT; #夨膀骩~2010"

What's your desired output?

lukeA
– lukeA

2014-03-11 09:37:13 +00:00
Commented Mar 11, 2014 at 9:37 — lukeA
– lukeA, Commented Mar 11, 2014 at 9:37

lukeA · Accepted Answer · 2014-03-11 09:45:32Z

E.g. using gregexpr and regmatches:

ex <- "CATARACT; #&#22823;&#33151;&#39592;~2010;CATARACT; #&#22824;&#33152;&#39593;~2010" m <- gregexpr("&#[0-9]+;", ex) (r <- regmatches(ex, m)) # [[1]] # [1] "&#22823;" "&#33151;" "&#39592;" "&#22824;" "&#33152;" "&#39593;" paste(r[[1]], collapse="") # [1] "&#22823;&#33151;&#39592;&#22824;&#33152;&#39593;"

droopy · Accepted Answer · 2014-03-11 09:38:16Z

0

you can try :

 gsub("(^\\D*)((&#[0-9]+;)+)(.*$)", "\\2", x)

answered Mar 11, 2014 at 9:38

droopy

2,8281 gold badge16 silver badges12 bronze badges

Collectives™ on Stack Overflow

pick up string with specific pattern in R using gsub

Edit 01

Edit 02

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

Edit 01

Edit 02

2 Answers 2

Comments

Comments

Related