0

"CATARACT; #大腿骨~2010"

I need to pick up the 大腿骨 in R using gsub, it is actually unicode that starts with &# followed by a five digits number and then ended with ;.

I know how to get rid of these unicode using the following:

gsub("&#[0-9]+;","","CATARACT; #大腿骨~2010")

But how can I retain these unicode using gsub?

Edit 01

My desired output is 大腿骨.

Edit 02

Thanks for the answer, but what if the pattern is not always like that, I need to pick up the unicode no matter where they are:

"CATARACT; #大腿骨~2010;CATARACT; #夨膀骩~2010"

1
  • What's your desired output? Commented Mar 11, 2014 at 9:37

2 Answers 2

1

E.g. using gregexpr and regmatches:

ex <- "CATARACT; #&#22823;&#33151;&#39592;~2010;CATARACT; #&#22824;&#33152;&#39593;~2010" m <- gregexpr("&#[0-9]+;", ex) (r <- regmatches(ex, m)) # [[1]] # [1] "&#22823;" "&#33151;" "&#39592;" "&#22824;" "&#33152;" "&#39593;" paste(r[[1]], collapse="") # [1] "&#22823;&#33151;&#39592;&#22824;&#33152;&#39593;" 
Sign up to request clarification or add additional context in comments.

Comments

0

you can try :

 gsub("(^\\D*)((&#[0-9]+;)+)(.*$)", "\\2", x) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.