I have extracted the text between "<review_text>"..."</review_text>" using the following command
fname="positive.txt" sed -n '/<review_text>/,/<\/review_text>/p' $fname > review.txt but the output file contains text along with the tags as follows
<review_text> I'm not sure why Sony, which now owns I Dream of Jeannie, decided to colorize the first season of this series. Whatever the reason, you can readily tell by looking at the prices here on Amazon.com that the original black-and-white version of the first season is worth a lot more. The reason for that is simple--I Dream of Jeannie was originally broadcast in black-and-white. And for a television fan like myself, that's the ONLY way to watch the first season. The episodes themselves are just as I remember seeing them. Since I wasn't around in 1965, I'm pretty sure I've never seen these without the cuts that have been referenced here. But to me, they're still pretty good. The theme music, in my opinion, is every bit as good as the second theme, introduced when Jeannie went to color in 1966. The one thing that truly will drive the purists nuts is the fact that Sony stripped off the old Screen Gems animation from the end of every episode. That logo was attached to so many classic shows from the 1960s and 1970s, and it is consistenly rated, along with Viacom's old blue V of Doom, as the scariest logo in the history of television. The new Sony outro doesn't pack the same punch. Still, if you liked Jeannie way back when, you'll love it now, especially since you can watch it anytime you like, without commercial interruption </review_text> <review_text> If you don't own this dvd you need to add it to your collection. In my opinion it is the best american animated film ever released </review_text> I want to extract only the text between these tags from the output file and save it to separate text files. How can I perform this?
awkandsedto "parse" XML if it is very carefully formatted. In general, you want to use a proper XML parser, which exist for most general purpose languages (like Python and Ruby). From the command line,xmlstarletis an option.