1

I have extracted the text between "<review_text>"..."</review_text>" using the following command

fname="positive.txt" sed -n '/<review_text>/,/<\/review_text>/p' $fname > review.txt 

but the output file contains text along with the tags as follows

<review_text> I'm not sure why Sony, which now owns I Dream of Jeannie, decided to colorize the first season of this series. Whatever the reason, you can readily tell by looking at the prices here on Amazon.com that the original black-and-white version of the first season is worth a lot more. The reason for that is simple--I Dream of Jeannie was originally broadcast in black-and-white. And for a television fan like myself, that's the ONLY way to watch the first season. The episodes themselves are just as I remember seeing them. Since I wasn't around in 1965, I'm pretty sure I've never seen these without the cuts that have been referenced here. But to me, they're still pretty good. The theme music, in my opinion, is every bit as good as the second theme, introduced when Jeannie went to color in 1966. The one thing that truly will drive the purists nuts is the fact that Sony stripped off the old Screen Gems animation from the end of every episode. That logo was attached to so many classic shows from the 1960s and 1970s, and it is consistenly rated, along with Viacom's old blue V of Doom, as the scariest logo in the history of television. The new Sony outro doesn't pack the same punch. Still, if you liked Jeannie way back when, you'll love it now, especially since you can watch it anytime you like, without commercial interruption </review_text> <review_text> If you don't own this dvd you need to add it to your collection. In my opinion it is the best american animated film ever released </review_text> 

I want to extract only the text between these tags from the output file and save it to separate text files. How can I perform this?

3
  • Parsing XML using regex. Not again. Commented May 20, 2014 at 14:11
  • sorry i didnt get you Commented May 20, 2014 at 14:12
  • 1
    You can only use tools like awk and sed to "parse" XML if it is very carefully formatted. In general, you want to use a proper XML parser, which exist for most general purpose languages (like Python and Ruby). From the command line, xmlstarlet is an option. Commented May 20, 2014 at 14:17

1 Answer 1

3

You can for example use this awk:

awk '/<\/review_text>/ {f=0} f {print >> (t".txt")}; /<review_text>/ {f=1; t++}' file 

That creates these files:

$ cat 1.txt I'm not sure why Sony, which now owns I Dream of Jeannie, decided to colorize the first season of this series. Whatever the reason, you can readily tell by looking at the prices here on Amazon.com that the original black-and-white version of the first season is worth a lot more. The reason for that is simple--I Dream of Jeannie was originally broadcast in black-and-white. And for a television fan like myself, that's the ONLY way to watch the first season. The episodes themselves are just as I remember seeing them. Since I wasn't around in 1965, I'm pretty sure I've never seen these without the cuts that have been referenced here. But to me, they're still pretty good. The theme music, in my opinion, is every bit as good as the second theme, introduced when Jeannie went to color in 1966. The one thing that truly will drive the purists nuts is the fact that Sony stripped off the old Screen Gems animation from the end of every episode. That logo was attached to so many classic shows from the 1960s and 1970s, and it is consistenly rated, along with Viacom's old blue V of Doom, as the scariest logo in the history of television. The new Sony outro doesn't pack the same punch. Still, if you liked Jeannie way back when, you'll love it now, especially since you can watch it anytime you like, without commercial interruption $ cat 2.txt If you don't own this dvd you need to add it to your collection. In my opinion it is the best american animated film ever released 

Explanation

  • /<\/review_text>/ {f=0} if </review text> is found, deactivate the flag f. Note / has to be escaped, so that we write \/.
  • f {print >> (t".txt")} if flag f is active, print current line into a file XX.txt, where XX is a number that will be incremented every time a new <review text> comes.
  • /<review_text>/ {f=1; t++} if <review text> is found, activate the flag f and increment t as file name.
  • Addendum: (t".txt") within parentheses is used to make it work with BSD (OSX) awk also (thanks mklement0!).
Sign up to request clarification or add additional context in comments.

2 Comments

+1, nicely done; to also make it work with BSD (OSX) awk, use (t".txt") (parentheses).
@mklement0 thanks! I wasn't aware of that, just updated my answer with your comment :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.