0

I am working with tweeter text data in JSON format which I have stored in a text file. I am not interested in retweets and i created a parser that could extract most of the text, but somehow some retweets also came along. So i was wondering for a quick solution for this problem, i.e. to remove the text that starts with RT.

So a text in the file looks like

`"RT ...... RT ....."` 

"..." are the other words in the sentence. I would like to only remove the lines starting with the word "RT" and save it in another file. The same word RT might come in the middle of text that doesn't start with RT, such texts should not be removed. I tried with the following command, which I am not entirely sure

grep -v "RT" twitterDataset.txt > clean_RT.txt 

I would really appreciate for a solution to this problem and an explanation of the code would be also helpful.

3
  • 1
    Welcome to the site. If possible, please add possibly anonymized, but "full" input examples for your question. It will make it easier for contributors to help you find the problem. Commented Feb 3, 2020 at 8:54
  • 1
    That said, did you try anchoring your regular expression to the beginning of the line, as in grep -v "^RT"? Commented Feb 3, 2020 at 8:56
  • 6
    You mentioned JSON, but I see no JSON document in your question. There exists tools for working with JSON data in the terminal or in scripts. These tools makes it possible to parse, extract or modify JSON data in a safe and robust way (note that your grep would also remove any key whose name contained RT). Please include a representable sample of your data. Commented Feb 3, 2020 at 9:02

1 Answer 1

0

If the file in question is plain text you can do something like:

grep -v "^RT" twitterDataset.txt > clean_RT.txt 

This will not match lines which start with string "RT"

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.