1

I want to replace rare words with _RARE_ in a JSON tree using JAVA.

My rareWords list contains

late populate convicts 

So for JSON below

["S", ["PP", ["ADP", "In"], ["NP", ["DET", "the"], ["NP", ["ADJ", "late"], ["NOUN", "1700<s"]]]], ["S", ["NP", ["ADJ", "British"], ["NOUN", "convicts"]], ["S", ["VP", ["VERB", "were"], ["VP", ["VERB", "used"], ["S+VP", ["PRT", "to"], ["VP", ["VERB", "populate"], ["WHNP", ["DET", "which"], ["NOUN", "colony"]]]]]], [".", "?"]]]] 

I should get

["S", ["PP", ["ADP", "In"], ["NP", ["DET", "the"], ["NP", ["ADJ", "_RARE_"], ["NOUN", "1700<s"]]]], ["S", ["NP", ["ADJ", "British"], ["NOUN", "_RARE_"]], ["S", ["VP", ["VERB", "were"], ["VP", ["VERB", "used"], ["S+VP", ["PRT", "to"], ["VP", ["VERB", "populate"], ["WHNP", ["DET", "which"], ["NOUN", "colony"]]]]]], [".", "?"]]]] 

Notice how

["ADJ","late"] 

was replaced by

["ADJ","_RARE_"] 

My code so far is like below:

I recursively iterate over the tree and as soon as rare word is found, I create a new JSON array and try to replace the existing tree's node with it. See // this Doesn't work in below, that is where I got stuck. The tree remains unchanged outside of this function.

public static void traverseTreeAndReplaceWithRare(JsonArray tree){ //System.out.println(tree.getAsJsonArray()); for (int x = 0; x < tree.getAsJsonArray().size(); x++) { if(!tree.get(x).isJsonArray()) { if(tree.size()==2) { //beware it will get here twice for same word String word= tree.get(1).toString(); word=word.replaceAll("\"", ""); // removing double quotes if(rareWords.contains(word)) { JsonParser parser = new JsonParser(); //This works perfectly System.out.println("Orig:"+tree); JsonElement jsonElement = parser.parse("["+tree.get(0)+","+"_RARE_"+"]"); JsonArray newRareArray = jsonElement.getAsJsonArray(); //This works perfectly System.out.println("New:"+newRareArray); tree=newRareArray; // this Doesn't work } } continue; } traverseTreeAndReplaceWithRare(tree.get(x).getAsJsonArray()); } } 

code for calling above, I use google's gson

JsonParser parser = new JsonParser(); JsonElement jsonElement = parser.parse(strJSON); JsonArray tree = jsonElement.getAsJsonArray(); 
4
  • 1
    Why don't you just do a strJSON.replaceAll("(late|populate|convicts)", "_RARE_") Commented Apr 18, 2013 at 22:13
  • +1 Sure, I am going to try that and it might work for most cases. But main motivation for asking this question was to understand/learn how to manipulate such tree. Commented Apr 18, 2013 at 22:16
  • sorry, replaceAll() doesn't work for me because my rareWords list is 3435 long and also it end up replacing "SQ" with "RARE" from instances like ["SQ", "late"] Commented Apr 18, 2013 at 23:38
  • The above is happening because there is a "S." in my rareList .. I just found by going through all 3435 rarewords. Commented Apr 18, 2013 at 23:51

1 Answer 1

6

Here's a straight forward approach in C++:

#include <fstream> #include "JSON.hpp" #include <boost/algorithm/string/regex.hpp> #include <boost/range/adaptors.hpp> #include <boost/phoenix.hpp> static std::vector<std::wstring> readRareWordList() { std::vector<std::wstring> result; std::wifstream ifs("testcases/rarewords.txt"); std::wstring line; while (std::getline(ifs, line)) result.push_back(std::move(line)); return result; } struct RareWords : boost::static_visitor<> { ///////////////////////////////////// // do nothing by default template <typename T> void operator()(T&&) const { /* leave all other things unchanged */ } ///////////////////////////////////// // recurse arrays and objects void operator()(JSON::Object& obj) const { for(auto& v : obj.values) { //RareWords::operator()(v.first); /* to replace in field names (?!) */ boost::apply_visitor(*this, v.second); } } void operator()(JSON::Array& arr) const { int i = 0; for(auto& v : arr.values) { if (i++) // skip the first element in all arrays boost::apply_visitor(*this, v); } } ///////////////////////////////////// // do replacements on strings void operator()(JSON::String& s) const { using namespace boost; const static std::vector<std::wstring> rareWords = readRareWordList(); const static std::wstring replacement = L"__RARE__"; for (auto&& word : rareWords) if (word == s.value) s.value = replacement; } }; int main() { auto document = JSON::readFrom(std::ifstream("testcases/test3.json")); boost::apply_visitor(RareWords(), document); std::cout << document; } 

This assumes you wanted to do replacements on all string values, and only matches whole strings. You could easily make this case insensitive, match words inside strings etc. by changing the regex or regex flags. Slightly adapted in response to the comments.

The full code including JSON.hpp/cpp is here: https://github.com/sehe/spirit-v2-json/tree/16093940

Sign up to request clarification or add additional context in comments.

8 Comments

+1 for code, Thanks! Since I don't know much C++. Will it possible you can modify your code, so that I can pass rareWords through a file? Actually, I tried to shorten my question for readability, in real my rareword list contains 3435 words and some of them contanis . or * for example S. U.S.A. A* that was messing up with String.replaceAll Regex matching . I will accept this answer after trying the updated code.
Yeah reading the words form a file is pretty trivial. However, I'd really want exact examples on what to match (do you always want exact matches of whole strings?
Yes, exact matches of whole string. For example: if "xyz." is in rare word list, then only "xyz." should be replaced with "RARE" not even "xyz". And if some array is like ["xyz.","xyz." ] it should be ["xyz.", "RARE"] ... notice that second string in branch array getting replaced, we never touch the first one. Another drawback of replaceAll method was that it could potentially replace the first string. I am going to modify question to draw tree for more clarity. I can share whole input and rareWord file if you need.
@Watt updated with readRareWordList() and showing how to do exact matches only. EDIT also skipping the first element in every array now (see comment) and removed the regex matching that I picked from your code, but wasn't what you wanted after all.
Thank you! Going to try this now.. will be back in 10-15 min.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.