4

I have an array of string that I load throughout my application, and it contains different words. I have a simple if statement to see if it contains letters or numbers but not words .

I mean i only want those words which is like AB2CD5X .. and i want to remove all other words like Hello 3 , 3 word , any other words which is a word in English. Is it possible to filter only alphaNumeric words except those words which contain real grammar word.

i know how to check whether string contains alphanumeric words

Pattern p = Pattern.compile("[\\p{Alnum},.']*"); 

also know

 if(string.contains("[a-zA-Z]+") || string.contains([0-9]+]) 
9
  • 2
    Short answer: use Regexes Commented May 28, 2014 at 11:58
  • stackoverflow.com/questions/6343047/… Commented May 28, 2014 at 12:00
  • 3
    how will you identify the difference between a series of alphabets and a word? Commented May 28, 2014 at 12:01
  • this is my question hirak? Commented May 28, 2014 at 12:02
  • for real grammer word of complete english language you need vast implementation. Just check user input for alphanumeric and add them to key value pair style and eliminate else. for alpha numeric use regex Commented May 28, 2014 at 12:03

5 Answers 5

5
+50

What you need is a dictionary of English words. Then you basically scan your input and check if each token exists in your dictionary. You can find text files of dictionary entries online, such as in Jazzy spellchecker. You might also check Dictionary text file.

Here is a sample code that assumes your dictionary is a simple text file in UTF-8 encoding with exactly one (lower case) word per line:

public static void main(String[] args) throws IOException { final Set<String> dictionary = loadDictionary(); final String text = loadInput(); final List<String> output = new ArrayList<>(); // by default splits on whitespace final Scanner scanner = new Scanner(text); while(scanner.hasNext()) { final String token = scanner.next().toLowerCase(); if (!dictionary.contains(token)) output.add(token); } System.out.println(output); } private static String loadInput() { return "This is a 5gse5qs sample f5qzd fbswx test"; } private static Set<String> loadDictionary() throws IOException { final File dicFile = new File("path_to_your_flat_dic_file"); final Set<String> dictionaryWords = new HashSet<>(); String line; final LineNumberReader reader = new LineNumberReader(new BufferedReader(new InputStreamReader(new FileInputStream(dicFile), "UTF-8"))); try { while ((line = reader.readLine()) != null) dictionaryWords.add(line); return dictionaryWords; } finally { reader.close(); } } 

If you need more accurate results, you need to extract stems of your words. See Apache's Lucene and EnglishStemmer

Sign up to request clarification or add additional context in comments.

Comments

1

You can use Cambridge Dictionaries to verify human words. In this case, if you find a "human valid" word you can skip it.

As the documentation says, to use the library, you need to initialize a request handler and an API object:

DefaultHttpClient httpClient = new DefaultHttpClient(new ThreadSafeClientConnManager()); SkPublishAPI api = new SkPublishAPI(baseUrl + "/api/v1", accessKey, httpClient); api.setRequestHandler(new SkPublishAPI.RequestHandler() { public void prepareGetRequest(HttpGet request) { System.out.println(request.getURI()); request.setHeader("Accept", "application/json"); } }); 

To use the "api" object:

 try { System.out.println("*** Dictionaries"); JSONArray dictionaries = new JSONArray(api.getDictionaries()); System.out.println(dictionaries); JSONObject dict = dictionaries.getJSONObject(0); System.out.println(dict); String dictCode = dict.getString("dictionaryCode"); System.out.println("*** Search"); System.out.println("*** Result list"); JSONObject results = new JSONObject(api.search(dictCode, "ca", 1, 1)); System.out.println(results); System.out.println("*** Spell checking"); JSONObject spellResults = new JSONObject(api.didYouMean(dictCode, "dorg", 3)); System.out.println(spellResults); System.out.println("*** Best matching"); JSONObject bestMatch = new JSONObject(api.searchFirst(dictCode, "ca", "html")); System.out.println(bestMatch); System.out.println("*** Nearby Entries"); JSONObject nearbyEntries = new JSONObject(api.getNearbyEntries(dictCode, bestMatch.getString("entryId"), 3)); System.out.println(nearbyEntries); } catch (Exception e) { e.printStackTrace(); } 

Comments

0

Antlr might help you. Antlr stands for ANother Tool for Language Recognition

Hibernate uses ANTLR to parse its query language HQL(like SELECT,FROM).

Comments

0

if(string.contains("[a-zA-Z]+") || string.contains([0-9]+])

I think this is a good starting point, but since you're looking for strings that contain both letters and numbers you might want:

if(string.contains("[a-zA-Z]+") && string.contains([0-9]+])

I guess you might also want to check if there are spaces? Right? Because you that could indicate that there are separate words or some sequence like 3 word. So maybe in the end you could use:

if(string.contains("[a-zA-Z]+") && string.contains([0-9]+] && !string.contains(" "))

Hope this helps

Comments

0

You may try this,

First tokenize the string using StringTokenizer with default delimiter, for each token if it contains only digits or only characters, discard it, remaining will be the words which contains combination of both digits and characters. For identifying only digits only characters you can have regular expressions used.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.