How do learn the most important nodes in a tree?

Question

I have a list of 20000 words and how often they appeared in a set of 500 newspaper articles. I am trying to build a stemmer which chops off suffuxes from each words, so walked, walking, walks are the same word.

In English, the Porter Stemmer is a rule-based system where you repeatedly chop off suffixes:

 CONNECTIONS CONNECTION CONNECT

I am concerned if I do this for my collection of Spanish words and articles, I may not have a complete list of rules or it may be prone to other forms of error. So I had proposed to learn the suffixes.

Right now, I just count the appearances of each suffixed up to 4 letters. Here is the result the most common last letter in my vocabulary list:

 u'a': 58189 u'd': 3183 u'e': 62971 u'i': 1725 u'l': 26374 u'n': 37823 u'o': 46786 u'r': 16833 u's': 57396 u'u': 2639 u'y': 2212 u'z': 1968 u'\xe1': 1813 u'\xf3': 6722

The last letters a and o are obvious things to stem since they indicate masculine and feminine. However o could also be the 1st person singular of a verb. a could be the 3rd person singular.

e and s are also obvious choices to stem. Let's look at the last 4 letters:

 u'ados': 1826, u'ales': 1633, u'ando': 1291, u'ante': 1062, u'aron': 1027, u'ci\xf3n': 5355, u'ente': 3084, u'ento': 1690, u'erto': 1061, u'idad': 1749, u'ncia': 1362, u'ntes': 1511, u'ones': 2845, u'ores': 1050, u'si\xf3n': 1127

These are very common Spanish suffixes, appearing more than 1000 times in my corpus. Should I stem them?

How do I choose a data set which handles the suffixes of different sizes and decides which ones are the most "significant" ?

Just collecting frequently occurring suffixes will probably get you nowhere. The same suffixes can occur for many different reasons, not necessarily morphological. And your corpus is indeed very small. — babou
– babou, Commented Jun 15, 2014 at 22:39

Thyamarkos · Accepted Answer · 2014-06-14 21:52:40Z

The first thing you need to consider is that stemming rules are language-specific - a stemmer for the English language will not work for Spanish.

That being said, there is already an implementation which you can use, it's called Snowball and it already has a Spanish stemmer you can use.

The only thing you need to figure out is how you want to install and use it - how you feed it the raw data and what you want to do with the output (store it somewhere, run post-processing on it etc.). There's no point in trying to reinvent the wheel. I've used Snowball and successfully written a Romanian stemmer back in the day (about seven years ago) and I must warn you it's not easy to do it from scratch even when you have all the tools (I had Snowball and the stemmer Dana Cojocaru wrote back then, but I wanted to do it on my own).

Best of luck in your endeavors!

guest · Accepted Answer · 2014-05-15 19:56:36Z

Do you WANT to use machine learning?

Because the simplest solution that comes to mind is to use a word net or dictionary and build an aggregation list based on common definitions. A lot of people worked very hard to catalogue and taxonomize the whole language, so why not use that?

If you decide to go for machine learning you might still need the entire dictionary for training, because your sample set is so small. So the only thing you could gain from ML would be the ability to parse hypothetical constructs that are invalid anyways, and probably wouldn't be used by a serious publication anyways.

so you think my data set is too small to extract a meaningful language pattern? Maybe I could use a thesaurus to tell me which words should be similar? — john mangual
– john mangual, Commented May 15, 2014 at 20:37

Stack Exchange Network

How do learn the most important nodes in a tree?

2 Answers 2

Hot Network Questions

How do learn the most important nodes in a tree?

2 Answers 2

Related

Hot Network Questions