I have a list of 20000 words and how often they appeared in a set of 500 newspaper articles. I am trying to build a stemmer which chops off suffuxes from each words, so walked, walking, walks are the same word.
In English, the Porter Stemmer is a rule-based system where you repeatedly chop off suffixes:
CONNECTIONS CONNECTION CONNECT I am concerned if I do this for my collection of Spanish words and articles, I may not have a complete list of rules or it may be prone to other forms of error. So I had proposed to learn the suffixes.
Right now, I just count the appearances of each suffixed up to 4 letters. Here is the result the most common last letter in my vocabulary list:
u'a': 58189 u'd': 3183 u'e': 62971 u'i': 1725 u'l': 26374 u'n': 37823 u'o': 46786 u'r': 16833 u's': 57396 u'u': 2639 u'y': 2212 u'z': 1968 u'\xe1': 1813 u'\xf3': 6722 The last letters a and o are obvious things to stem since they indicate masculine and feminine. However o could also be the 1st person singular of a verb. a could be the 3rd person singular.
e and s are also obvious choices to stem. Let's look at the last 4 letters:
u'ados': 1826, u'ales': 1633, u'ando': 1291, u'ante': 1062, u'aron': 1027, u'ci\xf3n': 5355, u'ente': 3084, u'ento': 1690, u'erto': 1061, u'idad': 1749, u'ncia': 1362, u'ntes': 1511, u'ones': 2845, u'ores': 1050, u'si\xf3n': 1127 These are very common Spanish suffixes, appearing more than 1000 times in my corpus. Should I stem them?
How do I choose a data set which handles the suffixes of different sizes and decides which ones are the most "significant" ?