Python - Compute the frequency of words after removing stop words and stemming

Python - Compute the frequency of words after removing stop words and stemming

To compute the frequency of words in a text while removing stop words and applying stemming, you can use the Natural Language Toolkit (nltk) in Python. This toolkit includes a list of stop words and various stemmers.

Here's a step-by-step guide:

  • Install nltk if you haven't already:
pip install nltk 
  • Import the necessary components and download the list of stop words:
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer # Download the set of stop words the first time nltk.download('punkt') nltk.download('stopwords') 
  • Tokenize your text, filter out stop words, and apply stemming:
# Sample text text = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.""" # Set of English stop words stop_words = set(stopwords.words('english')) # Create a stemmer stemmer = PorterStemmer() # Tokenize the text words = word_tokenize(text) # Remove stop words and stem filtered_words = [stemmer.stem(word) for word in words if word.lower() not in stop_words and word.isalnum()] print(filtered_words) 
  • Compute the frequency distribution of the remaining words:
from nltk.probability import FreqDist # Compute the frequency distribution freq_dist = FreqDist(filtered_words) # Print the frequency of each word for word, frequency in freq_dist.items(): print(f'{word}: {frequency}') # Or if you want to see it in a more sorted manner: for word in sorted(freq_dist, key=freq_dist.get, reverse=True): print(f'{word}: {freq_dist[word]}') 

This script processes the text by first tokenizing it into words, removing stop words and non-alphabetic tokens, then applying stemming, and finally calculating the frequency distribution of the resulting words.

Remember to replace """Natural language processing ... languages.""" with your actual text. The isalnum() function is used to filter out any remaining punctuation or special characters after tokenization. Adjust the filters as necessary for your specific text processing needs.


More Tags

iteration recursive-query onclick getattribute javax.activation alembic jsonp client private-key xcodebuild

More Programming Guides

Other Guides

More Programming Examples