Text analytics in Python

Question

I am working with very large text data with millions of lines in it. As a basic step of text analytics, I need into split the text to individual words and store the number of words in each line.

1) Is line.split() an efficient way to split text into words? (Not bothered about punctuation)

2) What is the efficient way to store word count? Is it through arrays/lists/tuples? Which one is faster.

Sorry if this seems too basic. I am just getting started.

Hello and welcome to StackOverflow. Please take some time to read the help page, especially the sections named "What topics can I ask about here?" and "What types of questions should I avoid asking?". And more importantly, please read the Stack Overflow question checklist. You might also want to learn about Minimal, Complete, and Verifiable Examples. — Morgan Thrapp
– Morgan Thrapp, Commented Jul 13, 2015 at 17:42
While we on the outside can make educated guesses on efficiency, nothing beats timing it yourself. Personally, I'd use a numpy.array, but be forewarned: you can't append to a numpy array with any amount of efficiency. — NightShadeQueen
– NightShadeQueen, Commented Jul 13, 2015 at 17:44
To avoid reinventing the wheel you should look at something like Text Blob or NLTK. — Jamie Bull
– Jamie Bull, Commented Jul 13, 2015 at 17:50

adrianus · Accepted Answer · 2015-07-13 19:31:39Z

Have a look at NLTK for Python.

It handles operations like tokenization (splitting text into words, including punctuation and other non-trivial cases) efficiently for large files and provides cool features like dispersion plots (where the words occur in the text) and also word count.

An example for the latter (taken from this NTLK cheatsheet):

>>> len(text1) # number of words >>> text1.count("heaven") # how many times does a word occur? >>> fd = nltk.FreqDist(text1) # information about word frequency >>> fd["the"] # how many occurences of the word ‘the’ >>> fd.plot(50, cumulative=False) # generate a chart of the 50 most frequent words

About the second part of your question, here it depends on how you want to further use these numbers. If you're just interested in the raw numbers, a list is fine:

word_count = [len(text1), len(text2), len(text3), ...] # how much words per average? print(sum(word_count)/len(word_count))

If you want to store which text has how many words/tokens and you want to access them by names, maybe you're better off with a dictionary:

word_count = {'first text' = len(text1), 'second text' = len(text2), ...} # how much words in the first text? print(word_count['first text'])

When storing some word counts as simple numbers it isn't really a matter of speed which data structure you're using, either dict or list is fine.

James Mertz · Accepted Answer · 2015-07-13 17:55:21Z

1

This is the simplest way I can think of to get a word count.

with open('sample_file.txt') as f: word_count = 0 for line in f: word_count += len(line.split(' '))

answered Jul 13, 2015 at 17:55

James Mertz

8,79912 gold badges63 silver badges88 bronze badges

5 Comments

Thesis Grad Over a year ago

This worked! Also, could you please tell me how to write a dictionary to a text file. I tried json...it doesn't preserve the order. pickle didn't work either

James Mertz Over a year ago

@ThesisGrad that sounds like a new question you should probably ask. But you should probably first do some research on the matter. I'm not quite sure what you mean by writ[ing] a dictionary to a text file.

James Mertz Over a year ago

For example, is this what you're looking for? This was found by using a simple SO query.

Thesis Grad Over a year ago

I wanted to save the contents of my dictionary to a text file. I tried it using json.dump and pickle.dump. However, they do not output the result in alphabetical order. In the end, I ended up modifying my code. Sorry for the confusion.

James Mertz Over a year ago

@ThesisGrad that's because dictionaries aren't ordered. You'll need to use an "ordered dictionary" for that.

dagrha · Accepted Answer · 2015-07-13 19:32:03Z

A simple way to do this is with the excellent collections library.

import collections import re words = re.findall(r'\w+', open('file.txt').read().lower()) print collection.Counter(words).most_common()

This will give you a list of tuples of words and how frequently each word occurs.

With the most_common(*n*) method, specifying a value for n will return the n most common elements.

Collectives™ on Stack Overflow

Text analytics in Python

3 Answers 3

Comments

5 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

Comments

Linked

Related