-3

I am working with very large text data with millions of lines in it. As a basic step of text analytics, I need into split the text to individual words and store the number of words in each line.

1) Is line.split() an efficient way to split text into words? (Not bothered about punctuation)

2) What is the efficient way to store word count? Is it through arrays/lists/tuples? Which one is faster.

Sorry if this seems too basic. I am just getting started.

3

3 Answers 3

2

Have a look at NLTK for Python.

It handles operations like tokenization (splitting text into words, including punctuation and other non-trivial cases) efficiently for large files and provides cool features like dispersion plots (where the words occur in the text) and also word count.

An example for the latter (taken from this NTLK cheatsheet):

>>> len(text1) # number of words >>> text1.count("heaven") # how many times does a word occur? >>> fd = nltk.FreqDist(text1) # information about word frequency >>> fd["the"] # how many occurences of the word ‘the’ >>> fd.plot(50, cumulative=False) # generate a chart of the 50 most frequent words 

About the second part of your question, here it depends on how you want to further use these numbers. If you're just interested in the raw numbers, a list is fine:

word_count = [len(text1), len(text2), len(text3), ...] # how much words per average? print(sum(word_count)/len(word_count)) 

If you want to store which text has how many words/tokens and you want to access them by names, maybe you're better off with a dictionary:

word_count = {'first text' = len(text1), 'second text' = len(text2), ...} # how much words in the first text? print(word_count['first text']) 

When storing some word counts as simple numbers it isn't really a matter of speed which data structure you're using, either dict or list is fine.

Sign up to request clarification or add additional context in comments.

Comments

1

This is the simplest way I can think of to get a word count.

with open('sample_file.txt') as f: word_count = 0 for line in f: word_count += len(line.split(' ')) 

5 Comments

This worked! Also, could you please tell me how to write a dictionary to a text file. I tried json...it doesn't preserve the order. pickle didn't work either
@ThesisGrad that sounds like a new question you should probably ask. But you should probably first do some research on the matter. I'm not quite sure what you mean by writ[ing] a dictionary to a text file.
For example, is this what you're looking for? This was found by using a simple SO query.
I wanted to save the contents of my dictionary to a text file. I tried it using json.dump and pickle.dump. However, they do not output the result in alphabetical order. In the end, I ended up modifying my code. Sorry for the confusion.
@ThesisGrad that's because dictionaries aren't ordered. You'll need to use an "ordered dictionary" for that.
0

A simple way to do this is with the excellent collections library.

import collections import re words = re.findall(r'\w+', open('file.txt').read().lower()) print collection.Counter(words).most_common() 

This will give you a list of tuples of words and how frequently each word occurs.

With the most_common(*n*) method, specifying a value for n will return the n most common elements.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.