Have a look at NLTK for Python.
It handles operations like tokenization (splitting text into words, including punctuation and other non-trivial cases) efficiently for large files and provides cool features like dispersion plots (where the words occur in the text) and also word count.
An example for the latter (taken from this NTLK cheatsheet):
>>> len(text1) # number of words >>> text1.count("heaven") # how many times does a word occur? >>> fd = nltk.FreqDist(text1) # information about word frequency >>> fd["the"] # how many occurences of the word ‘the’ >>> fd.plot(50, cumulative=False) # generate a chart of the 50 most frequent words
About the second part of your question, here it depends on how you want to further use these numbers. If you're just interested in the raw numbers, a list is fine:
word_count = [len(text1), len(text2), len(text3), ...] # how much words per average? print(sum(word_count)/len(word_count))
If you want to store which text has how many words/tokens and you want to access them by names, maybe you're better off with a dictionary:
word_count = {'first text' = len(text1), 'second text' = len(text2), ...} # how much words in the first text? print(word_count['first text'])
When storing some word counts as simple numbers it isn't really a matter of speed which data structure you're using, either dict or list is fine.
numpy.array, but be forewarned: you can't append to a numpy array with any amount of efficiency.