A Tensorflow implementation of word2vec applied to stanford philosophy encyclopedia, the implementation supports both cbow and skip gram
for more reference, please have a look at this papers:
- Distributed Representations of Words and Phrases and their Compositionality
- word2vec Parameter Learning Explained
- Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method
After training the model returns some interesting results, see interesting results part
Evaluating hume - empiricist + rationalist:
descartes malebranche spinoza hobbes herder Similar words to death:
untimely ravages grief torment Similar words to god:
divine De Providentia christ Hesiod Similar words to love:
friendship affection christ reverence Similar words to life:
career live lifetime community society Similar words to brain:
neurological senile nerve nervous Evaluating hume - empiricist + rationalist:
descartes malebranche spinoza hobbes herder Evaluating ethics - rational:
hiroshima Evaluating ethic - reason:
inegalitarian anti-naturalist austere Evaluating moral - rational:
commonsense Evaluating life - death + love:
self-positing friendship care harmony Evaluating death + choice:
regret agony misfortune impending Evaluating god + human:
divine inviolable yahweh god-like man Evaluating god + religion:
amida torah scripture buddha sokushinbutsu Evaluating politic + moral:
rights-oriented normative ethics integrity - an object to crawl data from the philosophy encyclopedia; PlatoData
- a object to build the vocabulary based on the crawled data; VocabBuilder
- the model that computes the continuous distributed representations of words; Philo2Vec
The dependencies used for this module can be easily installed with pip:
> pip install -r requirements.txt - min_frequency: the minimum frequency of the words to be used in the model.
- size: the size of the data, the model then use the top size most frequenct words.
- optimizer: an instance of tensorflow
Optimizer, such asGradientDescentOptimizer,AdagradOptimizer, orMomentumOptimizer. - model: the model to use to create the vectorized representation, possible values:
CBOW,SKIP_GRAM. - loss_fct: the loss function used to calculate the error, possible values:
SOFTMAX,NCE. - embedding_size: dimensionality of word embeddings.
- neg_sample_size: number of negative samples for each positive sample
- num_skips: numer of skips for a
SKIP_GRAMmodel. - context_window: window size, this window is used to create the context for calculating the vector representations [ window target window ].
params = { 'model': Philo2Vec.CBOW, 'loss_fct': Philo2Vec.NCE, 'context_window': 5, } x_train = get_data() validation_words = ['kant', 'descartes', 'human', 'natural'] x_validation = [StemmingLookup.stem(w) for w in validation_words] vb = VocabBuilder(x_train, min_frequency=5) pv = Philo2Vec(vb, **params) pv.fit(epochs=30, validation_data=x_validation)params = { 'model': Philo2Vec.SKIP_GRAM, 'loss_fct': Philo2Vec.SOFTMAX, 'context_window': 2, 'num_skips': 4, 'neg_sample_size': 2, } x_train = get_data() validation_words = ['kant', 'descartes', 'human', 'natural'] x_validation = [StemmingLookup.stem(w) for w in validation_words] vb = VocabBuilder(x_train, min_frequency=5) pv = Philo2Vec(vb, **params) pv.fit(epochs=30, validation_data=x_validation)Since the words are stemmed as part of the preprocessing, some operation are sometimes necessary
StemmingLookup.stem('religious') # returns "religi" StemmingLookup.original_form('religi') # returns "religion"pv.get_similar_words(['rationalist', 'empirist'])pv.evaluate_operation('moral - rational')pv.plot(['hume', 'empiricist', 'descart', 'rationalist'])







