Skip to content

Commit fe6362e

Browse files
committed
some write up about word2vec
1 parent 9a25345 commit fe6362e

File tree

1 file changed

+77
-0
lines changed

1 file changed

+77
-0
lines changed

032-word2vec/README.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Word2vec
2+
3+
References:
4+
1. https://medium.com/deep-math-machine-learning-ai/chapter-9-1-nlp-word-vectors-d51bff9628c1
5+
2. https://medium.com/deep-math-machine-learning-ai/chapter-9-2-nlp-code-for-word2vec-neural-network-tensorflow-544db99f5334
6+
7+
8+
#### 1. Corpus Vectorisation (preprocessing)
9+
10+
We can use counter vectorisation but we might get high count vectors for one document
11+
and low count vectors for others. Intead we will favour TF-IDF(Term frequency-Inverse term frequency).
12+
13+
tf-idf is a weighting factor which is used to get the important features from the documents(corpus).
14+
15+
It actually tells us how important a word is to a document in a corpus, the importance of a word increases proportionally to the number of times the word appears in the individual document, this is called Term Frequency(TF).
16+
17+
Ex : document 1:
18+
19+
“ Mady loves programming. He programs all day, he will be a world class programmer one day ”
20+
21+
if we apply tokenization, steeming and stopwords (we discussed in the last story) to this document, we get features with high count like → program(3), day(2),love(1) and etc….
22+
23+
***TF*** = (no of times the word appear in the doc) / (total no of words in the doc)
24+
25+
Here program is the highest frequent term in the document.
26+
27+
so program is a good feature if we consider TF.
28+
29+
However, if multiple documents contain the word “program” many times then we might say…
30+
31+
it’s also a frequent word in all other documents in our corpus so it does not give much meaning so it probably may not be an important feature.
32+
33+
To adjust this we use IDF.
34+
35+
The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents.
36+
37+
***IDF*** — Log(total no of documents / no of documents with the term t in it).
38+
39+
so TF-IDF = TF * IDF.
40+
41+
Problems:
42+
1. TF-IDF or CounterVectorisations do not maintain order or semantic relationship
43+
between the words
44+
45+
2. Instead we need to build Word2Vec model -> Converts this high dimensional vector (10000 sized) into low dimensional vector (let’s say 200 sized)
46+
47+
48+
#### 2. Word 2 vec
49+
50+
Word2vec takes care of 2 things:
51+
52+
1. Converts this high dimensional vector (10000 sized) into low dimensional vector (let’s say 200 sized)
53+
2. Maintains the word context (meaning)
54+
55+
the word context / meaning can be created using 2 simple algorithms which are
56+
57+
1. Continuous Bag-of-Words model (CBOW)
58+
59+
Ex: Text= “Mady goes crazy about machine leaning” and window size is 3
60+
61+
-> [ [“Mady”,”crazy” ] , “goes”] → “goes” is the target word
62+
63+
64+
2. Skip-Gram model
65+
66+
It takes one word as input and try to predict the surrounding (neighboring) words,
67+
68+
[“Mady”, “goes”],[“goes”,”crazy”] → “goes” is the input word and “Mady” and “Crazy” are the surrounding words (Output probabilities)
69+
70+
71+
What is word2vec in short?
72+
73+
→ it’s a neural network training for all the words in our dictionary to get the weights(vectors )
74+
75+
→ it has word embeddings for every word in the dictionary
76+
77+

0 commit comments

Comments
 (0)