|
| 1 | +# Word2vec |
| 2 | + |
| 3 | +References: |
| 4 | +1. https://medium.com/deep-math-machine-learning-ai/chapter-9-1-nlp-word-vectors-d51bff9628c1 |
| 5 | +2. https://medium.com/deep-math-machine-learning-ai/chapter-9-2-nlp-code-for-word2vec-neural-network-tensorflow-544db99f5334 |
| 6 | + |
| 7 | + |
| 8 | +#### 1. Corpus Vectorisation (preprocessing) |
| 9 | + |
| 10 | +We can use counter vectorisation but we might get high count vectors for one document |
| 11 | +and low count vectors for others. Intead we will favour TF-IDF(Term frequency-Inverse term frequency). |
| 12 | + |
| 13 | +tf-idf is a weighting factor which is used to get the important features from the documents(corpus). |
| 14 | + |
| 15 | +It actually tells us how important a word is to a document in a corpus, the importance of a word increases proportionally to the number of times the word appears in the individual document, this is called Term Frequency(TF). |
| 16 | + |
| 17 | +Ex : document 1: |
| 18 | + |
| 19 | +“ Mady loves programming. He programs all day, he will be a world class programmer one day ” |
| 20 | + |
| 21 | +if we apply tokenization, steeming and stopwords (we discussed in the last story) to this document, we get features with high count like → program(3), day(2),love(1) and etc…. |
| 22 | + |
| 23 | +***TF*** = (no of times the word appear in the doc) / (total no of words in the doc) |
| 24 | + |
| 25 | +Here program is the highest frequent term in the document. |
| 26 | + |
| 27 | +so program is a good feature if we consider TF. |
| 28 | + |
| 29 | +However, if multiple documents contain the word “program” many times then we might say… |
| 30 | + |
| 31 | +it’s also a frequent word in all other documents in our corpus so it does not give much meaning so it probably may not be an important feature. |
| 32 | + |
| 33 | +To adjust this we use IDF. |
| 34 | + |
| 35 | +The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. |
| 36 | + |
| 37 | +***IDF*** — Log(total no of documents / no of documents with the term t in it). |
| 38 | + |
| 39 | +so TF-IDF = TF * IDF. |
| 40 | + |
| 41 | +Problems: |
| 42 | +1. TF-IDF or CounterVectorisations do not maintain order or semantic relationship |
| 43 | +between the words |
| 44 | + |
| 45 | +2. Instead we need to build Word2Vec model -> Converts this high dimensional vector (10000 sized) into low dimensional vector (let’s say 200 sized) |
| 46 | + |
| 47 | + |
| 48 | +#### 2. Word 2 vec |
| 49 | + |
| 50 | +Word2vec takes care of 2 things: |
| 51 | + |
| 52 | +1. Converts this high dimensional vector (10000 sized) into low dimensional vector (let’s say 200 sized) |
| 53 | +2. Maintains the word context (meaning) |
| 54 | + |
| 55 | +the word context / meaning can be created using 2 simple algorithms which are |
| 56 | + |
| 57 | +1. Continuous Bag-of-Words model (CBOW) |
| 58 | + |
| 59 | +Ex: Text= “Mady goes crazy about machine leaning” and window size is 3 |
| 60 | + |
| 61 | +-> [ [“Mady”,”crazy” ] , “goes”] → “goes” is the target word |
| 62 | + |
| 63 | + |
| 64 | +2. Skip-Gram model |
| 65 | + |
| 66 | +It takes one word as input and try to predict the surrounding (neighboring) words, |
| 67 | + |
| 68 | +[“Mady”, “goes”],[“goes”,”crazy”] → “goes” is the input word and “Mady” and “Crazy” are the surrounding words (Output probabilities) |
| 69 | + |
| 70 | + |
| 71 | +What is word2vec in short? |
| 72 | + |
| 73 | +→ it’s a neural network training for all the words in our dictionary to get the weights(vectors ) |
| 74 | + |
| 75 | +→ it has word embeddings for every word in the dictionary |
| 76 | + |
| 77 | + |
0 commit comments