Introduction and implementation of Word Embedding and Word2Vec With Pytorch.

3 min readApr 7, 2021

What is Word Embedding ?

In traditional methods for creating vector representation of words, we use one hot representation that is vector are of the same length as size of vocabulary and that have 0’s everywhere except a single position that has value 1’s that correspond to word index.

Problem with one hot encoding:

First is the size of the vector representation .it’s size is the same as the size of the corpus. so in real life the size of the corpus will be in million and also they are sparse.so storing them is a real issue .

Second problem is that it doesn’t preserve correlation between words. We can’t find which word occurs with which word by looking at the vector representation of that particular word and also we can’t derive any meaning from it.

Approaches to word embedding:

In word embedding methods we learn representation from data . However data is unlabel but still we train our model in supervised fashion.This can be possible by arranging auxiliary supervised task , in this task data is implicitly labeled . We try to optimize this supervised task and this will capture many statistical as well as linguistic properties of text.Auxiliary task depend on task .But we will see two main methods GloVe and Wordvec .

Word2Vec

Word2Vec uses auxiliary supervised task to get embedding .Word2Vec uses two different approaches (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW) . Both approaches use a Neural network for getting embedding.

Skip Gram

In skip gram , the distributed representation of input word is used to predict context.In skip gram for each word Network try to predict its context word (Number of words it will predict is set as hyperparameter).

Here we can see for each center word network output probability distributed for corresponding context words.

Common Bag Of Words(CBOW)

In this method Network takes the context of each center word as the input and tries to predict the word corresponding to the context.

Now we implement the Word2Vec method using the CBOW approach. For this we use the Frankenstein dataset . First of all we do preprocessing that convert text into tokens. Then in next step To enumerate dataset as a sequence of window so that CBOW model can be optimized . To do this we iterate on list of token in each tokens in each line and group them in window size.You can download processed file from here.

Then we implement Vocabulary class as well as vectorizer as we generally do in NLP and then we implement a dataset as to implement in pytorch dataloader

For the classifier we use the first layer as an embedding layer to create vectors for each word in the context and also it is used to combine the vectors in some way such that it captures the overall context.After that we used a fully connected layer.Here Vocabulary_size is total number of token in vocabulary , embedding_size is size of vector we want for word repersentation.

After training this model ,the embedding layer matrix contains representation of words in the vocabulary .

We can find the closest words for a corresponding word.Here we first find embedding of words and after then we find distance from other vectors of words in vocabulary and output top 5 closest words.

You can get full code here.This is very basic implementation of Word2Vec and also there are other technique Golve etc.

Thanks for reading!

References

CS224N stanford NLP course lecture 2 , http://web.stanford.edu/class/cs224n/slides/cs224n-2021-lecture02-wordvecs2.pdf
NLP with Pytorch book,https://www.oreilly.com/library/view/natural-language-processing/9781491978221/?ar
Word2Vec orginal paper ,https://arxiv.org/abs/1301.3781