Skip to content

Word Embeddings

What are Word Embeddings?

Word Embeddings are a method to translate a string of text into an N-dimensional vector of real numbers. Many computational methods are not capable of accepting text as input. This is one method of transforming text into a number space that can be used in various computational methods.

An Example

Word "Royalty" Feature "Masculinity" Feature
Queen 0.958123 0.03591294
King 0.94981289 0.92959219
Man 0.051231 0.9592109321
Woman 0.0912987 0.04912983189

Let's say we were to create embeddings for the four words above, with two features representing Royalty and Masculinity. The resulting embedding for the word "Queen" is <0.958123, 0.03591294> - a value close to 1 for Royalty, and a value close to 0 for Masculinity. In this example, we could identify the concept of Masculinity by the vector <0,1> and Royalty by <1,0>.

Word Feature 1 Feature 2 Feature 3 Feature 4
Queen 0.854712 0.32145 0.123512 0.023051
King 0.9521 0.23151 0.721385 0.950213
Man 0.12351 0.067231 0.85921 0.821231
Woman 0.23131 0.123151 0.011249 0.23151

However, in real world examples, it is rarely this simple. In general, we do not know exactly what each of the N features mean. It is very common to test the quality of embeddings using analogies. For example, even though we may not be able to identify a feature for Royalty, we expect that subtracting the vectors "King - Man" and "Queen - Woman" should give us roughly the same vector representing Royalty. similarly for Masculinity, we can test the similarity of the vectors "Man - Woman" and "King - Queen".

Common Embedding Models

Word2Vec

Developed by Google, it uses skip-gram and continuous bag-of-words (CBOW) architectures to create word embeddings. You can read more on Word2Vec's wikipedia page. The Gensim package in Python has an implementation of these algorithms.

GloVe (Global Vectors for Word Representation)

Developed by Stanford, it creates embeddings by aggregating global word-word co-occurrence statistics from a corpus. You can find GloVe and more information here.

FastText

Developed by Facebook, it extends Word2Vec by representing words as bags of character n-grams, which helps in handling out-of-vocabulary words (words which did not exist in the training data).

BERT (Bidirectional Encoder Representations from Transformers)

Developed by Google, it uses transformers to create context-aware embeddings by considering the entire sentence.

ELMo (Embeddings from Language Models)

Developed by Allen Institute for AI, it generates context-sensitive embeddings using deep bidirectional LSTM models.

GPT (Generative Pre-trained Transformer)

Developed by OpenAI, it generates embeddings using transformer-based architectures, focusing on language generation tasks.

Universal Sentence Encoder

Developed by Google, it creates embeddings for sentences and paragraphs, useful for tasks like semantic similarity and clustering.

API's

Many companies offer API's to generate word embeddings. Google's Cloud Natural Language API, OpenAI, and Microsoft's Azure Cognitive Services all offer text analysis services that can generate word embeddings. If you are looking to use more open source embeddings there are model repositories like Hugging Face that offer a wide range of models.

Common Embedding Uses

Regression and Classification

Word embeddings provide an easy way to translate text into machine learning model inputs. Examples include:

  • Sentiment analysis of customer reviews
  • Topic classification of news articles
  • Stock volatility prediction from SEC filings
  • Email spam detection

Document Clustering

Word embeddings can be aggregated to document level for:

  • Finding similar research papers
  • Grouping customer support tickets
  • Detecting duplicate content
  • Organizing large document collections

Unlike keyword matching, embeddings enable searching by meaning:

  • Finding relevant documents even with different terminology
  • Question-answering systems
  • Semantic code search

Information Retrieval

Embeddings improve information retrieval through:

  • Semantic similarity scoring
  • Query expansion
  • Cross-lingual information retrieval

Transfer Learning

Pre-trained embeddings enable:

  • Knowledge transfer between related tasks
  • Few-shot learning
  • Domain adaptation

Recommendation Systems

Embeddings power modern recommenders through:

  • Content-based filtering
  • Item similarity calculation
  • User preference modeling

Implementation Tips

  • Consider using pre-trained embedding models like Word2Vec, GloVe, or BERT.
  • Evaluate the appropriate embedding dimensionality based on your vocabulary size and the task at hand.
  • Consider fine-tuning pre-trained models to better suit your specific task.