Content
This article marks the third entry in a series titled Deep Learning Research Review, which aims to summarize and clarify research papers across various deep learning subfields. The current focus is on Natural Language Processing (NLP), a domain concerned with building systems capable of understanding and processing human language to perform tasks such as Question Answering (like Siri or Alexa), Sentiment Analysis, Image to Text Mappings, Machine Translation, Speech Recognition, Part of Speech Tagging, and Named Entity Recognition. Previous installments covered Reinforcement Learning and Generative Adversarial Networks, laying a foundation for this exploration of how deep learning techniques enhance NLP.
Traditionally, NLP relied heavily on linguistic domain knowledge, encompassing concepts such as phonemes and morphemes. For example, breaking down a word like "uninterested" into prefix, root, and suffix helps decipher its sentiment and meaning by leveraging linguistic rules — "un" indicating negation, "interest" as the root word, and "ed" marking past tense. However, manually accounting for all English prefixes and suffixes would require extensive linguistic expertise and still likely miss many nuances, making traditional methods labor-intensive and limited in scalability.
Deep learning offers a transformative approach by focusing on representation learning. Much like how convolutional neural networks (CNNs) learn image features through filters, deep learning models in NLP aim to learn word representations from large datasets. This shift moves the emphasis from handcrafted features to data-driven embeddings, enabling models to capture word meanings and contexts more flexibly and comprehensively.
One foundational concept in deep learning-based NLP is representing words as vectors in a multi-dimensional space. For instance, a word could be represented as a six-dimensional vector, with each dimension encoding some aspect of the word's meaning or context. A basic method to initialize these vectors is by constructing a co-occurrence matrix, which counts how often each word appears near every other word in the training corpus. Extracting rows from this matrix provides initial word vectors, where similar words tend to have similar vector patterns. For example, words like "love" and "like" show similar co-occurrence counts with nouns such as "NLP" and "dogs," and pronouns like "I," indicating shared semantic properties.
However, the co-occurrence matrix approach scales poorly. With a large vocabulary, such as a million words, the matrix becomes prohibitively large and sparse, resulting in inefficiency in both storage and computation. To overcome this, more sophisticated methods like Word2Vec have been developed. Word2Vec generates compact word embeddings by training a model to predict surrounding words within a specified window size for each center word. For example, given the sentence "I love NLP and I like dogs," with a window size of three, the model tries to maximize the probability of context words surrounding the center word "love."
The training optimizes a function that sums the log probabilities of occurrence for context words given the center word, utilizing stochastic gradient descent to update the vectors. Each word has two distinct vector representations: one when it is the center word and one when it appears in the context. Despite its mathematical complexity, Word2Vec significantly advances word vector representations by enabling the capture of semantic and syntactic relationships.
A remarkable outcome of Word2Vec training is the emergence of linear relationships between word vectors. These relationships encode grammatical and semantic analogies, such as vector arithmetic reflecting "king" – "man" + "woman" ≈ "queen." This property highlights the power of simple neural architectures combined with appropriate training objectives in capturing complex language concepts. Thus, Word2Vec not only offers efficient embeddings but also deep insights into language structure learned from data rather than linguistic rules.