Implicit Matrix Factorization in NLP - Coding

Implicit matrix factorization is a technique in natural language processing (NLP) used to identify latent structures in word co-occurrence data.

In this article, we will then delve into Pointwise Mutual Information (PMI), Positive Pointwise Mutual Information (PPMI), and Shifted PMI, and implement these techniques in Python for hands-on experience.

What is Implicit Matrix Factorization in NLP?

Implicit matrix factorization is a technique used in various fields, including natural language processing (NLP), collaborative filtering, and recommendation systems, to uncover latent structures or factors in data that are not directly observed but inferred from indirect signals. The main idea is to decompose a large, sparse matrix of interactions or co-occurrences into lower-dimensional matrices that capture the underlying patterns or relationships between entities, such as words in a text or users and items in a recommendation system.

In NLP, implicit matrix factorization can be used to identify latent semantic relationships between words. For example, by factorizing a word co-occurrence matrix, we can discover underlying topics in a collection of documents. This helps in generating word embeddings that capture semantic similarities, which can be used in various downstream tasks like sentiment analysis and machine translation.

Key Concepts of Implicit Matrix Factorization

Latent Factors: These are hidden variables inferred from the observed data. In the context of NLP, latent factors can represent abstract concepts or topics underlying the word co-occurrences. In recommendation systems, they can represent user preferences and item attributes.
Co-occurrence Matrix: A matrix that records the frequency with which items (words, users, items) co-occur. For example, in NLP, a co-occurrence matrix might capture how often pairs of words appear together in a given context.
Matrix Factorization: The process of decomposing a large matrix into the product of two smaller matrices. This factorization reduces the dimensionality of the data while preserving its essential structure.

PMI, PPMI, and Shifted PMI Techniques in Implicit Matrix Factorization

1. Pointwise Mutual Information (PMI)

PMI is a measure used to calculate the association between two events (such as the occurrence of two words together in a text) based on their joint probability compared to their individual probabilities.

Formula:

[Tex]\text{PMI}(x, y) = \log \left( \frac{P(x, y)}{P(x) \cdot P(y)} \right)[/Tex]

Where:

P(x, y) is the joint probability of words x and y occurring together.
P(x) and P(y) are the individual probabilities of x and y occurring independently.

Interpretation:

A high PMI value indicates a strong association between the words, meaning they appear together more often than expected by chance.
A PMI value of zero means the words occur together as often as expected by chance.
Negative PMI values indicate that the words co-occur less frequently than expected by chance.

Python

import numpy as np
from collections import Counter
from sklearn.preprocessing import normalize

def calculate_pmi(co_occurrence_matrix, word_counts, total_count):
    rows, cols = co_occurrence_matrix.shape
    pmi_matrix = np.zeros((rows, cols))

    for i in range(rows):
        for j in range(cols):
            p_ij = co_occurrence_matrix[i, j] / total_count
            p_i = word_counts[i] / total_count
            p_j = word_counts[j] / total_count
            if p_ij > 0:
                pmi_matrix[i, j] = np.log2(p_ij / (p_i * p_j))

    return pmi_matrix

co_occurrence_matrix = np.array([[10, 2, 0], [2, 5, 3], [0, 3, 8]])
word_counts = np.array([12, 10, 11])
total_count = np.sum(word_counts)

pmi_matrix = calculate_pmi(co_occurrence_matrix, word_counts, total_count)
print("PMI Matrix:\n", pmi_matrix)

Output:

PMI Matrix: [[ 1.19639721 -0.86249648 0. ] [-0.86249648 0.72246602 -0.15200309] [ 0. -0.15200309 1.12553088]]

2. Positive Pointwise Mutual Information (PPMI)

PPMI is a variant of PMI that addresses the issue of negative values in PMI by setting all negative PMI values to zero. This ensures that only positive associations between words are considered.

Formula:

[Tex]\text{PPMI}(x, y) = \max(\text{PMI}(x, y), 0)[/Tex]

Interpretation:

PPMI retains only the positive PMI values, which helps in focusing on meaningful and strong word associations.
By eliminating negative values, PPMI is more suitable for tasks where positive relationships are more informative, such as word embeddings and topic modeling.

Python

def calculate_ppmi(co_occurrence_matrix, word_counts, total_count):
    pmi_matrix = calculate_pmi(co_occurrence_matrix, word_counts, total_count)
    ppmi_matrix = np.maximum(pmi_matrix, 0)
    return ppmi_matrix

ppmi_matrix = calculate_ppmi(co_occurrence_matrix, word_counts, total_count)
print("PPMI Matrix:\n", ppmi_matrix)

Output:

PPMI Matrix: [[1.19639721 0. 0. ] [0. 0.72246602 0. ] [0. 0. 1.12553088]]

3. Shifted PMI

Shifted PMI introduces a shift parameter to normalize PMI values, addressing the issue of rare word pairs by reducing the impact of extremely high PMI values for infrequent co-occurrences.

Formula:

[Tex]\text{Shifted PMI}(x, y) = \text{PMI}(x, y) – \log(k)[/Tex]

Where k is a shift value.

Interpretation:

The shift value k is typically a constant that is subtracted from the PMI values to balance the representation of word associations.
This technique helps in mitigating the effect of rare word pairs, ensuring that the PMI values are more balanced and meaningful.

Python

def calculate_shifted_pmi(co_occurrence_matrix, word_counts, total_count, shift=1):
    pmi_matrix = calculate_pmi(co_occurrence_matrix, word_counts, total_count)
    shifted_pmi_matrix = pmi_matrix - np.log2(shift)
    return shifted_pmi_matrix

shifted_pmi_matrix = calculate_shifted_pmi(co_occurrence_matrix, word_counts, total_count, shift=5)
print("Shifted PMI Matrix:\n", shifted_pmi_matrix)

Output:

Shifted PMI Matrix: [[-1.12553088 -3.18442457 -2.32192809] [-3.18442457 -1.59946207 -2.47393119] [-2.32192809 -2.47393119 -1.19639721]]

Conclusion

In conclusion, implicit matrix factorization techniques like PMI, PPMI, and Shifted PMI are useful for uncovering latent semantic structures within large texts. These techniques help us understand the relationships between words by creating dense word representations. Each of these techniques has its own benefits. First, we learned how to create a PMI matrix. Then, we understood what positive PMI is and used NumPy to convert negative values to zero, creating a positive PMI matrix. Next, we learned how Shifted PMI can normalize our matrix values, reducing the impact of rare word pairs. So, the next time you are performing NLP tasks like document clustering, information retrieval, or measuring document similarity, consider using these techniques.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
NumPy's polyfit Function : A Comprehensive Guide
Why Pandas is Used in Python
How to Change Label Font Sizes in Seaborn
Modifying Colors in a Seaborn Lineplot
Large scale Machine Learning

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	25