![]() |
Implicit matrix factorization is a technique in natural language processing (NLP) used to identify latent structures in word co-occurrence data. In this article, we will then delve into Pointwise Mutual Information (PMI), Positive Pointwise Mutual Information (PPMI), and Shifted PMI, and implement these techniques in Python for hands-on experience. What is Implicit Matrix Factorization in NLP?Implicit matrix factorization is a technique used in various fields, including natural language processing (NLP), collaborative filtering, and recommendation systems, to uncover latent structures or factors in data that are not directly observed but inferred from indirect signals. The main idea is to decompose a large, sparse matrix of interactions or co-occurrences into lower-dimensional matrices that capture the underlying patterns or relationships between entities, such as words in a text or users and items in a recommendation system. In NLP, implicit matrix factorization can be used to identify latent semantic relationships between words. For example, by factorizing a word co-occurrence matrix, we can discover underlying topics in a collection of documents. This helps in generating word embeddings that capture semantic similarities, which can be used in various downstream tasks like sentiment analysis and machine translation. Key Concepts of Implicit Matrix Factorization
PMI, PPMI, and Shifted PMI Techniques in Implicit Matrix Factorization1. Pointwise Mutual Information (PMI)PMI is a measure used to calculate the association between two events (such as the occurrence of two words together in a text) based on their joint probability compared to their individual probabilities. Formula: [Tex]\text{PMI}(x, y) = \log \left( \frac{P(x, y)}{P(x) \cdot P(y)} \right)[/Tex] Where:
Interpretation:
Output: PMI Matrix: [[ 1.19639721 -0.86249648 0. ] [-0.86249648 0.72246602 -0.15200309] [ 0. -0.15200309 1.12553088]] 2. Positive Pointwise Mutual Information (PPMI)PPMI is a variant of PMI that addresses the issue of negative values in PMI by setting all negative PMI values to zero. This ensures that only positive associations between words are considered. Formula: [Tex]\text{PPMI}(x, y) = \max(\text{PMI}(x, y), 0)[/Tex] Interpretation:
Output: PPMI Matrix: [[1.19639721 0. 0. ] [0. 0.72246602 0. ] [0. 0. 1.12553088]] 3. Shifted PMIShifted PMI introduces a shift parameter to normalize PMI values, addressing the issue of rare word pairs by reducing the impact of extremely high PMI values for infrequent co-occurrences. Formula: [Tex]\text{Shifted PMI}(x, y) = \text{PMI}(x, y) – \log(k)[/Tex] Where k is a shift value. Interpretation:
Output: Shifted PMI Matrix: [[-1.12553088 -3.18442457 -2.32192809] [-3.18442457 -1.59946207 -2.47393119] [-2.32192809 -2.47393119 -1.19639721]] ConclusionIn conclusion, implicit matrix factorization techniques like PMI, PPMI, and Shifted PMI are useful for uncovering latent semantic structures within large texts. These techniques help us understand the relationships between words by creating dense word representations. Each of these techniques has its own benefits. First, we learned how to create a PMI matrix. Then, we understood what positive PMI is and used NumPy to convert negative values to zero, creating a positive PMI matrix. Next, we learned how Shifted PMI can normalize our matrix values, reducing the impact of rare word pairs. So, the next time you are performing NLP tasks like document clustering, information retrieval, or measuring document similarity, consider using these techniques. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 25 |