![]() |
Natural Language Processing (NLP) is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dictionary-based tokenization is a common method used in NLP to segment text into tokens based on a pre-defined dictionary. Dictionary-based tokenization is a technique in natural language processing (NLP) that involves splitting a text into individual tokens based on a predefined dictionary of multi-word expressions. This is useful when the standard word tokenization techniques may not be sufficient for certain applications, such as sentiment analysis or named entity recognition, where multi-word expressions need to be treated as a single token. Dictionary-based tokenization divides the text into tokens by using a predefined dictionary of multi-word expressions. A dictionary is a list of words, phrases, and other linguistic constructions along with the definitions, speech patterns, and other pertinent data that go with them. Each word in the text is compared to the terms in the dictionary as part of the dictionary-based tokenization process, and the text is then divided into tokens based on the matches discovered. We can tokenize the name, and phrases by creating a custom dictionary. A token in natural language processing is a group of characters that stands for a single meaning. Words, phrases, integers, and punctuation marks can all be used as tokens. Several NLP activities, including text classification, sentiment analysis, machine translation, and named entity recognition, depend on the tokenization process. Several methods, including rule-based tokenization, machine learning-based tokenization, and hybrid tokenization, can be used to conduct the dictionary-based tokenization process. Rule-based tokenization divides the text into tokens according to the text’s characteristics, such as punctuation, capitalization, and spacing. Tokenization that is based on machine learning entails training a model to separate text into tokens based on a set of training data. To increase accuracy and efficiency, hybrid tokenization blends rule-based and machine-learning-based methods. Steps needed for implementing Dictionary-based tokenization:
For example, consider the following sentence: Jammu Kashmir is an integral part of India. My name is Pawan Kumar Gunjan. He is from Himachal Pradesh. The steps involved in the dictionary-based tokenization of this sentence are as follows: Step 1: Import the necessary librariesPython3
Step 2: Create a custom dictionary using the name or phrasesCollect a dictionary of words having joint words like phrases or names. Let the dictionary contain the following name or phrases. Python3
Step 3: Create an instance of MWETokenizer with the dictionaryPython3
Step 4: Create a text dataset and tokenize with word_tokenizePython3
Output: ['Jammu', 'Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan', 'Kumar', 'Gunjan', '.', 'He', 'is', 'from', 'Himachal', 'Pradesh', '.'] Step 5: Apply Dictionary based tokenization with Dictionary_tokenizer Python3
Output: ['Jammu Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan Kumar Gunjan', '.', 'He', 'is', 'from', 'Himachal Pradesh', '.'] We can easily observe the differences between General word tokenization and Dictionary-based tokenization. This is useful when we know the phrases or joint words present in the TEXT DOCUMENT and we want to assign these joint words as single tokens. Full code implementationsPython3
Output: General Word Tokenization ['Jammu', 'Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan', 'Kumar', 'Gunjan', '.', 'He', 'is', 'from', 'Himachal', 'Pradesh', '.'] Dictionary based tokenization ['Jammu Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan Kumar Gunjan', '.', 'He', 'is', 'from', 'Himachal Pradesh', '.'] |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 12 |