![]() |
Natural Language Processing (NLP) is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dictionary-based tokenization is a common method used in NLP to segment text into tokens based on a pre-defined dictionary. Tokenization is the process of splitting text into individual tokens, usually words or sentences and separating them from one another using spaces or punctuation or some specific rules. In rule-based tokenization, a set of rules is defined to determine how text is split into tokens. These rules can be based on various factors such as whitespace, punctuation, and context. Rule-Based Tokenization:Rule-based tokenization is a technique where a set of rules is applied to the input text to split it into tokens. These rules can be based on different criteria, such as whitespace, punctuation, regular expressions, or language-specific rules. Here are some common concepts related to rule-based tokenization: Whitespace tokenizationThis approach splits the input text based on whitespace characters such as space, tab, or newline. For example, the sentence : "This is a sample text." would be split into the following tokens: "This", "is", "a", "sample", and "text." The following Python code demonstrates whitespace rule-based tokenization:Steps for Rule-Based Tokenization:
Python3
Output: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.'] Regular expression tokenizationThis approach uses regular expressions to split the input text based on a pattern. This is mainly used when we have to find some specific type of patterns in text like email id, phone number, order id, currency, etc. For example, the regular expression "[\w]+-[\w]+-[\w]+" will match the "Geeks-for-Geeks" and ([\w\.-]+@[\w]+.[\w]+) will match the email id. from "Hello, I am working at Geeks-for-Geeks and my email is [email protected]." The following Python code demonstrates whitespace Regular expression tokenization:Python3
Output: Company Name: Geeks-for-Geeks Email address: [email protected] Punctuation tokenizationThis approach splits the input text based on punctuation characters such as period, comma, or semicolon. For example, the sentence "Hello Geeks! How can I help you?" would be split into the following tokens: 'Hello', 'Geeks', 'How', 'can', 'I', 'help', 'you' The following Python code demonstrates punctuation rule-based tokenization:Python3
Output: ['Hello', 'Geeks', 'How', 'can', 'I', 'help', 'you'] Language-specific tokenizationThis approach uses language-specific rules to split the input text into tokens. For example, in some languages, words can be concatenated without spaces, such as in German. Therefore, language-specific rules are needed to split the input text into meaningful tokens. Python3
Output: ["▁'", 'ॐ', '▁भू', 'र्', 'भव', ':', '▁स्व', ':', '▁तत्', 'स', 'वि', 'तु', 'र्', 'वरेण्य', 'ं', '▁भ', 'र्ग', 'ो', '▁देवस्य', '▁धीम', 'हि', '▁', 'धि', 'यो', '▁यो', '▁न', ':', '▁प्र', 'च', 'ोदय', 'ात्', "।'"] |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Related |
---|
![]() |
![]() |
![]() |
![]() |
![]() |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 12 |