Rule-Based Tokenization in NLP - Coding

Natural Language Processing (NLP) is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dictionary-based tokenization is a common method used in NLP to segment text into tokens based on a pre-defined dictionary.

Tokenization is the process of splitting text into individual tokens, usually words or sentences and separating them from one another using spaces or punctuation or some specific rules. In rule-based tokenization, a set of rules is defined to determine how text is split into tokens. These rules can be based on various factors such as whitespace, punctuation, and context.

Rule-Based Tokenization:

Rule-based tokenization is a technique where a set of rules is applied to the input text to split it into tokens. These rules can be based on different criteria, such as whitespace, punctuation, regular expressions, or language-specific rules. Here are some common concepts related to rule-based tokenization:

Whitespace tokenization

This approach splits the input text based on whitespace characters such as space, tab, or newline.

For example, the sentence : "This is a sample text." 
would be split into the following tokens: "This", "is", "a", "sample", and "text."

The following Python code demonstrates whitespace rule-based tokenization:

Steps for Rule-Based Tokenization:

Load the input text: The input text can be loaded from a file or entered by the user.
Define the tokenization rules: Based on the type of tokenization required, define the rules to split the input text into tokens. These rules can be based on whitespace, punctuation, regular expressions, or language-specific rules.
Apply the rules to the input text: Use the defined rules to split the input text into tokens.
Output the tokens: Output the tokens generated by the tokenization process.

Python3

# Step 1: Load the input text 
text = "The quick brown fox jumps over the lazy dog."
  
# Step 2: Define the tokenization rules (split on whitespace) 
tokens = text.split() 
  
# Step 4: Output the tokens 
print(tokens)

Output:

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']

Regular expression tokenization

This approach uses regular expressions to split the input text based on a pattern. This is mainly used when we have to find some specific type of patterns in text like email id, phone number, order id, currency, etc.

For example, the regular expression "[\w]+-[\w]+-[\w]+" will match the "Geeks-for-Geeks" 
and ([\w\.-]+@[\w]+.[\w]+) will match the email id. from
"Hello, I am working at Geeks-for-Geeks and my email is [email protected]."

The following Python code demonstrates whitespace Regular expression tokenization:

Python3

import re 
  
#Load the input text 
text = "Hello, I am working at Geeks-for-Geeks and my email is [email protected]."
  
#Define the regular expression pattern 
p='([\w]+-[\w]+-[\w]+)|([\w\.-]+@[\w]+.[\w]+)'
  
# Find matches 
matches = re.findall(p, text) 
# print output 
for match in matches: 
    if match[0]: 
        print(f"Company Name: {match[0]}") 
    else: 
        print(f"Email address: {match[1]}")

Output:

Company Name: Geeks-for-Geeks
Email address: [email protected]

Punctuation tokenization

This approach splits the input text based on punctuation characters such as period, comma, or semicolon.

For example, the sentence "Hello Geeks! How can I help you?" 
would be split into the following tokens: 'Hello', 'Geeks', 'How', 'can', 'I', 'help', 'you'

The following Python code demonstrates punctuation rule-based tokenization:

Python3

import re 
  
# Load the input text 
text = "Hello Geeks! How can I help you?"
  
# Define the regular expression pattern 
# Matches one or more non-alphanumeric characters 
pattern = r'\W+' 
  
# Remove the punctuation and get the resulting string 
result = re.sub(pattern, ' ', text) 
  
# tokenize 
tokens = re.findall(r'\b\w+\b|[^\w\s]', result) 
  
# Print the result 
print(tokens)

Output:

['Hello', 'Geeks', 'How', 'can', 'I', 'help', 'you']

Language-specific tokenization

This approach uses language-specific rules to split the input text into tokens. For example, in some languages, words can be concatenated without spaces, such as in German. Therefore, language-specific rules are needed to split the input text into meaningful tokens.

Python3

from inltk.inltk import tokenize 
from inltk.inltk import setup 
setup('sa') 
  
Text = "'ॐ भूर्भव: स्व: तत्सवितुर्वरेण्यं भर्गो देवस्य धीमहि धियो यो न: प्रचोदयात्।'"
# tokenize(input text, language code) 
tokenize(Text, "sa")

Output:

["▁'",
 'ॐ',
 '▁भू',
 'र्',
 'भव',
 ':',
 '▁स्व',
 ':',
 '▁तत्',
 'स',
 'वि',
 'तु',
 'र्',
 'वरेण्य',
 'ं',
 '▁भ',
 'र्ग',
 'ो',
 '▁देवस्य',
 '▁धीम',
 'हि',
 '▁',
 'धि',
 'यो',
 '▁यो',
 '▁न',
 ':',
 '▁प्र',
 'च',
 'ोदय',
 'ात्',
 "।'"]

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Dictionary Based Tokenization in NLP
Draw Heatmap with Clusters Using pheatmap in R
Coordinate systems in ggplot2
Top 10 Data Analytics Tools in 2024
Spam Classification using OpenAI

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	12