Horje
Working with Unicode in Python

Unicode serves as the global standard for character encoding, ensuring uniform text representation across diverse computing environments. Python, a widely used programming language, adopts the Unicode Standard for its strings, facilitating internationalization in software development.

This tutorial aims to provide a foundational understanding of working with Unicode in Python, covering key aspects such as encoding, normalization, and handling Unicode errors.

How To Work With Unicode In Python?

Below are some of the ways by which we can work with Unicode in Python:

  • Converting Unicode Code Points
  • Normalize Unicode
  • Unicode with NFD & NFC
  • Regular Expressions
  • Solving Unicode Errors

Converting Unicode Code Points in Python

Encoding, the process of representing data in a computer-readable form, is crucial for internationalized data in Python 3. The default string encoding is UTF-8. Let’s create the copyright symbol (©) using its Unicode code point. below, code creates a string s with the Unicode code point \u00A9, and due to UTF-8 encoding, printing the value of s results in the corresponding Unicode symbol ‘©’.

Python3

s = '\u00A9'
print(s)

Output

©



Normalizing Unicode in Python

Normalization is crucial for determining whether two characters, written in different fonts, are the same. For instance, the Unicode characters ‘R’ and ‘ℜ’ might appear identical to the human eye but are considered different by Python strings. Normalization helps address such issues.

Below, code prints False because Python strings do not consider the two characters identical. Normalization becomes crucial when dealing with combined characters, as shown in the example with strings s1 and s2.

Python3

styled_R = 'ℜ'
normal_R = 'R'
print(styled_R == normal_R)

Output

False



Normalize Unicode with NFD & NFC

Python’s unicodedata module provides the normalize() function for normalizing Unicode strings. The normalization forms include NFD, NFC, NFKD, and NFKC.

Below, code demonstrates the effects of different normalization forms on string lengths. NFD decomposes characters, while NFC composes them. Similarly, NFKD and NFKC are used for “strict” normalization.

Python3

from unicodedata import normalize
 
s1 = 'hôtel'
s2 = 'ho\u0302tel'
 
s1_nfd = normalize('NFD', s1)
print(len(s1), len(s1_nfd))
 
s2_nfc = normalize('NFC', s2)
print(len(s2), len(s2_nfc))

Output

5 6
6 5



Solving Unicode Errors in Python

Two common Unicode errors are UnicodeEncodeError and UnicodeDecodeError. Handling these errors is crucial for robust Unicode support.

Solving a UnicodeEncodeError : Below, code snippet showcases different error handling approaches when encoding a string with characters outside the ASCII character set.

Python3

ascii_unsupported = '\ufb06'
 
# Using 'ignore', 'replace', and 'xmlcharrefreplace' to handle errors
print(ascii_unsupported.encode('ascii', errors='ignore'))
print(ascii_unsupported.encode('ascii', errors='replace'))
print(ascii_unsupported.encode('ascii', errors='xmlcharrefreplace'))

Output

b''
b'?'
b'st'



Solving a UnicodeDecodeError : Below, code snippet demonstrates error handling when attempting to decode a byte string into an incompatible encoding.

Python3

iso_supported = '§A'
b = iso_supported.encode('iso8859_1')
 
# Using 'replace' and 'ignore' to handle errors
print(b.decode('utf-8', errors='replace'))
print(b.decode('utf-8', errors='ignore'))

Output

�A
A






Reffered: https://www.geeksforgeeks.org


Geeks Premier League

Related
How To Fix Modulenotfounderror And Importerror in Python How To Fix Modulenotfounderror And Importerror in Python
Memorandum of Association (MoA) : Meaning, Format, Objectives and Clauses Memorandum of Association (MoA) : Meaning, Format, Objectives and Clauses
How to Stop Tornado Web Server How to Stop Tornado Web Server
How to Capture udp Packets in Python How to Capture udp Packets in Python
Convert Date To Datetime In Python Convert Date To Datetime In Python

Type:
Geek
Category:
Coding
Sub Category:
Tutorial
Uploaded by:
Admin
Views:
13