Match Chinese Characters
When searching for "JavaScript regex for matching Chinese characters" on Google, most results suggest using /[\u4e00-\u9fa5]/. But is this regular expression really correct? Let's dive in and find out.
Han Script and Han Characters
Let's start by understanding two fundamental concepts:
- Han Script is a writing system that originated from Chinese and was later adopted by Japanese, Korean, and other languages
- Han Characters (CJK Ideographs) are the basic units of Han Script
Many countries and regions in the Han cultural sphere have developed their own character encoding standards. Unicode unifies these standards, aiming to achieve lossless conversion between original standards and Unicode encoding.
Character Sets and Character Encodings
What's the difference between Unicode, GBK, and UTF-8? They are actually concepts from different domains.
Character Sets
Common character sets like Unicode and ASCII are designed to represent characters with a series of numbers, also known as code points.
ASCII uses one byte to represent a character, defining encodings for 128 characters that correspond to English characters and binary values.
For Asian languages like Chinese, more bytes are needed to represent a single character. For example, GB2312 (for Simplified Chinese) uses two bytes per character, allowing representation of up to 65,536 characters (256 x 256).
The existence of multiple encoding systems meant that the same binary number could be interpreted as different symbols. Reading text with the wrong encoding results in garbled characters. This is why Unicode was created.
Unicode is a unified character set that assigns a unique code to every symbol in the world. This uniqueness eliminates character encoding confusion.
Character Encodings
Unicode is just a character set - it defines binary codes for symbols but doesn't specify how to store these codes. To save characters in computers, they must first be encoded.