Match Chinese Characters

When searching for "JavaScript regex for matching Chinese characters" on Google, most results suggest using /[\u4e00-\u9fa5]/. But is this regular expression really correct? Let's dive in and find out.

Han Script and Han Characters

Let's start by understanding two fundamental concepts:

Han Script is a writing system that originated from Chinese and was later adopted by Japanese, Korean, and other languages
Han Characters (CJK Ideographs) are the basic units of Han Script

Many countries and regions in the Han cultural sphere have developed their own character encoding standards. Unicode unifies these standards, aiming to achieve lossless conversion between original standards and Unicode encoding.

Character Sets and Character Encodings

What's the difference between Unicode, GBK, and UTF-8? They are actually concepts from different domains.

Character Sets

Common character sets like Unicode and ASCII are designed to represent characters with a series of numbers, also known as code points.

ASCII uses one byte to represent a character, defining encodings for 128 characters that correspond to English characters and binary values.

For Asian languages like Chinese, more bytes are needed to represent a single character. For example, GB2312 (for Simplified Chinese) uses two bytes per character, allowing representation of up to 65,536 characters (256 x 256).

The existence of multiple encoding systems meant that the same binary number could be interpreted as different symbols. Reading text with the wrong encoding results in garbled characters. This is why Unicode was created.

Unicode is a unified character set that assigns a unique code to every symbol in the world. This uniqueness eliminates character encoding confusion.

Character Encodings

Unicode is just a character set - it defines binary codes for symbols but doesn't specify how to store these codes. To save characters in computers, they must first be encoded.

Han Script and Han Characters

Let's start by understanding two fundamental concepts:

Han Script is a writing system that originated from Chinese and was later adopted by Japanese, Korean, and other languages

Han Characters (CJK Ideographs) are the basic units of Han Script

Character Sets and Character Encodings

What's the difference between Unicode, GBK, and UTF-8? They are actually concepts from different domains.

Common character sets like Unicode and ASCII are designed to represent characters with a series of numbers, also known as code points.

ASCII uses one byte to represent a character, defining encodings for 128 characters that correspond to English characters and binary values.

Unicode is a unified character set that assigns a unique code to every symbol in the world. This uniqueness eliminates character encoding confusion.

Unicode is just a character set - it defines binary codes for symbols but doesn't specify how to store these codes. To save characters in computers, they must first be encoded.

Han Script and Han Characters

Character Sets and Character Encodings

Character Sets

Character Encodings

Match Chinese Characters

Han Script and Han Characters

Character Sets and Character Encodings

Character Sets

Character Encodings

Matching Chinese Characters with Regex

Similar Unicode Property Escapes

Summary