Calculating String Bytes Count

Recently, I encountered a challenging problem at work. I was responsible for developing the file upload function, which stores multiple versions of a file based on the filename. When uploading a file, the total number of bytes in the filename is counted and spliced into the name as an identifier.

At first I used string.length as the filename byte count, which resulted in an identifier error. After I checked the wiki documentation, I have a new perception of character encoding.

Unicode Code Points

The charCodeAt() method of String values returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.

The codePointAt() method of String values returns a non-negative integer that is the Unicode code point value of the character starting at the given index. Note that the index is still based on UTF-16 code units, not Unicode code points.

Unicode code points range from 0 to 1114111 (0x10FFFF). charCodeAt() always returns a value that is less than 65536, because the higher code points are represented by a pair of 16-bit surrogate pseudo-characters. Therefore, in order to get a full character with value greater than 65535, it is necessary to retrieve not only charCodeAt(i), but also charCodeAt(i + 1), or to use codePointAt(i) instead.

So we use codePointAt(i) method to get the Unicode code point value at the given index.

const charCode = str

At first I used string.length as the filename byte count, which resulted in an identifier error. After I checked the wiki documentation, I have a new perception of character encoding.

Unicode Code Points

The charCodeAt() method of String values returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.

So we use codePointAt(i) method to get the Unicode code point value at the given index.

const charCode = str

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
U+0000	U+007F	0xxxxxxx
U+0080	U+07FF	110xxxxx	10xxxxxx
U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
U+010000	[b]U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

Unicode Code Points

Calculating String Bytes Count

Unicode Code Points

UTF-8

UTF-16

Final Code