Learn how to capitalize the first character of a string in JavaScript correctly, handling Unicode characters.
To covert a first letter in the string to upper case we need to use toLocaleUpperCase
and taking the surrogate pairs into account.
How strings are represented in JavaScript?
In JavaScript, strings are represented as sequences of UTF-16 code units. UTF-16 is a character encoding capable of encoding all 1,112,064 valid code points in Unicode using one or two 16-bit code units. The first 65,536 code points, known as the Basic Multilingual Plane (BMP), can be represented by a single 16-bit code unit. Characters outside the BMP, known as astral characters, must be represented using two 16-bit code units, a high-surrogate code unit and a low-surrogate code unit. This pair of code units is known as a surrogate pair.
- High-surrogate code units have values between
0xD800
and0xDBFF
. - Low-surrogate code units have values between
0xDC00
and0xDFFF
.
When dealing with strings that contain astral characters, it’s crucial to handle surrogate pairs correctly.
Getting and capitalizing the first character of a string in JavaScript
The below function capitalize the first character of a given text string while handling special Unicode characters correctly.
Let’s break down the code:
- The function takes a
text
parameter which represents the input string to be processed. - The first
if
statement checks if the length of the Unicode character\uD83D\uDE00
is equal to1
. This is used to determine if the browser supports UTF-32. If this condition is met, it capitalizes the first character usingtext.charAt(0).toLocaleUpperCase()
and concatenates it with the rest of the string starting from index1
usingtext.substring(1)
. - If the above condition is not met, the function then determines whether the code point for the first character is greater than 0xFFFF (which indicates a surrogate pair) and adjusts the index for the
substring
method accordingly. - Finally, it capitalizes the first character using
String.fromCodePoint(firstChar).toLocaleUpperCase()
and concatenates it with the rest of the string starting from either index 1 or 2 based on whether it’s a surrogate pair.
The code could be more optimized, if needed, but less readable, by wrapping in an immediately invoked function expression (IIFE) and testing for UTF-32 support only once. The optimized version of getting and capitalizing the first letter code is included in the CodePen below.
You can also play and use The Surrogate Pair Calculator
.
Comments