Unicode Support for Surrogate Pairs and Combining Character Sequences

Article
11/03/2006

The Unicode Standard defines a surrogate pair as a coded character representation for a single abstract character that consists of a sequence of two code units. The first value of the surrogate pair is the high surrogate, and contains a 16-bit code value in the range of U+D800 through U+DBFF. The second value of the pair is the low surrogate, and contains values in the range of U+DC00 through U+DFFF.

The Unicode Standard defines a combining character sequence as a combination of a base character and one or more combining characters. A surrogate pair can represent a base character or a combining character. For more information on surrogate pairs and combining character sequences, see The Unicode Standard at www.unicode.org.

The key point to remember is that surrogate pairs represent 32-bit single characters, and you cannot assume that one 16-bit Unicode encoding value maps to exactly one character. By using surrogate pairs, a 16-bit Unicode encoded system can address an additional one million code points to which characters will be assigned by the Unicode standard.

The .NET Framework supports text elements. A text element is a unit of text that is displayed as a single character, called a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence. The StringInfo class provides methods that allow you to split a string into its text elements and iterate through the text elements. For example, the StringInfo.GetNextTextElement method allows you to retrieve a surrogate pair as one text element. For an example of using the StringInfo class, see String Indexing.

Partager via

Unicode Support for Surrogate Pairs and Combining Character Sequences

See Also

Reference

Concepts

Other Resources

Ressources supplémentaires