Normalization

Article
11/03/2006

Some Unicode characters have multiple equivalent binary representations consisting of sets of combining and/or composite Unicode characters. The existence of multiple representations for a single character complicates searching, sorting, matching, and other operations.

The Unicode standard defines a process called normalization that returns one binary representation when given any of the equivalent binary representations of a character. Normalization can be performed with several algorithms, called normalization forms, that obey different rules. The .NET Framework currently supports Unicode normalization forms C, D, KC, and KD.

Two strings normalized to the same normalization form can be compared using an ordinal comparison (that is, a character-by-character binary comparison).

For more information about the normalization forms supported by the .NET Framework, see System.Text.NormalizationForm. For more information about normalization, character decompositions, and equivalence, see Unicode Standard Annex #15, "Unicode Normalization Forms," at the Unicode home site.

Normalizing a String

Use the System.String.Normalize method of a String object to return a new string that is normalized by default to normalization form C, or use the System.String.Normalize(System.Text.NormalizationForm) method of a String object specifying a NormalizationForm value to return a new string that is normalized specifically to normalization form C, D, KC, or KD.

Testing Whether a String is Normalized

Use the System.String.IsNormalized method of a String object to determine whether the string value of the object is normalized to normalization form C, or use the System.String.IsNormalized(System.Text.NormalizationForm) method of a String object specifying a particular NormalizationForm value to determine whether the string value of the object is normalized specifically to normalization form C, D, KC, or KD.

Example

The following code example demonstrates the IsNormalized and Normalize methods. The code example tests whether an original string is in any of the four normalization forms, creates a version of the original string in each of the normalization forms, tests whether each normalized string is in the intended normalization form, then displays the hexadecimal code point of each character in each normalized string.

' This example demonstrates the String.Normalize method
'                       and the String.IsNormalized method
Imports System
Imports System.Text
Imports Microsoft.VisualBasic

Class Sample
   Public Shared Sub Main()
      ' Character c; combining characters acute and cedilla; character 3/4
      Dim s1 = New [String](New Char() {ChrW(&H0063), ChrW(&H0301), ChrW(&H0327), ChrW(&H00BE)})
      Dim s2 As String = Nothing
      Dim divider = New [String]("-"c, 80)
      divider = [String].Concat(Environment.NewLine, divider, Environment.NewLine)
      
      Try
         Show("s1", s1)
         Console.WriteLine()
         Console.WriteLine("U+0063 = LATIN SMALL LETTER C")
         Console.WriteLine("U+0301 = COMBINING ACUTE ACCENT")
         Console.WriteLine("U+0327 = COMBINING CEDILLA")
         Console.WriteLine("U+00BE = VULGAR FRACTION THREE QUARTERS")

         Console.WriteLine(divider)
         
         Console.WriteLine("A1) Is s1 normalized to the default form (Form C)?: {0}", s1.IsNormalized())
         Console.WriteLine("A2) Is s1 normalized to Form C?:  {0}", s1.IsNormalized(NormalizationForm.FormC))
         Console.WriteLine("A3) Is s1 normalized to Form D?:  {0}", s1.IsNormalized(NormalizationForm.FormD))
         Console.WriteLine("A4) Is s1 normalized to Form KC?: {0}", s1.IsNormalized(NormalizationForm.FormKC))
         Console.WriteLine("A5) Is s1 normalized to Form KD?: {0}", s1.IsNormalized(NormalizationForm.FormKD))
         
         Console.WriteLine(divider)
         
         Console.WriteLine("Set string s2 to each normalized form of string s1.")
         Console.WriteLine()
         Console.WriteLine("U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE")
         Console.WriteLine("U+0033 = DIGIT THREE")
         Console.WriteLine("U+2044 = FRACTION SLASH")
         Console.WriteLine("U+0034 = DIGIT FOUR")
         Console.WriteLine(divider)
         
         s2 = s1.Normalize()
         Console.Write("B1) Is s2 normalized to the default form (Form C)?: ")
         Console.WriteLine(s2.IsNormalized())
         Show("s2", s2)
         Console.WriteLine()
         
         s2 = s1.Normalize(NormalizationForm.FormC)
         Console.Write("B2) Is s2 normalized to Form C?: ")
         Console.WriteLine(s2.IsNormalized(NormalizationForm.FormC))
         Show("s2", s2)
         Console.WriteLine()
         
         s2 = s1.Normalize(NormalizationForm.FormD)
         Console.Write("B3) Is s2 normalized to Form D?: ")
         Console.WriteLine(s2.IsNormalized(NormalizationForm.FormD))
         Show("s2", s2)
         Console.WriteLine()
         
         s2 = s1.Normalize(NormalizationForm.FormKC)
         Console.Write("B4) Is s2 normalized to Form KC?: ")
         Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKC))
         Show("s2", s2)
         Console.WriteLine()
         
         s2 = s1.Normalize(NormalizationForm.FormKD)
         Console.Write("B5) Is s2 normalized to Form KD?: ")
         Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKD))
         Show("s2", s2)
         Console.WriteLine()
      
      Catch e As Exception
         Console.WriteLine(e.Message)
      End Try
   End Sub 'Main
   
   Private Shared Sub Show(title As String, s As String)
      Console.Write("Characters in string {0} = ", title)
      Dim x As Char
      For Each x In  s.ToCharArray()
         Console.Write("{0:X4} ", AscW(x))
      Next x
      Console.WriteLine()
   End Sub 'Show
End Class 'Sample
'
'This example produces the following results:
'
'Characters in string s1 = 0063 0301 0327 00BE
'
'U+0063 = LATIN SMALL LETTER C
'U+0301 = COMBINING ACUTE ACCENT
'U+0327 = COMBINING CEDILLA
'U+00BE = VULGAR FRACTION THREE QUARTERS
'
'--------------------------------------------------------------------------------
'
'A1) Is s1 normalized to the default form (Form C)?: False
'A2) Is s1 normalized to Form C?:  False
'A3) Is s1 normalized to Form D?:  False
'A4) Is s1 normalized to Form KC?: False
'A5) Is s1 normalized to Form KD?: False
'
'--------------------------------------------------------------------------------
'
'Set string s2 to each normalized form of string s1.
'
'U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
'U+0033 = DIGIT THREE
'U+2044 = FRACTION SLASH
'U+0034 = DIGIT FOUR
'
'--------------------------------------------------------------------------------
'
'B1) Is s2 normalized to the default form (Form C)?: True
'Characters in string s2 = 1E09 00BE
'
'B2) Is s2 normalized to Form C?: True
'Characters in string s2 = 1E09 00BE
'
'B3) Is s2 normalized to Form D?: True
'Characters in string s2 = 0063 0327 0301 00BE
'
'B4) Is s2 normalized to Form KC?: True
'Characters in string s2 = 1E09 0033 2044 0034
'
'B5) Is s2 normalized to Form KD?: True
'Characters in string s2 = 0063 0327 0301 0033 2044 0034
'

// This example demonstrates the String.Normalize method
//                       and the String.IsNormalized method

using System;
using System.Text;

class Sample 
{
    public static void Main() 
    {
// Character c; combining characters acute and cedilla; character 3/4
    string s1 = new String( new char[] {'\u0063', '\u0301', '\u0327', '\u00BE'});
    string s2 = null;
    string divider = new String('-', 80);
    divider = String.Concat(Environment.NewLine, divider, Environment.NewLine);

    try 
    {
    Show("s1", s1);
    Console.WriteLine();
    Console.WriteLine("U+0063 = LATIN SMALL LETTER C");
    Console.WriteLine("U+0301 = COMBINING ACUTE ACCENT");
    Console.WriteLine("U+0327 = COMBINING CEDILLA");
    Console.WriteLine("U+00BE = VULGAR FRACTION THREE QUARTERS");
    Console.WriteLine(divider);

    Console.WriteLine("A1) Is s1 normalized to the default form (Form C)?: {0}", 
                                 s1.IsNormalized());
    Console.WriteLine("A2) Is s1 normalized to Form C?:  {0}", 
                                 s1.IsNormalized(NormalizationForm.FormC));
    Console.WriteLine("A3) Is s1 normalized to Form D?:  {0}", 
                                 s1.IsNormalized(NormalizationForm.FormD));
    Console.WriteLine("A4) Is s1 normalized to Form KC?: {0}", 
                                 s1.IsNormalized(NormalizationForm.FormKC));
    Console.WriteLine("A5) Is s1 normalized to Form KD?: {0}", 
                                 s1.IsNormalized(NormalizationForm.FormKD));

    Console.WriteLine(divider);

    Console.WriteLine("Set string s2 to each normalized form of string s1.");
    Console.WriteLine();
    Console.WriteLine("U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE");
    Console.WriteLine("U+0033 = DIGIT THREE");
    Console.WriteLine("U+2044 = FRACTION SLASH");
    Console.WriteLine("U+0034 = DIGIT FOUR");
    Console.WriteLine(divider);

    s2 = s1.Normalize();
    Console.Write("B1) Is s2 normalized to the default form (Form C)?: ");
    Console.WriteLine(s2.IsNormalized());
    Show("s2", s2);
    Console.WriteLine();

    s2 = s1.Normalize(NormalizationForm.FormC);
    Console.Write("B2) Is s2 normalized to Form C?: ");
    Console.WriteLine(s2.IsNormalized(NormalizationForm.FormC));
    Show("s2", s2);
    Console.WriteLine();

    s2 = s1.Normalize(NormalizationForm.FormD);
    Console.Write("B3) Is s2 normalized to Form D?: ");
    Console.WriteLine(s2.IsNormalized(NormalizationForm.FormD));
    Show("s2", s2);
    Console.WriteLine();

    s2 = s1.Normalize(NormalizationForm.FormKC);
    Console.Write("B4) Is s2 normalized to Form KC?: ");
    Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKC));
    Show("s2", s2);
    Console.WriteLine();

    s2 = s1.Normalize(NormalizationForm.FormKD);
    Console.Write("B5) Is s2 normalized to Form KD?: ");
    Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKD));
    Show("s2", s2);
    Console.WriteLine();
    }

    catch (Exception e) 
        {
        Console.WriteLine(e.Message);
        }
    }

    private static void Show(string title, string s)
    {
    Console.Write("Characters in string {0} = ", title);
    foreach(short x in s.ToCharArray())
        {
        Console.Write("{0:X4} ", x);
        }
    Console.WriteLine();
    }
}
/*
This example produces the following results:

Characters in string s1 = 0063 0301 0327 00BE

U+0063 = LATIN SMALL LETTER C
U+0301 = COMBINING ACUTE ACCENT
U+0327 = COMBINING CEDILLA
U+00BE = VULGAR FRACTION THREE QUARTERS

--------------------------------------------------------------------------------

A1) Is s1 normalized to the default form (Form C)?: False
A2) Is s1 normalized to Form C?:  False
A3) Is s1 normalized to Form D?:  False
A4) Is s1 normalized to Form KC?: False
A5) Is s1 normalized to Form KD?: False

--------------------------------------------------------------------------------

Set string s2 to each normalized form of string s1.

U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
U+0033 = DIGIT THREE
U+2044 = FRACTION SLASH
U+0034 = DIGIT FOUR

--------------------------------------------------------------------------------

B1) Is s2 normalized to the default form (Form C)?: True
Characters in string s2 = 1E09 00BE

B2) Is s2 normalized to Form C?: True
Characters in string s2 = 1E09 00BE

B3) Is s2 normalized to Form D?: True
Characters in string s2 = 0063 0327 0301 00BE

B4) Is s2 normalized to Form KC?: True
Characters in string s2 = 1E09 0033 2044 0034

B5) Is s2 normalized to Form KD?: True
Characters in string s2 = 0063 0327 0301 0033 2044 0034

*/

// This example demonstrates the String.Normalize method
//                       and the String.IsNormalized method
using namespace System;
using namespace System::Text;
void Show( String^ title, String^ s )
{
   Console::Write( "Characters in string {0} = ", title );
   System::Collections::IEnumerator^ myEnum = s->ToCharArray()->GetEnumerator();
   while ( myEnum->MoveNext() )
   {
      
      /*) * __try_cast < Char * > ( myEnum -> Current );*/
      int x;
      Console::Write( "{0:X4} ", x );
   }

   Console::WriteLine();
}

int main()
{
   
   // Character c; combining characters acute and cedilla; character 3/4
   array<Char>^temp0 = {L'c',L'\u0301',L'\u0327',L'\u00BE'};
   String^ s1 = gcnew String( temp0 );
   String^ s2 = nullptr;
   String^ divider = gcnew String( '-',80 );
   divider = String::Concat( Environment::NewLine, divider, Environment::NewLine );
   try
   {
      Show( "s1", s1 );
      Console::WriteLine();
      Console::WriteLine( "U+0063 = LATIN SMALL LETTER C" );
      Console::WriteLine( "U+0301 = COMBINING ACUTE ACCENT" );
      Console::WriteLine( "U+0327 = COMBINING CEDILLA" );
      Console::WriteLine( "U+00BE = VULGAR FRACTION THREE QUARTERS" );
      Console::WriteLine( divider );
      Console::WriteLine( "A1) Is s1 normalized to the default form (Form C)?: {0}", s1->IsNormalized() );
      Console::WriteLine( "A2) Is s1 normalized to Form C?:  {0}", s1->IsNormalized( NormalizationForm::FormC ) );
      Console::WriteLine( "A3) Is s1 normalized to Form D?:  {0}", s1->IsNormalized( NormalizationForm::FormD ) );
      Console::WriteLine( "A4) Is s1 normalized to Form KC?: {0}", s1->IsNormalized( NormalizationForm::FormKC ) );
      Console::WriteLine( "A5) Is s1 normalized to Form KD?: {0}", s1->IsNormalized( NormalizationForm::FormKD ) );
      Console::WriteLine( divider );
      Console::WriteLine( "Set string s2 to each normalized form of string s1." );
      Console::WriteLine();
      Console::WriteLine( "U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE" );
      Console::WriteLine( "U+0033 = DIGIT THREE" );
      Console::WriteLine( "U+2044 = FRACTION SLASH" );
      Console::WriteLine( "U+0034 = DIGIT FOUR" );
      Console::WriteLine( divider );
      s2 = s1->Normalize();
      Console::Write( "B1) Is s2 normalized to the default form (Form C)?: " );
      Console::WriteLine( s2->IsNormalized() );
      Show( "s2", s2 );
      Console::WriteLine();
      s2 = s1->Normalize( NormalizationForm::FormC );
      Console::Write( "B2) Is s2 normalized to Form C?: " );
      Console::WriteLine( s2->IsNormalized( NormalizationForm::FormC ) );
      Show( "s2", s2 );
      Console::WriteLine();
      s2 = s1->Normalize( NormalizationForm::FormD );
      Console::Write( "B3) Is s2 normalized to Form D?: " );
      Console::WriteLine( s2->IsNormalized( NormalizationForm::FormD ) );
      Show( "s2", s2 );
      Console::WriteLine();
      s2 = s1->Normalize( NormalizationForm::FormKC );
      Console::Write( "B4) Is s2 normalized to Form KC?: " );
      Console::WriteLine( s2->IsNormalized( NormalizationForm::FormKC ) );
      Show( "s2", s2 );
      Console::WriteLine();
      s2 = s1->Normalize( NormalizationForm::FormKD );
      Console::Write( "B5) Is s2 normalized to Form KD?: " );
      Console::WriteLine( s2->IsNormalized( NormalizationForm::FormKD ) );
      Show( "s2", s2 );
      Console::WriteLine();
   }
   catch ( Exception^ e ) 
   {
      Console::WriteLine( e->Message );
   }

}

/*
This example produces the following results:

Characters in string s1 = 0063 0301 0327 00BE

U+0063 = LATIN SMALL LETTER C
U+0301 = COMBINING ACUTE ACCENT
U+0327 = COMBINING CEDILLA
U+00BE = VULGAR FRACTION THREE QUARTERS

--------------------------------------------------------------------------------

A1) Is s1 normalized to the default form (Form C)?: False
A2) Is s1 normalized to Form C?:  False
A3) Is s1 normalized to Form D?:  False
A4) Is s1 normalized to Form KC?: False
A5) Is s1 normalized to Form KD?: False

--------------------------------------------------------------------------------

Set string s2 to each normalized form of string s1.

U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
U+0033 = DIGIT THREE
U+2044 = FRACTION SLASH
U+0034 = DIGIT FOUR

--------------------------------------------------------------------------------

B1) Is s2 normalized to the default form (Form C)?: True
Characters in string s2 = 1E09 00BE

B2) Is s2 normalized to Form C?: True
Characters in string s2 = 1E09 00BE

B3) Is s2 normalized to Form D?: True
Characters in string s2 = 0063 0327 0301 00BE

B4) Is s2 normalized to Form KC?: True
Characters in string s2 = 1E09 0033 2044 0034

B5) Is s2 normalized to Form KD?: True
Characters in string s2 = 0063 0327 0301 0033 2044 0034

*/

// This example demonstrates the String.Normalize method
//                       and the String.IsNormalized method
import System.*;
import System.Text.*;

class Sample
{
    public static void main(String[] args)
    {
        // Character c; combining characters acute and cedilla; character 3/4
        String s1 = new String(new char[] { '\u0063', '\u0301', '\u0327', 
            '\u00BE' });
        String s2 = null;
        String divider = new String('-', 80);
        divider = String.Concat(Environment.get_NewLine(), divider, 
            Environment.get_NewLine());

        try {
            Show("s1", s1);
            Console.WriteLine();
            Console.WriteLine("U+0063 = LATIN SMALL LETTER C");
            Console.WriteLine("U+0301 = COMBINING ACUTE ACCENT");
            Console.WriteLine("U+0327 = COMBINING CEDILLA");
            Console.WriteLine("U+00BE = VULGAR FRACTION THREE QUARTERS");
            Console.WriteLine(divider);

            Console.WriteLine("A1) Is s1 normalized to the default form " 
                + "(Form C)?: {0}", System.Convert.ToString(s1.IsNormalized()));
            Console.WriteLine("A2) Is s1 normalized to Form C?:  {0}", 
                System.Convert.ToString(s1.
                IsNormalized(NormalizationForm.FormC)));
            Console.WriteLine("A3) Is s1 normalized to Form D?:  {0}", 
                System.Convert.ToString(s1.
                IsNormalized(NormalizationForm.FormD)));
            Console.WriteLine("A4) Is s1 normalized to Form KC?: {0}", 
                System.Convert.ToString(s1.
                IsNormalized(NormalizationForm.FormKC)));
            Console.WriteLine("A5) Is s1 normalized to Form KD?: {0}", 
                System.Convert.ToString(s1.
                IsNormalized(NormalizationForm.FormKD)));

            Console.WriteLine(divider);

            Console.WriteLine("Set string s2 to each normalized form of " 
                + "string s1.");
            Console.WriteLine();
            Console.WriteLine("U+1E09 = LATIN SMALL LETTER C WITH CEDILLA " 
                + "AND ACUTE");
            Console.WriteLine("U+0033 = DIGIT THREE");
            Console.WriteLine("U+2044 = FRACTION SLASH");
            Console.WriteLine("U+0034 = DIGIT FOUR");
            Console.WriteLine(divider);

            s2 = s1.Normalize();
            Console.Write("B1) Is s2 normalized to the default form " 
                + "(Form C)?: ");
            Console.WriteLine(s2.IsNormalized());
            Show("s2", s2);
            Console.WriteLine();

            s2 = s1.Normalize(NormalizationForm.FormC);
            Console.Write("B2) Is s2 normalized to Form C?: ");
            Console.WriteLine(s2.IsNormalized(NormalizationForm.FormC));
            Show("s2", s2);
            Console.WriteLine();

            s2 = s1.Normalize(NormalizationForm.FormD);
            Console.Write("B3) Is s2 normalized to Form D?: ");
            Console.WriteLine(s2.IsNormalized(NormalizationForm.FormD));
            Show("s2", s2);
            Console.WriteLine();

            s2 = s1.Normalize(NormalizationForm.FormKC);
            Console.Write("B4) Is s2 normalized to Form KC?: ");
            Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKC));
            Show("s2", s2);
            Console.WriteLine();

            s2 = s1.Normalize(NormalizationForm.FormKD);
            Console.Write("B5) Is s2 normalized to Form KD?: ");
            Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKD));
            Show("s2", s2);
            Console.WriteLine();
        }
        catch (System.Exception e) {
            Console.WriteLine(e.get_Message());
        }
    } //main

    private static void Show(String title, String s)
    {
        Console.Write("Characters in string {0} = ", title);
        char myCharArray[] = s.ToCharArray();
        for (int iCtr = 0; iCtr < myCharArray.length; iCtr++) {
            char c = myCharArray[iCtr];
            Console.Write(((System.Int32)c).ToString("X4") + " ");
        }
        Console.WriteLine();
    } //Show
} //Sample
/*
This example produces the following results:

Characters in string s1 = 0063 0301 0327 00BE

U+0063 = LATIN SMALL LETTER C
U+0301 = COMBINING ACUTE ACCENT
U+0327 = COMBINING CEDILLA
U+00BE = VULGAR FRACTION THREE QUARTERS

--------------------------------------------------------------------------------

A1) Is s1 normalized to the default form (Form C)?: False
A2) Is s1 normalized to Form C?:  False
A3) Is s1 normalized to Form D?:  False
A4) Is s1 normalized to Form KC?: False
A5) Is s1 normalized to Form KD?: False

--------------------------------------------------------------------------------

Set string s2 to each normalized form of string s1.

U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
U+0033 = DIGIT THREE
U+2044 = FRACTION SLASH
U+0034 = DIGIT FOUR

--------------------------------------------------------------------------------

B1) Is s2 normalized to the default form (Form C)?: True
Characters in string s2 = 1E09 00BE

B2) Is s2 normalized to Form C?: True
Characters in string s2 = 1E09 00BE

B3) Is s2 normalized to Form D?: True
Characters in string s2 = 0063 0327 0301 00BE

B4) Is s2 normalized to Form KC?: True
Characters in string s2 = 1E09 0033 2044 0034

B5) Is s2 normalized to Form KD?: True
Characters in string s2 = 0063 0327 0301 0033 2044 0034

*/

Partager via

Normalization

Normalizing a String

Testing Whether a String is Normalized

Example

See Also

Reference

Ressources supplémentaires