Compartilhar via


Transliterator Class

Definition

Transliterator is an abstract class that transliterates text from one format to another.

[Android.Runtime.Register("android/icu/text/Transliterator", ApiSince=29, DoNotGenerateAcw=true)]
public abstract class Transliterator : Java.Lang.Object
[<Android.Runtime.Register("android/icu/text/Transliterator", ApiSince=29, DoNotGenerateAcw=true)>]
type Transliterator = class
    inherit Object
Inheritance
Transliterator
Attributes

Remarks

Transliterator is an abstract class that transliterates text from one format to another. The most common kind of transliterator is a script, or alphabet, transliterator. For example, a Russian to Latin transliterator changes Russian text written in Cyrillic characters to phonetically equivalent Latin characters. It does not <em>translate</em> Russian to English! Transliteration, unlike translation, operates on characters, without reference to the meanings of words and sentences.

Although script conversion is its most common use, a transliterator can actually perform a more general class of tasks. In fact, Transliterator defines a very general API which specifies only that a segment of the input text is replaced by new text. The particulars of this conversion are determined entirely by subclasses of Transliterator.

<b>Transliterators are stateless</b>

Transliterator objects are <em>stateless</em>; they retain no information between calls to transliterate(). As a result, threads may share transliterators without synchronizing them. This might seem to limit the complexity of the transliteration operation. In practice, subclasses perform complex transliterations by delaying the replacement of text until it is known that no other replacements are possible. In other words, although the Transliterator objects are stateless, the source text itself embodies all the needed information, and delayed operation allows arbitrary complexity.

<b>Batch transliteration</b>

The simplest way to perform transliteration is all at once, on a string of existing text. This is referred to as <em>batch</em> transliteration. For example, given a string input and a transliterator t, the call

<blockquote>String result = t.transliterate(input); </blockquote>

will transliterate it and return the result. Other methods allow the client to specify a substring to be transliterated and to use Replaceable objects instead of strings, in order to preserve out-of-band information (such as text styles).

<b>Keyboard transliteration</b>

Somewhat more involved is <em>keyboard</em>, or incremental transliteration. This is the transliteration of text that is arriving from some source (typically the user's keyboard) one character at a time, or in some other piecemeal fashion.

In keyboard transliteration, a Replaceable buffer stores the text. As text is inserted, as much as possible is transliterated on the fly. This means a GUI that displays the contents of the buffer may show text being modified as each new character arrives.

Consider the simple rule-based Transliterator:

<blockquote> th&gt;{theta}<br> t&gt;{tau} </blockquote>

When the user types 't', nothing will happen, since the transliterator is waiting to see if the next character is 'h'. To remedy this, we introduce the notion of a cursor, marked by a '|' in the output string:

<blockquote> t&gt;|{tau}<br> {tau}h&gt;{theta} </blockquote>

Now when the user types 't', tau appears, and if the next character is 'h', the tau changes to a theta. This is accomplished by maintaining a cursor position (independent of the insertion point, and invisible in the GUI) across calls to transliterate(). Typically, the cursor will be coincident with the insertion point, but in a case like the one above, it will precede the insertion point.

Keyboard transliteration methods maintain a set of three indices that are updated with each call to transliterate(), including the cursor, start, and limit. These indices are changed by the method, and they are passed in and out via a Position object. The start index marks the beginning of the substring that the transliterator will look at. It is advanced as text becomes committed (but it is not the committed index; that's the cursor). The cursor index, described above, marks the point at which the transliterator last stopped, either because it reached the end, or because it required more characters to disambiguate between possible inputs. The cursor can also be explicitly set by rules. Any characters before the cursor index are frozen; future keyboard transliteration calls within this input sequence will not change them. New text is inserted at the limit index, which marks the end of the substring that the transliterator looks at.

Because keyboard transliteration assumes that more characters are to arrive, it is conservative in its operation. It only transliterates when it can do so unambiguously. Otherwise it waits for more characters to arrive. When the client code knows that no more characters are forthcoming, perhaps because the user has performed some input termination operation, then it should call finishTransliteration() to complete any pending transliterations.

<b>Inverses</b>

Pairs of transliterators may be inverses of one another. For example, if transliterator <b>A</b> transliterates characters by incrementing their Unicode value (so "abc" -&gt; "def"), and transliterator <b>B</b> decrements character values, then <b>A</b> is an inverse of <b>B</b> and vice versa. If we compose <b>A</b> with <b>B</b> in a compound transliterator, the result is the identity transliterator, that is, a transliterator that does not change its input text.

The Transliterator method getInverse() returns a transliterator's inverse, if one exists, or null otherwise. However, the result of getInverse() usually will <em>not</em> be a true mathematical inverse. This is because true inverse transliterators are difficult to formulate. For example, consider two transliterators: <b>AB</b>, which transliterates the character 'A' to 'B', and <b>BA</b>, which transliterates 'B' to 'A'. It might seem that these are exact inverses, since

<blockquote>"A" x <b>AB</b> -&gt; "B"<br> "B" x <b>BA</b> -&gt; "A"</blockquote>

where 'x' represents transliteration. However,

<blockquote>"ABCD" x <b>AB</b> -&gt; "BBCD"<br> "BBCD" x <b>BA</b> -&gt; "AACD"</blockquote>

so <b>AB</b> composed with <b>BA</b> is not the identity. Nonetheless, <b>BA</b> may be usefully considered to be <b>AB</b>'s inverse, and it is on this basis that <b>AB</b>.getInverse() could legitimately return <b>BA</b>.

<b>Filtering</b>

Each transliterator has a filter, which restricts changes to those characters selected by the filter. The filter affects just the characters that are changed -- the characters outside of the filter are still part of the context for the filter. For example, in the following even though 'x' is filtered out, and doesn't convert to y, it does affect the conversion of 'a'.

String rules = &quot;x &gt; y; x{a} &gt; b; &quot;;
            Transliterator tempTrans = Transliterator.createFromRules(&quot;temp&quot;, rules, Transliterator.FORWARD);
            tempTrans.setFilter(new UnicodeSet(&quot;[a]&quot;));
            String tempResult = tempTrans.transform(&quot;xa&quot;);
            // results in &quot;xb&quot;

<b>IDs and display names</b>

A transliterator is designated by a short identifier string or <em>ID</em>. IDs follow the format <em>source-destination</em>, where <em>source</em> describes the entity being replaced, and <em>destination</em> describes the entity replacing <em>source</em>. The entities may be the names of scripts, particular sequences of characters, or whatever else it is that the transliterator converts to or from. For example, a transliterator from Russian to Latin might be named "Russian-Latin". A transliterator from keyboard escape sequences to Latin-1 characters might be named "KeyboardEscape-Latin1". By convention, system entity names are in English, with the initial letters of words capitalized; user entity names may follow any format so long as they do not contain dashes.

In addition to programmatic IDs, transliterator objects have display names for presentation in user interfaces, returned by #getDisplayName.

<b>Composed transliterators</b>

In addition to built-in system transliterators like "Latin-Greek", there are also built-in <em>composed</em> transliterators. These are implemented by composing two or more component transliterators. For example, if we have scripts "A", "B", "C", and "D", and we want to transliterate between all pairs of them, then we need to write 12 transliterators: "A-B", "A-C", "A-D", "B-A",..., "D-A", "D-B", "D-C". If it is possible to convert all scripts to an intermediate script "M", then instead of writing 12 rule sets, we only need to write 8: "A~M", "B~M", "C~M", "D~M", "M~A", "M~B", "M~C", "M~D". (This might not seem like a big win, but it's really 2<em>n</em> vs. <em>n</em> <sup>2</sup> - <em>n</em>, so as <em>n</em> gets larger the gain becomes significant. With 9 scripts, it's 18 vs. 72 rule sets, a big difference.) Note the use of "~" rather than "-" for the script separator here; this indicates that the given transliterator is intended to be composed with others, rather than be used as is.

Composed transliterators can be instantiated as usual. For example, the system transliterator "Devanagari-Gujarati" is a composed transliterator built internally as "Devanagari~InterIndic;InterIndic~Gujarati". When this transliterator is instantiated, it appears externally to be a standard transliterator (e.g., getID() returns "Devanagari-Gujarati").

<b>Rule syntax</b>

A set of rules determines how to perform translations. Rules within a rule set are separated by semicolons (';'). To include a literal semicolon, prefix it with a backslash ('\'). Unicode Pattern_White_Space is ignored. If the first non-blank character on a line is '#', the entire line is ignored as a comment.

Each set of rules consists of two groups, one forward, and one reverse. This is a convention that is not enforced; rules for one direction may be omitted, with the result that translations in that direction will not modify the source text. In addition, bidirectional forward-reverse rules may be specified for symmetrical transformations.

Note: Another description of the Transliterator rule syntax is available in section Transform Rules Syntax of UTS #35: Unicode LDML. The rules are shown there using arrow symbols ← and → and ↔. ICU supports both those and the equivalent ASCII symbols &lt; and &gt; and &lt;&gt;.

Rule statements take one of the following forms:

<dl> <dt>$alefmadda=\\u0622;</dt> <dd><strong>Variable definition.</strong> The name on the left is assigned the text on the right. In this example, after this statement, instances of the left hand name, &quot;$alefmadda&quot;, will be replaced by the Unicode character U+0622. Variable names must begin with a letter and consist only of letters, digits, and underscores. Case is significant. Duplicate names cause an exception to be thrown, that is, variables cannot be redefined. The right hand side may contain well-formed text of any length, including no text at all (&quot;$empty=;&quot;). The right hand side may contain embedded UnicodeSet patterns, for example, &quot;$softvowel=[eiyEIY]&quot;.</dd> <dt>ai&gt;$alefmadda;</dt> <dd><strong>Forward translation rule.</strong> This rule states that the string on the left will be changed to the string on the right when performing forward transliteration.</dd> <dt>ai&lt;$alefmadda;</dt> <dd><strong>Reverse translation rule.</strong> This rule states that the string on the right will be changed to the string on the left when performing reverse transliteration.</dd> </dl>

<dl> <dt>ai&lt;&gt;$alefmadda;</dt> <dd><strong>Bidirectional translation rule.</strong> This rule states that the string on the right will be changed to the string on the left when performing forward transliteration, and vice versa when performing reverse transliteration.</dd> </dl>

Translation rules consist of a <em>match pattern</em> and an <em>output string</em>. The match pattern consists of literal characters, optionally preceded by context, and optionally followed by context. Context characters, like literal pattern characters, must be matched in the text being transliterated. However, unlike literal pattern characters, they are not replaced by the output text. For example, the pattern &quot;abc{def}&quot; indicates the characters &quot;def&quot; must be preceded by &quot;abc&quot; for a successful match. If there is a successful match, &quot;def&quot; will be replaced, but not &quot;abc&quot;. The final '}' is optional, so &quot;abc{def&quot; is equivalent to &quot;abc{def}&quot;. Another example is &quot;{123}456&quot; (or &quot;123}456&quot;) in which the literal pattern &quot;123&quot; must be followed by &quot;456&quot;.

The output string of a forward or reverse rule consists of characters to replace the literal pattern characters. If the output string contains the character '|', this is taken to indicate the location of the <em>cursor</em> after replacement. The cursor is the point in the text at which the next replacement, if any, will be applied. The cursor is usually placed within the replacement text; however, it can actually be placed into the precending or following context by using the special character '

Java documentation for android.icu.text.Transliterator.

Portions of this page are modifications based on work created and shared by the Android Open Source Project and used according to terms described in the Creative Commons 2.5 Attribution License.

Constructors

Transliterator(IntPtr, JniHandleOwnership)

Fields

Forward
Obsolete.

Direction constant indicating the forward direction in a transliterator, e.

Reverse
Obsolete.

Direction constant indicating the reverse direction in a transliterator, e.

Properties

AvailableIDs

Returns an enumeration over the programmatic names of registered Transliterator objects.

AvailableSources

Returns an enumeration over the source names of registered transliterators.

Class

Returns the runtime class of this Object.

(Inherited from Object)
Filter

Returns the filter used by this transliterator, or null if this transliterator uses no filter. -or- Changes the filter used by this transliterator.

Handle

The handle to the underlying Android instance.

(Inherited from Object)
ID

Returns a programmatic identifier for this transliterator.

Inverse

Returns this transliterator's inverse.

JniIdentityHashCode (Inherited from Object)
JniPeerMembers
MaximumContextLength

Returns the length of the longest context required by this transliterator.

PeerReference (Inherited from Object)
SourceSet

Returns the set of all characters that may be modified in the input text by this Transliterator.

TargetSet

Returns the set of all characters that may be generated as replacement text by this transliterator.

ThresholdClass
ThresholdType

Methods

Clone()

Creates and returns a copy of this object.

(Inherited from Object)
CreateFromRules(String, String, DirectionOptions)

Returns a Transliterator object constructed from the given rule string.

Dispose() (Inherited from Object)
Dispose(Boolean) (Inherited from Object)
Equals(Object)

Indicates whether some other object is "equal to" this one.

(Inherited from Object)
FilteredTransliterate(IReplaceable, Transliterator+Position, Boolean)
FinishTransliteration(IReplaceable, Transliterator+Position)
GetAvailableTargets(String)

Returns an enumeration over the target names of registered transliterators having a given source name.

GetAvailableVariants(String, String)

Returns an enumeration over the variant names of registered transliterators having a given source name and target name.

GetDisplayName(String, Locale)

Returns a name for this transliterator that is appropriate for display to the user in the given locale.

GetDisplayName(String, ULocale)

Returns a name for this transliterator that is appropriate for display to the user in the given locale.

GetDisplayName(String)

Returns a name for this transliterator that is appropriate for display to the user in the default DISPLAY locale.

GetElements()

Return the elements that make up this transliterator.

GetHashCode()

Returns a hash code value for the object.

(Inherited from Object)
GetInstance(String, Int32)

Returns a Transliterator object given its ID.

GetInstance(String)

Returns a Transliterator object given its ID.

JavaFinalize()

Called by the garbage collector on an object when garbage collection determines that there are no more references to the object.

(Inherited from Object)
Notify()

Wakes up a single thread that is waiting on this object's monitor.

(Inherited from Object)
NotifyAll()

Wakes up all threads that are waiting on this object's monitor.

(Inherited from Object)
SetHandle(IntPtr, JniHandleOwnership)

Sets the Handle property.

(Inherited from Object)
ToArray<T>() (Inherited from Object)
ToRules(Boolean)

Returns a rule string for this transliterator.

ToString()

Returns a string representation of the object.

(Inherited from Object)
Transliterate(IReplaceable, Int32, Int32)

Transliterates a segment of a string, with optional filtering.

Transliterate(IReplaceable, Transliterator+Position, Int32)
Transliterate(IReplaceable, Transliterator+Position, String)
Transliterate(IReplaceable, Transliterator+Position)
Transliterate(IReplaceable)

Transliterates an entire string in place.

Transliterate(String)

Transliterate an entire string and returns the result.

UnregisterFromRuntime() (Inherited from Object)
Wait()

Causes the current thread to wait until it is awakened, typically by being <em>notified</em> or <em>interrupted</em>.

(Inherited from Object)
Wait(Int64, Int32)

Causes the current thread to wait until it is awakened, typically by being <em>notified</em> or <em>interrupted</em>, or until a certain amount of real time has elapsed.

(Inherited from Object)
Wait(Int64)

Causes the current thread to wait until it is awakened, typically by being <em>notified</em> or <em>interrupted</em>, or until a certain amount of real time has elapsed.

(Inherited from Object)

Explicit Interface Implementations

IJavaPeerable.Disposed() (Inherited from Object)
IJavaPeerable.DisposeUnlessReferenced() (Inherited from Object)
IJavaPeerable.Finalized() (Inherited from Object)
IJavaPeerable.JniManagedPeerState (Inherited from Object)
IJavaPeerable.SetJniIdentityHashCode(Int32) (Inherited from Object)
IJavaPeerable.SetJniManagedPeerState(JniManagedPeerStates) (Inherited from Object)
IJavaPeerable.SetPeerReference(JniObjectReference) (Inherited from Object)

Extension Methods

JavaCast<TResult>(IJavaObject)

Performs an Android runtime-checked type conversion.

JavaCast<TResult>(IJavaObject)
GetJniTypeName(IJavaPeerable)

Gets the JNI name of the type of the instance self.

JavaAs<TResult>(IJavaPeerable)

Try to coerce self to type TResult, checking that the coercion is valid on the Java side.

TryJavaCast<TResult>(IJavaPeerable, TResult)

Try to coerce self to type TResult, checking that the coercion is valid on the Java side.

Applies to