Indexes - Analyze

Reference

Service:: Search Service

API Version:: 2024-07-01

Shows how an analyzer breaks text into tokens.

POST {endpoint}/indexes('{indexName}')/search.analyze?api-version=2024-07-01

URI Parameters

Name	In	Required	Type	Description
endpoint	path	True	string	The endpoint URL of the search service.
indexName	path	True	string	The name of the index for which to test an analyzer.
api-version	query	True	string	Client Api Version.

Request Header

Name	Required	Type	Description
x-ms-client-request-id		string uuid	The tracking ID sent with the request to help with debugging.

Request Body

Name	Required	Type	Description
text	True	string	The text to break into tokens.
analyzer		LexicalAnalyzerName	The name of the analyzer to use to break the given text. If this parameter is not specified, you must specify a tokenizer instead. The tokenizer and analyzer parameters are mutually exclusive.
charFilters		CharFilterName[]	An optional list of character filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.
tokenFilters		TokenFilterName[]	An optional list of token filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.
tokenizer		LexicalTokenizerName	The name of the tokenizer to use to break the given text. If this parameter is not specified, you must specify an analyzer instead. The tokenizer and analyzer parameters are mutually exclusive.

Responses

Name	Type	Description
200 OK	AnalyzeResult
Other Status Codes	ErrorResponse	Error response.

Examples

SearchServiceIndexAnalyze

Sample request

HTTP

POST https://myservice.search.windows.net/indexes('hotels')/search.analyze?api-version=2024-07-01

{
  "text": "Text to analyze",
  "analyzer": "standard.lucene"
}

Sample response

Status code:: 200

{
  "tokens": [
    {
      "token": "text",
      "startOffset": 0,
      "endOffset": 4,
      "position": 0
    },
    {
      "token": "to",
      "startOffset": 5,
      "endOffset": 7,
      "position": 1
    },
    {
      "token": "analyze",
      "startOffset": 8,
      "endOffset": 15,
      "position": 2
    }
  ]
}

Definitions

Name	Description
AnalyzedTokenInfo	Information about a token returned by an analyzer.
AnalyzeRequest	Specifies some text and analysis components used to break that text into tokens.
AnalyzeResult	The result of testing an analyzer on text.
CharFilterName	Defines the names of all character filters supported by the search engine.
ErrorAdditionalInfo	The resource management error additional info.
ErrorDetail	The error detail.
ErrorResponse	Error response
LexicalAnalyzerName	Defines the names of all text analyzers supported by the search engine.
LexicalTokenizerName	Defines the names of all tokenizers supported by the search engine.
TokenFilterName	Defines the names of all token filters supported by the search engine.

AnalyzedTokenInfo

Information about a token returned by an analyzer.

Name	Type	Description
endOffset	integer	The index of the last character of the token in the input text.
position	integer	The position of the token in the input text relative to other tokens. The first token in the input text has position 0, the next has position 1, and so on. Depending on the analyzer used, some tokens might have the same position, for example if they are synonyms of each other.
startOffset	integer	The index of the first character of the token in the input text.
token	string	The token returned by the analyzer.

AnalyzeRequest

Specifies some text and analysis components used to break that text into tokens.

Name	Type	Description
analyzer	LexicalAnalyzerName	The name of the analyzer to use to break the given text. If this parameter is not specified, you must specify a tokenizer instead. The tokenizer and analyzer parameters are mutually exclusive.
charFilters	CharFilterName[]	An optional list of character filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.
text	string	The text to break into tokens.
tokenFilters	TokenFilterName[]	An optional list of token filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.
tokenizer	LexicalTokenizerName	The name of the tokenizer to use to break the given text. If this parameter is not specified, you must specify an analyzer instead. The tokenizer and analyzer parameters are mutually exclusive.

AnalyzeResult

The result of testing an analyzer on text.

Name	Type	Description
tokens	AnalyzedTokenInfo[]	The list of tokens returned by the analyzer specified in the request.

CharFilterName

Defines the names of all character filters supported by the search engine.

Name	Type	Description
html_strip	string	A character filter that attempts to strip out HTML constructs. See https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.html

ErrorAdditionalInfo

The resource management error additional info.

Name	Type	Description
info	object	The additional info.
type	string	The additional info type.

ErrorDetail

The error detail.

Name	Type	Description
additionalInfo	ErrorAdditionalInfo[]	The error additional info.
code	string	The error code.
details	ErrorDetail[]	The error details.
message	string	The error message.
target	string	The error target.

ErrorResponse

Error response

Name	Type	Description
error	ErrorDetail	The error object.

LexicalAnalyzerName

Defines the names of all text analyzers supported by the search engine.

Name	Type	Description
ar.lucene	string	Lucene analyzer for Arabic.
ar.microsoft	string	Microsoft analyzer for Arabic.
bg.lucene	string	Lucene analyzer for Bulgarian.
bg.microsoft	string	Microsoft analyzer for Bulgarian.
bn.microsoft	string	Microsoft analyzer for Bangla.
ca.lucene	string	Lucene analyzer for Catalan.
ca.microsoft	string	Microsoft analyzer for Catalan.
cs.lucene	string	Lucene analyzer for Czech.
cs.microsoft	string	Microsoft analyzer for Czech.
da.lucene	string	Lucene analyzer for Danish.
da.microsoft	string	Microsoft analyzer for Danish.
de.lucene	string	Lucene analyzer for German.
de.microsoft	string	Microsoft analyzer for German.
el.lucene	string	Lucene analyzer for Greek.
el.microsoft	string	Microsoft analyzer for Greek.
en.lucene	string	Lucene analyzer for English.
en.microsoft	string	Microsoft analyzer for English.
es.lucene	string	Lucene analyzer for Spanish.
es.microsoft	string	Microsoft analyzer for Spanish.
et.microsoft	string	Microsoft analyzer for Estonian.
eu.lucene	string	Lucene analyzer for Basque.
fa.lucene	string	Lucene analyzer for Persian.
fi.lucene	string	Lucene analyzer for Finnish.
fi.microsoft	string	Microsoft analyzer for Finnish.
fr.lucene	string	Lucene analyzer for French.
fr.microsoft	string	Microsoft analyzer for French.
ga.lucene	string	Lucene analyzer for Irish.
gl.lucene	string	Lucene analyzer for Galician.
gu.microsoft	string	Microsoft analyzer for Gujarati.
he.microsoft	string	Microsoft analyzer for Hebrew.
hi.lucene	string	Lucene analyzer for Hindi.
hi.microsoft	string	Microsoft analyzer for Hindi.
hr.microsoft	string	Microsoft analyzer for Croatian.
hu.lucene	string	Lucene analyzer for Hungarian.
hu.microsoft	string	Microsoft analyzer for Hungarian.
hy.lucene	string	Lucene analyzer for Armenian.
id.lucene	string	Lucene analyzer for Indonesian.
id.microsoft	string	Microsoft analyzer for Indonesian (Bahasa).
is.microsoft	string	Microsoft analyzer for Icelandic.
it.lucene	string	Lucene analyzer for Italian.
it.microsoft	string	Microsoft analyzer for Italian.
ja.lucene	string	Lucene analyzer for Japanese.
ja.microsoft	string	Microsoft analyzer for Japanese.
keyword	string	Treats the entire content of a field as a single token. This is useful for data like zip codes, ids, and some product names. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html
kn.microsoft	string	Microsoft analyzer for Kannada.
ko.lucene	string	Lucene analyzer for Korean.
ko.microsoft	string	Microsoft analyzer for Korean.
lt.microsoft	string	Microsoft analyzer for Lithuanian.
lv.lucene	string	Lucene analyzer for Latvian.
lv.microsoft	string	Microsoft analyzer for Latvian.
ml.microsoft	string	Microsoft analyzer for Malayalam.
mr.microsoft	string	Microsoft analyzer for Marathi.
ms.microsoft	string	Microsoft analyzer for Malay (Latin).
nb.microsoft	string	Microsoft analyzer for Norwegian (Bokmål).
nl.lucene	string	Lucene analyzer for Dutch.
nl.microsoft	string	Microsoft analyzer for Dutch.
no.lucene	string	Lucene analyzer for Norwegian.
pa.microsoft	string	Microsoft analyzer for Punjabi.
pattern	string	Flexibly separates text into terms via a regular expression pattern. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/PatternAnalyzer.html
pl.lucene	string	Lucene analyzer for Polish.
pl.microsoft	string	Microsoft analyzer for Polish.
pt-BR.lucene	string	Lucene analyzer for Portuguese (Brazil).
pt-BR.microsoft	string	Microsoft analyzer for Portuguese (Brazil).
pt-PT.lucene	string	Lucene analyzer for Portuguese (Portugal).
pt-PT.microsoft	string	Microsoft analyzer for Portuguese (Portugal).
ro.lucene	string	Lucene analyzer for Romanian.
ro.microsoft	string	Microsoft analyzer for Romanian.
ru.lucene	string	Lucene analyzer for Russian.
ru.microsoft	string	Microsoft analyzer for Russian.
simple	string	Divides text at non-letters and converts them to lower case. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/SimpleAnalyzer.html
sk.microsoft	string	Microsoft analyzer for Slovak.
sl.microsoft	string	Microsoft analyzer for Slovenian.
sr-cyrillic.microsoft	string	Microsoft analyzer for Serbian (Cyrillic).
sr-latin.microsoft	string	Microsoft analyzer for Serbian (Latin).
standard.lucene	string	Standard Lucene analyzer.
standardasciifolding.lucene	string	Standard ASCII Folding Lucene analyzer. See https://learn.microsoft.com/rest/api/searchservice/Custom-analyzers-in-Azure-Search#Analyzers
stop	string	Divides text at non-letters; Applies the lowercase and stopword token filters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/StopAnalyzer.html
sv.lucene	string	Lucene analyzer for Swedish.
sv.microsoft	string	Microsoft analyzer for Swedish.
ta.microsoft	string	Microsoft analyzer for Tamil.
te.microsoft	string	Microsoft analyzer for Telugu.
th.lucene	string	Lucene analyzer for Thai.
th.microsoft	string	Microsoft analyzer for Thai.
tr.lucene	string	Lucene analyzer for Turkish.
tr.microsoft	string	Microsoft analyzer for Turkish.
uk.microsoft	string	Microsoft analyzer for Ukrainian.
ur.microsoft	string	Microsoft analyzer for Urdu.
vi.microsoft	string	Microsoft analyzer for Vietnamese.
whitespace	string	An analyzer that uses the whitespace tokenizer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/WhitespaceAnalyzer.html
zh-Hans.lucene	string	Lucene analyzer for Chinese (Simplified).
zh-Hans.microsoft	string	Microsoft analyzer for Chinese (Simplified).
zh-Hant.lucene	string	Lucene analyzer for Chinese (Traditional).
zh-Hant.microsoft	string	Microsoft analyzer for Chinese (Traditional).

LexicalTokenizerName

Defines the names of all tokenizers supported by the search engine.

Name	Type	Description
classic	string	Grammar-based tokenizer that is suitable for processing most European-language documents. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
edgeNGram	string	Tokenizes the input from an edge into n-grams of the given size(s). See https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html
keyword_v2	string	Emits the entire input as a single token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/KeywordTokenizer.html
letter	string	Divides text at non-letters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/LetterTokenizer.html
lowercase	string	Divides text at non-letters and converts them to lower case. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/LowerCaseTokenizer.html
microsoft_language_stemming_tokenizer	string	Divides text using language-specific rules and reduces words to their base forms.
microsoft_language_tokenizer	string	Divides text using language-specific rules.
nGram	string	Tokenizes the input into n-grams of the given size(s). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html
path_hierarchy_v2	string	Tokenizer for path-like hierarchies. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizer.html
pattern	string	Tokenizer that uses regex pattern matching to construct distinct tokens. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizer.html
standard_v2	string	Standard Lucene analyzer; Composed of the standard tokenizer, lowercase filter and stop filter. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
uax_url_email	string	Tokenizes urls and emails as one token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizer.html
whitespace	string	Divides text at whitespace. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html

TokenFilterName

Defines the names of all token filters supported by the search engine.

Name	Type	Description
apostrophe	string	Strips all characters after an apostrophe (including the apostrophe itself). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/tr/ApostropheFilter.html
arabic_normalization	string	A token filter that applies the Arabic normalizer to normalize the orthography. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizationFilter.html
asciifolding	string	Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if such equivalents exist. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html
cjk_bigram	string	Forms bigrams of CJK terms that are generated from the standard tokenizer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html
cjk_width	string	Normalizes CJK width differences. Folds fullwidth ASCII variants into the equivalent basic Latin, and half-width Katakana variants into the equivalent Kana. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html
classic	string	Removes English possessives, and dots from acronyms. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/ClassicFilter.html
common_grams	string	Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html
edgeNGram_v2	string	Generates n-grams of the given size(s) starting from the front or the back of an input token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html
elision	string	Removes elisions. For example, "l'avion" (the plane) will be converted to "avion" (plane). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html
german_normalization	string	Normalizes German characters according to the heuristics of the German2 snowball algorithm. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html
hindi_normalization	string	Normalizes text in Hindi to remove some differences in spelling variations. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizationFilter.html
indic_normalization	string	Normalizes the Unicode representation of text in Indian languages. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/in/IndicNormalizationFilter.html
keyword_repeat	string	Emits each incoming token twice, once as keyword and once as non-keyword. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html
kstem	string	A high-performance kstem filter for English. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/en/KStemFilter.html
length	string	Removes words that are too long or too short. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html
limit	string	Limits the number of tokens while indexing. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html
lowercase	string	Normalizes token text to lower case. See https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html
nGram_v2	string	Generates n-grams of the given size(s). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html
persian_normalization	string	Applies normalization for Persian. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizationFilter.html
phonetic	string	Create tokens for phonetic matches. See https://lucene.apache.org/core/4_10_3/analyzers-phonetic/org/apache/lucene/analysis/phonetic/package-tree.html
porter_stem	string	Uses the Porter stemming algorithm to transform the token stream. See http://tartarus.org/~martin/PorterStemmer
reverse	string	Reverses the token string. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html
scandinavian_folding	string	Folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminates against use of double vowels aa, ae, ao, oe and oo, leaving just the first one. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html
scandinavian_normalization	string	Normalizes use of the interchangeable Scandinavian characters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html
shingle	string	Creates combinations of tokens as a single token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
snowball	string	A filter that stems words using a Snowball-generated stemmer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/snowball/SnowballFilter.html
sorani_normalization	string	Normalizes the Unicode representation of Sorani text. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizationFilter.html
stemmer	string	Language specific stemming filter. See https://learn.microsoft.com/rest/api/searchservice/Custom-analyzers-in-Azure-Search#TokenFilters
stopwords	string	Removes stop words from a token stream. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html
trim	string	Trims leading and trailing whitespace from tokens. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html
truncate	string	Truncates the terms to a specific length. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html
unique	string	Filters out tokens with same text as the previous token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html
uppercase	string	Normalizes token text to upper case. See https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/UpperCaseFilter.html
word_delimiter	string	Splits words into subwords and performs optional transformations on subword groups.

Share via