Dela via


U-SQL Built-in Extractors

Summary

U-SQL provides a built-in extractor class called Extractors that provides the following three built-in extractors to generate a rowset from the input file or files:

  • Extractors.Text() : Provides extraction from delimited text files of different encodings.
  • Extractors.Csv() : Provides extraction from comma-separated value (CSV) files of different encodings.
  • Extractors.Tsv() : Provides extraction from tab-separated value (TSV) files of different encodings.

Technically speaking these are factory methods that generate an instance of the IExtractor class and they can be used in the USING clause of the EXTRACT expression. Since they create the extractor object, one does not need to call them with new.

The Csv() and Tsv() extractors are special versions of the generic Text() extractor where the delimiter has been fixed to comma and tab respectively.

If the EXTRACT expression specifies a file set pattern, then the extractor parameters will be applied to all the selected files equally. If different files require different parameter values, then different EXTRACT expressions need to be used.

Built-in Extractor Processing Model

The built-in extractors transforms a byte stream in parallel into a rowset that can be further processed with U-SQL statements. The following figure provides a logical view of the processing model (that in turn is based on the general UDO Extractor processing model).

Processing model for the built-in extractor processing model

If the maximal input row length or the maximal output row length are being exceeded, errors are raised.

Built-in Type Conversions

The extractors will convert the string values val in the stream to an instance of the specified type in the extractor schema after the processing of encoding and escaped values have occurred.

Per default, an empty field is mapped to a zero-length string if target type is string and null otherwise. Most types follow the standard C# <Type>.Parse(val) behaviour without a specific culture behaviour and being defaulted to the cluster machine’s locale. The table below provides more details for each type.

Tip

Since the built-in extractors are implemented natively, the conversions may differ in small details from the U-SQL/C# conversion semantics. Some differences cannot be avoided, primarily around floating point values where minute differences may be present.

The following data types (and their nullable variant) are supported by the built-in extractors. Any data type that is not listed and is not supported by the extractor (such as SQL.MAP and SQL.ARRAY) either needs to be converted in a subsequent SELECT statement or a user-defined extractor has to be written.

Type Conversion
bool bool.Parse(val).
val can be either "true" or "false" or any casing thereof and maps to true or false respectively.
sbyte sbyte.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us", which includes
   ○ NumberFormatInfo.PositiveSign: '+'
   ○ NumberFormatInfo.NegativeSign: '-'
 ● NumberStyle.Integer that includes:
   ○ NumberStyles.AllowLeadingWhite
   ○ NumberStyles.AllowTrailingWhite
   ○ NumberStyles.AllowLeadingSign
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands
 ● NumberStyles.AllowExponent
 ● NumberStyles.AllowDecimalPoint
 ● NumberStyles.AllowHexSpecifier
resulting in the following grammar for the supported numbers lexical representations:
[whitespace] [sign] [digits] [whitespace]
Note there is no whitespace allowed between the sign and the digits. If val satisfy the grammar and fits into the range of the type, it will successfully converted. Otherwise appropriate runtime errors will be raised.
byte byte.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us", which includes
   ○ NumberFormatInfo.PositiveSign: '+'
   ○ NumberFormatInfo.NegativeSign: N/A
 ● NumberStyle.Integer that includes:
   ○ NumberStyles.AllowLeadingWhite
   ○ NumberStyles.AllowTrailingWhite
   ○ NumberStyles.AllowLeadingSign
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands
 ● NumberStyles.AllowExponent
 ● NumberStyles.AllowDecimalPoint
 ● NumberStyles.AllowHexSpecifier
resulting in the following grammar for the supported numbers lexical representations:
[whitespace] [sign] [digits] [whitespace]
Where sign is only '+'. Note there is no whitespace allowed between the sign and the digits. If val satisfy the grammar and fits into the range of the type, it will successfully converted. Otherwise apropriate runtime errors will be raised.
short short.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us", which includes
   ○ NumberFormatInfo.PositiveSign: '+'
   ○ NumberFormatInfo.NegativeSign: '-'
 ● NumberStyle.Integer that includes:
   ○ NumberStyles.AllowLeadingWhite
   ○ NumberStyles.AllowTrailingWhite
   ○ NumberStyles.AllowLeadingSign
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands
 ● NumberStyles.AllowExponent
 ● NumberStyles.AllowDecimalPoint
 ● NumberStyles.AllowHexSpecifier
resulting in the following grammar for the supported numbers lexical representations:
[whitespace] [sign] [digits] [whitespace]
Note there is no whitespace allowed between the sign and the digits. If val satisfy the grammar and fits into the range of the type, it will successfully converted. Otherwise apropriate runtime errors will be raised.
ushort ushort.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us", which includes
   ○ NumberFormatInfo.PositiveSign: '+'
   ○ NumberFormatInfo.NegativeSign: N/A
 ● NumberStyle.Integer that includes:
   ○ NumberStyles.AllowLeadingWhite
   ○ NumberStyles.AllowTrailingWhite
   ○ NumberStyles.AllowLeadingSign
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands
 ● NumberStyles.AllowExponent
 ● NumberStyles.AllowDecimalPoint
 ● NumberStyles.AllowHexSpecifier
resulting in the following grammar for the supported numbers lexical representations:
[whitespace] [sign] [digits] [whitespace]
Where sign is only '+'. Note there is no whitespace allowed between the sign and the digits. If val satisfy the grammar and fits into the range of the type, it will successfully converted. Otherwise apropriate runtime errors will be raised.
int int.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us", which includes
   ○ NumberFormatInfo.PositiveSign: '+'
   ○ NumberFormatInfo.NegativeSign: '-'
 ● NumberStyle.Integer that includes:
   ○ NumberStyles.AllowLeadingWhite
   ○ NumberStyles.AllowTrailingWhite
   ○ NumberStyles.AllowLeadingSign
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands
 ● NumberStyles.AllowExponent
 ● NumberStyles.AllowDecimalPoint
 ● NumberStyles.AllowHexSpecifier
resulting in the following grammar for the supported numbers lexical representations:
[whitespace] [sign] [digits] [whitespace]
Note there is no whitespace allowed between the sign and the digits. If val satisfy the grammar and fits into the range of the type, it will successfully converted. Otherwise apropriate runtime errors will be raised.
uint uint.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us", which includes
   ○ NumberFormatInfo.PositiveSign: '+'
   ○ NumberFormatInfo.NegativeSign: N/A
 ● NumberStyle.Integer that includes:
   ○ NumberStyles.AllowLeadingWhite
   ○ NumberStyles.AllowTrailingWhite
   ○ NumberStyles.AllowLeadingSign
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands
 ● NumberStyles.AllowExponent
 ● NumberStyles.AllowDecimalPoint
 ● NumberStyles.AllowHexSpecifier
resulting in the following grammar for the supported numbers lexical representations:
[whitespace] [sign] [digits] [whitespace]
Where sign is only '+'. Note there is no whitespace allowed between the sign and the digits. If val satisfy the grammar and fits into the range of the type, it will successfully converted. Otherwise apropriate runtime errors will be raised.
long long.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us", which includes
   ○ NumberFormatInfo.PositiveSign: '+'
   ○ NumberFormatInfo.NegativeSign: '-'
 ● NumberStyle.Integer that includes:
   ○ NumberStyles.AllowLeadingWhite
   ○ NumberStyles.AllowTrailingWhite
   ○ NumberStyles.AllowLeadingSign
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands
 ● NumberStyles.AllowExponent
 ● NumberStyles.AllowDecimalPoint
 ● NumberStyles.AllowHexSpecifier
resulting in the following grammar for the supported numbers lexical representations:
[whitespace] [sign] [digits] [whitespace]
Note there is no whitespace allowed between the sign and the digits. If val satisfy the grammar and fits into the range of the type, it will successfully converted. Otherwise apropriate runtime errors will be raised.
ulong ulong.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us", which includes
   ○ NumberFormatInfo.PositiveSign: '+'
   ○ NumberFormatInfo.NegativeSign: N/A
 ● NumberStyle.Integer that includes:
   ○ NumberStyles.AllowLeadingWhite
   ○ NumberStyles.AllowTrailingWhite
   ○ NumberStyles.AllowLeadingSign
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands
 ● NumberStyles.AllowExponent
 ● NumberStyles.AllowDecimalPoint
 ● NumberStyles.AllowHexSpecifier
resulting in the following grammar for the supported numbers lexical representations:
[whitespace] [sign] [digits] [whitespace]
Where sign is only '+'. Note there is no whitespace allowed between the sign and the digits. If val satisfy the grammar and fits into the range of the type, it will successfully converted. Otherwise apropriate runtime errors will be raised.
decimal decimal.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us", which includes
   ○ NumberFormatInfo.PositiveSign: '+'
   ○ NumberFormatInfo.NegativeSign: '-'
 ● NumberStyle.Integer that includes:
   ○ NumberStyles.AllowLeadingWhite
   ○ NumberStyles.AllowTrailingWhite
   ○ NumberStyles.AllowLeadingSign
   ○ NumberStyles.AllowDecimalPoint
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign (note: decimal.Parse() supports this per default!)
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands (note: decimal.Parse() supports this per default!)
 ● NumberStyles.AllowExponent
 ● NumberStyles.AllowHexSpecifier
resulting in the following grammar for the supported numbers lexical representations:
[whitespace] [sign] [digits] [[decimal-point] digits] [whitespace]
Note there is no whitespace allowed between the sign, decimal-point (if present) and the digits. If val satisfy the grammar and fits into the range of the type, it will successfully converted. If val represents more than the decimal type’s 29 digits of precision, has a fractional part and is within the range of MaxValue and MinValue, the number is rounded, not truncated, to 29 digits using rounding to nearest. Otherwise apropriate runtime errors will be raised.
float float.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us"
 ● NumberFormatInfo.PositiveInfinitySymbol: Infinity
 ● NumberFormatInfo.NegativeInfinitySymbol: -Infinity
 ● NumberFormatInfo.NaNSymbol: NaN
 ● NumberStyles.AllowLeadingWhite
 ● NumberStyles.AllowTrailingWhite
 ● NumberStyles.AllowLeadingSign
 ● NumberStyles.AllowDecimalPoint
 ● NumberStyles.AllowExponent
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands (note: double.Parse() supports this)
resulting in the following grammar for the supported numbers lexical representations:
[sign] "Infinity" | "NaN" | [whitespace] [sign] [digits] [[decimal-point] digits] [ ("d" | "D" | "e" | "E") [sign] digits] [whitespace]
If no digits appear before the decimal point character, at least one must appear after the decimal point character. If neither an exponent part nor a decimal point character appears, a decimal point character is assumed to follow the last digit in the string. Values that are too large or small will raise a runtime error, values who are smaller than the supported precision will get rounded to 0 (but not -0) as in C#. Currently -0 is mapped to 0.
double double.Parse(val) with the following NumberFormatInfo and NumberStyles:
 ● Cultural locale is "en-us"
 ● NumberFormatInfo.PositiveInfinitySymbol: Infinity
 ● NumberFormatInfo.NegativeInfinitySymbol: -Infinity
 ● NumberFormatInfo.NaNSymbol: NaN
 ● NumberStyles.AllowLeadingWhite
 ● NumberStyles.AllowTrailingWhite
 ● NumberStyles.AllowLeadingSign
 ● NumberStyles.AllowDecimalPoint
 ● NumberStyles.AllowExponent
In particular, the following NumberStyles are not supported:
 ● NumberStyles.AllowCurrencySymbol
 ● NumberStyles.AllowTrailingSign
 ● NumberStyles.AllowParentheses
 ● NumberStyles.AllowThousands (note: double.Parse() supports this)
resulting in the following grammar for the supported numbers lexical representations:
[sign] "Infinity" | "NaN" | [whitespace] [sign] [digits] [[decimal-point] digits] [ ("d" | "D" | "e" | "E") [sign] digits] [whitespace]
If no digits appear before the decimal point character, at least one must appear after the decimal point character. If neither an exponent part nor a decimal point character appears, a decimal point character is assumed to follow the last digit in the string. Values that are too large or small will raise a runtime error, values who are smaller than the supported precision will get rounded to 0 (but not -0) as in C#. Currently -0 is mapped to 0.
string string values are read as UTF-16 encoded strings.
char a character is read as a Unicode 16-bit codepoint (this means including single surrogate pair codepoints that are not valid Unicode characters). It supports the following lexical representations on input:

 1.  Base-10 Integer numbers. A number will map to the Unicode codepoint of that number. E.g., 32 will map to space.
 2. Unicode codepoints of the range 0x0000 to 0xFFFF.
If the character is outside the valid range, an error will be raised.
byte[] It will read hexadecimal numbers into a byte array. The lexical representation in the file has to correspond the following grammar:
byte_array = {byte}.
byte = hexcode hexcode.
hexcode = '0…9' | 'A…F' | 'a…f'.
If the number of hex codes is odd, then an error is raised.
DateTime DateTime.Parse(val) with the following:

This means a variety of formats are supported, including (note optionality and short forms are not called out):
 ● UTC format: YYYY-MM-DDThh:mm:ss.nnnn[TZ]
 ● Day, DD MMM YYYY hh:mm:ss [AM/PM] [verbal TZ]
 ● MM/DD/YYYY [hh:mm:ss [AM/PM]]
 ● MM.DD.YYYY [hh:mm:ss [AM/PM]]
The hour information can be given using a 12 or 24 hour clock. If a 12 hour clock is used, PM has to be specified to indicate time points in the second half of the day.
If parts of the date or time information is missing in the value, it will be defaulted as follows:
 ● If the month or day value is missing, it is defaulted to 1.
 ● If the year is missing, it is defaulted to the current year. In the non-UTC format, if the year is given with 2 digits, then the current century is assumed (e.g., 14 is transformed to 2014).
 ● If hours, minutes, seconds or subseconds are missing, they are defaulted to 0 (note that 0 hours is often represented as 12AM).
U-SQL’s built-in extractors normalize all date time values to UTC -07:00 and then drop timezone information if present.
Guid The following lexical representations can be converted into a GUID:
'{' byte4 '-' byte2 '-' byte2 '-' byte2 '-' byte4 '}' |
  byte4 '-' byte2 '-' byte2 '-' byte2 '-' byte4 |
  byte4 byte2 byte2 byte2 byte4.
byte4 = byte2 byte2.
byte2 = byte byte.
byte = hexcode hexcode.
hexcode = '0…9' | 'A…F' | 'a…f'.
This means the following three guids are all valid lexical representations:
F9168C5E-CEB2-4faa-B6BF-329BF39FA1E4
{F9168C5E-CEB2-4faa-B6BF-329BF39FA1E4}
F9168C5ECEB24faaB6BF329BF39FA1E4

But the following are not:
{F9168C5ECEB24faaB6BF329BF39FA1E4}
F9168C5E-CEB24faaB6BF329BF39FA1E4

See Also