Parsing Non-Standard Text File Formats with the Script Component

Applies to: SQL Server SSIS Integration Runtime in Azure Data Factory

When your source data is arranged in a non-standard format, you may find it more convenient to consolidate all your parsing logic in a single script than to chain together multiple Integration Services transformations to achieve the same result.

Example 1: Parsing Row-Delimited Records

Example 2: Splitting Parent and Child Records

Note

If you want to create a component that you can more easily reuse across multiple Data Flow tasks and multiple packages, consider using the code in this Script component sample as the starting point for a custom data flow component. For more information, see Developing a Custom Data Flow Component.

Example 1: Parsing Row-Delimited Records

This example shows how to take a text file in which each column of data appears on a separate line and parse it into a destination table by using the Script component.

For more information about how to configure the Script component for use as a transformation in the data flow, see Creating a Synchronous Transformation with the Script Component and Creating an Asynchronous Transformation with the Script Component.

To configure this Script Component example

  1. Create and save a text file named rowdelimiteddata.txt that contains the following source data:

    FirstName: Nancy  
    LastName: Davolio  
    Title: Sales Representative  
    City: Seattle  
    StateProvince: WA  
    
    FirstName: Andrew  
    LastName: Fuller  
    Title: Vice President, Sales  
    City: Tacoma  
    StateProvince: WA  
    
    FirstName: Steven  
    LastName: Buchanan  
    Title: Sales Manager  
    City: London  
    StateProvince:  
    
    
  2. Open Management Studio and connect to an instance of SQL Server.

  3. Select a destination database, and open a new query window. In the query window, execute the following script to create the destination table:

    create table RowDelimitedData  
    (  
    FirstName varchar(32),  
    LastName varchar(32),  
    Title varchar(32),  
    City varchar(32),  
    StateProvince varchar(32)  
    )  
    
    
  4. Open SQL Server Data Tools and create a new Integration Services package named ParseRowDelim.dtsx.

  5. Add a Flat File connection manager to the package, name it RowDelimitedData, and configure it to connect to the rowdelimiteddata.txt file that you created in a previous step.

  6. Add an OLE DB connection manager to the package and configure it to connect to the instance of SQL Server and the database in which you created the destination table.

  7. Add a Data Flow task to the package and click the Data Flow tab of SSIS Designer.

  8. Add a Flat File Source to the data flow and configure it to use the RowDelimitedData connection manager. On the Columns page of the Flat File Source Editor, select the single available external column.

  9. Add a Script Component to the data flow and configure it as a transformation. Connect the output of the Flat File Source to the Script Component.

  10. Double-click the Script component to display the Script Transformation Editor.

  11. On the Input Columns page of the Script Transformation Editor, select the single available input column.

  12. On the Inputs and Outputs page of the Script Transformation Editor, select Output 0 and set its SynchronousInputID to None. Create 5 output columns, all of type string [DT_STR] with a length of 32:

    • FirstName

    • LastName

    • Title

    • City

    • StateProvince

  13. On the Script page of the Script Transformation Editor, click Edit Script and enter the code shown in the ScriptMain class of the example. Close the script development environment and the Script Transformation Editor.

  14. Add a SQL Server Destination to the data flow. Configure it to use the OLE DB connection manager and the RowDelimitedData table. Connect the output of the Script Component to this destination.

  15. Run the package. After the package has finished, examine the records in the SQL Server destination table.

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)  
  
    Dim columnName As String  
    Dim columnValue As String  
  
    ' Check for an empty row.  
    If Row.Column0.Trim.Length > 0 Then  
        columnName = Row.Column0.Substring(0, Row.Column0.IndexOf(":"))  
        ' Check for an empty value after the colon.  
        If Row.Column0.Substring(Row.Column0.IndexOf(":")).TrimEnd.Length > 1 Then  
            ' Extract the column value from after the colon and space.  
            columnValue = Row.Column0.Substring(Row.Column0.IndexOf(":") + 2)  
            Select Case columnName  
                Case "FirstName"  
                    ' The FirstName value indicates a new record.  
                    Me.Output0Buffer.AddRow()  
                    Me.Output0Buffer.FirstName = columnValue  
                Case "LastName"  
                    Me.Output0Buffer.LastName = columnValue  
                Case "Title"  
                    Me.Output0Buffer.Title = columnValue  
                Case "City"  
                    Me.Output0Buffer.City = columnValue  
                Case "StateProvince"  
                    Me.Output0Buffer.StateProvince = columnValue  
            End Select  
        End If  
    End If  
  
End Sub  
public override void Input0_ProcessInputRow(Input0Buffer Row)  
    {  
  
        string columnName;  
        string columnValue;  
  
        // Check for an empty row.  
        if (Row.Column0.Trim().Length > 0)  
        {  
            columnName = Row.Column0.Substring(0, Row.Column0.IndexOf(":"));  
            // Check for an empty value after the colon.  
            if (Row.Column0.Substring(Row.Column0.IndexOf(":")).TrimEnd().Length > 1)  
            // Extract the column value from after the colon and space.  
            {  
                columnValue = Row.Column0.Substring(Row.Column0.IndexOf(":") + 2);  
                switch (columnName)  
                {  
                    case "FirstName":  
                        // The FirstName value indicates a new record.  
                        this.Output0Buffer.AddRow();  
                        this.Output0Buffer.FirstName = columnValue;  
                        break;  
                    case "LastName":  
                        this.Output0Buffer.LastName = columnValue;  
                        break;  
                    case "Title":  
                        this.Output0Buffer.Title = columnValue;  
                        break;  
                    case "City":  
                        this.Output0Buffer.City = columnValue;  
                        break;  
                    case "StateProvince":  
                        this.Output0Buffer.StateProvince = columnValue;  
                        break;  
                }  
            }  
        }  
  
    }  

Example 2: Splitting Parent and Child Records

This example shows how to take a text file, in which a separator row precedes a parent record row that is followed by an indefinite number of child record rows, and parse it into properly normalized parent and child destination tables by using the Script component. This simple example could easily be adapted for source files that use more than one row or column for each parent and child record, as long as there is some way to identify the beginning and end of each record.

Caution

This sample is intended for demonstration purposes only. If you run the sample more than once, it inserts duplicate key values into the destination table.

For more information about how to configure the Script component for use as a transformation in the data flow, see Creating a Synchronous Transformation with the Script Component and Creating an Asynchronous Transformation with the Script Component.

To configure this Script Component example

  1. Create and save a text file named parentchilddata.txt that contains the following source data:

    **********  
    PARENT 1 DATA  
    child 1 data  
    child 2 data  
    child 3 data  
    child 4 data  
    **********  
    PARENT 2 DATA  
    child 5 data  
    child 6 data  
    child 7 data  
    child 8 data  
    **********  
    
    
  2. Open SQL Server Management Studio and connect to an instance of SQL Server.

  3. Select a destination database, and open a new query window. In the query window, execute the following script to create the destination tables:

    CREATE TABLE [dbo].[Parents]([ParentID] [int] NOT NULL,  
    [ParentRecord] [varchar](32) NOT NULL,  
     CONSTRAINT [PK_Parents] PRIMARY KEY CLUSTERED   
    ([ParentID] ASC))  
    GO  
    CREATE TABLE [dbo].[Children]([ChildID] [int] NOT NULL,  
    [ParentID] [int] NOT NULL,  
    [ChildRecord] [varchar](32) NOT NULL,  
     CONSTRAINT [PK_Children] PRIMARY KEY CLUSTERED   
    ([ChildID] ASC))  
    GO  
    ALTER TABLE [dbo].[Children] ADD CONSTRAINT [FK_Children_Parents] FOREIGN KEY([ParentID])  
    REFERENCES [dbo].[Parents] ([ParentID])  
    
    
  4. Open SQL Server Data Tools (SSDT) and create a new Integration Services package named SplitParentChild.dtsx.

  5. Add a Flat File connection manager to the package, name it ParentChildData, and configure it to connect to the parentchilddata.txt file that you created in a previous step.

  6. Add an OLE DB connection manager to the package and configure it to connect to the instance of SQL Server and the database in which you created the destination tables.

  7. Add a Data Flow task to the package and click the Data Flow tab of SSIS Designer.

  8. Add a Flat File Source to the data flow and configure it to use the ParentChildData connection manager. On the Columns page of the Flat File Source Editor, select the single available external column.

  9. Add a Script Component to the data flow and configure it as a transformation. Connect the output of the Flat File Source to the Script Component.

  10. Double-click the Script component to display the Script Transformation Editor.

  11. On the Input Columns page of the Script Transformation Editor, select the single available input column.

  12. On the Inputs and Outputs page of the Script Transformation Editor, select Output 0, rename it to ParentRecords, and set its SynchronousInputID to None. Create 2 output columns:

    • ParentID (the primary key), of type four-byte signed integer [DT_I4]

    • ParentRecord, of type string [DT_STR] with a length of 32.

  13. Create a second output and name it ChildRecords. The SynchronousInputID of the new output is already set to None. Create 3 output columns:

    • ChildID (the primary key), of type four-byte signed integer [DT_I4]

    • ParentID (the foreign key), also of type four-byte signed integer [DT_I4]

    • ChildRecord, of type string [DT_STR] with a length of 50

  14. On the Script page of the Script Transformation Editor, click Edit Script. In the ScriptMain class, enter the code shown in the example. Close the script development environment and the Script Transformation Editor.

  15. Add a SQL Server Destination to the data flow. Connect the ParentRecords output of the Script Component to this destination.Configure it to use the OLE DB connection manager and the Parents table.

  16. Add another SQL Server Destination to the data flow. Connect the ChildRecords output of the Script Component to this destination. Configure it to use the OLE DB connection manager and the Children table.

  17. Run the package. After the package has finished, examine the parent and child records in the two SQL Server destination tables.

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)  
  
    Static nextRowIsParent As Boolean = False  
    Static parentCounter As Integer = 0  
    Static childCounter As Integer = 0  
  
    ' If current row starts with separator characters,  
    '  then following row contains new parent record.  
    If Row.Column0.StartsWith("***") Then  
        nextRowIsParent = True  
    Else  
        If nextRowIsParent Then  
            ' Current row contains parent record.  
            parentCounter += 1  
            Me.ParentRecordsBuffer.AddRow()  
            Me.ParentRecordsBuffer.ParentID = parentCounter  
            Me.ParentRecordsBuffer.ParentRecord = Row.Column0  
            nextRowIsParent = False  
        Else  
            ' Current row contains child record.  
            childCounter += 1  
            Me.ChildRecordsBuffer.AddRow()  
            Me.ChildRecordsBuffer.ChildID = childCounter  
            Me.ChildRecordsBuffer.ParentID = parentCounter  
            Me.ChildRecordsBuffer.ChildRecord = Row.Column0  
        End If  
    End If  
  
End Sub  
public override void Input0_ProcessInputRow(Input0Buffer Row)  
    {  
  
    int static_Input0_ProcessInputRow_childCounter = 0;  
    int static_Input0_ProcessInputRow_parentCounter = 0;  
    bool static_Input0_ProcessInputRow_nextRowIsParent = false;  
  
        // If current row starts with separator characters,   
        // then following row contains new parent record.   
        if (Row.Column0.StartsWith("***"))  
        {  
            static_Input0_ProcessInputRow_nextRowIsParent = true;  
        }  
        else  
        {  
            if (static_Input0_ProcessInputRow_nextRowIsParent)  
            {  
                // Current row contains parent record.   
                static_Input0_ProcessInputRow_parentCounter += 1;  
                this.ParentRecordsBuffer.AddRow();  
                this.ParentRecordsBuffer.ParentID = static_Input0_ProcessInputRow_parentCounter;  
                this.ParentRecordsBuffer.ParentRecord = Row.Column0;  
                static_Input0_ProcessInputRow_nextRowIsParent = false;  
            }  
            else  
            {  
                // Current row contains child record.   
                static_Input0_ProcessInputRow_childCounter += 1;  
                this.ChildRecordsBuffer.AddRow();  
                this.ChildRecordsBuffer.ChildID = static_Input0_ProcessInputRow_childCounter;  
                this.ChildRecordsBuffer.ParentID = static_Input0_ProcessInputRow_parentCounter;  
                this.ChildRecordsBuffer.ChildRecord = Row.Column0;  
            }  
        }  
  
    }  

See Also

Creating a Synchronous Transformation with the Script Component
Creating an Asynchronous Transformation with the Script Component