Quickstart: Data structures, data types, and objects using R with SQL machine learning
Applies to: SQL Server 2016 (13.x) and later Azure SQL Managed Instance
In this quickstart, you'll learn how to use data structures and data types when using R in SQL Server Machine Learning Services or on Big Data Clusters. You'll learn about moving data between R and SQL Server, and the common issues that might occur.
In this quickstart, you'll learn how to use data structures and data types when using R in SQL Server Machine Learning Services. You'll learn about moving data between R and SQL Server, and the common issues that might occur.
In this quickstart, you'll learn how to use data structures and data types when using R in SQL Server R Services. You'll learn about moving data between R and SQL Server, and the common issues that might occur.
In this quickstart, you'll learn how to use data structures and data types when using R in Azure SQL Managed Instance Machine Learning Services. You'll learn about moving data between R and SQL Managed Instance, and the common issues that might occur.
Common issues to know up front include:
- Data types sometimes don't match
- Implicit conversions might take place
- Cast and convert operations are sometimes required
- R and SQL use different data objects
Prerequisites
You need the following prerequisites to run this quickstart.
- SQL Server Machine Learning Services. To install Machine Learning Services, see the Windows installation guide or the Linux installation guide. You can also enable Machine Learning Services on SQL Server Big Data Clusters.
- SQL Server Machine Learning Services. To install Machine Learning Services, see the Windows installation guide.
- SQL Server 2016 R Services. To install R Services, see the Windows installation guide.
- Azure SQL Managed Instance Machine Learning Services. For information, see the Azure SQL Managed Instance Machine Learning Services overview.
- A tool for running SQL queries that contain R scripts. This quickstart uses Azure Data Studio.
Always return a data frame
When your script returns results from R to SQL Server, it must return the data as a data.frame. Any other type of object that you generate in your script - whether that be a list, factor, vector, or binary data - must be converted to a data frame if you want to output it as part of the stored procedure results. Fortunately, there are multiple R functions to support changing other objects to a data frame. You can even serialize a binary model and return it in a data frame, which you'll do later in this quickstart.
First, let's experiment with some R basic R objects - vectors, matrices, and lists - and see how conversion to a data frame changes the output passed to SQL Server.
Compare these two "Hello World" scripts in R. The scripts look almost identical, but the first returns a single column of three values, whereas the second returns three columns with a single value each.
Example 1
EXECUTE sp_execute_external_script
@language = N'R'
, @script = N' mytextvariable <- c("hello", " ", "world");
OutputDataSet <- as.data.frame(mytextvariable);'
, @input_data_1 = N' ';
Example 2
EXECUTE sp_execute_external_script
@language = N'R'
, @script = N' OutputDataSet<- data.frame(c("hello"), " ", c("world"));'
, @input_data_1 = N' ';
Identify schema and data types
Why are the results so different?
The answer can usually be found by using the R str()
command. Add the function str(object_name)
anywhere in your R script to have the data schema of the specified R object returned as an informational message.
To figure out why Example 1 and Example 2 have such different results, insert the line str(OutputDataSet)
at the end of the @script
variable definition in each statement, like this:
Example 1 with str function added
EXECUTE sp_execute_external_script
@language = N'R'
, @script = N' mytextvariable <- c("hello", " ", "world");
OutputDataSet <- as.data.frame(mytextvariable);
str(OutputDataSet);'
, @input_data_1 = N' '
;
Example 2 with str function added
EXECUTE sp_execute_external_script
@language = N'R',
@script = N' OutputDataSet <- data.frame(c("hello"), " ", c("world"));
str(OutputDataSet);' ,
@input_data_1 = N' ';
Now, review the text in Messages to see why the output is different.
Results - Example 1
STDOUT message(s) from external script:
'data.frame': 3 obs. of 1 variable:
$ mytextvariable: Factor w/ 3 levels " ","hello","world": 2 1 3
Results - Example 2
STDOUT message(s) from external script:
'data.frame': 1 obs. of 3 variables:
$ c..hello..: Factor w/ 1 level "hello": 1
$ X... : Factor w/ 1 level " ": 1
$ c..world..: Factor w/ 1 level "world": 1
As you can see, a slight change in R syntax had a big effect on the schema of the results. We won't go into why, but the differences in R data types are explained in details in the Data Structures section in "Advanced R" by Hadley Wickham.
For now, just be aware that you need to check the expected results when coercing R objects into data frames.
Tip
You can also use R identity functions, such as is.matrix
, is.vector
, to return information about the internal data structure.
Implicit conversion of data objects
Each R data object has its own rules for how values are handled when combined with other data objects if the two data objects have the same number of dimensions, or if any data object contains heterogeneous data types.
First, create a small table of test data.
CREATE TABLE RTestData (col1 INT NOT NULL)
INSERT INTO RTestData
VALUES (1);
INSERT INTO RTestData
VALUES (10);
INSERT INTO RTestData
VALUES (100);
GO
For example, assume you run the following statement to perform matrix multiplication using R. You multiply a single-column matrix with the three values by an array with four values, and expect a 4x3 matrix as a result.
EXECUTE sp_execute_external_script
@language = N'R'
, @script = N'
x <- as.matrix(InputDataSet);
y <- array(12:15);
OutputDataSet <- as.data.frame(x %*% y);'
, @input_data_1 = N' SELECT [Col1] from RTestData;'
WITH RESULT SETS (([Col1] int, [Col2] int, [Col3] int, Col4 int));
Under the covers, the column of three values is converted to a single-column matrix. Because a matrix is just a special case of an array in R, the array y
is implicitly coerced to a single-column matrix to make the two arguments conform.
Results
Col1 | Col2 | Col3 | Col4 |
---|---|---|---|
12 | 13 | 14 | 15 |
120 | 130 | 140 | 150 |
1200 | 1300 | 1400 | 1500 |
However, note what happens when you change the size of the array y
.
execute sp_execute_external_script
@language = N'R'
, @script = N'
x <- as.matrix(InputDataSet);
y <- array(12:14);
OutputDataSet <- as.data.frame(y %*% x);'
, @input_data_1 = N' SELECT [Col1] from RTestData;'
WITH RESULT SETS (([Col1] int ));
Now R returns a single value as the result.
Results
Col1 |
---|
1542 |
Why? In this case, because the two arguments can be handled as vectors of the same length, R returns the inner product as a matrix. This is the expected behavior according to the rules of linear algebra; however, it could cause problems if your downstream application expects the output schema to never change!
Tip
Getting errors? Make sure that you're running the stored procedure in the context of the database that contains the table, and not in master or another database.
Also, we suggest that you avoid using temporary tables for these examples. Some R clients will terminate a connection between batches, deleting temporary tables.
Merge or multiply columns of different length
R provides great flexibility for working with vectors of different sizes, and for combining these column-like structures into data frames. Lists of vectors can look like a table, but they don't follow all the rules that govern database tables.
For example, the following script defines a numeric array of length 6 and stores it in the R variable df1
. The numeric array is then combined with the integers of the RTestData table, which contains three (3) values, to make a new data frame, df2
.
EXECUTE sp_execute_external_script
@language = N'R'
, @script = N'
df1 <- as.data.frame( array(1:6) );
df2 <- as.data.frame( c( InputDataSet , df1 ));
OutputDataSet <- df2'
, @input_data_1 = N' SELECT [Col1] from RTestData;'
WITH RESULT SETS (( [Col2] int not null, [Col3] int not null ));
To fill out the data frame, R repeats the elements retrieved from RTestData as many times as needed to match the number of elements in the array df1
.
Results
Col2 | Col3 |
---|---|
1 | 1 |
10 | 2 |
100 | 3 |
1 | 4 |
10 | 5 |
100 | 6 |
Remember that a data frame only looks like a table, and is actually a list of vectors.
Cast or convert data
R and SQL Server don't use the same data types, so when you run a query in SQL Server to get data and then pass that to the R runtime, some type of implicit conversion usually takes place. Another set of conversions takes place when you return data from R to SQL Server.
- SQL Server pushes the data from the query to the R process managed by the Launchpad service and converts it to an internal representation for greater efficiency.
- The R runtime loads the data into a data.frame variable and performs its own operations on the data.
- The database engine returns the data to SQL Server using a secured internal connection and presents the data in terms of SQL Server data types.
- You get the data by connecting to SQL Server using a client or network library that can issue SQL queries and handle tabular data sets. This client application can potentially affect the data in other ways.
To see how this works, run a query such as this one on the AdventureWorksDW data warehouse. This view returns sales data used in creating forecasts.
USE AdventureWorksDW
GO
SELECT ReportingDate
, CAST(ModelRegion as varchar(50)) as ProductSeries
, Amount
FROM [AdventureWorksDW].[dbo].[vTimeSeries]
WHERE [ModelRegion] = 'M200 Europe'
ORDER BY ReportingDate ASC
Note
You can use any version of AdventureWorks, or create a different query using a database of your own. The point is to try to handle some data that contains text, datetime and numeric values.
Now, try pasting this query as the input to the stored procedure.
EXECUTE sp_execute_external_script
@language = N'R'
, @script = N' str(InputDataSet);
OutputDataSet <- InputDataSet;'
, @input_data_1 = N'
SELECT ReportingDate
, CAST(ModelRegion as varchar(50)) as ProductSeries
, Amount
FROM [AdventureWorksDW].[dbo].[vTimeSeries]
WHERE [ModelRegion] = ''M200 Europe''
ORDER BY ReportingDate ASC ;'
WITH RESULT SETS undefined;
If you get an error, you'll probably need to make some edits to the query text. For example, the string predicate in the WHERE clause must be enclosed by two sets of single quotation marks.
After you get the query working, review the results of the str
function to see how R treats the input data.
Results
STDOUT message(s) from external script: 'data.frame': 37 obs. of 3 variables:
STDOUT message(s) from external script: $ ReportingDate: POSIXct, format: "2010-12-24 23:00:00" "2010-12-24 23:00:00"
STDOUT message(s) from external script: $ ProductSeries: Factor w/ 1 levels "M200 Europe",..: 1 1 1 1 1 1 1 1 1 1
STDOUT message(s) from external script: $ Amount : num 3400 16925 20350 16950 16950
- The datetime column has been processed using the R data type, POSIXct.
- The text column "ProductSeries" has been identified as a factor, meaning a categorical variable. String values are handled as factors by default. If you pass a string to R, it is converted to an integer for internal use, and then mapped back to the string on output.
Summary
From even these short examples, you can see the need to check the effects of data conversion when passing SQL queries as input. Because some SQL Server data types are not supported by R, consider these ways to avoid errors:
- Test your data in advance and verify columns or values in your schema that could be a problem when passed to R code.
- Specify columns in your input data source individually, rather than using
SELECT *
, and know how each column will be handled. - Perform explicit casts as necessary when preparing your input data, to avoid surprises.
- Avoid passing columns of data (such as GUIDs or rowguids) that cause errors and aren't useful for modeling.
For more information on supported and unsupported data types, see R libraries and data types.
Next steps
To learn about writing advanced R functions with SQL machine learning, follow this quickstart: