Condividi tramite


To (a) or not to (a), that is the question?

While looking for content for the next edition of my book (newsflash! I'm currently working on the next edition of my book!) I went hunting around for that trick using decision trees to predict the states of a single column independently rather than all together. Turns out - I never wrote it! So, in case you don' t want to wait for it (or it doesn't make it into the book!), here it is now.

The classifier-type algorithms in SQL Server Data Mining (notably decision trees, naive bayes, neural networks) can all predict multinomial outputs - that is, output attributes with multiple states - 3,4, 10, 20, whatever. However, in reality, classifiers in general prefer the ying and yang of things, the black and white, the yes-ness and the no-ness. In short, they really are better at seperating between the states of a binomial attribute rather than those of a multinomial attribute - and so are you, actually. If I gave you ten things to look at and said what is the factor that most cleanly divides all ten of these things, you'd have a hard time, but if I gave you two things instead, you might not have a problem. You would be more accurate, and your model may be more accurate as well, if you only had binomials.

Additionally, with multinomials, your model - particularly with decision trees - is harder to interpret. Say you have the marital status states of "Married", "Single", "Seperated", "Divorced" and "Widowed". When I look at the dependency net and it shows that "Number of Children" is predictive of "Marital Status" - what does it mean? Which aspect of marital status is it talking about - all of them? One of them? Impossible to tell.

 So what do you do when you have a multinomial output? The first choice is obvious - see if you can reduce it to two states - e.g. can they be changed into "Married" and "Not Married". If that's not an option, i.e. if the states are important, another option is to turn them into a series of binomials - e.g. "Marriend" and "Not Married", "Single" and "Not Single", etc.

Now we run into an additional problem - not only is transposing our data like that a royal pain in the butt, it increases the number of columns and we may exceed the maximum row length of SQL Server.

Luckily with SQL Server Data Mining, we have a way out. By using a trick with nested tables, we can create a model that treats each state as a binomial attribute without changing any of our data. Assume our original model looked like this:

CREATE MINING MODEL MyMultinomialModel
(
CustomerID LONG KEY,
Age LONG CONTINUOUS,
Gender TEXT DISCRETE,
NumChildren LONG CONTINUOUS,
MaritalStatus TEXT DISCRETE PREDICT
) USING Microsoft_Decision_Trees

we can transform the MaritalStatus field into a nested table like this:

CREATE MINING MODEL MyBinomialModel
(
CustomerID LONG KEY,
Age LONG CONTINUOUS,
Gender TEXT DISCRETE,
NumChildren LONG CONTINUOUS,
MaritalStatus TABLE PREDICT ONLY
(
Status TEXT KEY
)
) USING Microsoft_Decision_Trees

OK, so now that we've created the model, how do we train it - we only have a single source table right? Well, you're right, but that doesn't really matter. Using the tools in BI Dev Studio, you can select a table as both a Case and Nested table, and you can do the same thing using DMX, like this

INSERT INTO MyBinomialModel(CustomerID,
Age, Gender, NumChildren,
MaritalStatus(SKIP, Status))
SHAPE{OPENQUERY(MyDataSource,
'SELECT CustID, Age, Gender, NumChildren
FROM Customers ORDER BY CustID') }
APPEND
{OPENQUERY(MyDataSource,'SELECT CustID, MaritalStatus
FROM Customers ORDER BY CustID')
RELATE CustID to CustID} AS MaritalStatus

So why does this work? Since we're using the same table as the case and nested tables, we're guaranteed that each case will have one and only one "Status" in the "MaritalStatus" table. The MaritalStatus table is PREDICT ONLY, so there's no cross-confusion between the states. A decision tree model in this case will build five trees - one for each state of MaritalStatus. Each attribute is a binomial variable with two possibilities "This state exists" or "This state doesn't exist".

This also helps in prediction if you need to predict a particular state. You will get a much better prediction of whether or not a customer is or is not a particular state than if you did this the traditional way. You can find out how likely by using subselects with your prediction statements, e.g.

SELECT FLATTENED
(SELECT * FROM
PREDICT(MaritalStatus, INCLUDE_STATISTICS)
WHERE Status='Single')
FROM MyBinomialModel PREDICTION JOIN ...

The above query will give you the probability and support of being single for each customer in the input set.

Additionally, in the dependency network, each state of MaritalStatus will appear as its own node, so you will finally be able to see whether Number Of Children is more predictive of "Married" or of "Divorced" J

And for those wanting more information on the book, I’m working once again with Zhaohui Tang on the book and this time we’ve added Bogdan Crivat as a new author. It will probably have some spiffy name like “Data Mining with SQL Server 2008” and will be released sometime around when the product ships – no comment on that!

  (note: the author is happily married with four children, regardless of what your dependency net says)

Comments