Re: Training Decision Tree
- From: "Dejan Sarka" <dejan_please_reply_to_newsgroups.sarka@xxxxxxxxxx>
- Date: Tue, 20 Mar 2007 12:16:59 +0100
I have a data set where in the last column is the name of the class where
each data row belongs. I want keep an amount of records for validation and
the rest for training the decision tree. How can this be done?
You can use SQL Server Integration Services, Percentage Sampling and Row
Sampling transformations. If you use SQL Server 2005, you can also use the
new TABLESAMPLE clause of the SELECT statement. However, this clause works
on page level, not on row level; you should use it for large tables only.
Example:
-- Use 80% of data to train
SELECT *
INTO vTM_Train
FROM vTargetMail
TABLESAMPLE (80 PERCENT)
GO
-- Select other 20% to test
SELECT vTM_table.*, 1 AS TrainTest
INTO vTM_Test
FROM vTargetMail
EXCEPT
SELECT *
FROM vTM_Train
GO
You can also create manual sampling query, works perfectly even on small
sets (kudos to Steve Kass):
-- Manual Bernoulli sampling
SELECT RAND(CHECKSUM(NEWID())%1000000000+CustomerKey),*
FROM vTargetMail
WHERE RAND(CHECKSUM(NEWID())%1000000000+CustomerKey)< 0.1
-- 0.1 is the desired probability of choosing a row, change as needed.
GO
--
Dejan Sarka
http://www.solidqualitylearning.com/blogs/
.
- Follow-Ups:
- Re: Training Decision Tree
- From: anonymous_user@xxxxxxxxxxxxxxxxxxxxxxx
- Re: Training Decision Tree
- References:
- Training Decision Tree
- From: anonymous_user@xxxxxxxxxxxxxxxxxxxxxxx
- Training Decision Tree
- Prev by Date: Training Decision Tree
- Next by Date: Re: Text Mining Issue
- Previous by thread: Training Decision Tree
- Next by thread: Re: Training Decision Tree
- Index(es):