Re: DT question

Tech-Archive recommends: Speed Up your PC by fixing your registry



The model creation wizard detects that Col1 is modeled as Input and
Continuous, so it marks it as REGRESSOR
That explains the presence of Col1 in the regression equations associated
with each split.


--
--
--
This posting is provided "AS IS" with no warranties, and confers no rights.
Please do not send email directly to this alias. It is for newsgroup
purposes only.

thanks,
bogdan

"Paul" <PAUL_R_JACOBS@xxxxxxxxx> wrote in message
news:1150241320.344176.234380@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I reread some of this. I took your regressor flags thing to mean the
force regressor in the params. You are instead talking about the
modeling flag regressor,or so I'm guessing. I don't show a modeling
flag set to regressor so I'm puzzled about why it appears in the XML I
just posted.


Dana Cristofor [MS] wrote:
Hi Paul,

Can you attach the mining model file from your project (you will find it
in
your project location and it has the extension dmm).

Thanks,
--
Dana Cristofor [MSFT]
SQL Server Data Mining
This posting is provided "AS IS" with no warranties, and confers no
rights.

"Paul" <PAUL_R_JACOBS@xxxxxxxxx> wrote in message
news:1150223327.847445.301770@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hi Dana,

The model is exactly as yours. Col2 is PredictOnly, Col1 is Input. I
am using the BI studio, not DMX.

Microsoft Analysis Server 9.00.1399.06

I get three splits (copied directly from the Legend):

Col1 < 179.200
Existing Cases: 80
Missing Cases: 0
Col2 = 1,395.035+9.999*(Col1-139.500)

Col1 >= 218.800
Existing Cases: 80
Missing Cases: 0
Col2 = 769.871-19.998*(Col1-258.500)

Col1 >= 179.200 and < 218.800
Existing Cases: 39
Missing Cases: 0
Col2 = 1,828.515-6.152*(Col1-199.000)


So you can see my question: between 179 and 218 I get a 'poor'
prediction. Outside of that range it works great. The implication is
that it is somehow regressing in chunks and when it approaches the
inflection point, the chunk gets data from both sides...





Dana Cristofor [MS] wrote:
Hi Paul,

I am not seeing the behavior you describe (I am using a model with
[Key]
as
Key, [Col1] as Input and [Col2] as PredictOnly, neither [Col1] nor
[Col2]
have Modeling_Flags set to Regressor and the default parameters of the
algorithm).

Can you give me more details about how you build the model: what
version
of
AS server are you using, how is your model different from the model I
mentioned above, what are the values of the algorithm parameters, and
whether you built your model using SQL Server Business Intelligence
Development Studio or DMX statements. Also, how many nodes does the
tree
have, what are the splitting conditions in the nodes and the
regression
formulas you obtain (you can get them from the Mining Legend when you
click
on each node).

Thanks,
--
Dana Cristofor [MSFT]
SQL Server Data Mining
This posting is provided "AS IS" with no warranties, and confers no
rights.


"Paul" <PAUL_R_JACOBS@xxxxxxxxx> wrote in message
news:1150153893.067415.138550@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I admit to being confused so here comes a stupid question....

In the example above, if you run it without setting the regressor
you
get regression equations, and except for some values in the middle,
the
coefficients of the equations are correct. It doesn't simply report
a
mean, or at least as far as I can see...



Dana Cristofor [MS] wrote:
Hi Paul,

The decision tree algorithm will partition the dataset into regions
with
meaningful patterns even if we do not specify Regressor for the
input
columns. The difference is that by specifying Regressors for some
input
column C1, C2,..., the decision tree algorithm will try to find
regression
equations of the form a*C1 + b*C2 + ... to fit the patterns in the
nodes
of
the tree. Without Regressors, the predicted value for the
continuous
output
will be the mean of the values in a given node of the tree. When we
mark
input columns with Regressor, we are giving an indication to the
algorithm
that the values of the column to predict might depend on these
input
columns
and the algorithm will discover which combinations are suitable and
it
will
produce the appropriate regression equations.

Thanks,
--
Dana Cristofor [MSFT]
SQL Server Data Mining
This posting is provided "AS IS" with no warranties, and confers no
rights.


"Paul" <PAUL_R_JACOBS@xxxxxxxxx> wrote in message
news:1150139663.145905.158260@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
In some ways having to specify the regressor defeats the purpose
of
mining.

Is the machine doing some sort of smoothing (or partitioning
sic?)
if
the regressor is not specified?

Dana Cristofor [MS] wrote:
Hi Paul,

You could model [Key] as Key of the model, model [Col1] as
Input,
Continuous
and mark it with Modeling Flag = Regressor to be used in the
regression
equations and model [Col2] as Continuous and PredictOnly. For
your
data
the
regression tree should have a split into two nodes, the first
one
for
values
of Col1 less than 200 and the second one for values greater than
200,
which
correspond to the two distinct regions in the dataset.

Thanks,
--
Dana Cristofor [MSFT]
SQL Server Data Mining
This posting is provided "AS IS" with no warranties, and confers
no
rights.


"Paul" <PAUL_R_JACOBS@xxxxxxxxx> wrote in message
news:1149805256.303302.56400@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Data follows:

I'm trying to predict Col2. There is a clear inflection point
right
in
the middle. The DT created finds the proper slope for the
first
80
or
so and the last 80 or so entries, but has trouble with the 40
in
the
middle.

I am presuming that this is because it is doing some sort of
sampling
in order to keep from having to regress all combinations.

Is there a place that explains precisely how the algorithm
attacks
this
situation? Is there a way to set the parameters so that the
'fuzziness' in the middle is eliminated or minimized?


Key Col1 Col2
1 100 1000
2 101 1010
3 102 1020
4 103 1030
5 104 1040
6 105 1050
7 106 1060
8 107 1070
9 108 1080
10 109 1090
11 110 1100
12 111 1110
13 112 1120
14 113 1130
15 114 1140
16 115 1150
17 116 1160
18 117 1170
19 118 1180
20 119 1190
21 120 1200
22 121 1210
23 122 1220
24 123 1230
25 124 1240
26 125 1250
27 126 1260
28 127 1270
29 128 1280
30 129 1290
31 130 1300
32 131 1310
33 132 1320
34 133 1330
35 134 1340
36 135 1350
37 136 1360
38 137 1370
39 138 1380
40 139 1390
41 140 1400
42 141 1410
43 142 1420
44 143 1430
45 144 1440
46 145 1450
47 146 1460
48 147 1470
49 148 1480
50 149 1490
51 150 1500
52 151 1510
53 152 1520
54 153 1530
55 154 1540
56 155 1550
57 156 1560
58 157 1570
59 158 1580
60 159 1590
61 160 1600
62 161 1610
63 162 1620
64 163 1630
65 164 1640
66 165 1650
67 166 1660
68 167 1670
69 168 1680
70 169 1690
71 170 1700
72 171 1710
73 172 1720
74 173 1730
75 174 1740
76 175 1750
77 176 1760
78 177 1770
79 178 1780
80 179 1790
81 180 1800
82 181 1810
83 182 1820
84 183 1830
85 184 1840
86 185 1850
87 186 1860
88 187 1870
89 188 1880
90 189 1890
91 190 1900
92 191 1910
93 192 1920
94 193 1930
95 194 1940
96 195 1950
97 196 1960
98 197 1970
99 198 1980
100 199 1960
101 200 1940
102 201 1920
103 202 1900
104 203 1880
105 204 1860
106 205 1840
107 206 1820
108 207 1800
109 208 1780
110 209 1760
111 210 1740
112 211 1720
113 212 1700
114 213 1680
115 214 1660
116 215 1640
117 216 1620
118 217 1600
119 218 1580
120 219 1560
121 220 1540
122 221 1520
123 222 1500
124 223 1480
125 224 1460
126 225 1440
127 226 1420
128 227 1400
129 228 1380
130 229 1360
131 230 1340
132 231 1320
133 232 1300
134 233 1280
135 234 1260
136 235 1240
137 236 1220
138 237 1200
139 238 1180
140 239 1160
141 240 1140
142 241 1120
143 242 1100
144 243 1080
145 244 1060
146 245 1040
147 246 1020
148 247 1000
149 248 980
150 249 960
151 250 940
152 251 920
153 252 900
154 253 880
155 254 860
156 255 840
157 256 820
158 257 800
159 258 780
160 259 760
161 260 740
162 261 720
163 262 700
164 263 680
165 264 660
166 265 640
167 266 620
168 267 600
169 268 580
170 269 560
171 270 540
172 271 520
173 272 500
174 273 480
175 274 460
176 275 440
177 276 420
178 277 400
179 278 380
180 279 360
181 280 340
182 281 320
183 282 300
184 283 280
185 284 260
186 285 240
187 286 220
188 287 200
189 288 180
190 289 160
191 290 140
192 291 120
193 292 100
194 293 80
195 294 60
196 295 40
197 296 20
198 297 0
199 298 -20







.



Relevant Pages

  • Re: DT question
    ... SQL Server Data Mining ... Dana Cristofor wrote: ... mentioned above, what are the values of the algorithm parameters, and ... if you run it without setting the regressor you ...
    (microsoft.public.sqlserver.datamining)
  • Re: DT question
    ... mentioned above, what are the values of the algorithm parameters, and ... how many nodes does the tree ... if you run it without setting the regressor you ... Dana Cristofor wrote: ...
    (microsoft.public.sqlserver.datamining)
  • Re: DT question
    ... SQL Server Data Mining ... flag set to regressor so I'm puzzled about why it appears in the XML I ... Dana Cristofor wrote: ...
    (microsoft.public.sqlserver.datamining)
  • Re: DT question
    ... SQL Server Data Mining ... so it marks it as REGRESSOR ... Dana Cristofor wrote: ...
    (microsoft.public.sqlserver.datamining)
  • Re: Regression question
    ... You could mark both X and Y as Predict Only and Time as Input and REGRESSOR ... You will get 2 trees (or regression formulae, depending what algorithm you ...
    (microsoft.public.sqlserver.datamining)