Re: Clustering wrong results in SQL 2005



Dejan Sarka pisze:
Hi!

Could you check whether you get the same result with hard clustering, i.e. with K-means algorithm? The default Expectation-Maximization algorithm, also known as soft clustering, actually assigns each case to each cluster with some probability, while K-means assigns each case to one cluster with probability 1. Maybe you can check probabilities for each cluster with teh ClusterProbability DMX function?


Hi,

I've built 3 additional models to cover all clustering methods.
The results are the same for each pair: EM algorithms and k-means algorithms.

In each model I still get the 50% probability of P2 in cluster 1 - in NODE_DISTRIBUTION, also for k-means. The other Ps are assigned to 1 cluster with probability 1, just as you write ;)

Maybe the SUPPORT in NODE_DISTRIBUTION could be helpful - I noticed it was higher than in NODE_SUPPORT but I don't know how it was calculated. The support for the whole node is equal to number of cases assigned to the cluster but what about the SUPPORT in NODE_DISTRIBUTION? E.g. I received SUPPORT=6 for P1=Existing while the number of cases in this cluster was only 3. Also support for P3=Missing was 6 while the total number of cases containing P3 was only 5.

The ClusterProbability results are clear in k-means if my query covers existing cases. It has difficulties in cases built as extensions of historical ones - it is specific for this algorithm, isnt' it?
In case of EM, if ClusterProbability is equal for clusters, Cluster() returns the one with lower node id (number in node name).

What would you suggest? Clustering function is promising but I can't load real data until I understand "the P2" effect ;)

Ver.
.



Relevant Pages

  • Re: Clustering wrong results in SQL 2005
    ... Maybe the SUPPORT in NODE_DISTRIBUTION could be helpful - I noticed it was higher than in NODE_SUPPORT but I don't know how it was calculated. ... The support for the whole node is equal to number of cases assigned to the cluster but what about the SUPPORT in NODE_DISTRIBUTION? ... The SUPPORT in the NODE_DISTRIBUTION is the support for specific state / value in a cluster; you should read it together with the PROBABILITY column in NODE_DISTRIBUTION. ... extensions of historical ones" on possibility to incrementally train a model, then it is not a problem of Clustering only - in SQL Server 2005, you can only reprocess the complete model, there is no incremental processing for any algorithm. ...
    (microsoft.public.sqlserver.datamining)
  • Re: Sequence clustering - order of the model
    ... The research paper for the algorithm is here: ... The algorithm is essentially a mixture of markov and cluster models. ... The input sequence is used to determine the cluster probability of the ...
    (microsoft.public.sqlserver.datamining)
  • Re: Complexity of a specific kind of instances of a NP-complete problem
    ... Here the main point is the size of the cluster. ... algorithm exists by calling them one by one until an answer is obtained. ... > an instance is NPC according current complexity theory. ... > years are needed to resolve it, using computer, mathematics, whatever? ...
    (comp.theory)
  • Re: Using smooth shading
    ... Assuming that for a well connected mesh the number of vertices is ... A more efficient algorithm is to ... scan all vertices such that you cluster vertices with nearly ... stanford scanning repository had till sometimes ago ply utils download. ...
    (comp.graphics.api.opengl)
  • Re: Using smooth shading
    ... psvalins@xxxxxxxxx ha scritto: ... A more efficient algorithm is to ... scan all vertices such that you cluster vertices with nearly ...
    (comp.graphics.api.opengl)

Quantcast