How useful is Sequence Clustering algorithm?
- From: "Bostjan Kozuh" <bostjan.kozuh@xxxxxxxx>
- Date: Wed, 04 Apr 2007 14:00:14 GMT
Hello!
I'm having serious problems with the performance of SequenceClustering algorithm and I would like to know whether it is I that am doing something wrong or is it the algorithm's limitations.
Here is the situation:
I need to examine clickstream data in order to be able to predict users actions for my research project. According to "Data mining with SQL Server 2005" book and all the posts I have read this is the typical application for the Sequence CLustering algoritm. So, I have constructed case table (Customers) and nested table (Clickpath) according to the instructions. I have identified 36,296 users that have in total visited around 192,000 pages (cleaned records) - the max number of visited pages per user is 64, while the number of different URLCategories is 122 (pages are already coverted categories).
I use BI Development Studio for DM modelling - in the "Create DM Structure Wizard" I select case and nested tables and then
CASE
CustomerGuid - KEY (Key, Text)
NESTED
URLCategory - INPUT, PREDICTABLE (Key Sequence, Long)
SequenceID - KEY, INPUT (Discrete, Text)
Then I process the model (MAXIMUM_STATES=1000, other settings = DEFAULT) and the viewers indicate pretty reasonable results. Algorith finds 16 differenct clusters and some of them are quite distinct from others which is OK. The problems start in the prediction phase - I use the following singleton DMX query (Test1 is model name) in the prediction tab
SELECT Cluster(), (Select $Sequence,
, URLCategory, PredictProbability(UrLCategory) As Prob
From PredictSequence(ClickPath,5)) As Sequences
FROM Test1
NATURAL PREDICTION JOIN
(SELECT (SELECT 1 As SequenceID, 'CategoryA' AS URLCategory
UNION
SELECT 2 As SequenceID, 'CategoryB' As URLCategory
UNION
SELECT 3 As SequenceID, 'CategoryC' As URLCategory) As ClickPath) As t
The results are not what I expect - very often I get recommendations with probability 1E-14 and sometimes I get "Internal error: An unexpected exeption occured." I tried to use URL sequences that are very distinct for a particular cluster, but the results do not differ much - the predicted cluster is often correct (ie. as expected), but the predicted pages (often) do not occur in that cluster at all.
I would appreciate your help in determining how useful is Sequence Clustering algorithm for personalization of websites.
Thanks,
Bostjan
.
- Follow-Ups:
- Re: How useful is Sequence Clustering algorithm?
- From: Dejan Sarka
- Re: How useful is Sequence Clustering algorithm?
- Prev by Date: Re: Data Source View
- Next by Date: Re: Data Source View
- Previous by thread: Data Source View
- Next by thread: Re: How useful is Sequence Clustering algorithm?
- Index(es):