Thursday, July 5, 2012

Collaborative Filtering with Personalized Skylines


Collaborative filtering (CF) systems exploit previous ratings and similarity in user behavior to recommend the top-k objects/records which are potentially most interesting to the user assuming a single score per object. However, in various applications, a record (e.g., hotel) maybe rated on several attributes (value, service, etc.), in which case simply returning the ones with the highest overall scores fails to capture the individual attribute characteristics and to accommodate different selection criteria. In order to enhance the flexibility of CF, we propose Collaborative Filtering Skyline (CFS), a general framework that combines the advantages of CF with those of the skyline operator. CFS generates a personalized skyline for each user based on scores of other users with similar behavior. The personalized skyline includes objects that are good on certain aspects, and eliminates the ones that are not interesting on any attribute combination. Although the integration of skylines and CF has several attractive properties, it also involves rather expensive computations. We face this challenge through a comprehensive set of algorithms and optimizations that reduce the cost of generating personalized skylines. In addition to exact skyline processing, we develop an approximate method that provides error guarantees. Finally, we propose the top-k personalized skyline, where the user specifies the required output cardinality.
Skyline Processing

Assume records with attributes, each taking values from a totally ordered domain. Accordingly, a record can be represented as a point in the d-dimensional space (in the sequel, we use the terms record, point, and object
Interchangeably). The skyline contains the best points according to any function that is monotonic on each attribute. Conversely, for each skyline record r, there is such a function that would assign it the highest score. These attractive properties of skylines have led to their application in various domains including multi objective optimization maximum vectors and the contour problem. They introduced the skyline operator to the database literature and proposed two disk-based algorithms for large data sets. The first, called D&C (for divide and conquer) divides the data set into partitions that fit in memory, computes the partial skyline in every partition, and generates the final skyline by merging the partial ones. The second algorithm, called BNL, applies the concept of block-nested loops. It improves BNL by sorting the data. Other variants of all these methods do not use any indexing and, usually, they have to scan the entire data set before reporting any skyline point. Another set of algorithms utilizes conventional or multidimensional indexes to speed up query processing and progressively report skyline points. Such methods include Bitmap, In addition to conventional databases; skyline processing has been studied in other scenarios. For instance, Morse et al. uses spatial access methods to maintain the skyline in streams with explicit deletions. Efficient skyline maintenance has also been the focus of in distributed environments; several methods query independent subsystems, each in charge of a specific attribute, and compute the skylines using the partial results. In the data mining context, identify the combinations