Homework

Mining Large Streams of User Data for Personalized Recommendations Xavier AmatriainNet ix xamatriain@net ix.com ABSTRACT The Net ix Prize put the spotlight on the use of data min- ing and machine learning methods for predicting user pref- erences. Many lessons came out of the competition. But since then, Recommender Systems have evolved. This evo- lution has been driven by the greater availability of di erent kinds of user data in industry and the interest that the area has drawn among the research community. The goal of this paper is to give an up-to-date overview of the use of data mining approaches for personalization and recommendation.

Using Net ix personalization as a motivating use case, I will describe the use of di erent kinds of data and machine learn- ing techniques.

After introducing the traditional approaches to recommen- dation, I highlight some of the main lessons learned from the Net ix Prize. I then describe the use of recommenda- tion and personalization techniques at Net ix. Finally, I pinpoint the most promising current research avenues and unsolved problems that deserve attention in this domain.

1. INTRODUCTION Recommender Systems (RS) are a prime example of the mainstream applicability of large scale data mining. Ap- plications such as e-commerce, search, Internet music and video, gaming or even online dating make use of similar techniques to mine large volumes of data to better match their users' needs in a personalized fashion.

There is more to a good recommender system than the data mining technique. Issues such as the user interaction design, outside the scope of this paper, may have a deep impact on the e ectiveness of an approach. But given an existing application, an improvement in the algorithm can have a value of millions of dollars, and can even be the factor that determines the success or failure of a business. On the other hand, given an existing method or algorithm, adding more features coming from di erent data sources can also result in a signi\fcant improvement. I will describe the use of data, models, and other personalization techniques at Net ix in section 3. I will also discuss whether we should focus on more data or better models in section 4.

Another important issue is how to measure the success of a given personalization technique. Root mean squared er- ror (RMSE) was the o ine evaluation metric of choice in the Net ix Prize (see Section 2). But there are many other relevant metrics that, if optimized, would lead to di erent solutions - think, for example, of ranking metrics such as Normalized Discounted Cumulative Gain (NDCG) or other information retrieval ones such as recall or area under the curve (AUC). Beyond the optimization of a given o ine met- ric, what we are really pursuing is the impact of a method on the business. Is there a way to relate the goodness of an algo- rithm to more customer-facing metrics such as click-through rate (CTR) or retention? I will describe our approach to in- novation called \Consumer Data Science" in section 3.1.

But before we understand the reasons for all these e ects, and before we are ready to embrace the open research ques- tions in the area of personalization described in Section 5, we need to understand some of the basic techniques that enable the di erent approaches. I will brie y describe them in the following paragraphs.

1.1 Approaches to the Recommendation problem The most common approach to build a Recommender Sys- tem is to use one of the many available Collaborative Fil- tering (CF) algorithms [1]. The underlying assumption of these methods is captured by the principle of like minded- ness : users who are measurably similar in their historical preferences are likely to also share similar tastes in the fu- ture. In other words, CF algorithms assume that, in order to recommend content of any kind to users, information can be drawn from what they and other similar users have liked in the past. Historically, the k-Nearest Neighbor ( kNN) algorithm was the most favored approach to CF, since it transparently captured this assumption of like-mindedness:

it operates by \fnding, for each user (or item), a number of similar users (items) whose pro\fles can then be used to directly compute recommendations [55].

The main alternative to CF is the so-called content-based approach [46], which identi\fes similarities between items based on the features inherent in the items themselves. These recommender systems require a way to extract content de- scriptions and a similarity measure between items. Auto- matic content description is still only available for some kinds of content, and under some constraints. That is why some systems need to rely on experts to manually input and categorize the content [56]. On the other hand, content- based approaches can deal with some of the shortcomings of CF such as item cold-start - i.e.the initialization of new items that the system has no previous user preference data for.

CF and content-based methods can be combined in di erent ways using hybrid approaches [15]. Hybrid RS can combine SIGKDD Explorations Volume 14, Issue 2 Page 37 several di erent methods in a way that one method provides support whenever the other methods are lacking. In prac- tice, most of the advanced recommendation systems used in the industry are based on some sort of hybridation, and are rarely purely CF or content-based.

1.2 Data Mining methods in Recommender Systems No matter which of the previous approaches is used, a rec- ommender system's engine can be seen as a particular in- stantiation of a traditional data mining task [4]. A data min- ing task typically consists of 3 steps, carried out in succes- sion: Data Preprocessing, Data Modeling, and Result Anal- ysis. Traditional machine learning techniques such as di- mensionality reduction, classi\fcation, or clustering, can be applied to the recommendation problem. In the following paragraphs, I will describe some of the models that, beyond the classical kNN, can be used to build a recommender sys- tem.

Although current trends seem to indicate that other matrix factorization techniques are preferred (see Section 2.1), ear- lier works used Principal Component Analysis (PCA) [24] . Decision Trees may be used in a content-based ap- proach for a RS. One possibility is to use content features to build a decision tree that models the variables involved in the user preferences [13]. Bayesian classi\fershave been used to derive a model for content-based RS [23]. Arti\f- cial Neural Networks (ANN) can be used in a similar way as Bayesian Networks to construct content-based RS's [47]. ANN can also be used to combine (or hybridize) the input from several recommendation modules or data sources [20]. Support Vector Machines (SVM) have also shown promising recent results [30].

Clustering approaches such as k -meanscan be used as a pre-processing step to help in neighborhood formation [65].

Finally, association rules [3] can also be used[38].

2. THE NETFLIX PRIZE In 2006, Net ix announced the Net ix Prize, a machine learning and data mining competition for movie rating pre- diction. We o ered $1 million to whoever improved the ac- curacy of our existing system called Cinematch by 10%. We conducted this competition to \fnd new ways to improve the recommendations we provide to our members, which is a key part of our business. However, we had to come up with a proxy question that was easier to evaluate and quantify: the root mean squared error (RMSE) of the predicted rating.

The Net ix Prize put the spotlight on Recommender Sys- tems and the value of user data to generate personalized recommendations. It did so by providing a crisp problem de\fnition that enabled thousands of teams to focus on im- proving a metric. While this was a simpli\fcation of the recommendation problem, there were many lessons learned.

2.1 Lessons from the Prize A year into the competition, the Korbell team won the \frst Progress Prize with an 8.43% improvement. They reported more than 2000 hours of work in order to come up with the \fnal combination of 107 algorithms that gave them this prize. And they gave us the source code. We looked at the two underlying algorithms with the best performance in the ensemble: Matrix Factorization (MF) [35] 1 and Restricted Boltzmann Machines (RBM) [54]. Matrix Factorization by itself provided a 0.8914 RMSE, while RBM alone provided a competitive but slightly worse 0.8990 RMSE. A linear blend of these two reduced the error to 0.88. To put these algo- rithms to use, we had to work to overcome some limitations, for instance that they were built to handle 100 million rat- ings, instead of the more than 5 billion that we have, and that they were not built to adapt as members added more ratings. But once we overcame those challenges, we put the two algorithms into production, where they are still used as part of our recommendation engine.

The standard matrix factorization decomposition provides user factor vectors U u 2 Rf and item-factors vector V v 2 R f . In order to predict a rating, we \frst estimate a baseline b uv = + b u + b v as the user and item deviation from average.

The prediction can then be obtained by adding the product of user and item factors to the baseline as r0 uv = b uv + UT u U v.

One of the most interesting \fndings during the Net ix Prize came out of a blog post. Simon Funk introduced an incre- mental, iterative, and approximate way to compute the SVD using gradient descent [22]. This provided a practical way to scale matrix factorization methods to large datasets.

Another enhancement to matrix factorization methods was Koren et. al's SVD++ [33]. This asymmetric variation enables adding both implicit and explicit feedback, and re- moves the need for parameterizing the users.

The second model that proved successful in the Net ix Prize was the Restricted Boltzmann Machine (RBM). RBM's can be understood as the fourth generation of Arti\fcial Neural Networks - the \frst being the Perceptron popularized in the 60s; the second being the backpropagation algorithm in the 80s; and the third being Belief Networks (BNs) from the 90s. RBMs are BNs that restrict the connectivity to make learning easier. RBMs can be stacked to form Deep Belief Nets (DBN). For the Net ix Prize, Salakhutditnov et al.

proposed an RBM structure with binary hidden units and softmax visible units with 5 biases only for the movies the user rated [54].

Many other learnings came out of the Prize. For exam- ple, the matrix factorization methods mentioned above were combined with the traditional neighborhood approaches [33].

Also, early in the prize, it became clear that it was impor- tant to take into account temporal dynamics in the user feedback [34].

Another \fnding of the Net ix Prize was the realization that user explicit ratings are noisy. This was already known in the literature. Herlocker et al.[27] coined the term \magic barrier" to refer to the limit in accuracy in a recommender system due to the natural variability in the ratings. This limit was in fact relatively close to the actual Prize threshold [6], and might have played a role in why it took so much e ort to squeeze the last fractions of RMSE.

The \fnal Grand Prize ensemble that won the $1M two years later was a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to \fnally cross the \fnish line [11]. The way that the \fnal 1 The application of Matrix Factorization to the task of rat- ing prediction closely resembles the technique known as Sin- gular Value Decomposition used, for example, to identify la- tent factors in Information Retrieval. Therefore, it is com- mon to see people referring to this MF solution as SVD.

SIGKDD Explorations Volume 14, Issue 2 Page 38 Figure 1: Following an iterative and data-driven o ine-online process for innovating in personalization solution was accomplished by combining many independent models also highlighted the power of using ensembles.

At Net ix, we evaluated some of the new methods included in the \fnal solution. The additional accuracy gains that we measured did not seem to justify the engineering e ort needed to bring them into a production environment. Also, our focus on improving Net ix personalization had by then shifted from pure rating prediction to the next level. In the next section, I will explain the di erent methods and com- ponents that make up a complete personalization approach such as the one used by Net ix.

3. NETFLIX PERSONALIZATION: BEYOND RATING PREDICTION Net ix has discovered through the years that there is tremen- dous value in incorporating recommendations to personal- ize as much of the experience as possible. This realization pushed us to propose the Net ix Prize described in the pre- vious section. In this section, we will go over the main com- ponents of Net ix personalization. But \frst let us take a look at how we manage innovation in this space.

3.1 Consumer Data Science The abundance of source data, measurements and associated experiments allow Net ix not only to improve our personal- ization algorithms but also to operate as a data-driven orga- nization. We have embedded this approach into our culture since the company was founded, and we have come to call it Consumer (Data) Science. Broadly speaking, the main goal of our Consumer Science approach is to innovate for members e ectively. We strive for an innovation that allows us to evaluate ideas rapidly, inexpensively, and ob jectively.

And once we test something, we want to understand why it failed or succeeded. This lets us focus on the central goal of improving our service for our members. So, how does this work in practice? It is a slight variation on the traditional scienti\fc process called A/B testing (or bucket testing):

1. Start with a hypothesis: Algorithm/feature/design X will increase member engagement with our service and ultimately member retention.

2. Design a test: Develop a solution or prototype. Think about issues such as dependent & independent vari- ables, control, and signi\fcance.

3. Execute the test: Assign users to the di erent buck- ets and let them respond to the di erent experiences.

4. Let data speak for itself : Analyze signi\fcant changes on primary metrics and try to explain them through variations in the secondary metrics.

When we execute A/B tests, we track many di erent met- rics. But we ultimately trust member engagement (e.g.

viewing hours) and retention. Tests usually have thousands of members and anywhere from 2 to 20 cells exploring vari- ations of a base idea. We typically have scores of A/B tests running in parallel. A/B tests let us try radical ideas or test many approaches at the same time, but the key advantage is that they allow our decisions to be data-driven.

An interesting follow-up question that we have faced is how to integrate our machine learning approaches into this data- driven A/B test culture at Net ix. We have done this with an o ine-online testing process that tries to combine the best of both worlds (see Figure 1). The o ine testing cycle is a step where we test and optimize our algorithms prior to performing online A/B testing. To measure model per- formance o ine we track multiple metrics: from ranking measures such as normalized discounted cumulative gain, to classi\fcation metrics such as precision, and recall. We also SIGKDD Explorations Volume 14, Issue 2 Page 39 Figure 2: Example of a Net ix Top 10 row. We promote personalization awareness and re ect on the diversity of a house- hold. Note though that personal labels are only the author's guess since the system is uncertain about the true household composition.

use the famous RMSE from the Net ix Prize or other more exotic metrics to track di erent aspects like diversity. We keep track of how well those metrics correlate to measurable online gains in our A/B tests. However, since the mapping is not perfect, o ine performance is used only as an indication to make informed decisions on follow up tests.

Once o ine testing has validated a hypothesis, we are ready to design and launch the A/B test that will prove the new feature valid from a member perspective. If it does, we will be ready to roll out in our continuous pursuit of the better product for our members. That is in fact how we came about to having the personalization experience I will describe in the next section.

3.2 Everything is a Recommendation Personalization starts on our homepage in any device. This page consists of groups of videos arranged in horizontal rows.

Each row has a title that conveys the intended meaningful connection between the videos in that group. Most of our personalization is based on the way we select rows, how we determine what items to include in them, and in what order to place those items.

Take as a \frst example the Top 10 row (see Figure 2). This row is our best guess at the ten titles you are most likely to enjoy. Of course, when we say \you", we really mean every- one in your household. It is important to keep in mind that Net ix personalization is intended to handle a household that is likely to have di erent people with di erent tastes.

That is why when you see your Top 10, you are likely to discover items for dad, mom, the kids, or the whole fam- ily. Even for a single person household we want to appeal to your range of interests and moods. To achieve this, in many parts of our system we are not only optimizing for accuracy, but also for diversity.

Another important element in Net ix personalization is aware- ness. We want members to be aware of how we are adapting to their tastes. This not only promotes trust in the system, but encourages members to give feedback that will result in better recommendations. A di erent way of promoting trust with the personalization component is to provide explana- tions as to why we decide to recommend a given movie or show (see Figure 3). We are not recommending it because it suits our business needs, but because it matches the in- formation we have from you: your explicit taste preferences and ratings, your viewing history, or even your friends rec- ommendations.

On the topic of friends, we recently released our Facebook connect feature. Knowing about your friends not only gives Figure 3: Adding explanation and support for recommen- dations contributes to user satisfaction and requires speci\fc algorithms. Support in Net ix can include your predicted rating, related shows you have watched, or even friends who have interacted with the title.

us another signal to use in our personalization algorithms, but it also allows for di erent rows that rely mostly on your social circle to generate recommendations.

Some of the most recognizable personalization in our ser- vice is the collection of \genre" rows. These range from fa- miliar high-level categories like \Comedies" and \Dramas" to highly tailored slices such as \Imaginative Time Travel Movies from the 1980s". Each row represents 3 layers of personalization: the choice of genre itself, the subset of ti- tles selected within that genre, and the ranking of those titles. Rows are generated using a member's implicit genre preferences recent plays, ratings, and other interactions {, or explicit feedback provided through our taste preferences survey (see Figure 4) . As with other personalization ele- ments, freshness and diversity is taken into account when deciding what genres to show from the thousands possible.

Similarity is also an important source of personalization.

We think of similarity in a very broad sense; it can be be- tween movies or between members, and can be in multi- ple dimensions such as metadata, ratings, or viewing data.

Furthermore, these similarities can be blended and used as features in other models. Similarity is used in multiple con- texts, for example in response to generate rows of \adhoc genres" based on similarity to titles that a member has in- teracted with recently.

In most of the previous contexts, the goal of the recom- mender systems is still to present a number of attractive SIGKDD Explorations Volume 14, Issue 2 Page 40 Figure 4: Net ix Genre rows can be generated from implicit, explicit, or hybrid feedback SIGKDD Explorations Volume 14, Issue 2 Page 41 items for a person to choose from. This is usually accom- plished by selecting some items and sorting them in the order of expected enjoyment (or utility). Since the most common way of presenting recommended items is in some form of list, we need an appropriate rankingmodel that can use a wide variety of information to come up with an optimal sorting of the items. In the next section, we will go into some of the details of how to design such a ranking model.

3.3 Ranking The goal of a ranking system is to \fnd the best possible ordering of a set of items for a user, within a speci\fc context, in real-time. We optimize ranking algorithms to give the highest scores to titles that a member is most likely to play and enjoy.

If you are looking for a ranking function that optimizes con- sumption, an obvious baseline is item popularity. The rea- son is clear: on average, a member is most likely to watch what most others are watching. However, popularity is the opposite of personalization: it will produce the same order- ing of items for every member. Thus, the goal becomes to \fnd a personalized ranking function that is better than item popularity, so we can better satisfy members with varying tastes.

Recall that our goal is to recommend the titles that each member is most likely to play and enjoy. One obvious way to approach this is to use the member's predicted rating of each item as an adjunct to item popularity. Using predicted ratings on their own as a ranking function can lead to items that are too niche or unfamiliar, and can exclude items that the member would want to watch even though they may not rate them highly. To compensate for this, rather than using either popularity or predicted rating on their own, we would like to produce rankings that balance both of these aspects.

At this point, we are ready to build a ranking prediction model using these two features.

Let us start with a very simple scoring approach by choosing our ranking function to be a linear combination of popularity and predicted rating. This gives an equation of the form score (u; v) =w 1p(v ) +w 2r (u; v ) +b, where u=user,v=video item, p=popularity and r=predicted rating. This equation de\fnes a two-dimensional space (see Figure 5).

Once we have such a function, we can pass a set of videos through our function and sort them in descending order ac- cording to the score. First, though, we need to determine the weights w 1 and w 2 in our model (the bias bis constant and thus ends up not a ecting the \fnal ordering). We can formulate this as a machine learning problem: select pos- itive and negative examples from your historical data and let a machine learning algorithm learn the weights that op- timize our goal. This family of machine learning problems is known as "Learning to Rank" and is central to application scenarios such as search engines or ad targeting. A cru- cial di erence in the case of ranked recommendations is the importance of personalization: we do not expect a global notion of relevance, but rather look for ways of optimizing a personalized model.

As you might guess, the previous two-dimensional model is a very basic baseline. Apart from popularity and rating pre- diction, we have tried many other features at Net ix. Some have shown no positive e ect while others have improved our ranking accuracy tremendously. Figure 6 shows the ranking improvement we have obtained by adding di erent features Figure 5: Constructing a basic personalized two-dimensional ranking function based on popularity and predicted rating Figure 6: Performance of Net ix ranking system when adding features and optimizing the machine learning algorithm.

Many supervised classi\fcation methods can be used for rank- ing. In section 5.2, we will explore some of the latest ap- proaches to the learning to rank problem.

3.4 Data The previous discussion on the ranking algorithms highlights the importance of both data and models in creating an op- timal personalized experience. The availability of high vol- umes of high quality user data allows for some approaches that would have been unthinkable just a few years back. As an example, here are some of the data sources we can use at Net ix to optimize our recommendations:

We have several billion item ratingsfrom members.

And we receive millions of new ratings every day.

We already mentioned the use of global item popu- larity for ranking. There are many ways to compute popularity such as over various time ranges or group- ing members by region or other similarity metrics.

Our members add millions of items to their queues SIGKDD Explorations Volume 14, Issue 2 Page 42 each day. And they directly enter millions of search terms each day.

Each item in our catalog has rich metadatasuch as actors, director, genre, parental rating, or reviews.

Using presentation and impression data, we know what items we have recommended and where we have shown them, and can look at how that decision has a ected the user's actions. We can also observe the member's interactions with the recommendations: scrolls, mouse-overs, clicks, or the time spent on a given page.

Social data has become our latest source of person- alization features. Social data may include the social network connections themselves as well as interactions, or activities of connected nodes.

We can also tap into external datasuch as box o ce performance or critic reviews to improve our features.

And that is not all: there are many other features such as demographics, location,language, or tempo- ral data that can be used in our predictive models.

3.5 Models So, what about the models? As we described in Section 1.2, many di erent modeling approaches have been used for building personalization engines. One thing we have found at Net ix is that with the great availability of data, both in quantity and types, a thoughtful approach is re- quired to model selection, training, and testing. We use all sorts of machine learning approaches: From unsuper- vised methods such as clusteringalgorithms to a number of supervised classi\fers that have shown optimal results in various contexts. This is an incomplete list of methods you should probably know about if you are working in machine learning for personalization: Linear regression,Logis- tic regression, Elastic nets,Singular Value Decom- position, Restricted Boltzmann Machines, Markov Chains, Latent Dirichlet Allocation, Association Rules, Matrix factorization, Gradient Boosted Decision Trees, Random Forests , and Clustering techniques from the sim- ple k-means to graphical approaches such as A nity Prop- agation.

There is no easy answer to how to choose which model will perform best in a given problem. The simpler your feature space is, the simpler your model can be. But it is easy to get trapped in a situation where a new feature does not show value because the model cannot learn it. Or, the other way around, to conclude that a more powerful model is not useful simply because you don't have the feature space that exploits its bene\fts.

4. DISCUSSION: MORE DATA OR BETTER MODELS?

The previous discussion on models vs. data has recently become a favorite - and controversial - topic. The improve- ments enabled thanks to the availability of large volumes of data together with a certain Big Data "hype" have driven many people to conclude that it is "all about the data".

But in most cases, data by itself does not help in making our predictive models better. Figure 7: In some cases, accuracy does not depend so much on the model used, but on the amount of data used to train the model. (From "Scaling to Very Very Large Corpora for Natural Language Disambiguation" [Banko And Brill, 2001]) Probably one of the most famous quotes defending the power of data is that of Peter Norvig claiming that "We don't have better algorithms. We just have more data.". He is even misquoted as saying that "All models are wrong, and you don't need them anyway" (You can read Norvig's rebuttal and clari\fcations in his own webpage 2 ). Norvig did co- author a paper entitled "The Unreasonable E ectiveness of Data" [26] in which the authors discuss the power of data over models. The e ect that the authors were referring to had already been described years before in a now famous paper by Banko and Brill [9] where the authors included the plot in Figure 7. Both Norvig, and Banko and Brill are of course right... in a context. The problem is that they are now and again misquoted in contexts that are completely di erent from the original ones.

In order to understand why, we need to clarify the di er- ence between models with high varianceor highbias. The basic idea is that there are two possible (and almost oppo- site) reasons why a model might not perform well. In the \frst case, we might have a model that is too complicated for the amount of data we have. This situation, known as high variance, leads to model over\ftting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features, and by in- creasing the number of data points. Both Banko & Brill and Norvig were dealing with high variance models since they were working on language models in which roughly every word in the vocabulary makes a feature. These are models with many features as compared to the training examples.

Therefore, they are likely to over\ft. And yes, in this case adding more examples will help. In the opposite case, we might have a model that is too simple to explain the data 2 http://norvig.com/fact-check.html SIGKDD Explorations Volume 14, Issue 2 Page 43 Figure 8: Adding more data to high bias models will not help. This plot illustrates a real case from a production system at Net ix we have. In that case, known as high bias, adding more data will not help. See \fgure 8 illustrating the performance of a real production system at Net ix as we add more training examples.

So, no, more data does not always help. As we have just seen, there can be many cases in which adding more exam- ples to our training set will not improve the model perfor- mance.

On the other hand, high bias models will not bene\ft from more training examples, but they might very well bene\ft from more features. Or will they? Well, again, it depends.

Let's take the Net ix Prize, for example. Pretty early on, there was a blog post commenting on the use of extra fea- tures to solve the problem 3 . The post explains how a team of students got an improvement on the prediction accuracy by adding content features from IMDB. In retrospect, it is easy to criticize the post for making a gross over-generalization from a single data point. Many teams showed later that adding content features from IMDB or the like to an opti- mized algorithm had little to no improvement. Some of the members of the Gravity team, one of the top contenders for the Prize, published a detailed paper in which they showed how those content-based features would add no improve- ment to the highly optimized collaborative \fltering matrix factorization approach [48]. Again: More data, even in the form of di erent features, does not always help.

So, is the Big Data revolution only hype? No way. Having more data, both in terms of more examples or more fea- tures, is a blessing. The availability of data enables more and better insights and applications. More data indeed en- ables better approaches. More than that: it requires better approaches! In other words, data is important. But data without a sound approach becomes noise.

5. RESEARCH DIRECTIONS As I mentioned before, even though there were a lot of re- search advances in the context of the Net ix Prize, the prize was a simpli\fcation. In section 3, I illustrated the broader scope of the recommendation problem by presenting Net- ix' comprehensive approach. In this section, I will describe some of the latest advances in Recommender Systems by highlighting some of the most promising research directions. 3 http://anand.typepad.com/datawocky/2008/03/more- data-usual.html Most of these directions are enabled thanks to the availabil- ity of larger amounts of di erent data such as implicit user feedback, contextual information, or social network interac- tion data.

5.1 Beyond explicit ratings Explicit ratings are not the only or even the best kind of feedback we can get from our users. As already mentioned, explicit feedback is noisy. Another issue with ratings is that they do not represent a linear but rather an ordinal scale.

Most traditional methods wrongly interpret ratings as be- ing linear, for example by computing averages. This issue, however, has been addressed by some recent methods such as OrdRec [70] deal with this issue.

In any case, in most real-world situations, implicit and bi- nary feedback is much more readily available and requires no extra e ort on the user side. For instance, in a web page you will have users visiting a URL, or clicking on an ad as a positive feedback. In a music service, a user will decide to listen to a song. Or in a movie service, like Net ix, you will have users deciding to watch a title as an indication that the user liked the movie. That is why, besides trying to address some of the issues with explicit ratings, there have been many recent approaches that focus on the use of the more reliable and readily available implicit feedback . Bayesian Personalized Ranking (BPR) [51], for example, uses implicit feedback to compute a personalized ranking.

Implicit and explicit feedback can be combined in di erent ways [44]. Even the SVD++ approach explained in Sec- tion 2.1 and used during the prize can combine explicit and implicit feedback. Another way is to use logistic ordinal regression [45]. Matchbox, a Bayesian approach [62], also o ers a framework to integrate di erent kinds of feedback such as ordinal ratings or implicit like/don't like preferences.

5.2 Personalized Learning to Rank The traditional pointwise approach to learning to rank de- scribed in Section 3.3 treats ranking as a simple binary clas- si\fcation problem where the only input are positive and neg- ative examples. Typical models used in this context include Logistic Regression, Support Vector Machines, or Gradient Boosted Decision Trees.

There is a growing research e ort in \fnding better approaches to ranking. The pairwise approach to ranking, for instance, optimizes a loss function de\fned on pairwise preferences from the user. The goal is to minimize the number of inver- sions in the resulting ranking. Once we have reformulated the problem this way, we can transform it back into the pre- vious binary classi\fcation problem. Examples of such an approach are RankSVM [17], RankBoost [21], or RankNet [14].

We can also try to directly optimize the ranking of the whole list by using a listwise approach. RankCosine [66], for example, uses similarity between the ranking list and the ground truth as a loss function. ListNet [16] uses KL- divergence as loss function by de\fning a probability distri- bution. RankALS [63] is a recent approach that de\fnes an ob jective function that directly includes the ranking opti- mization and then uses Alternating Least Squares (ALS) for optimizing.

Whatever ranking approach we use, we need to use rank- speci\fc information retrieval metrics to measure the perfor- mance of the model. Some of those metrics include Mean SIGKDD Explorations Volume 14, Issue 2 Page 44 Average Precision (MAP), Normalized Discounted Cumula- tive Gain (NDCG), Mean Reciprocal Rank (MRR), or Frac- tion of Concordant Pairs (FCP). What we would ideally like to do is to directly optimize those same metrics. However, it is hard to optimize machine-learned models directly on these measures since they are not di erentiable and stan- dard methods such as gradient descent or ALS cannot be directly applied.

In order to optimize those metrics, some methods \fnd a smoothed version of the ob jective function to run Gradient Descent. CLiMF optimizes MRR [60], and TFMAP [59], optimizes MAP in a similar way. AdaRank [68] uses boost- ing to optimize NDCG. Another method to optimize NDCG is NDCG-Boost [64], which optimizes expectation of NDCG over all possible permutations. SVM-MAP [69] relaxes the MAP metric by adding it to the SVM constraints. It is even possible to directly optimize the non-diferentiable IR metrics by using techniques such as Genetic Programming, Simulated Annealing [32], or even Particle Swarming [18].

5.3 Context­aware recommendations Most of the work on recommender systems has traditionally focused on the two-dimensional user/item problem. But we know that in practice many other dimensions might be af- fecting the user's preference. All of those other dimensions (e.g. location, or time) are referred to as context. Using con- textual variables represents having to deal with more data, and a higher dimensionality problem. However, this might prove e ective for businesses [25].

Adomavicius and Tuzhilin [2] do a thorough review of ap- proaches to contextual recommendations and categorize context- aware recommender systems (CARS) into three types: con- textual pre-\fltering, where context drives data selection; contextual post-\fltering, where context is used to \flter rec- ommendations once they have been computed using a tra- ditional approach; and contextual modeling, where context is integrated directly into the model. An example of con- textual pre-\fltering is the so-called user micro-pro\fle, in which a single user is represented by a hierarchy of possibly overlapping contextual pro\fles [8]. Post-\fltering methods can use traditional approaches and then apply \fltering or weighting. In their experimental evaluation, Panniello et al.

[43] found that the choice of a pre-\fltering or post-\fltering strategy depends on the particular method and sometimes a simple post-\flter can outperform an elaborate pre-\fltering approach.

Although some standard approaches to recommendation could theoretically accept more dimensions, the only models to re- port results in this category are Oku et al.'s Context-aware Support Vector Machines (SVM) [42]. Xiong et al. present a Bayesian Probabilistic Tensor Factorization model to cap- ture the temporal evolution of online shopping preferences [67]. The authors show in their experiments that results using this third dimension in the form of a tensor does im- prove accuracy when compared to the non-temporal case.

Multiverse is another multidimensional tensor factorization approach to contextual recommendations that has proved e ective in di erent situations [31].

Factorization Machines [50] is a novel general-purpose re- gression model that models interactions between pairs of variables and the target by using factors. Factorization Ma- chines have proved to be useful in di erent tasks and do- mains [52]. In particular, they can be e ciently used to model the interaction with contextual variables [53]. An- other novel approach to contextual recommendations worth mentioning is the one based on the use of Sparse Linear Method (SLIM) [39].

5.4 Unbalanced class problems and presenta­ tion effects In the traditional formulation of the "Recommender Prob- lem", we have pairs of items and users and user feedback values for very few of those dyads. The problem is formu- lated as the \fnding of a utility function or model to estimate the missing values. But in cases where we have implicit feedback, the recommendation problem becomes the predic- tion of the probability a user will interact with a given item.

There is a big shortcoming in using the standard recommen- dation formulation in such a setting: we don't have negative feedback. All the data we have is either positive or missing.

And the missing data includes both items that the user ex- plicitly chose to ignore because they were not appealing and items that would have been perfect recommendations but were never presented to the user [61].

A way that this unbalanced class problem has been ad- dressed is to convert unlabeled examples into both a pos- itive and a negative example, each with a di erent weight related to the probability that a random exemplar is posi- tive or negative [19]. Another solution is to binarize the im- plicit feedback values: any feedback value greater than zero means positive preference, while any value equal to zero is converted to no preference [28]. A greater value in the im- plicit feedback value is used to measure the "con\fdence" in the fact the user liked the item.

In many practical situations, though, we have more informa- tion than the simple binary implicit feedback from the user.

In particular, we might be able to know whether items not selected by the user were actually shown. This adds very valuable information, but slightly complicates the formula- tion of our recommendation problem. We now have three di erent kinds of values for items: positive, presented but not chosen, and not presented. This issue has been recently addressed by the so-called Collaborative Competitive Filter- ing (CCF) approach [71]. The goal of CCF is to model not only the collaboration between similar users and items, but also the competition of items for user attention.

Another important issue related to how items are presented is the so-called position bias: An item that is presented in the \frst position of a list has many more possibilities to be chosen than one that is further down [49].

5.5 Social Recommendations One of the factors that has contributed the most to the re- cent availability of large streams of data is the explosion of social networks. Recommender systems have also jumped onto this new source of data [37]. Most of the initial ap- proaches to social recommendation 4 relied on the so-called trust-based model in which the trust (or in uence) of oth- ers is transmitted through the social network connections [41; 7]. It is still unclear whether users prefer recommen- dations from friends to those coming from other users. In a recent study [12], the authors found that the selection of 4 It is important to note that the term \social recommenda- tion" was originally used to describe collaborative \fltering approaches [10; 58] SIGKDD Explorations Volume 14, Issue 2 Page 45 users where the recommendation came from did not make much di erence, except if the recipients of the recommenda- tion were made aware of it. In any case, it seems clear that at the very least social trust can be used as a way to gen- erate explanations and support that have a positive impact on the user.

There are other uses of social information. For instance, so- cial network data can be an e cient way to deal with user or item cold-start. Social information can, for instance, be used to select the most informative and relevant users or items [36]. And speaking of selecting users, some recent meth- ods propose using social information to select experts [57] in a similar way as they can also be selected in collaborative \fltering settings [5].

Social-based recommendations can also be combined with the more traditional content-based or collaborative \fltering approaches [29]. As a matter of fact, social network infor- mation can be e ciently included in a pure collaborative \fltering setting by, for example, including it in the matrix factorization ob jective function [40; 72].

6. CONCLUSION The Net ix Prize abstracted the recommendation problem to a proxy and simpli\fed question of predicting ratings. But it is clear that the Net ix Prize ob jective, accurate predic- tion of a movie's rating, is just one of the many components of an e ective recommendation system. We also need to take into account factors such as context, popularity, inter- est, evidence, novelty, diversity, or freshness. Supporting all the di erent contexts in which we want to make recommen- dations requires a range of algorithms and di erent kinds of data.

Recommender systems need to optimize the probability a member chooses an item and enjoys it enough to come back to the service. In order to do so, we should employ all the data that is available: from user ratings and interactions, to content metadata. More data availability enables bet- ter results. But in order to get those results, we need to have optimized approaches, appropriate metrics and rapid experimentation.

This availability of more and di erent sources of data has opened up new research avenues. As personalization algo- rithms keep improving, so does the experience of the users that use these systems. But the recommendation problem is far from solved. There are still many unexplored opportuni- ties and lessons to be learned. Our team of researchers and engineers at Net ix do so every day. Make sure to visit our jobs page 5 if you are interested in joining us on this pursuit.

7. REFERENCES [1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans.

on Know ledge and Data Engineering, 17(6):734{749, 2005. 5 http://jobs.net ix.com [2] G. Adomavicius and A. Tuzhilin.

Recommender Sys- tems Handbook , chapter Context-aware recommender systems, pages 217{253. Springer, 2011.

[3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. of 20th VLDB Conf. , 1994.

[4] X. Amatriain, A. Jaimes, N. Oliver, and J. M. Pu- jol. Data Mining Methods for Recommender Sys- tems. In Recommender Systems Handbook, pages 39{71.

Springer US, 2011.

[5] X. Amatriain, N. Lathia, J. M. Pujol, H. Kwak, and N. Oliver. The wisdom of the few: a collaborative \fl- tering approach based on expert opinions from the web.

InProc. of 32nd ACM SIGIR, 2009.

[6] X. Amatriain, J. M. Pujol, and N. Oliver. I Like It... I Like It Not: Evaluating User Ratings Noise in Recom- mender Systems. In User Modeling, Adaptation, and Personalization, volume 5535, chapter 24, pages 247{ 258. Springer Berlin, 2009.

[7] R. Andersen, C. Borgs, J. Chayes, U. Feige, A. Flax- man, A. Kalai, V. Mirrokni, and M. Tennenholtz.

Trust-based recommendation systems: an axiomatic approach. In Proc. of the 17th WWW, 2008.

[8] L. Baltrunas and X. Amatriain. Towards time- dependant recommendation based on implicit feedback.

InWorkshop on Context-Aware Recommender Systems (CARS 2009), 2009.

[9] M. Banko and E. Brill. Scaling to very very large cor- pora for natural language disambiguation. In Proc. of ACL '01 , pages 26{33, 2001.

[10] C. Basu, H. Hirsh, and W. Cohen. Recommendation as classi\fcation: using social and content-based informa- tion in recommendation. In Proc. of AAAI '98, 1998.

[11] R. M. Bell and Y. Koren. Lessons from the Net ix Prize Challenge. SIGKDD Explor. Newsl., 9(2):75{79, December 2007.

[12] S. Bourke, K. McCarthy, and B. Smyth. Power to the people: exploring neighbourhood formations in social recommender system. In Proc. of Recsys '11, pages 337{ 340, 2011.

[13] A Bouza, G Reif, A Bernstein, and H. Gall. Semtree: ontology-based decision tree algorithm for recom- mender systems. In International Semantic Web Con- ference, 2008.

[14] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd ICML, pages 89{96, 2005.

[15] R. Burke. The adaptive web. chapter Hybrid Web Rec- ommender Systems, pages 377{408. 2007.

[16] Z. Cao and T. Liu. Learning to rank: From pairwise approach to listwise approach. In In Proceedings of the 24th ICML, 2007.

SIGKDD Explorations Volume 14, Issue 2 Page 46 [17] O. Chapelle and S. S. Keerthi. E cient algorithms for ranking with SVMs. Information Retrieval, 13:201{215, June 2010.

[18] E. Diaz-Aviles, M. Georgescu, and W. Nejdl. Swarming to rank for recommender systems. In Proc. of Recsys '12, 2012.

[19] C. Elkan and K. Noto. Learning classi\fers from only positive and unlabeled data. In Proc. of the 14th ACM SIGKDD, 2008.

[20] S. Hsu et al. AIMED- A Personalized TV Recommen- dation System. In Interactive TV: a Shared Experience, 2007.

[21] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An e cient boosting algorithm for combining preferences.

J. Mach. Learn. Res. , 4:933{969, December 2003.

[22] S. Funk. Net ix update: Try this at home. http://sifter.org/ simon/journal/20061211.html, 2006.

[23] R. Ghani and A. Fano. Building recommender systems using a knowledge base of product semantics. In In 2nd International Conference on Adaptive Hypermedia and Adaptive Web Based Systems, 2002.

[24] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time collaborative \fltering al- gorithm. Journal Information Retrieval, 4(2):133{151, July 2001.

[25] M. Gorgoglione, U. Panniello, and A. Tuzhilin. The e ect of context-aware recommendations on customer purchasing behavior and trust. In Proc. of Recsys '11, pages 85{92, 2011.

[26] A. Halevy, P. Norvig, and F. Pereira. The Unreason- able E ectiveness of Data. IEEE Intel ligent Systems, 24(2):8{12, March 2009.

[27] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative \fltering recommender systems. ACM Trans. Inf. Syst. , 22(1):5{53, 2004.

[28] Yifan Hu, Y. Koren, and C. Volinsky. Collaborative Filtering for Implicit Feedback Datasets. In Proc. of the 2008 Eighth ICDM, pages 263{272, 2008.

[29] M. Jamali and M. Ester. Trustwalker: a random walk model for combining trust-based and item-based rec- ommendation. In Proc. of KDD '09, 2009.

[30] H. Kang and S. Yoo. SVM and Collaborative Filtering- Based Prediction of User Preference for Digital Fashion Recommendation Systems. IEICE Transactions on Inf & Syst, 2007.

[31] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver. Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative \fl- tering. In Proc. of the fourth ACM Recsys, 2010.

[32] M. Karimzadehgan, W. Li, R. Zhang, and J. Mao. A stochastic learning-to-rank algorithm and its applica- tion to contextual advertising. In Proceedings of the 20th WWW, 2011. [33] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative \fltering model. In Proceed- ings of the 14th ACM SIGKDD, 2008.

[34] Y. Koren. Collaborative \fltering with temporal dynam- ics. In Proceedings of the 15th ACM SIGKDD, 2009.

[35] Y. Koren, R. Bell, and C. Volinsky. Matrix Factoriza- tion Techniques for Recommender Systems. Computer, 42(8):30{37, August 2009.

[36] N. N. Liu, X. Meng, C. Liu, and Q. Yang. Wisdom of the better few: cold start recommendation via repre- sentative based rating elicitation. In Proc. of RecSys '11, 2011.

[37] H. Ma, H. Yang, M.R. Lyu, and I. King. Sorec: so- cial recommendation using probabilistic matrix factor- ization. In Proc. of the 17th Conf. on Information and know ledge management , 2008.

[38] B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. E ec- tive personalization based on association rule discovery from web usage data. In WIDM '01.

[39] X. Ning and G. Karypis. Sparse linear methods with side information for top-n recommendations. In Proc.

of the 21st WWW, 2012.

[40] J. Noel, S. Sanner, K. Tran, P. Christen, L. Xie, E. V. Bonilla, E. Abbasnejad, and N. Della Penna. New ob- jective functions for social collaborative \fltering. In Proc. of WWW '12, pages 859{868, 2012.

[41] J. O'Donovan and B. Smyth. Trust in recommender systems. In Proc. of IUI '05, 2005.

[42] K. Oku, S. Naka jima, J. Miyazaki, and S. Uemura. Context-aware SVM for context-dependent information recommendation. In Proc. of the 7th Conference on Mo- bile Data Management , 2006.

[43] U. Panniello, A. Tuzhilin, M. Gorgoglione, C. Palmisano, and A. Pedone. Experimental com- parison of pre- vs. post-\fltering approaches in context-aware recommender systems. In Proceedings of RecSys '09, 2009.

[44] D. Parra and X. Amatriain. Walk the Talk: Analyzing the relation between implicit and explicit feedback for preference elicitation. In User Modeling, Adaption and Personalization, volume 6787, chapter 22, pages 255{ 268. Springer, 2011.

[45] D. Parra, A. Karatzoglou, X. Amatriain, and I. Yavuz. Implicit feedback recommendation via implicit-to- explicit ordinal logistic regression mapping. In Proc. of the 2011 CARS Workshop.

[46] M. Pazzani and D. Billsus. Content-based recommen- dation systems. In The Adaptive Web, volume 4321.

2007.

[47] M. J. Pazzani and D. Billsus. Learning and revising user pro\fles: The identi\fcation of interesting web sites.

Machine Learning , 27(3):313{331, 1997.

SIGKDD Explorations Volume 14, Issue 2 Page 47 [48] I. Pil aszy and D. Tikk. Recommending new movies:

even a few ratings are more valuable than metadata.

InProc. of RecSys '09, 2009.

[49] F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data re ect retrieval quality? In Proc. of the 17th CIKM, pages 43{52, 2008.

[50] S. Rendle. Factorization Machines. In Proc. of 2010 IEEE ICDM, 2010.

[51] S. Rendle, C. Freudenthaler, Z. Gantner, and L. S. Thieme. BPR: Bayesian personalized ranking from im- plicit feedback. In Proceedings of the 25th UAI , 2009.

[52] S. Rendle, C. Freudenthaler, and L. S. Thieme. Factor- izing personalized Markov chains for next-basket rec- ommendation. In Proc. of the 19th WWW , New York, NY, USA, 2010.

[53] S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme. Fast context-aware recommenda- tions with factorization machines. In Proc. of the 34th ACM SIGIR, 2011.

[54] R. Salakhutdinov, A. Mnih, and G. E. Hinton. Re- stricted Boltzmann machines for collaborative \fltering.

InProc of ICML '07, 2007.

[55] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative \fltering recommendation al- gorithms. In In Proc. of the 10th International WWW Conference, 2000.

[56] Science. Rockin' to the Music Genome. Science, 311(5765):1223d{, 2006.

[57] X. Sha, D. Quercia, P. Michiardi, and M. Dell'Amico. Spotting trends: the wisdom of the few. In Proc. of the Recsys '12 , 2012.

[58] U. Shardanand and P. Maes. Social information \flter- ing: algorithms for automating word of mouth. In Proc.

of SIGCHI '95, 1995.

[59] Y. Shi, A. Karatzoglou, L. Baltrunas, M. Larson, A. Hanjalic, and N. Oliver. TFMAP: optimizing MAP for top-n context-aware recommendation. In Proc. of the 35th SIGIR , 2012.

[60] Y. Shi, A. Karatzoglou, L. Baltrunas, M. Larson, N. Oliver, and A. Hanjalic. CLiMF: learning to maxi- mize reciprocal rank with collaborative less-is-more \fl- tering. In Proc. of the sixth Recsys, 2012.

[61] H. Steck. Training and testing of recommender systems on data missing not at random. In Proc. of the 16th ACM SIGKDD, 2010.

[62] D. H. Stern, R. Herbrich, and T. Graepel. Matchbox: large scale online bayesian recommendations. In Proc.

of the 18th WWW, 2009.

[63] G. Tak acs and D. Tikk. Alternating least squares for personalized ranking. In Proc. of Recsys '12, 2012.

[64] H. Valizadegan, R. Jin, R. Zhang, and J. Mao. Learning to Rank by Optimizing NDCG Measure. In Proc. of SIGIR '00, 2000. [65] Gui-Rong X., Chenxi L., Qiang Y., WenSi X., Hua- Jun Z., Yong Y., and Zheng C. Scalable collaborative \fltering using cluster-based smoothing. In Proc. of the 2005 SIGIR, 2005.

[66] F. Xia, T. Y. Liu, W. Wang, J.and Zhang, and H. Li. Listwise approach to learning to rank: theory and al- gorithm. In Proc. of the 25th ICML, New York, NY, USA, 2008.

[67] L. Xiong, X. Chen, T. Huang, and J. G. Carbonell J. Schneider. Temporal collaborative \fltering with bayesian probabilistic tensor factorization. In Proceed- ings of SIAM Data Mining , 2010.

[68] J. Xu and H. Li. AdaRank: a boosting algorithm for information retrieval. In Proc. of SIGIR '07, 2007.

[69] J. Xu, T. Y. Liu, M. Lu, H. Li, and W. Y. Ma. Directly optimizing evaluation measures in learning to rank. In Proc. of SIGIR '08 , 2008.

[70] Koren Y and J. Sill. OrdRec: an ordinal model for pre- dicting personalized item rating distributions. In Rec- Sys '11, pages 117{124, 2011.

[71] S.H. Yang, B. Long, A.J. Smola, H. Zha, and Z. Zheng. Collaborative competitive \fltering: learning recom- mender using context of user choice. In Proc. of the 34th ACM SIGIR, 2011.

[72] X. Yang, H. Steck, Y. Guo, and Y. Liu. On top-k rec- ommendation using social networks. In Proc. of RecSys '12, 2012.

SIGKDD Explorations Volume 14, Issue 2 Page 48