Recommender Systems Notation: Proposed Common Notation for Teaching and Research

As the field of recommender systems has developed, authors have used a myriad of notations for describing the mathematical workings of recommendation algorithms. These notations ap-pear in research papers, books, lecture notes, blog posts, and software documentation. The dis-ciplinary diversity of the field has not contributed to consistency in notation; scholars whose home base is in information retrieval have different habits and expectations than those in ma-chine learning or human-computer interaction. In the course of years of teaching and research on recommender systems, we have seen the val-ue in adopting a consistent notation across our work. This has been particularly highlighted in our development of the Recommender Systems MOOC on Coursera (Konstan et al. 2015), as we need to explain a wide variety of algorithms and our learners are not well-served by changing notation between algorithms. In this paper, we describe the notation we have adopted in our work, along with its justification and some discussion of considered alternatives. We present this in hope that it will be useful to others writing and teaching about recommender systems. This notation has served us well for some time now, in research, online education, and traditional classroom instruction. We feel it is ready for broad use.


INTRODUCTION
As the field of recommender systems has developed, authors have used a myriad of notations for describing the mathematical workings of recommendation algorithms. These notations appear in research papers, books, lecture notes, blog posts, and software documentation. The disciplinary diversity of the field has not contributed to consistency in notation; scholars whose home base is in information retrieval have different habits and expectations than those in machine learning or human-computer interaction.
In the course of years of teaching and research on recommender systems, we have seen the value in adopting a consistent notation across our work. This has been particularly highlighted in our development of the Recommender Systems MOOC on Coursera (Konstan et al. 2015), as we need to explain a wide variety of algorithms and our learners are not well-served by changing notation between algorithms.
In this paper, we describe the notation we have adopted in our work, along with its justification and some discussion of considered alternatives. We present this in hope that it will be useful to others writing and teaching about recommender systems. This notation has served us well for some time now, in research, online education, and traditional classroom instruction. We feel it is ready for broad use.

DESIGN GOALS
We have several design goals in order to support a wide range of recommender exposition. Some of these are in conflict, and we navigate tensions between competing desiderata in some of our specific decisions.

Flexibility
We desire our notation to be flexible to a wide range of algorithms; it should apply equally well to neighborhood models and advanced learning-to-rank applications.

Clarity
We want to minimize the guesswork required in order to interpret notation. When feasible, common symbols should be connected to their referents. Notation should not be ambiguous.

Concision
At the same time, notation should not be overly verbose. While we favor clarity over terseness, clarity is not always well-served by cumbersome explicit notation.

Commonality
When practical, we also seek to use common notation from other fields. For example, if it is feasible to reuse common linear algebra notation, we seek to do so.

Usability by Hand
To support educational use, as well as whiteboard collaboration in office and lab environments, we want our notation to be legible when hand-written. In particular, we avoid distinctions in symbols that are difficult to replicate in handwriting.

RECOMMENDATION INPUTS
With these in mind, we begin with notation for the underlying objects in a recommender system.
We denote items , ∈ and users , ∈ . We consistently maintain a distinction between the variables used to denote users and items to avoid ambiguity. If we need more than two users or items, we use numeric subscripts, such as 1 , 2 , … , .
User-item preference data is denoted in the form of a | | × | | ratings matrix . is sparsely observed; in a mild abuse of notation, we also consider as a set, and write ∈ to denote that we have a rating or other observed or inferred preference from user for item .
We can then denote subsets of these sets.
is the set of ratings by user and is the set of ratings of item . Since we use distinct variables for user and items, the direction of the subset operation is clear from context. We can also denote user and item rating vectors by ⃗ and ⃗ , which again are sparsely observed. In typed material we will sometimes use boldface and .
We find it useful to also denote subsets of users and items. = { ∈ : ∈ } is the set of items that have been rated by , and likewise = { ∈ : ∈ } is the set of users who have rated .
Overload Note: Our use of for the set of items conflicts with the common use of as the identity matrix. We find item sets to be a significantly more common notational need than identity matrices. Therefore, when an identity matrix is required, the less common but precedented notation of or may be used.

Summary Statistics
This notation lends itself well to a number of summary statistics:

| |
The number of items rated or consumed by user .

| |
The number of users who have rated or consumed item .

| |
The number of ratings provided by (| | = | | unless there are repeated ratings for the same user-item pair).

| |
The number of ratings provided for (likewise, | | = | | in the absence of repeated ratings) ̅ , ̅ The average of user or item 's ratings: ̅ = ∑ ∈ | |

Unobserved Underlying Data
It is sometimes useful to refer to underlying "true preference" scores of which ratings are a noisy observation. We denote these scores ∈ Π. If we want to model a rating as being preference plus Gaussian noise (psychologically unrealistic but precedented in the recommender systems literature), we can say:

RECOMMENDATION OUTPUTS
Many recommendation algorithms compute an ordering of items for a user. The recommendation request may also be associated with a context and/or an explicit representation of a user information need, such as a search query. We can denote this with an ordering function : Where is a set of queries or task descriptions, and a set of contexts or context cues.
Recommendation is often performed by a top-N ranking using some per-item score; such a score is also the basis of rating prediction or preference estimation. We can similarly denote this with a function : ( | , ℎ, ): × × × → ℝ For both of these functions, any given implementation may only depend on a subset of the input variables. This enables our notation to encompass a wide range of specific applications in the recommendation and search space; for example: Note: we have considered several different variables to denote the set of queries or task descriptions. Earlier versions of this notation used , but that overloads with the common use of as the right-hand side of a matrix decomposition. We also considered for task, but that would result in denoting individual task descriptions by , which causes unfortunate overload with the near-universal use of for time when conducting temporal analysis, evaluations, or algorithm implementations. We choose as being a relatively neutral selection that doesn't conflict with other common use cases.

NOTATING ALGORITHM FAMILIES
With this general notation in place, we can now apply it to describing various standard recommendation algorithms.

Bias Model
Many algorithms build on a user-item bias model, or personalized mean; it is a useful fallback for predicting ratings when a more sophisticated algorithm cannot produce a score, and it is useful in normalization steps prior to running other algorithms (M. D. Ekstrand, Riedl, and Konstan 2010;Funk 2006). We notate this as: The regularizing constants and determine the bias model's skepticism towards extreme biases on low-information users and items. The bias model can also be determined by optimization instead of the direct formulas above, either on its own or as a part of the optimization of some larger scoring model.

Probabilistic Models
We can similarly denote probabilistic models using the probability that an item is in a rating set, as in this non-personalized popularity model (probability taken over users ∈ ):

( ) = Pr[ ∈ ]
We can likewise write association rule metrics such as lift (in this formula, the context ℎ is the item for which lift is being used to compute related items): When clear in the surrounding context, we sometimes simplify to write Pr[ | ].

Neighborhood Approaches
User-based nearest-neighbor scoring (Herlocker, Konstan, and Riedl 2002) can be notated as: This introduces two more pieces of notation. is the interpolation weight between users and ; we prefer this notation over a similarity notation or ( , ) so that we can use to denote a score and to facilitate the use of other interpolation weighting schemes without changing the overall notation. A good choice is the cosine of normalized rating vectors (M. D. Ekstrand et al. 2011), described for item weights below; many implementations use the Pearson correlation (Herlocker, Konstan, and Riedl 2002).
( | ) is the neighborhood for with respect to . This will typically be the users most similar to that have rated . We can also consider the decontextualized neighborhood ( ), if we simply want to find similar users but do not need them to have rated any particular items. In cases where it is useful to explicitly notate the neighborhood size, it can be done with a subscript, as in ( | ).
Item-based nearest-neighbor (Sarwar et al. 2001) can be described as: ⃗ here denotes a normalized version of rating vector ⃗ ; often ̃= − ̅ . The weights and neighborhoods here are analogous to those in the user-based case.
is the weight between items and ; while it is often computed as the cosine above, SLIM (Ning and Karypis 2011) provides an alternative using elastic net regression to learn each item's neighbor weights.
( | ) is the neighborhood for with respect to ; this will usually be a subset of a larger pool of neighbors ( ). Both sets are typically the items most similar to (that have been rated by , in the ( | ) case), but other methods such as SLIM's lasso regularization are precedented.

Matrix Factorization
Matrix factorization (Koren, Bell, and Volinsky 2009) typically breaks down the rating matrix into composite user-feature and item-feature matrices: In this decomposition, is the | | × user-feature matrix and is the | | × item-feature matrix. We use the T framing so that users and items are both represented by row vectors for consistency 3 . Within these matrices, we use standard matrix entry notation to denote subsets of this matrix:

⃗ or
User 's latent feature vector.
⃗ or Item 's latent feature vector.
User 's value for feature .
Item 's value for feature .
We can then denote a score: ( | ) = ⃗ ⋅ ⃗ We prefer dot product notation over the matrix multiplication equivalent ( T , since latent feature vectors are row vectors) because it is easier for students without deep intuitive fluency in linear algebra to interpret -it is syntax sugar for a simple sum, rather than a special case of a more advanced sum. This notation works well for a variety of matrix factorization applications, including those for both explicit and implicit feedback, and for matrix factorization internals for more sophisticated algorithms such as BPR-MF (Rendle et al. 2009).

NEXT STEPS
We hope that the recommender systems research and education community finds this useful. We do not intend to propose this as a formal standard for the community, but we have found it helpful to standardize notation across our own work and think there is merit in greater commonality in notation across the field at large.