Monday, April 20, 2009

Tensor Decomposition and Applications


This is the first systematic paper I read on tensor. A tensor in computer science is a multi-dimensional array (I am still wondering whether it has anything to do with the term tensor in differential geometry).

So before we talk about tensors, here are several notations we'd like to use in the text:
  • The order of the tensor is the the number of dimensions: vectors are order-1 tensors, matrices are order-2.
  • An order one tensor is denoted with \mathbf{a}, an order-2 tensor with \mathbf{A} and one of higher-order with \mathcal{A}.
  • A fibre is an analog of a row/column of a matrix in tensor. A tensor \mathcal{X} of order n has a fibre in the form of \mathbf{x}_{i_1, \ldots, i_{j-1}, :,i_{j+1}, \ldots, i_n} which is the vector obtained by fixing all other index i_k except one i_j.
  • A slice of a tensor is obtained by fixing all indices except two, which can be regarded as a plane.
  • A tensor \mathcal{X} \in \mathbb{R}^{I_1} \times \cdots \times \mathbb{R}^{I_n} is said to have rank 1 if \mathcal{X} = \mathbf{a}^{(1)} \circ \cdots \circ \mathbf{a}^{(n)} where the \circ operator means x_{i_1, \ldots, i_n} = a_{i_1}^{(1)}\cdots a_{i_n}^{(n)}.
  • Matricization (a.k.a. unfolding, flattening) is the process of reordering the tensor data into a matrix. The mode-k unfolding of the order-n tensor \mathcal{X} is denoted as \mathbf{X}_{(n)} \in \mathbb{R}^{I_k\times (\prod_{s\neq k} I_s)}.
  • The k-mode product of a tensor \mathcal{X} and a matrix U \in \mathbb{R}^{J \times I_k} is
    (\mathcal{X} \times_k \mathbf{U})_{i_1, \ldots, j, \ldots, i_n} = \sum_{i_k = 1}^{I_k} x_{i_1, \ldots, x_n} u_{j, i_k}.
  • The Kronecker product of two matrices \mathbf{A}\in \mathbb{R}^{I \times J}, \mathbf{B} \in \mathbb{R}^{K \times L} is
    \mathbf{A} \otimes \mathbf{B} = \begin{pmatrix} a_{1, 1} \mathbf{B} & \cdots & a_{1, J}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{I, 1} \mathbf{B} & \cdots & a_{I, J} \mathbf{B} \end{pmatrix}.
  • The Khatri-Rao product of the two matrices \mathbf{A} \in \mathbb{R}^{I \times K}, \mathbf{B} \in \mathbb{R}^{J \times K} is
    \mathbf{A} \odot \mathbf{B} = \begin{pmatrix} \mathbf{a}_1 \otimes \mathbf{b}_1 & \cdots & \mathbf{a}_K \otimes \mathbf{b}_K \end{pmatrix}.
  • The Hadmard product is the elementwise product of two matrices \mathbf{A}, \mathbf{B} \in \mathbb{R}^{I, J}, i.e.
    \mathbf{A} * \mathbf{B} = \begin{pmatrix} a_{1,1} b_{1, 1} & \cdots & a_{1, J} b_{1, J} \\ \vdots & \ddots & \vdots \\ a_{I,1} b_{I, 1} & \cdots & a_{I, J} b_{I, J}\end{pmatrix}.
  • The so-called CP decomposition (canonical decomposition, parallel factors) of a tensor is a sum of rank 1 components
    \mathcal{X} \approx \sum_{r = 1}^R \mathbf{a}_r^{(1)} \circ \cdots \circ \mathbf{a}_r^{(n)}.
    This is usually denoted with
    \mathcal{X} \approx [\![ \mathbf{\lambda}; \mathbf{A}^{(1)}, \ldots, \mathbf{A}^{(n)} ]\!] \stackrel{\triangle}{=} \sum_{r = 1}^R \lambda_r \mathbf{a}_r^{(1)} \circ \cdots \circ \mathbf{a}_r^{(n)}.
    where \mathbf{A}^{(i)} has orthonormalized vectors.

However low-rank approximation in tensor is not as well defined as that of matrices. One thing is there may not exist a rank-2 tensor which is nearest to a rank-3 tensor. Although there are various bounds for the rank, it is still not clear how to estimate the rank of a tensor. Therefore it will be difficult for us to find a good low rank approximation.

The most popular algorithm for CP decomposition is ALS (alternating least square). This is very simple, by fixing all \mathbf{A}^{(i)} except \mathbf{A}^{(k)}. With the k-mode unfolding, we will approximate it by tuning \mathbf{A}^{(k)}, which leads to the following least square problem,
\min_{\hat{\mathbf{A}}} \| \mathbf{X}_{(k)} - \hat{\mathbf{A}} (\mathbf{A}_{(1)} \odot \cdots \odot \mathbf{A}_{(k-1)} \odot \mathbf{A}_{(k+1)} \odot \cdots \odot \mathbf{A}_{(n)})^\top\|,
with least square solution, we have
\hat{\mathbf{A}} = \mathbf{X}_{(1)} [ (\mathbf{A}_{(1)} \odot \cdots \odot \mathbf{A}_{(k-1)} \odot \mathbf{A}_{(k+1)} \odot \cdots \odot \mathbf{A}_{(n)})^\top ]^\dagger.
Please notice that (\mathbf{A} \odot \mathbf{B})^\top (\mathbf{A} \odot \mathbf{B}) = (\mathbf{A}^\top \mathbf{A}) * (\mathbf{B}^\top \mathbf{B}), therefore
\hat{\mathbf{A}} = \mathbf{X}_{(1)} (\mathbf{A}_{(1)} \odot \cdots \odot \mathbf{A}_{(k-1)} \odot \mathbf{A}_{(k+1)} \odot \cdots \odot \mathbf{A}_{(n)})[(\mathbf{A}_{(1)}^\top \mathbf{A}_{(1)} * \cdots * \mathbf{A}_{(k-1)}^\top \mathbf{A}_{(k-1)} * \mathbf{A}_{(k+1)}^\top \mathbf{A}_{(k+1)} * \cdots * \mathbf{A}_{(n)}^\top \mathbf{A}_{(n)})]^\dagger.
The values of \lambda is updated with the norm of the columns of \hat{\mathbf{A}} and \mathbf{A}_{(k)} is the normalized \hat{\mathbf{A}}.

Another important problem with tensor is Tucker decomposition (a.k.a. HOSVD, high order SVD): we seek to find
\mathcal{X} \approx \mathcal{G} \times_1 \mathbf{A}^{(1)} \times_2 \cdots \times_{n} \mathbf{A}^{(n)}.
Usually we need a degenerate version of HOSVD, which relates to the k-rank approximation. This problem can also be solved with ALS.

There are a lot of decompositions related to tensor:
  • INDSCAL (indivisual differences in scaling), for order-3 tensors where the first two dimensions are symmetric.
  • PARAFAC2 (parallel factorization) is, strictly speaking, not a problem of tensor decomposition. It seeks to find an approximation of several matrix \mathbf{X}_k, each is approximated with \mathbf{U}_k \mathbf{S}_k \mathbf{V}^\top, where \mathbf{U}_k and \mathbf{V} are orthogonal matrices and \mathbf{S}_k are diagonal matrices.
  • CANDELINC (canonical decomposition with linear constraints), in which we know the subspace of each dimension, can be turned into a standard CP decomposition this way.
  • DEDICOM (decomposition into directional components), seeks approximation of an asymmetric matrix \mathbf{X} = \mathbf{A} \mathbf{R} \mathbf{A}^\top.
  • PARATUCK2 seeks to find an approximation of \mathcal{X},
    \mathbf{X}_k \approx \mathbf{A} \mathbf{D}_k^A \mathbf{R} \mathbf{D}_k^B \mathbf{B}
    where \mathbf{A}, \mathbf{B} are matrices and \mathbf{D}_k^A, \mathbf{D}_k^B are diagonal matrices.
  • NTF (nonnegative tensor factorization), on each dimension, a NMF is applied instead of PCA.

No comments: