Matrix Operations in Spark MLlib

11/1/2015

MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas.
Related Data Types:

Local Vector - Create a dense vector (1.0, 0.0, 3.0) and a sparse vector

import org.apache.spark.mllib.linalg.{Vector, Vectors}
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))

Local Matrix(column major) - Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))

import org.apache.spark.mllib.linalg.{Matrix, Matrices}
val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))

Distributed matrix - Row Matrix
A row matrix is a row-oriented matrix without meaningful row indices. e.g. a collection of feature vectors.
It is backed by an RDD of its rows, where each row is a local vector.
A RowMatrix can be created from an RDD[Vector] instance.
rowMatrix.rows returns RDD[Vector], then can use .collect() to get its value.

import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val rows = sc.parallelize(Seq(dv,sv1))
val rdd: RDD[Vector] = rows
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rdd)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()
//Matrix Mul: a RowMatrix multiplies a local matrix
val result = mat.multiply(dm)
result.rows.collect()

Matrix Operations:
It seems that the JIRA for Disrtibuted block matrix remains unsolved.

Several ways to do distributed matrix multiplication:
1. sparse mat * dense mat : first create a distributed matrix with RowMatrix, then make a local dense matrix and multiply them.
2. block matrix (should have better performance)

tips:
-To get better performance in Spark, I'd recommend representing your matrix as an RDD of blocks (say 128x128 double arrays) instead of (int, int, double) pairs. (by Matei)
-Netlib-Java is a wrapper for low-level BLAS, LAPACK and ARPACK that performs as fast as the C / Fortran interfaces with a pure JVM fallback

Now Spark API doesn't provide transpose function. You have to define it like below. (Scala defines transpose)
def transpose(m: Array[Array[Double]]): Array[Array[Double]] = {
    (for {
      c <- m(0).indices
    } yield m.map(_(c)) ).toArray
}

0 Comments

Matrix Operations in Spark MLlib

Leave a Reply.

Categories

Archives