Academic Notes

Launch EC2 Spark Cluster

12/5/2015

(Spark maintains a document about this, but I find it's a bit difficult to understand if you are new to Spark. ) Here is a simpler guide (at least for me).

1 Comment

BIDMat - HDFSIO

11/19/2015

0 Comments

To fully understand Hadoop/Spark IO, you should better to first understand "Sequence File" and "Serializable".
I have been confused for a while though....and decide to write this post.

Java OOP Review: Inheritance, Hierarchy for package java.lang

0 Comments

Debugging: Using (scala) CUDA in Spark(2) - (ubuntu14.04)

11/10/2015

0 Comments

It was a pain to get Jcuda&scala worked on Spark, in case I (as well as someone else) need to install them later, I will try my best to recall most of the harmful errors. (it's not hard to solve many small bugs just by googling them, and I won't mention them)

0 Comments

Using (scala) CUDA in Spark(1) -- based on JCuda

11/9/2015

1 Comment

Many people want to leverage CUDA for some scala (machine learning) code. But cuda doesn't support scala TAT.
Hope never ends! We can always try the following approaches:

1 Comment

Is CPU the new bottleneck? - Tungsten (Spark DataFrame)

11/3/2015

0 Comments

Performance optimization is a never ending process. Project Tungsten will be the largest change to Spark's execution engine since the project's inception. It aims at substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware.
Why Cpu is the main bottleneck instead IO: 1. Hardware has been improved. 2.Spark's IO has been optimized. 3.Data Formats have improved. 4. Serialization and hashing are CPU-bound bottlenecks.

Three initiatives:

0 Comments

Matrix Operations in Spark MLlib

11/1/2015

0 Comments

MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas.
Related Data Types:

0 Comments

Spark Cluster (and some basic knowledge of Spark)- notes

10/26/2015

1 Comment

Computer cluster (concept), consists of a set of connected computers that work together and can be viewed as a single system. Computer clusters have each node set to perform the same task, controlled and scheduled by software.
(Design & Configuration) In a Beowulf system, the application programs never see the computational nodes (slave computers), but only interact with the "master", which is a specific computer handling the scheduling and management of the slaves.

Launch EC2 Spark Cluster

BIDMat - HDFSIO

Debugging: Using (scala) CUDA in Spark(2) - (ubuntu14.04)

Using (scala) CUDA in Spark(1) -- based on JCuda

Is CPU the new bottleneck? - Tungsten (Spark DataFrame)

Matrix Operations in Spark MLlib

Spark Cluster (and some basic knowledge of Spark)- notes

MapReduce VS Dryad

Categories

Archives