ZHIYUAN (ZOE) LIN
  • Home
  • Academic Notes
  • Thoughts
  • Chinese

Launch EC2 Spark Cluster

12/5/2015

1 Comment

 
(Spark maintains a document about this, but I find it's a bit difficult to understand if you are new to Spark. ) Here is a simpler guide (at least for me). 

Read More
1 Comment

BIDMat - HDFSIO

11/19/2015

0 Comments

 
To fully understand Hadoop/Spark IO, you should better to first understand "Sequence File" and "Serializable".
I have been confused for a while though....and  decide to write this post.
Java OOP Review: Inheritance, Hierarchy for package java.lang

Read More
0 Comments

Debugging: Using (scala) CUDA in Spark(2) - (ubuntu14.04)

11/10/2015

0 Comments

 
It was a pain to get Jcuda&scala worked on Spark, in case I (as well as someone else) need to install them later, I will try my best to recall most of the harmful errors.  (it's not hard to solve many small bugs just by googling them, and I won't mention them)

Read More
0 Comments

Using (scala) CUDA in Spark(1) -- based on JCuda

11/9/2015

1 Comment

 
Many people want to leverage CUDA for some scala (machine learning) code. But cuda doesn't support scala TAT. 
Hope never ends! We can always try the following approaches:

Read More
1 Comment

Is CPU the new bottleneck? - Tungsten (Spark DataFrame)

11/3/2015

0 Comments

 
Performance optimization is a never ending process. Project Tungsten will be the largest change to Spark's execution engine since the project's inception. It aims at substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware.
Why Cpu is the main bottleneck instead IO: 1. Hardware has been improved. 2.Spark's IO has been optimized. 3.Data Formats have improved. 4. Serialization and hashing are CPU-bound bottlenecks.
​
Three initiatives:

Read More
0 Comments

Matrix Operations in Spark MLlib

11/1/2015

0 Comments

 
MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas. 
Related Data Types:

Read More
0 Comments

Spark Cluster (and some basic knowledge of Spark)- notes

10/26/2015

1 Comment

 
Computer cluster (concept), consists of a set of connected computers that work together and can be viewed as a single system. Computer clusters have each node set to perform the same task, controlled and scheduled by software.
(Design & Configuration) In a Beowulf system, the application programs never see the computational nodes (slave computers), but only interact with the "master", which is a specific computer handling the scheduling and management of the slaves. 

Read More
1 Comment

MapReduce VS Dryad

9/7/2015

0 Comments

 
Paper Review:
  • MapReduce: Simplified Data Processing on Large Clusters
  • Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
0 Comments
<<Previous

    Categories

    All
    Artificial Intelligence
    GPU Programming
    Paper Review
    Scala
    Spark

    Archives

    December 2015
    November 2015
    October 2015
    September 2015
    August 2015

    RSS Feed

                                                                                                                   © 2015 Zhiyuan Lin Reserved