Spark Cluster (and some basic knowledge of Spark)- notes

10/26/2015

Computer cluster (concept), consists of a set of connected computers that work together and can be viewed as a single system. Computer clusters have each node set to perform the same task, controlled and scheduled by software.
(Design & Configuration) In a Beowulf system, the application programs never see the computational nodes (slave computers), but only interact with the "master", which is a specific computer handling the scheduling and management of the slaves.

But when I tried to figure out the mechanism of "communications between cluster nodes"
Helpful links:
Spark Documentation (many useful links at its bottom)
Spark Configuration
Cluster Overview

(important but confusing) Glossary (which helps to understand Spark output INFO):
Executor - A process launched for an application, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task - A unit of work that will be sent to one executor.
Job - A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect).
Stage - Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce.)

Monitoring Web UI: http://<driver-node>:4040
(--scheduler stages & tasks -- RDD sizes & memory usage - -env info -- running executors info)

SparkContext is the main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

Create RDDs: (a resilient distributed dataset is a fault-tolerant collection of data elements that can be operated on in parallel)
- Parallelize an existing collection in the driver program
Parallelized collections are created by calling SparkContext's parallelize on an existing collection (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

- Reference a dataset in an external storage system (such as your local file system, HDFS, HBase, Amazon S3, or any data source offering a Hadoop InputFormat )
The textFile method takes an URI for the file (either a local path on the machine ,or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.

val distFile = sc.textFile("data.txt")

Matrix Operations:
tips: To get better performance in Spark, I'd recommend representing your matrix as an RDD of blocks (say 128x128 double arrays) instead of (int, int, double) pairs.

1 Comment

Spark Cluster (and some basic knowledge of Spark)- notes

Leave a Reply.

Categories

Archives