(Design & Configuration) In a Beowulf system, the application programs never see the computational nodes (slave computers), but only interact with the "master", which is a specific computer handling the scheduling and management of the slaves.
Spark Documentation (many useful links at its bottom)
Executor - A process launched for an application, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task - A unit of work that will be sent to one executor.
Job - A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect).
Stage - Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce.)
(--scheduler stages & tasks -- RDD sizes & memory usage - -env info -- running executors info)
- Parallelize an existing collection in the driver program
Parallelized collections are created by calling SparkContext's parallelize on an existing collection (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
The textFile method takes an URI for the file (either a local path on the machine ,or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.
val distFile = sc.textFile("data.txt")
tips: To get better performance in Spark, I'd recommend representing your matrix as an RDD of blocks (say 128x128 double arrays) instead of (int, int, double) pairs.