- Make sure input.txt file with example data is present in the given path
- Star Spark REPL by executing the following command. Validate in Spark UI that started successfully.
$ spark-shell
$ JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home
- Apply some basic transformations - count lines by types
val input = sc.textFile("input.txt")
val tokenized = input.map(line => line.split(" ")).filter(words => words.size > 1)
val counts = tokenized.map(words => (words(0), 1)).reduceByKey { (a,b) => a + b }
- By far nothing was executed (DAG was defined - we always have a parent). To do so, we need to trigger an action. These RDDs simply store metadata that will help us compute them later. To trigger computation let’s call
collect()
to deliver results to the driver.
counts.collect()
- In Spark UI (port 4040 by default) look at particular things:
- Jobs tab - completed jobs, number of tasks
- Job details tab - DAG visualization, completed stages (how many tasks), shuffle read/write
- Stage details tab - event timeline
<aside>
💡 At this stage, you can see that each Spark action relates to the concept of a job. Each job consists of one or more stages, which groups tasks that can be executed within one node without the need for shuffling.
Under the hood, Spark creates a physical execution plan which can be understood as a “recipe” of exact steps how to transform data into final form.
</aside>
- Sometimes it might be difficult to access Spark UI for debugging purposes. We can inspect a lineage of each RDD by using
toDebugString()
method.
input.toDebugString
// ...
counts.toDebugString
- We can inspect a number of partitions (tasks) of each RDD by executing
input.getNumPartitions
input.glom().collect()
-
The problem with this is that when we call counts.collect()
one more time the whole graph will be re-executed one more time. (In this particular case not, because some shuffle operations such as reduceByKey
are automatically cached).
-
To avoid expensive recomputation we can store the intermediate results explicitly in workers memory. There are many various configurations available, such as whether to use disk or memory, replicate and serialize data.
Let’s store the whole RDD counts inside the memory.
counts.cache()
- Inside Spark UI storage tab there is still no information about cached RDD. We need to call some action to trigger the transformation.
counts.collect()
- Now after revisiting storage tab, you can find some information here:
- how many partitions are cached,
- size in memory and on disk (if we are spilling, useful for debugging),
- breakdown by node
In summary
- when a DAG is defined tasks are created and dispatched to an internal scheduler,
- stages in DAG can depend on each other, so they will be executed in a specific order,
- a physical stage will launch tasks that each do the same thing but on specific partitions of data. Each task internally performs the same steps.