Tuning Spark Jobs

I recently got into a discussion of how to tune spark jobs.

This led me to learnings related to dynamic allocation and stuff.

Some interesting links:


kafka-spark integration

Am trying to get to a point whereby i will have  kafka + spark streaming running locally on my machine.

there are several things to figure out along the way:

  • kafka
  • scala
  • spark
  • spark-streaming



EdX course ‘Introduction to Apache Spark’ resources

I am spending some time learning spark. As I make progress I think it would be a good idea to keep track of some resources I have found useful.


Code Repo: 

Spark App Development (Python)

In a previous post, I wrote about the spark app development process for Scala.

In this post/example, I have provided examples of how to develop a spark app using the pyspark library.

For Python 3.5

  • For interactive use I have to do ‘export PYSPARK_PYTHON=python3’  before doing pyspark
  • For standalone programs in local mode I first have to add the following in the script:
    • #os.environ[“PYSPARK_PYTHON”]=”/usr/bin/python3″



Getting Started with Spark on Windows 10 (Part 2)

After the initial start detailed in Part 1  of Getting Started with Spark on Windows 10, I started running into some issues.

To remove permission issues from the equation, I unzipped the spark package into the ‘D:\’ drive this time. this allowed me analyze some issues thoroughly. Here are some observations :


  • Make sure ‘winutils’ is properly set up. Otherwise there is an error that gets thrown when starting pyspark / spark-shell
    • the error says ‘could not locate winutils’ in the hadoop binaries.
  • I noticed a few files/folders getting auto generated :
    1. File named ‘derby’
    2. ‘metastore_db’ folder
    3. ‘tmp’ folder
  • The ‘derby’ file and the ‘metastore_db’ folder seem to be created in any location where the spark app is located.
  • The ‘tmp’ folder has to be given full permissions.
    • Note:  I noticed the ‘tmp’ folder getting created in my ‘D:\’ drive. Earlier I had this ‘tmp’ folder in my ‘C:\’ drive as well.  I need to follow up more on this.
      • do cross check if there are multiple ‘tmp’ folders and ensure the permissions are set up properly
    • If the folder doesnt have 777 permission, then you would hit the following error when running either ‘pyspark’ or ‘spark-shell’ :
      • java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
  • After setting the permissions properly in the ‘tmp’ folder, i hit another issue
    • java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:D://spark-warehouse
    • This issue is Windows specific. Its discussed in the following two threads:


So that was it. I was now able to get a set up going properly now. Phew!

Environment Variable:



  • D:\>pyspark –conf spark.sql.warehouse.dir=file:///D:/tmp




  • D:\>spark-shell –conf spark.sql.warehouse.dir=file:///D:/tmp




Spark App Development (Scala)

The spark app development process is pretty interesting.

I am jotting down some notes as I make progress in the process.


  • The easiest way is to develop your app in intellijIDEA and run it either from (a) intellij IDEA itself or  (b) from the sbt console using ‘run’
    • for this, we must necessarily  have the ‘master’ URL set to ‘local’
    • else you will get an error : “org.apache.spark.SparkException: A master URL must be set in your configuration”
  • sbt package
    • the official spark quick start guide actually has an example of this, whereby you don’ t have to set the ‘master’ URL in the app itself.
    • instead, you specify the master when doing spark submit
    • Note:
      • if you have master set in code, then –master in spark-submit doesn’t take effect.
    •  Example:
      • sbt assembly
      • spark-submit –master spark://spark-host:7077 target/scala-2.11/HelloRedmondApp-assembly-1.0.jar
  • sbt assembly
    • this is similar to the workflow for ‘sbt package’
    • in the build.sbt file,  there is a keyword  “Provided”  which has ramifications when one uses ‘sbt assembly’.
      • //libraryDependencies += “org.apache.spark” %% “spark-sql” % “2.0.0” % Provided
        //libraryDependencies += “org.apache.spark” %% “spark-mllib” % “2.0.0” % Provided
    • I need to follow up more on this Provided keyword…


To get additional insights this exercise was very useful.

  • Step 1: Run Spark Standalone.
    • From tools/spark and run ./sbin/start-master.sh
    • Run ./sbin/start-slave.sh spark://spark-host:7077
    • At this point you should have Spark Standalone up and running
    • Open Standalone’s Web UI available at http://localhost:8080. Confirm you have got a node connected in Workers section at the top
  • Step 2: Start spark-shell  (which is also a scala app itself) and attach it to the master.
    • spark-shell –master spark://spark-host:7077
  • Step 3: Actually submit the application to the same cluster
    • sbt assembly
    • spark-submit –master spark://spark-host:7077 target/scala-2.11/HelloRedmondApp-assembly-1.0.jar
    • Note how the app is now in WAITING state. Its waiting because the cores have actually been allocated to the spark-shell from Step 2.
  • Step 4: Kill the spark-shell that was started in Step 2. You will notice that the WAITING app now starts RUNNING.
    • this is because it now has the resources to run.