Getting Started With Scala

The other day I was discussing about how to have a gentle introduction to the Scala language. Here is what I prescribe ..

Sufficient Scala:

Additional Resources:


Java, Maven, Scala, SBT Concepts

Am toe dipping into maven. Trying to make sense of how maven fits in with  IDE, command line maven, POM files blah blah etc.


  • intellij will have default support within the IDE for both Maven and SBT.  So as long as we are not using mvn and sbt  from the command line we should be good.

java fundamental concepts:


base scala in intellij:

scala with SBT:

Spark App Development (Scala)

The spark app development process is pretty interesting.

I am jotting down some notes as I make progress in the process.


  • The easiest way is to develop your app in intellijIDEA and run it either from (a) intellij IDEA itself or  (b) from the sbt console using ‘run’
    • for this, we must necessarily  have the ‘master’ URL set to ‘local’
    • else you will get an error : “org.apache.spark.SparkException: A master URL must be set in your configuration”
  • sbt package
    • the official spark quick start guide actually has an example of this, whereby you don’ t have to set the ‘master’ URL in the app itself.
    • instead, you specify the master when doing spark submit
    • Note:
      • if you have master set in code, then –master in spark-submit doesn’t take effect.
    •  Example:
      • sbt assembly
      • spark-submit –master spark://spark-host:7077 target/scala-2.11/HelloRedmondApp-assembly-1.0.jar
  • sbt assembly
    • this is similar to the workflow for ‘sbt package’
    • in the build.sbt file,  there is a keyword  “Provided”  which has ramifications when one uses ‘sbt assembly’.
      • //libraryDependencies += “org.apache.spark” %% “spark-sql” % “2.0.0” % Provided
        //libraryDependencies += “org.apache.spark” %% “spark-mllib” % “2.0.0” % Provided
    • I need to follow up more on this Provided keyword…


To get additional insights this exercise was very useful.

  • Step 1: Run Spark Standalone.
    • From tools/spark and run ./sbin/
    • Run ./sbin/ spark://spark-host:7077
    • At this point you should have Spark Standalone up and running
    • Open Standalone’s Web UI available at http://localhost:8080. Confirm you have got a node connected in Workers section at the top
  • Step 2: Start spark-shell  (which is also a scala app itself) and attach it to the master.
    • spark-shell –master spark://spark-host:7077
  • Step 3: Actually submit the application to the same cluster
    • sbt assembly
    • spark-submit –master spark://spark-host:7077 target/scala-2.11/HelloRedmondApp-assembly-1.0.jar
    • Note how the app is now in WAITING state. Its waiting because the cores have actually been allocated to the spark-shell from Step 2.
  • Step 4: Kill the spark-shell that was started in Step 2. You will notice that the WAITING app now starts RUNNING.
    • this is because it now has the resources to run.






Running Tests in Scala IntelliJ IDEA

So I set up my first test suite in IntelliJ IDEA to test out some Scala code.

To run unit tests, I used the FunSuite package in ScalaTest.

import org.scalatest.FunSuite



[1]  I modified the build.sbt to include the line:

libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.0" % "test"

I reloaded the project. However, for the changes to take effect I had to terminate the running sbt prompt and open a new one.


[2]  I did not use junit





Getting Started with Spark on Windows 10 (Part 1)

The references below helped me get started with spark on windows. I am listing down a few additional tips based on my experience:


  • Added the following as system environment variables for Sbt, Spark, Scala, Hadoop and Java.
    • sbt_spark_scala
    • hadoop_java
  • Added the following to the system PATH environment variable:
    • systempath
    • Note the java path is automatically picked up. I believe that’s because there is a java class path already present inside PATH [C:\ProgramData\Oracle\Java\javapath]
  • spark-shell on Cygwin given an error on lauch itself.
    • Error looks something [: too many arguments
    • I think Spark on Cygwin has not been fully tried out yet. Read this.
  • After getting spark-shell to launch on the Cmd window, there is another weird stack trace that I hit
  • There is a weird stack dump that happens when exiting the spark-shell. (i.e on hitting :q within he spark shell)
    • It seems it is non-deterministic. i am not aware of the root cause of this issue..