Spark App Development (Python)

In a previous post, I wrote about the spark app development process for Scala.

In this post/example, I have provided examples of how to develop a spark app using the pyspark library.

For Python 3.5

  • For interactive use I have to do ‘export PYSPARK_PYTHON=python3’  before doing pyspark
  • For standalone programs in local mode I first have to add the following in the script:
    • #os.environ[“PYSPARK_PYTHON”]=”/usr/bin/python3″

References:

Code:

Getting Started with Spark on Windows 10 (Part 2)

After the initial start detailed in Part 1  of Getting Started with Spark on Windows 10, I started running into some issues.

To remove permission issues from the equation, I unzipped the spark package into the ‘D:\’ drive this time. this allowed me analyze some issues thoroughly. Here are some observations :

Observations:

  • Make sure ‘winutils’ is properly set up. Otherwise there is an error that gets thrown when starting pyspark / spark-shell
    • the error says ‘could not locate winutils’ in the hadoop binaries.
  • I noticed a few files/folders getting auto generated :
    1. File named ‘derby’
    2. ‘metastore_db’ folder
    3. ‘tmp’ folder
  • The ‘derby’ file and the ‘metastore_db’ folder seem to be created in any location where the spark app is located.
  • The ‘tmp’ folder has to be given full permissions.
    • Note:  I noticed the ‘tmp’ folder getting created in my ‘D:\’ drive. Earlier I had this ‘tmp’ folder in my ‘C:\’ drive as well.  I need to follow up more on this.
      • do cross check if there are multiple ‘tmp’ folders and ensure the permissions are set up properly
    • If the folder doesnt have 777 permission, then you would hit the following error when running either ‘pyspark’ or ‘spark-shell’ :
      • java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
  • After setting the permissions properly in the ‘tmp’ folder, i hit another issue
    • java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:D://spark-warehouse
    • This issue is Windows specific. Its discussed in the following two threads:

 

So that was it. I was now able to get a set up going properly now. Phew!

Environment Variable:

capture

Pyspark:

  • D:\>pyspark –conf spark.sql.warehouse.dir=file:///D:/tmp

capture

 

Spark-Shell:

  • D:\>spark-shell –conf spark.sql.warehouse.dir=file:///D:/tmp

capture

References:

 

Spark App Development (Scala)

The spark app development process is pretty interesting.

I am jotting down some notes as I make progress in the process.

Notes:

  • The easiest way is to develop your app in intellijIDEA and run it either from (a) intellij IDEA itself or  (b) from the sbt console using ‘run’
    • for this, we must necessarily  have the ‘master’ URL set to ‘local’
    • else you will get an error : “org.apache.spark.SparkException: A master URL must be set in your configuration”
  • sbt package
    • the official spark quick start guide actually has an example of this, whereby you don’ t have to set the ‘master’ URL in the app itself.
    • instead, you specify the master when doing spark submit
    • Note:
      • if you have master set in code, then –master in spark-submit doesn’t take effect.
    •  Example:
      • sbt assembly
      • spark-submit –master spark://spark-host:7077 target/scala-2.11/HelloRedmondApp-assembly-1.0.jar
  • sbt assembly
    • this is similar to the workflow for ‘sbt package’
    • in the build.sbt file,  there is a keyword  “Provided”  which has ramifications when one uses ‘sbt assembly’.
      • //libraryDependencies += “org.apache.spark” %% “spark-sql” % “2.0.0” % Provided
        //libraryDependencies += “org.apache.spark” %% “spark-mllib” % “2.0.0” % Provided
    • I need to follow up more on this Provided keyword…

Exercise:

To get additional insights this exercise was very useful.

  • Step 1: Run Spark Standalone.
    • From tools/spark and run ./sbin/start-master.sh
    • Run ./sbin/start-slave.sh spark://spark-host:7077
    • At this point you should have Spark Standalone up and running
    • Open Standalone’s Web UI available at http://localhost:8080. Confirm you have got a node connected in Workers section at the top
  • Step 2: Start spark-shell  (which is also a scala app itself) and attach it to the master.
    • spark-shell –master spark://spark-host:7077
  • Step 3: Actually submit the application to the same cluster
    • sbt assembly
    • spark-submit –master spark://spark-host:7077 target/scala-2.11/HelloRedmondApp-assembly-1.0.jar
    • Note how the app is now in WAITING state. Its waiting because the cores have actually been allocated to the spark-shell from Step 2.
  • Step 4: Kill the spark-shell that was started in Step 2. You will notice that the WAITING app now starts RUNNING.
    • this is because it now has the resources to run.

 

capture

 

Code:

scalatrialapps

Getting Started with Spark on Windows 10 (Part 1)

The references below helped me get started with spark on windows. I am listing down a few additional tips based on my experience:

Tips:

  • Added the following as system environment variables for Sbt, Spark, Scala, Hadoop and Java.
    • sbt_spark_scala
    • hadoop_java
  • Added the following to the system PATH environment variable:
    • systempath
    • Note the java path is automatically picked up. I believe that’s because there is a java class path already present inside PATH [C:\ProgramData\Oracle\Java\javapath]
  • spark-shell on Cygwin given an error on lauch itself.
    • Error looks something [: too many arguments
    • I think Spark on Cygwin has not been fully tried out yet. Read this.
  • After getting spark-shell to launch on the Cmd window, there is another weird stack trace that I hit
  • There is a weird stack dump that happens when exiting the spark-shell. (i.e on hitting :q within he spark shell)
    • It seems it is non-deterministic. i am not aware of the root cause of this issue..

References: