Tuning Spark Jobs

I recently got into a discussion of how to tune spark jobs.

This led me to learnings related to dynamic allocation and stuff.

Some interesting links:


ML Algos in R

I have been trying to understand how ML algos in R fit together / compare with each other.

# R RevoScaleR MML Comment
1 lm rxLinMod  — Linear models
2 glm rxGlm  — Linear models
3 Glm w/

Binomial family and the logit link function

rxLogit rxLogisticRegression Logistic regression
4 rpart rxDtree  — Decision Trees implementations
5 gbm rxBTrees rxFastTrees Boosted Decision Tree implementations
6  —- rxDForest rxFastForest



# Title Links
1 Fitting Logistic Regression Models https://msdn.microsoft.com/en-us/microsoft-r/scaler-user-guide-logistic-regression
2 Generalized Linear Models https://msdn.microsoft.com/en-us/microsoft-r/scaler-user-guide-generalized-linear-mode
3 rxDTree(): a new type of tree algorithm for big data http://blog.revolutionanalytics.com/2013/07/rxdtree-a-new-type-of-tree-algorithm.html
4 A first look at rxBTrees http://blog.revolutionanalytics.com/2015/03/a-first-look-at-rxbtrees.html
5 A First Look at rxDForest() http://blog.revolutionanalytics.com/2014/01/a-first-look-at-rxdforest.html


Details running R Scripts

  • trace(functionName, edit=TRUE).  Then write browser() where you want it to break
  • source(‘~/scripts/trial3_criteo_ensembles_ag.R’, echo=TRUE)
  • Rscript -e “.libPaths()”

Other functions in R I didn’t know:

  • class(score)
  • names(score)
  • head(criteoTest)
  • class(criteoTest)
  • rxGetVarInfo(criteoTest)
  • warnings()
  • sapply(score, class)

The initialization sequence in R

Here are some of my notes based on a toe-dipping into R


  • use .libPaths() to get the path where the libraries are located:
    • > .libPaths()  –> this is in RGui
      [1] “C:/Users/agoswami/Documents/R/win-library/3.3”
      [2] “C:/Program Files/R/R-3.3.0/library”


  • One can look for the Rprofile file like so:
    • $ find . -name ‘*Rprofile’ -type f 2>/dev/null
      ./Microsoft/MRO/R-3.2.5/library/base/R/Rprofile –> this is MRO (Microsoft R Open)
      ./Microsoft/MRO-for-RRE/8.0/R-3.2.2/library/base/R/Rprofile –> this is MRS (Microsoft R Server)
      ./Microsoft SQL Server/130/R_SERVER/library/base/R/Rprofile –> this is MRS (Microsoft R Server) through SQL install
      ./R/R-3.3.0/library/base/R/Rprofile  –> this is a regular R install