Tuning Spark Jobs

I recently got into a discussion of how to tune spark jobs.

This led me to learnings related to dynamic allocation and stuff.

Some interesting links:



ML Algos in R

I have been trying to understand how ML algos in R fit together / compare with each other.

# R RevoScaleR MML Comment
1 lm rxLinMod  — Linear models
2 glm rxGlm  — Linear models
3 Glm w/

Binomial family and the logit link function

rxLogit rxLogisticRegression Logistic regression
4 rpart rxDtree  — Decision Trees implementations
5 gbm rxBTrees rxFastTrees Boosted Decision Tree implementations
6  —- rxDForest rxFastForest



# Title Links
1 Fitting Logistic Regression Models https://msdn.microsoft.com/en-us/microsoft-r/scaler-user-guide-logistic-regression
2 Generalized Linear Models https://msdn.microsoft.com/en-us/microsoft-r/scaler-user-guide-generalized-linear-mode
3 rxDTree(): a new type of tree algorithm for big data http://blog.revolutionanalytics.com/2013/07/rxdtree-a-new-type-of-tree-algorithm.html
4 A first look at rxBTrees http://blog.revolutionanalytics.com/2015/03/a-first-look-at-rxbtrees.html
5 A First Look at rxDForest() http://blog.revolutionanalytics.com/2014/01/a-first-look-at-rxdforest.html


Using SSH Keys on Cloud Platforms


  • openssl.exe req -x509 -nodes -days 365 -newkey rsa:2048 -keyout myPrivateKey.key -out myCert.pem
    • We will mostly use the .key file
    • The .pem file is only needed for Classic deployments. Typically we wont use this.


  • Look up use of req : https://linux.die.net/man/1/req
    • The req command primarily creates and processes certificate requests . Thats why the output of req is a cerificate (myCert.pem)
    • But we are interested in the private key (myPrivateKey.key). Hence we are using the -keyout flag




  • In AWS,  the private key is saved in a .pem file . you just use the .pem file to connect to the instances.
    • Ideally the .pem extension is for certificates, not for keys.
    • This was one of my confusions – because AWS saves the key in the .pem file 



  • Use ssh-agent to store private keys. Makes life much simpler!


Visualization Using D3 (and dependent libraries)

This link gives a nice summary of data visualization libraries using D3:

Interestingly, it mentions mermaid and rickshaw! Two cool libraries I recently came across

Real time:


Details running R Scripts

  • trace(functionName, edit=TRUE).  Then write browser() where you want it to break
  • source(‘~/scripts/trial3_criteo_ensembles_ag.R’, echo=TRUE)
  • Rscript -e “.libPaths()”

Other functions in R I didn’t know:

  • class(score)
  • names(score)
  • head(criteoTest)
  • class(criteoTest)
  • rxGetVarInfo(criteoTest)
  • warnings()
  • sapply(score, class)