Thursday, 26 May 2016

How to run Spark Job server and spark jobs

Spark Job server provides a RESTful interface for submission and management of Spark jobs, jars and job contexts. It facilitates sharing of jobs and RDD data in a single context. It can run standalone job as well. Job History and configuration is persisted.

Features:
Few of the features are listed here:
    ·       Simple REST interface
    ·       Separate JVM per SparkContext for isolation
    ·       Separate Jar uploading step for faster job execution
    ·       Supports low-latency jobs via long running job contexts
    ·       Asynchronous and synchronous job API.
    ·       Kill running job via stop context and delete job.
    ·     Named Objects (RDD/Dataframes) to cache and retrieve by name, improving object sharing and reuse among jobs.
    ·      Preliminary support for Java.

Setup Spark Job Server:

Sbt Version
Spark version
0.3.1
0.9.1
0.4.0
1.0.2
0.4.1
1.1.0
0.5.0  
1.2.0
0.5.1
1.3.0
0.5.2
1.3.1 
0.6.0
1.4.1
0.6.1
1.5.2 
0.6.2
1.6.1
Master(0.13.x)
1.6.1 

Pre-requisites:
To setup the server, pre-requisites are:

               ·      64 bit Operating system
        ·      Java 8
        ·      sbt
        ·      curl
        ·      git
        ·      Spark

Please make sure sbt version should be compatible to spark version.  Here is the list of compatible versions.

You can install Java8 from here.

For sbt, you can refer sbt official site .








  For CentOS users:
Yum install curl
Yum install git
Yum install sbt

   For Ubuntu users:
sudo apt-get install curl
sudo apt-get install git
sudo apt-get install sbt

Download the Spark package and setup it. Windows User can refer the link How to setup Spark on windows. Once spark setup is done, run the spark master and worker daemon.
[xuser@machine123 spark-1.6.1-bin-hadoop2.6]$ sbin/start-all.sh

Now clone the spark job server repo on your local.

[xuser@machine123 ~]$ git clone https://github.com/spark-jobserver/spark-jobserver.git

Run sbt command in the cloned repo. It will build the project and give the sbt shell. If you are running sbt command first time,it will take much time. Then type re-start  to start the server on sbt shell:

[xuser@machine123 spark-jobserver]$ sbt
[info] Loading project definition from /home/xuser/softwares/spark-jobserver/project
Missing bintray credentials /home/xuser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/xuser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/xuser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/xuser/.bintray/.credentials. Some bintray features depend on this.
[info] Set current project to root (in build file:/home/xuser/spark-jobserver/)
> re-start

If you want to use any specific configuration to start the server, You can also specify JVM parameters after "---". Including all the options looks like this:
 > re-start config/application.conf  --- -Xmx512m

It will start the spark job server on http://localhost:8090 url. You can see all daemons using jps.


Sample SparkJobs Walkthrough:

Spark-job-server has some sample Spark jobs written in Scala.To package the test jar, run command. 
[xuser@machine123 spark-jobserver]$ sbt job-server-tests/package

It will give you a jar in job-server-tests/target/scala-2.10 directory.Now upload the jar to the server:

[xuser@machine123 spark-jobserver]$ curl --data-binary @job-server-tests/target/scala-2.10/job-server-tests_2.10-0.7.0-SNAPSHOT.jar localhost:8090/jars/test
OK

This jar is uploaded as app test. You can view same information on webUI.
We can run the jobs in two mode: Transient Context mode, Persistent Content mode.

Unrelated jobs -with Transient Context:

In this mode, each job will create its own spark context.  Let's submit the WordCount job on the server:
[xuser@machine123 ~]$ curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
{
  "status": "STARTED",
  "result": {
    "jobId": "5453779a-f004-45fc-a11d-a39dae0f9bf4",
    "context": "b7ea0eb5-spark.jobserver.WordCountExample"
  }
}

Persistent Context mode- Related Jobs:

In this mode, jobs can use the existing Spark context. Create a spark context named ‘test-context’:
[xuser@machine123 ~]$ curl -d "" 'localhost:8090/contexts/test-context?num-cpu-cores=4&memory-per-node=512m'
OK

To see the existing contexts:
[xuser@machine123 ~]$ curl localhost:8090/contexts
["test-context"]

To run the job in existing context:
[xuser@machine123 ~]$ curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample&context=test-context&sync=true'
{
  "result": {
    "a": 2,
    "b": 2,
    "c": 1,
    "see": 1
  }
}

You can run the job without any input argument passing -d "":

[xuser@machine123 ~]$ curl -d "" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.LongPiJob&context=test-context&sync=true'
{
  "result": 3.1403460207612457
}

You can check the job status by giving job ID in following command:

[xuser@machine123 ~]$ curl localhost:8090/jobs/<jobID>

You can see the all the running, completed, failed jobs on Job Server UI.  Now you are ready to write your jobs to run of SparkJobServer..!!!