Spark on YARN - limit attempts at runtime - apache-spark

If a Spark app fails YARN will reattempt it a certain number of times. I am happy about retrying after network/node failures and other random events. I am even able to set number of attempts for my Spark app providing --conf spark.yarn.maxAppAttempts=3 option to spark-submit.
However this can only be done at sumbission, I see no way of changing this parameter at runtime. But there are cases that I want to abort already running app without reattempt. How can I either:
set maxAppAttemts from within app
or exit in a way that will tell YARN not to rerun the app?


How to run 2 EMR Spark Step Concurrently?

I am trying to have 2 steps run concurrent in EMR. However I always get the first step running and the second pending.
Part of my Yarn configuration is as follows:
"Classification": "capacity-scheduler",
"Properties": {
"yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator",
"yarn.scheduler.capacity.maximum-am-resource-percent": "0.5"
When I run on my local Mac I am able to run the 2 application on Yarn with similar configuration, where the change are actually spark submit resource request, to match the cluster capacity and performance required.
In other words, My yarn is set up to run multiple application.
Hence, before i dig hard into it, i wonder if it is actually possible to have the step run concurrently or only serially ?
Else is there any tips or something specific to run to job concurrently ?
My cluster is over capacitated with respect to what each job request. Hence i don't not understand why it can't run concurrently.
Is it possible to have the step run concurrently or only serially?
Confirmed from AWS support people that we can not run multiple steps in parallel(concurrent), the steps are serial, so what you are seeing (ie second job in pending state) is expected.
Is there any tips or something specific to run to job concurrently?
You can simply put both the spark-submit in a bash script and run the bash script, but you might loose some direct debugging info on the AWS web console (which imo is slow already), you can see these debugging info on the spark-history server
On your local mac, you are able to run multiple YARN application in parallel because you are submitting the applications to yarn directly, whereas in EMR the yarn/spark applications are submitted through AWS's internal `command-runner.jar`, it does a bunch of other logging/bootstrapping etc to be able to see the `emr step` info on the web console.
There are 2 modes of running application in AWS EMR Yarn:
If you use client mode then only one step will be in running state at a given time.
However there is an option where in you can run more then 1 step concurrently.
try submitting your step in blow mode:
spark-submit --master yarn --deploy-mode cluster --executor-memory 1G --num-executors 2 --driver-memory 1g --executor-cores 2 --conf spark.yarn.submit.waitAppCompletion=false --class WordCount.word.App /home/hadoop/word.jar
Instead of letting AWS EMR define memory allocation try defining your allocation. Refer to link:
spark.yarn.submit.waitAppCompletion=false : In YARN cluster mode, controls whether the client waits to exit until the application completes. If set to true, the client process will stay alive reporting the application's status. Otherwise, the client process will exit after submission.
Hope this may of help for you.

Can not kill job gracefully in spark standalone cluster

There is a problem with killing streaming jobs gracefully in spark 2.1.0 with enabled spark.streaming.stopGracefullyOnShutdown
I've tested killing spark jobs in many ways and I got some conclusions.
With command spark-submit --master spark:// --kill driver-id
It kills job almost immediately - not gracefully
With api curl -X POST http://localhost:6066/v1/submissions/kill/driverId
The same like in 1. (I looked at the spark-submit code and it seems like this tool calls just REST endpoint)
With unix kill driver-process
It doesn't kill the job at all (driver is immediately restarted)
Then I noticed that I'd used param: --supervise so I repeated these all tests without this flag. It turned out that 1. and 2. methods worked in the same way like before but 3. method worked like I assumed. This means, calling kill driver-process job - spark digests all messages from kafka which left and than turns down job gracefully. It is of course some solution but quite inconvenient since I must track machine with driver instead of using simple spark REST endpoint. The second downside is that I can not use flag "supervise" so whenever node with spark driver fails than job stops.
Is anybody able to explain me why there are so many issues regarding this case and why 1 and 2 methods work in different way than 3. killing.

Setting Driver manually in Spark Submit over Yarn Cluster

I noticed that when I start a job in spark submit using yarn, the driver and executor nodes get set randomly. Is it possible to set this manually, so that when I collect the data and write it to file, it can be written on the same node every single time?
As of right now, the parameter I tried playing around with are: <driver-ip-address>
spark.driver.hostname <driver-ip-address>
If you submit to Yarn with --master yarn --deploy-mode client, the driver will be located on the node you are submitting from.
Also you can configure node labels for executors using property: spark.yarn.executor.nodeLabelExpression
A YARN node label expression that restricts the set of nodes executors will be scheduled on. Only versions of YARN greater than or equal to 2.6 support node label expressions, so when running against earlier versions, this property will be ignored.
Docs - Running Spark on YARN - Latest Documentation
A spark cluster can run in either yarncluster or yarn-client mode.
In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client machine can go away after initiating the application.
In yarn-client mode, the driver runs in the client
process, and the application master is only used for requesting resources from YARN.
So as you see, depending upon the mode, the spark picks up the Application Master. Its not happened randomly until this stage. However, the worker nodes which the application master requests the resource manager to perform tasks will be randomly picked based on the availability of the worker nodes.

How to exit spark-submit after the submission

When submitting spark streaming program using spark-submit(YARN mode)
it keep polling the status and never exit
Is there any option in spark-submit to exit after the submission?
===why this trouble me===
The streaming program will run forever and i don't need the status update
I can ctrl+c to stop it if i start it manually
but i have lots of streaming context to start and i need to start them using script
I can put the spark-submit program in background,
but after lots of background java process created, the user corresponding to, will not able to run any other java process because JVM cannot create GC thread
Interesting. I never thought about this issue. Not sure there is a clean way to do this, but I simply kill the submit process on the machine and the yarn job continues to run until you stop it specifically. So you can create a script that execute the spark submit and then kills it. When you will actually wanna stop the job use yarn -kill. Dirty but works.
I know this is an old question but there's a way to do this now by setting --conf spark.yarn.submit.waitAppCompletion=false when you're using spark-submit. With this the client will exit after successfully submitting the application.
In YARN cluster mode, controls whether the client waits to exit until
the application completes. If set to true, the client process will
stay alive reporting the application's status. Otherwise, the client
process will exit after submission.
Also, you may need to set --deploy-mode to cluster
In cluster mode, the Spark driver runs inside an application master
process which is managed by YARN on the cluster, and the client can go
away after initiating the application.
More at
command timeout TIME CMD will close CMD after TIME

How to configure automatic restart of the application driver on Yarn

From the Spark Programming Guide
To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalon
Spark Standalone - A Spark application driver can be submitted to run within the Spark Standalone cluster (see cluster deploy mode), that is, the application driver itself runs on one of the worker nodes. Furthermore, the Standalone cluster manager can be instructed to supervise the driver, and relaunch it if the driver fails either due to non-zero exit code, or due to failure of the node running the driver. See cluster mode and supervise in the Spark Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for more details.
So, the question is how to support the auto-restart for Spark
Streaming on Yarn.
Thanks and best regards,
What you are looking for is the set of instructions to launch your application in yarn "cluster mode" :
This means that your driver application runs on the cluster on YARN (not on your local machine). As such it can be restarted by YARN if it fails.
as documented here:
spark.yarn.maxAppAttempts -
"The maximum number of attempts that will be made to submit the application. It should be no larger than the global number of max attempts in the YARN configuration."
to set "global number of max attempts in the YARN configuration": -
"The maximum number of application attempts. It's a global setting for all application masters. Each application master can specify its individual maximum number of application attempts via the API, but the individual number cannot be more than the global upper bound. If it is, the resourcemanager will override it. The default number is set to 2, to allow at least one retry for AM"