Optaplanner: Reproducible solution - optaplanner

I am trying to solve a problem similar to employee rostering. The problem I am facing is every time I run the solver, it generates a different assignment. This makes it harder to debug why a particular case was picked over another. Why is this the case?
P.S. My assignment has many hard constraint and all of them may not be satisfied (most cases I still see some negative hard score). So my termination strategy is based on unimprovedSecondsSpentLimit. Could this be the reason?

Yes, it's likely the termination. OptaPlanner's default environmentMode guarantees the exact same solution at the exact same step (*). But CPU cycles differ a lot from run to run, so that means you get more or less steps per run. Use DEBUG logging to see that.
Use stepCountLimit or unimprovedStepCountLimit termination.
(*) Unless specified otherwise in the docs. Simulated Annealing for example will be different even in the exact same step if used with time bound terminations.

Related

OptaPlanner: Gaps in Chained Through Time Pattern

I'm just starting learning to use OptaPlanner recently. Please pardon me if there is any technically inaccurate description below.
Basically, I have a problem to assign several tasks on a bunch of machines. Tasks have some precedence restrictions such that some task cannot be started before the end of another task. In addition, each task can only be run on certain machines. The target is to minimize the makespan of all these tasks.
I modeled this problem with Chained Through Time Pattern in which each machine is the anchor. But the problem is that tasks on certain machine might not be executed sequentially due to the precedence restriction. For example, Task B can only be started after Task A completes while Tasks A and B are executed on machines I and II respectively. This means during the execution of Task A on machine I, if there is no other task that can be run on machine II, then machine II can only keep idle until Task A completes at which point Task B could be started on it. This kind of gap is not deterministic as it depends on the duration of Task A with respect to this example. According to the tutorial of OptaPlanner, it seems that additional planning variable gaps should be introduced for this kind of problem. But I have difficulty in modeling this gap variable now. In general, how to integrate the gap variable in the model using Chained Through Time Pattern? Some detailed explanation or even a simple example would be highly appreciated.
Moreover, I'm actually not sure whether chained through time pattern is suitable for modeling this kind of task assigning problem or I just used an entirely inappropriate method. Could someone please shed some light on this? Thanks in advance.
I'am using chained through time pattern to solve the same question as yours.And to solve the precedence restriction you can write drools rules.

OptaPlanner immediately produces better solution after terminating and restarting the solver

I created a solution based on the task assigning example of OptaPlanner and observe one specific behavior in both the original example and my own solution:
Solving the 100tasks-5employees problem does hardly produce new better scores after half a minute or so, but terminating the solver and restarting it again does immediately bring up better solutions.
Why does this happen? In my understanding the repeated construction heuristic does not change any planning entity as all of them are already initialized. Then local search is started again. Why does it immediately find new better solutions, while just continuing the first execution without interruption does not or at least much slower?
By terminating and restarting the solver, you're effectively causing Late Acceptance to do a reheating. OptaPlanner will do automatic reheating once this jira is prioritized and implemented.
This occurs on a minority of the use cases. But if it occurs on a use case, it tends to occur on all datasets.
I've some cases workaround it by configuring multiple <localSearch> phases with <unimprovedSecondsSpentLimit> terminations, but I don't like that. Fixing that jira is the only real solution.

Prevent DynamicSupervisor from shutdown if child reaches max_restarts

I have a DynamicSupervisor that starts children with restart: :transient. By default, if a child exits abnormally, it will be restarted by the supervisor.
However, by design, if the child fails after 3 restarts, the supervisor itself will exit. From the docs:
https://hexdocs.pm/elixir/Supervisor.html#module-exit-reasons-and-restarts
Notice that supervisor that reached maximum restart intensity will exit
with :shutdown reason. In this case the supervisor will only be restarted
if its child specification was defined with the :restart option set to :permanent
(the default).
Since killing the supervisor will also kill other children (background jobs that are in progress) I would like to avoid this scenario.
The question is: after reaching max_restarts, how can I kill the failing child process, preserving the supervisor and its other children?
Using Elixir 1.6 / OTP 20.
Update: I found this answer on StackOverflow that essentially suggests that the top-level DynamicSupervisor launches a DynamicSupervisor for each child; the top-level will start the child supervisors with restart: :permanent or :temporary. That's a good workaround, but I'd be interested if there is another solution.
DynamicSupervisor adheres to the same restart policy as the regular Supervisor and it works the way it does for a good reason. Instead of trying to work around this behaviour we need to understand why it is the way it is.
Understanding supervisor’s purpose
A supervisor monitors its children and in case an unexpected failure brings any of them down, it will restart it with a known initial state. The key to understanding the rationale behind restart limits lies in the definition of unexpected failures.
Unexpected here does not mean something you hadn’t thought about before pushing untested code to production. It’s something that only happens in rare circumstances which are difficult to simulate during normal testing, something that’s difficult to reproduce and that does not happen very often.
Catching such failures is difficult even with the default limit of 3 restarts within 5 seconds. In fact, this limit is way too conservative for live systems. I think it’s mostly useful for catching bugs early in development. When a bug is causing a process to shut down immediately or soon after being started, it won’t take long before it reaches 3 restarts and causes its supervisor to die. At that point you should look for the bug and fix it.
A different way to fail
Assuming you do test your code and are still observing processes die regularly, you’re probably experiencing a different kind of failure – an expected one. I highly suggest reading Fred Hebert's article It's About the Guarantees which covers in great detail the way supervisors should be used and the guarantees they’re supposed to provide. A very brief and abridged version of it:
Supervised processes provide guarantees in their initialization phase, not a best effort. This means that when you're writing a client for a database or service, you shouldn't need a connection to be established as part of the initialization phase unless you're ready to say it will always be available no matter what happens.
If you do require a connection to the database to be established in a process’s init() callback, failing to connect then really does mean the process cannot function and should die. When its restarted by the supervisor yet it keeps failing, that does indeed mean the whole supervision tree cannot function correctly and should die. This continues recursively until the root supervisor is reached and the whole system goes down.
Now, Elixir provides a lot of solutions to various problems like this out of the box. In a way this is really nice, but it also often makes those problems invisible, leaving newcomers unaware of their existence. For example, Ecto depends on db_connection under the hood to provide a default exponential backoff when a connection to the database cannot be established. This behaviour is described in db_connection’s docs.
So what should you do?
Going back to your problem, at this point it should be clear that another approach has to be employed for a process which can fail often and it’s not a bug that’s causing it. You need to acknowledge that its failure is expected and handle it explicitly in your code.
Perhaps, your process depends on an external service that may occasionally be unavailable. In that case, you need to use a circuit breaker. There’s one written in Erlang called fuse which is nicely described by its author in this comment on Hacker News.
Netflix has a blog post showcasing the use of circuit breakers in their API which receives a pounding of billions of requests on a daily basis. That’s a mind-boggling scale and it’s even bigger now since that post is from 2011!
If that’s still not the kind of failure you’re experiencing, then, perhaps, you run untrusted code that cannot be relied on? Wrap it in a try-rescue block and return errors as values instead of relying on the supervisor to magically handle them for you.
I hope this helps.

Debugging flows seems really painful

I'm running into serious productivity issues when debugging flows. I can only assume at this point is due to a lack of knowledge on my part; particularly effective debugging techniques of flows. The problems arise when I have one flow which needs to "wait" for a consumption of a specific state. What seems to happen is the waiting flow starts and waits for the consumption of the specified state, but despite implemented as a listening future with an associated call back (at this point I'm simply using getOrThrow on the future returned from 'WhenConsumed'), the flows just hang and I see hundreds of Artemis send/write messages in the console window. If I stop the debug session, delete the node build directory, redeploy the nodes and start again the flows restart and I can return to the point of failure. However if I simply stop and detach the debugger from the node and attempt to run the calling test (calling the flow via RPC), nothing seems to happen. It's almost as if the flow code (probably incorrect at this point) results in the StateMachine/messaging layer becoming stuck in some kind of stateful loop which is only resolved by wiping the node build directories and redeploying. Simply restarting the node results in the flow no longer executing at all. This is a real productivity killer, and so I'm writing this question in the hope and assumption I've missed an obvious trick in how to effectively test/debug flows in such a way which avoids repeatedly re-deploying the nodes.
It would be great if someone could explain how to effectively debug flows; especially flows which are dependent on vault updates and thus wait on a valut update event. I have considered using a subflow, but this would ultimately, (I believe?) not result in quite the functionality required; namely to have a flow triggered when an identified state is consumed by a node. Or maybe it would? Perhaps this issue is due to not using a subFlow??? I look forward to your thoughts anyway!!
Not sure about your specific use case. But in general,
I would do as much unit testing as possible before physically running the nodes and see if the flow works.
Corda provides three levels of unit testing: transaction/ledger DSL, mock network and driver DSL. So if done right, most if not all bugs in the flows should be resolved by the time of runnodes. Actual runnodes mostly just reveal configuration issues.

When does it make sense to handle a SIGSEGV signal?

Searching here and there I found some supposedly valid cases, but none of them gave a good (or any) explanation as to why this was the best (or only) choice.
Here are the cases:
Try to perform some kind of clean-up before crashing.
Provide a friendly error report to the user and maybe send an error report back to you.
Use it for debugging.
Try to perform a full recovery of your program.
Here are my thought on the cases:
SIGSEGV signals should not be there in the first place, but then again there is Murphy's law and there are some resources that an OS won't release implicitly after a program crashes (I am thinking about semaphores or shared memory).
Again, there is Murphy's law. Displaying a dialog when things go wrong and asking permission from the user to send an automated report seems very good for both the user and the developer. (I don't remember if any of the error reports of the programs that I use ever mentioned a segmentation fault, though. I guess I will start noticing now.)
I have never ever even thought of this option. A debugger and a core dump look like a much more effective approach.
For all I know, this is either impossible or illogical, since the program state is corrupted, making the execution unpredictable (this is another good argument against (1), (2) and (3)). I don't know if there is a very specific case where that might actually make sense, though. This reminds me of an argument in favor of turning assertions off in production software: that sometimes erroneous execution is better than no execution, sometimes being aviation software and the like.
So:
Are there any good reasons to handle a SIGSEGV signal?
Are any of the above cases indeed valid? Why or why not?
Why are we allowed to handle SIGSEGV in the first place?
You have a program that's supposed to run 24/7 processing stuff (e.g. handling messages) all the time and it uses a third party component to do some of the processing. This third party component sometimes fails with a SIGSEGV.
Clearly, the best long-term solution is to get the vendor to fix the component or use another component, but the short-term solution is to log the error and keep going.

Resources