Solution Partitioning failing - optaplanner

I have implemented a solution partitioner for my planning problem. But when I now run the optimizer, it returns the following error:
Exception in thread "main" java.lang.IllegalStateException: The partition child thread with partIndex (1) has thrown an exception. Relayed here in the parent thread.
at org.optaplanner.core.impl.partitionedsearch.queue.PartitionQueue$PartitionQueueIterator.createUpcomingSelection(PartitionQueue.java:157)
at org.optaplanner.core.impl.partitionedsearch.queue.PartitionQueue$PartitionQueueIterator.createUpcomingSelection(PartitionQueue.java:121)
at org.optaplanner.core.impl.heuristic.selector.common.iterator.UpcomingSelectionIterator.hasNext(UpcomingSelectionIterator.java:42)
at org.optaplanner.core.impl.partitionedsearch.DefaultPartitionedSearchPhase.solve(DefaultPartitionedSearchPhase.java:131)
at org.optaplanner.core.impl.solver.AbstractSolver.runPhases(AbstractSolver.java:88)
at org.optaplanner.core.impl.solver.DefaultSolver.solve(DefaultSolver.java:191)
at com.paconsulting.Demo.main(PowerPeersDemo.java:137)
Caused by: java.lang.IllegalStateException: When lookUpEnabled (false) is disabled in the constructor, this method should not be called.
at org.optaplanner.core.impl.score.director.AbstractScoreDirector.lookUpWorkingObject(AbstractScoreDirector.java:506)
at org.optaplanner.core.impl.heuristic.selector.move.generic.ChangeMove.rebase(ChangeMove.java:83)
at org.optaplanner.core.impl.heuristic.selector.move.generic.ChangeMove.rebase(ChangeMove.java:33)
at org.optaplanner.core.impl.localsearch.decider.MultiThreadedLocalSearchDecider.forageResult(MultiThreadedLocalSearchDecider.java:196)
at org.optaplanner.core.impl.localsearch.decider.MultiThreadedLocalSearchDecider.decideNextStep(MultiThreadedLocalSearchDecider.java:157)
at org.optaplanner.core.impl.localsearch.DefaultLocalSearchPhase.solve(DefaultLocalSearchPhase.java:70)
at org.optaplanner.core.impl.solver.AbstractSolver.runPhases(AbstractSolver.java:88)
at org.optaplanner.core.impl.partitionedsearch.PartitionSolver.solve(PartitionSolver.java:121)
at org.optaplanner.core.impl.partitionedsearch.DefaultPartitionedSearchPhase.lambda$solve$1(DefaultPartitionedSearchPhase.java:119)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-03-21 21:47:41,705 [main] DEBUG PS step (1), time spent (1493), score (-249530hard/0soft), best score (-249530hard/0soft), picked move (part-0 {3886 variables changed}).
Exception in thread "main" java.lang.IllegalStateException: The partition child thread with partIndex (1) has thrown an exception. Relayed here in the parent thread.
at org.optaplanner.core.impl.partitionedsearch.queue.PartitionQueue$PartitionQueueIterator.createUpcomingSelection(PartitionQueue.java:157)
at org.optaplanner.core.impl.partitionedsearch.queue.PartitionQueue$PartitionQueueIterator.createUpcomingSelection(PartitionQueue.java:121)
at org.optaplanner.core.impl.heuristic.selector.common.iterator.UpcomingSelectionIterator.hasNext(UpcomingSelectionIterator.java:42)
at org.optaplanner.core.impl.partitionedsearch.DefaultPartitionedSearchPhase.solve(DefaultPartitionedSearchPhase.java:131)
at org.optaplanner.core.impl.solver.AbstractSolver.runPhases(AbstractSolver.java:88)
at org.optaplanner.core.impl.solver.DefaultSolver.solve(DefaultSolver.java:191)
at com.paconsulting.powerpeers.PowerPeersDemo.main(PowerPeersDemo.java:137)
Caused by: java.lang.IllegalStateException: When lookUpEnabled (false) is disabled in the constructor, this method should not be called.
at org.optaplanner.core.impl.score.director.AbstractScoreDirector.lookUpWorkingObject(AbstractScoreDirector.java:506)
at org.optaplanner.core.impl.heuristic.selector.move.generic.ChangeMove.rebase(ChangeMove.java:83)
at org.optaplanner.core.impl.heuristic.selector.move.generic.ChangeMove.rebase(ChangeMove.java:33)
at org.optaplanner.core.impl.localsearch.decider.MultiThreadedLocalSearchDecider.forageResult(MultiThreadedLocalSearchDecider.java:196)
at org.optaplanner.core.impl.localsearch.decider.MultiThreadedLocalSearchDecider.decideNextStep(MultiThreadedLocalSearchDecider.java:157)
at org.optaplanner.core.impl.localsearch.DefaultLocalSearchPhase.solve(DefaultLocalSearchPhase.java:70)
at org.optaplanner.core.impl.solver.AbstractSolver.runPhases(AbstractSolver.java:88)
at org.optaplanner.core.impl.partitionedsearch.PartitionSolver.solve(PartitionSolver.java:121)
at org.optaplanner.core.impl.partitionedsearch.DefaultPartitionedSearchPhase.lambda$solve$1(DefaultPartitionedSearchPhase.java:119)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I do have implemented #PlanningId on all the relevant objects.
Running version 1.18 of OptaPlanner.

This is a bug, thank you for reporting:
Partitioned Search is incompatible with Multithreaded Incremental Solving in version 7.19 or lower. This was a gap in our test coverage.
Partitioned Search was implemented before Multithreaded Incremental Solving and the latter didn't take it into account here on AbstractScoreDirector:
public InnerScoreDirector<Solution_> createChildThreadScoreDirector(ChildThreadType childThreadType) {
if (childThreadType == ChildThreadType.PART_THREAD) {
AbstractScoreDirector<Solution_, Factory_> childThreadScoreDirector = (AbstractScoreDirector<Solution_, Factory_>)
scoreDirectorFactory.buildScoreDirector(false, constraintMatchEnabledPreference); // That false is lookUpEnabled
That false kills the ability to nest multithreaded solving under Partitioned Search.
I've created a jira issue and fixed it for 7.20 in this pull request.

Related

How to find exception root causes in apache flink application?

I have an application which uses the apache flink streaming framework, and also works with kafka sources and sinks.
During processing of data I will randomly get exceptions like this:
09:59:16.087 ERROR o.a.k.clients.producer.KafkaProducer - Interrupted while joining ioThread
java.lang.InterruptedException: null
at java.lang.Object.wait(Native Method) ~[na:1.8.0_51]
at java.lang.Thread.join(Thread.java:1253) [na:1.8.0_51]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1031) [kafka-clients-0.11.0.2.jar:na]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1010) [kafka-clients-0.11.0.2.jar:na]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:989) [kafka-clients-0.11.0.2.jar:na]
at org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.close(FlinkKafkaProducer.java:168) [flink-connector-kafka-0.11_2.11-1.6.3.jar:1.6.3]
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.close(FlinkKafkaProducer011.java:663) [flink-connector-kafka-0.11_2.11-1.6.3.jar:1.6.3]
at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) [flink-core-1.6.3.jar:1.6.3]
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) [flink-streaming-java_2.11-1.6.3.jar:1.6.3]
at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:477) [flink-streaming-java_2.11-1.6.3.jar:1.6.3]
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:378) [flink-streaming-java_2.11-1.6.3.jar:1.6.3]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) [flink-runtime_2.11-1.6.3.jar:1.6.3]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
09:59:16.087 ERROR o.a.f.s.runtime.tasks.StreamTask - Error during disposal of stream operator.
org.apache.kafka.common.KafkaException: Failed to close kafka producer
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1062) ~[kafka-clients-0.11.0.2.jar:na]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1010) ~[kafka-clients-0.11.0.2.jar:na]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:989) ~[kafka-clients-0.11.0.2.jar:na]
at org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.close(FlinkKafkaProducer.java:168) ~[flink-connector-kafka-0.11_2.11-1.6.3.jar:1.6.3]
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.close(FlinkKafkaProducer011.java:663) ~[flink-connector-kafka-0.11_2.11-1.6.3.jar:1.6.3]
at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) ~[flink-core-1.6.3.jar:1.6.3]
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) ~[flink-streaming-java_2.11-1.6.3.jar:1.6.3]
at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:477) [flink-streaming-java_2.11-1.6.3.jar:1.6.3]
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:378) [flink-streaming-java_2.11-1.6.3.jar:1.6.3]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) [flink-runtime_2.11-1.6.3.jar:1.6.3]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
However I doubt the actual issue has anything to do with the kafka producer. I had the same exception a while ago (see: this post), and I was able to pinpoint the issue by building back my (now rather complex) application into smaller blocks until it finally threw an exception from within my code. This time, I failed to find the error this way, and so I'm lost and don't know how to investigate this.
So the question would be how can I find the source of those exceptions? Is there any recommended way of debugging flink application to find those kinds of errors?

Apache Spark on k8s: securing RPC communication between driver and executors is not working

I have been trying Spark 2.4 deployment on k8s and want to establish a secured RPC communication channel between driver and executors. Was using the following configuration parameters as part of spark-submit
spark.authenticate true
spark.authenticate.secret good
spark.network.crypto.enabled true
spark.network.crypto.keyFactoryAlgorithm PBKDF2WithHmacSHA1
spark.network.crypto.saslFallback false
The driver and executors were not able to communicate on a secured channel and were throwing the following errors.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Unknown challenge message.
at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:109)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:181)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
Can someone guide me on this?
Disclaimer: I do not have a very deep understanding of spark implementation, so, be careful when using the workaround described below.
AFAIK, spark does not have support for auth/encryption for k8s in 2.4.0 version.
There is a ticket, which is already fixed and likely will be released in a next spark version: https://issues.apache.org/jira/browse/SPARK-26239
The problem is that spark executors try to open connection to a driver, and a configuration will be sent only using this connection. Although, an executor creates the connection with default config AND system properties started with "spark.".
For reference, here is the place where executor opens the connection: https://github.com/apache/spark/blob/5fa4384/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L201
Theoretically, if you would set spark.executor.extraJavaOptions=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true ..., it should help, although driver checks that there are no spark parameters set in extraJavaOptions.
Although, there is a workaround (a little bit hacky): you can set spark.executorEnv.JAVA_TOOL_OPTIONS=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true .... Spark does not check this parameter, but JVM uses this env variable to add this parameter to properties.
Also, instead of using JAVA_TOOL_OPTIONS to pass secret, I would recommend to use spark.executorEnv._SPARK_AUTH_SECRET=<secret>.

Drools 6.5 ConcurrentModificationException with LinkedHashMap Fact

In our Java application using Drools 6.5 final release, we use Disruptor to run the same rules by different user threads, each thread has its own dedicated Session object while all the sessions are created from a common KieBase. Dev/QA did not see the following error but in Production, we see the error: the object being inserted is LinkedHashMap instance and this object will definitely being processed by one user thread (based on hashCode of immutable object coming with the message), so this is strange that this LinkedHashCode object would be modified by thread other than the user threads. Any thoughts on what could be the cause?
07:04:15.719 ERROR [RuleHandler6] erf.SupportsProfilingHandlerBase - Exception -
java.util.ConcurrentModificationException
at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:711)
at java.util.LinkedHashMap$LinkedEntryIterator.next(LinkedHashMap.java:744)
at java.util.LinkedHashMap$LinkedEntryIterator.next(LinkedHashMap.java:742)
at java.util.AbstractMap.hashCode(AbstractMap.java:507)
at org.drools.core.common.EqualityAssertMapComparator.hashCodeOf(EqualityAssertMapComparator.java:46)
at org.drools.core.util.ObjectHashMap.get(ObjectHashMap.java:90)
at org.drools.core.common.ClassAwareObjectStore.getHandleForObject(ClassAwareObjectStore.java:150)
at org.drools.core.common.NamedEntryPoint.getFactHandle(NamedEntryPoint.java:680)
at consolidator.services.DroolsKieContainer$SessionWrapper.internalFire(DroolsKieContainer.java:198)
at consolidator.services.DroolsKieContainer$SessionWrapper.fire(DroolsKieContainer.java:175)
at consolidator.services.DroolsKieService.fire(DroolsKieService.java:153)
at consolidator.disruptor.RuleHandler.handleFIX(RuleHandler.java:88)
at consolidator.disruptor.RuleHandler.onEventCore(RuleHandler.java:68)
at consolidator.disruptor.RuleHandler.onEventCore(RuleHandler.java:15)
at consolidator.perf.SupportsProfilingHandlerBase.onEvent(SupportsProfilingHandlerBase.java:43)
at consolidator.perf.SupportsProfilingHandlerBase.onEvent(SupportsProfilingHandlerBase.java:9)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Spring Batch UNKNOWN state

Have a batch which does some heavy operations. It runs for approximately 11-12 hours.
After that it moves to UNKNOWN state.
I have a question when would a batch move to UNKNOWN state?
Following is stack Trace.
org.springframework.transaction.TransactionSystemException: Could not roll back JDBC transaction; nested exception is java.sql.SQLException: Protocol violation
at org.springframework.jdbc.datasource.DataSourceTransactionManager.doRollback(DataSourceTransactionManager.java:285)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.processRollback(AbstractPlatformTransactionManager.java:845)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.rollback(AbstractPlatformTransactionManager.java:822)
at org.springframework.transaction.support.TransactionTemplate.rollbackOnException(TransactionTemplate.java:161)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:134)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:264)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:76)
at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:367)
at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:214)
at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:143)
at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:284)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:195)
at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:135)
at org.springframework.batch.core.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:61)
at org.springframework.batch.core.job.flow.support.state.StepState.handle(StepState.java:60)
at org.springframework.batch.core.job.flow.support.SimpleFlow.resume(SimpleFlow.java:144)
at org.springframework.batch.core.job.flow.support.SimpleFlow.start(SimpleFlow.java:124)
at org.springframework.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:135)
at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:282)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:121)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:909)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.sql.SQLException: Protocol violation
at oracle.jdbc.driver.T4CTTIfun.receive
Thanks
Aditya
A batch job will move into the UNKNOWN state only when a rollback is unsuccessful, leaving the job in an uncertain state which is what it looks like happened here. The real question here is…why was the rollback unsuccessful?

Taking over log handling in JSF

I'm trying to clean up server logs in my JSF 2 application so it's less cluttered. That means for example not to log exceptions that i expect and is not at all interested in. For this, I've built my own error handler using the technique described at http://jugojava.blogspot.com/2010/09/jsf-2-exception-handling.html.
However, all exceptions thrown in my managed beans are logged several times before reaching my error handlar, first by javax.enterprise.resource.webcontainer.jsf.application, and after that by javax.enterprise.resource.webcontainer.jsf.lifecycle. Is it possible to prevent these classes from logging every single exception twice? I'd rather not mute those classes in log4j since they might log other useful information.
Example, NPE in managed bean. The last row is the only logged on purpose.
ALLVARLIG means "SERIOUS" and VARNING means "WARNING"):
2012-02-01 07:32:43,477 ALLVARLIG [javax.enterprise.resource.webcontainer.jsf.application] (http-0.0.0.0-80-33) java.lang.NullPointerException: javax.faces.el.EvaluationException: java.lang.NullPointerException
[full stack trace]
2012-02-01 07:32:43,477 VARNING [javax.enterprise.resource.webcontainer.jsf.lifecycle] (http-0.0.0.0-80-33) #{myController.doStuff}: java.lang.NullPointerException: javax.faces.FacesException: #{myController.doStuff}: java.lang.NullPointerException
[full stack trace]
2012-02-01 07:32:43,477 ERROR [com.mycompany.myapplication.controller.error.ErrorHandler] (http-0.0.0.0-80-33) Unknown error: javax.faces.FacesException: #{myController.doStuff}: java.lang.NullPointerException
[full stack trace]

Resources