How to solve score trap when we want a fair workload split by type - optaplanner

I am in the score trap
My business is grouping workers, and then scheduling the groups. Please refer to (Why i can't get feasible solution if i add a new soft constraint) for details. I think it doesn't matter if you don't look at it.
According to the way of solving the score trap in the document, I sum the square of the number of people in each group as a soft constraint. But in my business, my grouping strategy is based on worker skills, that is, the group's number of people under a certain skill is fair.
As you may have discovered, the result which i sum the squares of the number of people in each group is no difference with the one i sum the squares of the group's number of people under each skill.
In other words, I don't want the sum of the squares of the group's number of people to be the smallest. my needs are the group's number of people under a certain skill, the maximum - minimum <= 5. Under this strategy, the sum of squares of the group's number of people under a certain skill may be even higher.
I am not seeking the overall optimal.


Genetic Algorithm Sudoku - optimizing mutation

I am in the process of writing a genetic algorithm to solve Sudoku puzzles and was hoping for some input. The algorithm solves puzzles occasionally (about 1 out of 10 times on the same puzzle with max 1,000,000 iterations) and I am trying to get a little input about mutation rates, repopulation, and splicing. Any input is greatly appreciated as this is brand new to me and I feel like I am not doing things 100% correct.
A quick overview of the algorithm
Fitness Function
Counts the number of unique values of numbers 1 through 9 in each column, row, and 3*3 sub box. Each of these unique values in the subsets are summed and divided by 9 resulting in a floating value between 0 and 1. The sum of these values is divided by 27 providing a total fitness value ranging between 0 and 1. 1 indicates a solved puzzle.
Population Size:
Roulette Method. Each node is randomly selected where nodes containing higher fitness values have a slightly better chance of selection
Two randomly selected chromosomes/boards swap a randomly selected subset (row, column, or 3*3 subsets) The selection of subset(which row, column, or box) is random. The resulting boards are introduced into population.
Reproduction Rate: 12% of population per cycle
There are six reproductions per iteration resulting in 12 new chromosomes per cycle of the algorithm.
Mutation: mutation occurs at a rate of 2 percent of population after 10 iterations of no improvement of highest fitness.
Listed below are the three mutation methods which have varying weights of selection probability.
1: Swap randomly selected numbers. The method selects two random numbers and swaps them throughout the board. This method seems to have the greatest impact on growth early in the algorithms growth pattern. 25% chance of selection
2: Introduce random changes: Randomly select two cells and change their values. This method seems to help keep the algorithm from converging. %65 chance of selection
3: count the number of each value in the board. A solved board contains a count of 9 of each number between 1 and 9. This method takes any number that occurs less than 9 times and randomly swaps it with a number that occurs more than 9 times. This seems to have a positive impact on the algorithm but only used sparingly. %10 chance of selection
My main question is at what rate should I apply the mutation method. It seems that as I increase mutation I have faster initial results. However as the result approaches a correct result, I think the higher rate of change is introducing too many bad chromosomes and genes into the population. However, with the lower rate of change the algorithm seems to converge too early.
One last question is whether there is a better approach to mutation.
You can anneal the mutation rate over time to get the sort of convergence behavior you're describing. But I actually think there are probably bigger gains to be had by modifying other parts of your algorithm.
Roulette wheel selection applies a very high degree of selection pressure in general. It tends to cause a pretty rapid loss of diversity fairly early in the process. Binary tournament selection is usually a better place to start experimenting. It's a more gradual form of pressure, and just as importantly, it's much better controlled.
With a less aggressive selection mechanism, you can afford to produce more offspring, since you don't have to worry about producing so many near-copies of the best one or two individuals. Rather than 12% of the population producing offspring (possible less because of repetition of parents in the mating pool), I'd go with 100%. You don't necessarily need to literally make sure every parent participates, but just generate the same number of offspring as you have parents.
Some form of mild elitism will probably then be helpful so that you don't lose good parents. Maybe keep the best 2-5 individuals from the parent population if they're better than the worst 2-5 offspring.
With elitism, you can use a bit higher mutation rate. All three of your operators seem useful. (Note that #3 is actually a form of local search embedded in your genetic algorithm. That's often a huge win in terms of performance. You could in fact extend #3 into a much more sophisticated method that looped until it couldn't figure out how to make any further improvements.)
I don't see an obvious better/worse set of weights for your three mutation operators. I think at that point, you're firmly within the realm of experimental parameter tuning. Another idea is to inject a bit of knowledge into the process and, for example, say that early on in the process, you choose between them randomly. Later, as the algorithm is converging, favor the mutation operators that you think are more likely to help finish "almost-solved" boards.
I once made a fairly competent Sudoku solver, using GA. Blogged about the details (including different representations and mutation) here:

Getting the optimal number of employees for a month (rostering)

Is it possible to get the optimal number of employees in a month for a given number of shifts?
I'll explain myself a little further taking the nurse rostering as an example.
Imagine that we don't know the number of nurses to plan in a given month with a fixed number of shifts. Also, imagine that each time you insert a new nurse in the planification it decreases your score and each nurse has a limited number of normal hours and a limited number of extra hours. Extra hours decrease more the score than normal ones.
So, the problem consists on getting the optimal number of nurses needed and their planification. I've come up with two possible solutions:
Fix the number of nurses clearly above of the ones needed and treat the problem as an overconstrained one, so there will be some nurses not assigned to any shifts.
Launching multiple instances of the same problem in parallel with an incremental number of nurses for each instance. This solution has the problem that you have to estimate more or less an approximate range of nurses under and above the nurses needed beforehand.
Both solutions are a little bit inefficient, is there a better approach to tackle with this problem?
I call option 2 doing simulations. Typically in simulations, they don't just play with the number of employees, but also the #ConstraintWeights etc. It's useful for strategic "what if" decisions (What if we ... hire more people? ... focus more on service quality? ... focus more on financial gain? ...)
If you really just need to minimize the number of employees, and you can clearly weight that versus all the other hard and soft constraint (probably as a weight in between both, similar to overconstrained planning), then option 1 is good enough - and less cpu costly.

How would I clip a continuous action in an actor-critic agent?

Let's say we have a bot that has some money and some shares. The input is a list of prices for the last 30 days. It doesn't use an RNN and the prices are entered all at the same time. The output is a continuous action where a positive number is to buy and a negative number is to sell the amount of the stock. How can I restrict the action space so that it is clipped between how many shares it has(the lower bound) and how much money it has(the upper bound)?
Should I have it clipped or just penalize an illegal action? Which option would create the best results?
You can penalise illegal actions, but in my experience it hasn't shown to have a good effect on the AI (one more thing to worry about). Just clip the output so that if it tries to use more money that it has available it spends all its money. If it tries to sell more of a stock than it has, it sells all of it's stock. The network will learn what happens when it tries to use more resources than it has quite quickly, so it wont cause any degrade in performance.
The AI cannot sell a share amount it doesn't have or buy a share which is worth more than the money it has, so you should not allow this kind of transaction at all. However, if your AI looks at the trends and prefers shares which are expected to be more valuable in the near future, then there is a good chance that the total amount of property will be higher the next day. Let's say share1 had a starting value of s1 and an ending value of e1 and share2 had a starting value of s2 and an ending value of e2, then in the case when
e1 / s1 > e2 / s2
it is better to give share1 a higher priority. If any si / ei is smaller than 1, then the AI should not invest into it.
Also, you should value stability, if a share's value has increased continously in the last few days, then it has an increasing trend. If a share's initial value is smaller than the ending value, but in the last few days its value decreased, then it might be a decreasing trend and the share should not be preferable. Such rules need to be implemented and when they are conflicting, the AI must be able to intelligently choose its priorities.

Optimising table assignment to guests for an event based on a criteria

66 guests at an event, 8 tables. Each table has a "theme". We want to optimize various criteria: e.g., even number of men/women at the table, people get to discuss the topic they selected, etc.
I formulated this as a gradient-free optimisation problem: I wrote a function that calculates the goodness of the arrangement (i.e., cost of difference of men women, cost of non-preferred theme, etc.) and I am basically randomly perturbing the arrangement by swapping tables and keeping the "best so far" arrangement. This seems to work, but cannot guarantee optimality.
I am wondering if there is a more principled way to go about this. There (intuitively) seems to be no useful gradient in the operation of "swapping" people between tables, so random search is the best I came up with. However, brute-forcing by evaluating all possibilities seems to be difficult; if there are 66 people, there are factorial(66) possible orders, which is a ridiculously large number (10^92 according to Python). Since swapping two people at the same table is the same, there are actually fewer, which I think can be calculated by dividing out the repeats, e.g. fact(66)/(fact(number of people at table 1) * fact(number of people at table 2) * ...), which in my problem still comes out to about 10^53 possible arrangements, way too many to consider.
But is there something better that I can do than random search? I thought about evolutionary search but I don't know if it would provide any advantages.
Currently I am swapping a random number of people on each evaluation and keeping it only if it gives a better value. The random number of people is selected from an exponential distribution to make it more probable to swap 1 person than 6, for example, to make small steps on average but to keep the possibility of "jumping" a bit further in the search.
I don't know how to prove it but I have a feeling this is an NP-hard problem; if that's the case, how could it be reformulated for a standard solver?
Update: I have been comparing random search with a random "greedy search" and a "simulated annealing"-inspired approach where I have a probability of keeping swaps based on the measured improvement factor, that anneals over time. So far the greedy search surprisingly strongly outperforms the probabilistic approach. Adding the annealing schedule seems to help.
What I am confused by is exactly how to think about the "space" of the domain. I realize that it is a discrete space, and that distance are best described in terms of Levenshtein edit distance, but I can't think how I could "map" it to some gradient-friendly continuous space. Possibly if I remove the exact number of people-per-table and make this continuous, but strongly penalize it to incline towards the number that I want at each table -- this would make the association matrix more "flexible" and possibly map better to a gradient space? Not sure. Seating assignment could be a probability spread over more than one table..

combinatorial optimization: multiple upgrade paths with inventory constraints

I'm playing a video game, and i want to make a program that calculates the globally optimal build/upgrade path towards a fixed 6 item goal.
Time, Cost, Inventory constraints, and effectiveness (short/mid/long-term) ratings are to be considered. Identifying local spikes in effectiveness are also welcomed, but optional. I don't know how to classify this problem, but i'm guessing its a type of graph search. The fact that multiple criteria are being optimized is making things confusing for me.
Problem details:
There are 6 free slots in your bag to hold items.
There are 2 classes of items: basic items, and composite items.
Composite items are built/merged from basic items, and other composite items.
If you have enough gold, you can buy a composite item, and its missing sub components, all at once, using only 1 inventory slot.
The build path for various composite items are fixed, and many basic components are featured in more than one recipe.
Gold is earned at a fixed rate over time, as well as in small non-deterministic bursts.
Time is bounded: it increments in fixed ticks (seconds) and has a max value: 2400.
There exists no more than 50 items, maybe less.
So, thinking about the problem...
Tackling the gold/time issue first
We can either ignore the non-deterministic aspect, or use some statistical averages. Let's make life easy, and ignore it for now. Since gold, and time, are now directly related in our simplified version, they can be logically merged.
Combinatorial expansion of feasible paths
A graph could be built, top down, from each of the 6 goal items, indicating their individual upgrade hierarchies. Components that are shared between the various hierarchies can be connected, giving branch decisions. The edges between components can be weighted by their cost. At this point, it sounds like a shortest path problem, except with multiple parallel and overlapping goals.
Now the question is: how do inventory constraints play into this?
The inventory/cost constraints, add a context, that both disables (no free slots; not enough gold), and enables (two items merged freeing a slot) various branch decisions, based upon previous choices and elapsed time. Also, the possibility of saving up gold and doing nothing for a non fixed period, can be optimal in certain situations.
How does one expand all the feasible possibilities? Does it have to be done at every given step? How many total combinations are there? Does this fall under topological combinatorics?
Q: How does one expand all the feasible possibilities?
The item build path is a dependency graph. A correct evaluation order of the dependencies is given by the topological ordering of the graph. A graph may have more than one valid evaluation order.
Update 2:
Q: How many total combinations are there?
Seems that it has to be counted, and there is no numerical formula.
Algorithim 3.2, Page 150, "On Computing the Number of Topological Orderings of a Directed Acyclic Graph" by Wing-Nig Li, Zhichun Xiao, Gordon Beavers
f(g) | vertex_count(g) == 1 = 1
f(g) = ∑ {f(g \ {v}) for all v ∈ deg0set}
deg0set = {vertex_in_degree(g,x) == 0 for all x ∈ vertices(g)}
f[g_/; Length[VertexList[g]] == 1] := 1
f[g_] := With[
{deg0set = Select[VertexList[g], VertexInDegree[g,#] == 0&]},
Sum[f[VertexDelete[g,v]], {v,deg0set}]
Rating Effectiveness
If the above expansion produces less than a few billion possibilities, we can just exhaustive search using OpenCL/CUDA. I'm not sure what other options are available, since most graph search stuff seems to just solve for one criteria.
You need to walk down to establish max values at each end stage and element, (ie add your metric (dps or whatever) up as you walk down, Pruning any combinations that require you to have more slots then possible. Then walk backwards from the best sets of possible end options, (which should be fairly sparse), you should be able to limit it to only the best composite options), and always move in a way that takes you to the max. Then set a value equal to damage on each option,
Given that dps or whatever of basic and preliminary composite items is probably less then the final possible composite items, this will be naturally weighted towards effectiveness at the end stage, so you may want to adjust, or set a minimum.
Once you have that, you can figure out the gold costs for your options, and decide how to weight getting to things faster.
Then you can go through and establish gold costs. You need to decide if you want to wait and earn or not. If not, then you want to measure the rate of dps increase as a function of time. If so, I expect it's cheaper to buy the items after not playing, over all.
You are going to have to use your data and make some value judgements.