## Getting the optimal number of employees for a month (rostering) - optaplanner

### Genetic Algorithm Sudoku - optimizing mutation

```I am in the process of writing a genetic algorithm to solve Sudoku puzzles and was hoping for some input. The algorithm solves puzzles occasionally (about 1 out of 10 times on the same puzzle with max 1,000,000 iterations) and I am trying to get a little input about mutation rates, repopulation, and splicing. Any input is greatly appreciated as this is brand new to me and I feel like I am not doing things 100% correct.
A quick overview of the algorithm
Fitness Function
Counts the number of unique values of numbers 1 through 9 in each column, row, and 3*3 sub box. Each of these unique values in the subsets are summed and divided by 9 resulting in a floating value between 0 and 1. The sum of these values is divided by 27 providing a total fitness value ranging between 0 and 1. 1 indicates a solved puzzle.
Population Size:
100
Selection:
Roulette Method. Each node is randomly selected where nodes containing higher fitness values have a slightly better chance of selection
Reproduction:
Two randomly selected chromosomes/boards swap a randomly selected subset (row, column, or 3*3 subsets) The selection of subset(which row, column, or box) is random. The resulting boards are introduced into population.
Reproduction Rate: 12% of population per cycle
There are six reproductions per iteration resulting in 12 new chromosomes per cycle of the algorithm.
Mutation: mutation occurs at a rate of 2 percent of population after 10 iterations of no improvement of highest fitness.
Listed below are the three mutation methods which have varying weights of selection probability.
1: Swap randomly selected numbers. The method selects two random numbers and swaps them throughout the board. This method seems to have the greatest impact on growth early in the algorithms growth pattern. 25% chance of selection
2: Introduce random changes: Randomly select two cells and change their values. This method seems to help keep the algorithm from converging. %65 chance of selection
3: count the number of each value in the board. A solved board contains a count of 9 of each number between 1 and 9. This method takes any number that occurs less than 9 times and randomly swaps it with a number that occurs more than 9 times. This seems to have a positive impact on the algorithm but only used sparingly. %10 chance of selection
My main question is at what rate should I apply the mutation method. It seems that as I increase mutation I have faster initial results. However as the result approaches a correct result, I think the higher rate of change is introducing too many bad chromosomes and genes into the population. However, with the lower rate of change the algorithm seems to converge too early.
One last question is whether there is a better approach to mutation.
```
```You can anneal the mutation rate over time to get the sort of convergence behavior you're describing. But I actually think there are probably bigger gains to be had by modifying other parts of your algorithm.
Roulette wheel selection applies a very high degree of selection pressure in general. It tends to cause a pretty rapid loss of diversity fairly early in the process. Binary tournament selection is usually a better place to start experimenting. It's a more gradual form of pressure, and just as importantly, it's much better controlled.
With a less aggressive selection mechanism, you can afford to produce more offspring, since you don't have to worry about producing so many near-copies of the best one or two individuals. Rather than 12% of the population producing offspring (possible less because of repetition of parents in the mating pool), I'd go with 100%. You don't necessarily need to literally make sure every parent participates, but just generate the same number of offspring as you have parents.
Some form of mild elitism will probably then be helpful so that you don't lose good parents. Maybe keep the best 2-5 individuals from the parent population if they're better than the worst 2-5 offspring.
With elitism, you can use a bit higher mutation rate. All three of your operators seem useful. (Note that #3 is actually a form of local search embedded in your genetic algorithm. That's often a huge win in terms of performance. You could in fact extend #3 into a much more sophisticated method that looped until it couldn't figure out how to make any further improvements.)
I don't see an obvious better/worse set of weights for your three mutation operators. I think at that point, you're firmly within the realm of experimental parameter tuning. Another idea is to inject a bit of knowledge into the process and, for example, say that early on in the process, you choose between them randomly. Later, as the algorithm is converging, favor the mutation operators that you think are more likely to help finish "almost-solved" boards.
```
```I once made a fairly competent Sudoku solver, using GA. Blogged about the details (including different representations and mutation) here:
http://fakeguido.blogspot.com/2010/05/solving-sudoku-with-genetic-algorithms.html```

### Does evidence based scheduling work right with heterogenous estimations?

```Observing one year of estimations during a project I found out some strange things that make me wonder if evidence based scheduling would work right here?
individual programmers seem to have favorite numbers (e.g. 2,4,8,16,30 hours)
the big tasks seem to be underestimated by a fix value (about 2) but the standard deviation is low here
the small tasks (1 or 2 hours) are absolutely wide distributed. In average they have the same average underestimation factor of 2, but the standard deviation is high:
some 5 minute spelling issues are estimated with 1 hour
other bugfixes are estimated with 1 hour too, but take a day
So, is it really a good idea to let the programmers break down the 30 hours task down to 4 or 2 hours steps during estimations? Won't this raise the standard deviation? (Ok, let them break it down - but perhaps after the estimations?!)
```
```Yes, your observations are exatly the sort of problems EBS is designed to solve.
Yes, it's important to break bigger tasks down. Shoot for 1-2 day tasks, more or less.
If you have things estimated at under 2 hrs, see if it makes sense to group them. (It might not -- that's ok!)
If you have tasks that are estimated at 3+ days, see if there might be a way to break them up into pieces. There should be. If the estimator says there is not, make them defend that assertion. If it turns out that the task really just takes 3 days, fine, but the more of these you have, the more you should be looking hard in the mirror and seeing if folks aren't gaming the system.
Count 4 & 5 day estimates as 2x and 4x as bad as 3 day ones. Anyone who says something is going to take longer than 5 days and it can't be broken down, tell them you want them to spend 4 hrs thinking about the problem, and how it can be broken down. Remember, that's a task, btw.
As you and your team practice this, you will get better at estimating.
...You will also start to recognize patterns of failure, and solutions will present themselves.
The point of Evidence based scheduling is to use Evidence as the basis for your schedule, not a collection of wild-assed guesses. It's A Good Thing...!
```
```I think it is a good idea. When people break tasks down - they figure out the specifics of the task, You may get small deviations here and there, this way or the other, they may compensate or not...but you get a feeling of what is happening.
If you have a huge task of 30 hours - can take all 100. This is the worst that could happen.
Manage the risk - split down. You already figured out these small deviation - you know what to do with them.
So make sure developers also know what they do and say :)
```
```"So, is it really a good idea to let the programmers break down the 30 hours task down to 4 or 2 hours steps during estimations? Won't this raise the standard deviation? (Ok, let them break it down - but perhaps after the estimations?!)"
I certainly don't get this question at all.
What it sounds like you're saying (you may not be saying this, but it sure sounds like it)
The programmers can't estimate at all -- the numbers are always rounded to "magic" values and off by 2x.
I can't trust them to both define the work and estimate the time it takes to do the work.
Only I know the correct estimate for the time required to do the task. It's not a round 1/2 day multiple. It's an exact number of minutes.
Here's my follow-up questions:
What are you saying? What can't you do? What problem are you having? Why do you think the programmers estimate badly? Why can't they be trusted to estimate?
From your statements, nothing is broken. You're able to plan and execute to that plan. I'd say you were totally successful and doing a great job at it.
```
```Ok, I have the answer. Yes it is right AND the observations I made (see question) are absolutely understandable. To be sure I made a small Excel simulation to ensure myself of what I was guessing.
If you add multiple small task with a high standard deviation to bigger tasks, they will have a lower deviation, because the small task partially compensate the uncertainty.
So the answer is: Yes, it will work, if you break down your tasks, so that they are about the same length. It's because the simulation will do the compensation for bigger tasks automatically. I do not need to worry about a higher standard deviation in the smaller tasks.
But I am sure you must not mix up low estimated tasks with high estimated tasks - because they simply do not have the same variance.
Hence, it's always better to break them down. :)
create 50 rows with these columns:
first - a fixed value 2 (the very homogeneous estimation)
20 columns with some random function (e.g. "=rand()*rand()*20")
make sums fore each column
add "=VARIANCE(..)" for each random column
and add a variance calculation for the sums
The variance for each column in my simulation was about 2-3 and the variance of the sums below 1.```

### How to deal with multiple instance regression with a natural ordering and a different number of instances per bag?

```Sorry for the somewhat ambiguous title but I was not sure how to describe the problem in one line. The issue I am having is the following:
In a supervised learning setting, I have instances that have features associated to them. However, for some instances I have several observations.
As a concrete example, I might want to predict future employee performance in a company according to previous performance (e.g, a bunch of measurements like productivity etc). So I would have an employee with only one year of data (say 2003) and another employee with 3 (2001,2002,2003). The features measured for each year are the same and let's assume that all the employees worked at the same company, such that comparison is easier.
Now, the questions becomes: how to end up with one observation row per employee. I have a few ideas:
1) Simply use the last year of data available for each employee and discard those before, so that I have exactly one line per employee. I would also use a numerical variable indicating the number of years the employee has been working at the company for as an additional feature. The idea is that the most recent year would be the most informative anyways. However, it seems to me that I might be throwing away potentially useful information.
2) Taking the mean (or kernel mean embedding, any kind of summary really..) across all years. However, this looks wrong to me as people who worked for a different amount of time at a company get unfairly compared. Admittedly, they would be less productive in their first year and gradually improve. Meaning that it would be better to be more productive in your first year than someone in their third or fifth year for instance.
3) I would use some sort of measure that computes the rate of improvement for each feature from year 1 to most recent year and also add the number of years worked as an extra feature (as in point 1)). However, I would have to come up with some fake values for those of worked only on year. I was thinking of a very unrealistic value. I think this might work in a tree like algorithm that does not multiply a feature with a parameter, but would give seriously wrong results when using neural nets or linear regression to name a few. What are your thoughts on the effect this would have on various learning algorithms?
Any ideas would be greatly appreciated, thanks for reading through!```

### Optimising table assignment to guests for an event based on a criteria

```66 guests at an event, 8 tables. Each table has a "theme". We want to optimize various criteria: e.g., even number of men/women at the table, people get to discuss the topic they selected, etc.
I formulated this as a gradient-free optimisation problem: I wrote a function that calculates the goodness of the arrangement (i.e., cost of difference of men women, cost of non-preferred theme, etc.) and I am basically randomly perturbing the arrangement by swapping tables and keeping the "best so far" arrangement. This seems to work, but cannot guarantee optimality.
I am wondering if there is a more principled way to go about this. There (intuitively) seems to be no useful gradient in the operation of "swapping" people between tables, so random search is the best I came up with. However, brute-forcing by evaluating all possibilities seems to be difficult; if there are 66 people, there are factorial(66) possible orders, which is a ridiculously large number (10^92 according to Python). Since swapping two people at the same table is the same, there are actually fewer, which I think can be calculated by dividing out the repeats, e.g. fact(66)/(fact(number of people at table 1) * fact(number of people at table 2) * ...), which in my problem still comes out to about 10^53 possible arrangements, way too many to consider.
But is there something better that I can do than random search? I thought about evolutionary search but I don't know if it would provide any advantages.
Currently I am swapping a random number of people on each evaluation and keeping it only if it gives a better value. The random number of people is selected from an exponential distribution to make it more probable to swap 1 person than 6, for example, to make small steps on average but to keep the possibility of "jumping" a bit further in the search.
I don't know how to prove it but I have a feeling this is an NP-hard problem; if that's the case, how could it be reformulated for a standard solver?
Update: I have been comparing random search with a random "greedy search" and a "simulated annealing"-inspired approach where I have a probability of keeping swaps based on the measured improvement factor, that anneals over time. So far the greedy search surprisingly strongly outperforms the probabilistic approach. Adding the annealing schedule seems to help.
What I am confused by is exactly how to think about the "space" of the domain. I realize that it is a discrete space, and that distance are best described in terms of Levenshtein edit distance, but I can't think how I could "map" it to some gradient-friendly continuous space. Possibly if I remove the exact number of people-per-table and make this continuous, but strongly penalize it to incline towards the number that I want at each table -- this would make the association matrix more "flexible" and possibly map better to a gradient space? Not sure. Seating assignment could be a probability spread over more than one table..```

### Approximation to Large Linear Program

```I have a simple LP with linear constraints. There are many decision variables, roughly 24 million. I have been using lpSolve in R to play with small samples, but this solver isn't scaling well. Are there ways to get an approximate solution to the LP?
Edit:
The problem is a scheduling problem. There are 1 million people who need to be scheduled into one of 24 hours, hence 24 million decision variables. There is a reward \$R_{ij}\$ for scheduling person \$i\$ into hour \$j\$. The constraint is that each person needs to be scheduled into some hour, but each hour only has a finite amount of appointment slots \$c\$
```
```One good way to approach LPs/IPs with a massive number of variables and constraints is to look for ways to group the decision variables in some logical way. Since you have only given a sketch of your problem, here's a solution idea.
Approach 1 : Group people into smaller batches
Instead of 1M people, think of them as 100 units of 10K people each. So now you only have 2400 (24 x 100) variables. This will get you part of the way there, and note that this won't be the optimal solution, but a good approximation. You can of course make 1000 batches of 1000 people and get a more fine-grained solution. You get the idea.
Approach 2: Grouping into cohorts, based on the Costs
Take a look at your R_ij's. Presumably you don't have a million different costs. There will typically be only a few unique cost values. The idea is to group many people with the same cost structure into one 'cohort'. Now you solve a much smaller problem - which cohorts go into which hour.
Again, once you get the idea you can make it very tractable.
Update Based on OP's comment:
By its very nature, making these groups is an approximation technique. There is no guarantee that the optimal solution will be obtained. However, the whole idea of careful grouping (by looking at cohorts with identical or very similar cost structures) is to get solutions as close to the optimal as possible, with far less computational effort.
I should have also added that when scaling (grouping is just one way to scale-down the problem size), the other constants should also be scaled. That is, c_j should also be in the same units (10K).
If persons A,B,C cannot be fit into time slot j, then the model will squeeze in as many of those as possible in the lowest cost time slot, and move the others to other slots where the cost is slightly higher, but they can be accommodated.
Hope that helps you going in the right direction.
```
```Assuming you have a lot of duplicate people, you are now using way too many variables.
Suppose you only have 1000 different kinds of people and that some of these occcur 2000 times whilst others occur 500 times.
Then you just have to optimize the fraction of people that you allocate to each hour. (Note that you do have to adjust the objective functions and constraints a bit by adding 2000 or 500 as a constant)
The good news is that this should give you the optimal solution with just a 'few' variables, but depending on your problem you will probably need to round the results to get whole people as an outcome.```