Optaplanner limit valueRangeProvider based on another entity - optaplanner

My planning problem is similar to employee rostering.
My planning entity looks like this
public class Menu {
#PlanningVariable(valueRangeProviderRefs = "productRange")
private String productId = null;
private String packId;
private String date;
}
Now, I have a condition that if two packIds are "similar" then the productId for those on the same data must also be "similar" where being similar is defined by some business logic.
I added a hard constraint for this but number of products are ~3000 and it takes forever to run through all combinations. Is there a way to restrict the value range provider to achieve this (so that it only iterates over the similar products)?

As per OptaPlanner manual: the value range of an entity must be independent of the planning variable's state. Any such dependence must be handle through (hard) constraints.
That being said, there are often more efficient models to deal with complexity you're describing. I've never seen anything below 10k instances that doesn't have an efficient way to solve them in 5 minutes or so. Typical scaling tricks include precalculation (valid combo's, hashing, ...), rule/scoring efficiency, nearby selection, ... and mulithreaded solving to top it off. It depends on the case and requires an in depth review.

Related

How to model the price (clean or dirty) of a financial instrument?

I need help modeling the following situation:
A financial instrument always has a price. However, some financial instruments (of certain types, rather) also have what's called "clean" price which is an attribute that depends (among other things) on the price, in which case price is also called "dirty" price. There is a calculator service that computes both the price (or dirty price) and the clean price. How best to conceptually model that situation?
I have considered two alternatives:
FinancialInstrument has a Price
FinancialInstrument
+ price: Price
where Price is a supertype with two derived classes: DirtyPrice and CleanPrice. CleanPrice depends on DirtyPrice
CleanPrice
+ dirty: DirtyPrice
The Calculator service would then compute the price of a FinancialInstrument:
CalculatorService
+ compute_price(FinancialInstrument, ...): Price
FinancialInstrument is a supertype with two derivations: PlainFinancialInstrument (only has on price attribute) and CleanPriceFinancialInstrument that has both clean and dirty prices.
FinancialInstrument
+ price: double
PlainFinancialInstrument CleanPriceFinancialInstrument
+ clean_price: double
The Calculator service would then have two methods to compute price for a PlainSecurity or clean and dirty price for CleanPriceSecurities:
CalculatorService
+ compute_price(PlainFinancialInstrument, ...): double
+ compute_price(CleanPriceFinancialInstrument, ...): pair<double, double>
What are the trade-offs of both alternatives? Are there other alternatives?
Thanks.
it is not clear to me whether you are asking how to model the abstract problem which is specified by means of your example or whether you are trying to model the business concept of financial instrument pricing in a real world context. I think it is the latter, because you are quite specific, so I'll comment on that. In this case I doubt, though, that any of your two approaches is sophisticated enough to meet the needs of your task. I've been working for several years in that area.
I'm not sure which business area you are working in. In the area I used to work in (banking) the difference between clean and dirty price is a simple business concept. Eg for bonds valuated by amortized costs the clean price is the value of the discounted cash flow not taking into account accruals and deferrals, the dirty price is the sum of the clean price and the accruals/deferrals. In all cases known to me the clean price is the difference between dirty price and some most of the times simple functions of some key figures of the financial instrument (FI for short), and both clean and dirty price are just key figures which are relevant for some (but not all) kind of financial instruments.
On the other hand, depending on the GAAP and business area, the question whether you need to supply the clean or dirty price or both may depend in addition on which book the financial instrument is assigned to, eg banking book/trading book. For the trading book you usually want to only retrieve the dirty price, the clean price is relevant in the banking book.
To make things worse a FI may be, e.g., reassigned, leading to a different set of keyfigures becoming relevant. You should make sure your design takes the consequences of such changes into account if this is relevant in your context.
Personally, I'd start on an approach outlined as follows:
create an abstract class/interface for financial instrument
for each type of FI, define a subclass
create a list of all key figures which may become relevant for any possible FI you have in your scope -- in your example: clean price and dirty price, and probably one for the key figure representing the difference. Create a dummy price key figure entry in addition.
for each of these key figures, create a key figure interface with methods relevant for the KFs. E.g. calculate, update -- this depends on your overall model. Again for your example: a clean price interface, a dirty price interface, a delta interface and a price interface. It may become necessary to define an order in which they have to be updated. The set of methods of the price interface has to be a subset of clean and dirty price interface
for each type of FI, create a specific implementation (class) for all the key figure interfaces relevant for that FI type, taking, of course, reuse into account. Strictly avoid if/else or switch statements depending on the key figure or FI types in these implementations, if this turns out to be necessary, you need additional class definitions. Now when you instantiate a class representing an FI, use a factory pattern to create instances of the key figure interfaces. That is, you decide on FI instance creaton which method to use to calculate, the FI instance then knows how to calculate the key figures of the FI. The nice feature of the factory pattern is that you may, additionally take into account the book you are calculating for as well as other parameters, even at run time if necessary. The factory will let the price key figure interface simply point to instance which is relevant in the context.
what you called the calulator service will then, to calculate a price, call a method of the price key figue interface, but the instance that interface points to is provided by the FI instance, because the factory has simply mapped the price interface to the clean price interface or the dirty price interface depending on what is correct for that specific FI in that specific context.
If you use, as suggested, a list of relevant key figures and key figure calculation interface implementations in the FI instance you can even update/exchange this at runtime if the FI is reassigned, without having to delete/recreate the FI instance.
Hope I did not make your question more complex than it actually is.
Regards,
Thomas
Do you need to have a separate Calculator Service? If not how about:
class FinancialInstrument {
private price: Double;
public getPrice {
// calculate the price
// presumably sets the private price? Dunno
this.price= // etc. .....
return this.price;
}
class CleanFinancialInstrument extends FinancialInstrument {
private cleanPrice: Double;
public getPrice {
//override FinancialInstrument.getPrice() as required
}
public getDirtyPrice {
//do you need this? Dunno
return this.getPrice();
}
public getCleanPrice {
this.cleanPrice = //... etc.
return this.dirtyPrice;
}
}
You may not even need the local private variables if you're not caching the price.
Callers can simply call getPrice() on any instance (FinancialInstrument or CleanFinancialInstrument) without having to worry about which type it is.
hth.

Additional PlanningEntity in CloudBalancing - bounded-space situation

I successfully amended the nice CloudBalancing example to include the fact that I may only have a limited number of computers open at any given time (thanx optaplanner team - easy to do). I believe this is referred to as a bounded-space problem. It works dandy.
The processes come in groupwise, say 20 processes in a given order per group. I would like to amend the example to have optaplanner also change the order of these groups (not the processes within one group). I have therefore added a class ProcessGroup in the domain with a member List<Process>, the instances of ProcessGroup being stored in a List<ProcessGroup>. The desired optimisation would shuffle the members of this List, causing the instances of ProcessGroup to be placed at different indices of the List List<ProcessGroup>. The index of ProcessGroup should be ProcessGroup.index.
The documentation states that "if in doubt, the planning entity is the many side of the many-to-one relationsship." This would mean that ProcessGroup is the planning entity, the member index being a planning variable, getting assigned to (hopefully) different integers. After every new assignment of indices, I would have to resort the list List<ProcessGroup in ascending order of ProcessGroup.index. This seems very odd and cumbersome. Any better ideas?
Thank you in advance!
Philip.
The current design has a few disadvantages:
It requires 2 (genuine) entity classes (each with 1 planning variable): probably increases search space (= longer to solve, more difficult to find a good or even feasible solution) + it increases configuration complexity. Don't use multiple genuine entity classes if you can avoid it reasonably.
That Integer variable of GroupProcess need to be all different and somehow sequential. That smelled like a chained planning variable (see docs about chained variables and Vehicle Routing example), in which case the entire problem could be represented as a simple VRP with just 1 variable, but does that really apply here?
Train of thought: there's something off in this model:
ProcessGroup has in Integer variable: What does that Integer represent? Shouldn't that Integer variable be on Process instead? Are you ordering Processes or ProcessGroups? If it should be on Process instead, then both Process's variables can be replaced by a chained variable (like VRP) which will be far more efficient.
ProcessGroup has a list of Processes, but that a problem property: which means it doesn't change during planning. I suspect that's correct for your use case, but do assert it.
If none of the reasoning above applies (which would surprise me) than the original model might be valid nonetheless :)

Alternative to GUID with Scalablity in mind and Friendly URL

I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
I could be wrong but Pintrest looks like it is using a long (e.g. 275001120966638272) as a database primary key. If you are using GUID's then that doesn't help. Twitter also seems to have something similar.
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints needs to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This can ensures one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hexadecimal characters.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis :
should primary keys of database
and Unique Resource Location be kept as two different entities ?
should this numbering destruct the sequentially that could occurs in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
.
Now considering the DB system :
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.

Random number six digit IDs for content in web application

I am trying to implement the accepted answer from this question for ID generation and using XML files for storage of my content and for the content IDs table.
The idea is each content item would be stored (serialized) as my-content-item-slug-374871.xml, where the number is the random ID that the content item will be given from the IDs table (from the ones that are not yet taken). My requirement is that the ID is a six digit number (display requirements) between 100000 and 999999, so effectively we will only be able to create 899999 content items but that should be enough. If you wonder why such a requirement, I can only say that I don't want IDs starting from zero and I don't want IDs such as GUIDs (which would be way easier to create and maintain, I know) because ID will be used in MVC routes (much like the SO's URLs).
So for starters I decided to create a Dictionary, where key is the ID and value determines whether it is used or not (true if used, false if available). I then serialize this object into XML file using DataContractSerializer.
The file is 72MB long and here I think the problems start to appear. First of all, I just tried to open this file in VS2010, Notepad, Wordpad and IE and they all crashed and memory consumption went skyrocket. But the application seems to have no problems with it. Still I think this will be huge memory and CPU hog and performance will suffer.
Am I right in my assumptiosn and if so, what are my other options?
for starters I decided to create a Dictionary,
You will find that a BitArray takes up far less space.
But the basic question is: why 'random' ?
If you need unique ID's, just use a counter. Start it at 100000 and increment each time you use one.
I would suggest the same as Henk (just use sequential, seeded IDs), however you can accomplish what you're looking for:
Rather than creating a dictionary with all possible values, a GenericList with only values that have been used would be less intensive:
static class Static
{
static List<int> UsedIds = new List<int>();
}
Then loop until you find one that hasn't been used yet. (Randoms are probably not the best choice unless you seed them independently of the clock).
int GetNewId()
{
Random rand = new Random();
while (true)
{
int newId = rand.Next(100000, 999999);
if (!Static.UsedIDs.Contains(newId))
{
Static.UsedIDs.Add(newId);
return newId;
}
}
}
This should be more efficient in the short-term but for long-term performance and scalability, I would strongly suggest the use of seeded identities or GUIDs - which are fairly usable when Base-64 encoded (similar to YouTube URLs).
Instead of maintaining a list of used numbers, just create the new file name and do a File.Exists(fileName) call, if it doesn't exist it isn't used.
Edit: Sorry, presumed the language was C#, but the idea should be similar to other languages.

System.Collections best choice for my scenario

I want a collection for storing two types: string and DateTime.
The string should be the key of my collection and the DateTime is the insertion time of it into the collection. I want to remove items from the collection in a FIFO manner.
The collection should reject duplicate keys and queryable by DateTime so if want to now the number of items older than a given date it could answer.
There is no single builtin C# datatype that do all those things with maximal efficiency, mostly as you indicated two things you'd have to lookup by.
That being said, a Dictionary<string, DateTime> will be the simplest solution that gives you all the features you need, basically out of the box. However, that collection will give O(n) complexity for the DateTime lookups, and worse-than-O(1) removal time. That is probably not a big deal, but you didn't describe your performance requirements, the expected sizes of your dataset, or which access types happen most frequently.
To improve on the "older-than-DateTime" lookup performance and the FIFO removal, you could also keep a second index, such as a SortedList. More memory usage and somewhat-slower overall insertion time but DateTime and removal queries will be faster. For "older-than-DateTime" you can use a binary search of the SortedList.Keys.
It sounds like System.Collections.Generic.Dictionary<string, DateTime> should do your trick. It has methods to process the collection as you need.

Resources