How to order stream merges? - xstream-js

I would like to merge two maps of the same source guaranteeing the order of the result. Here is a unit test that I would like to pass:
const source = xs.of(1,2,3)
const a = source.map(v=>v*10)
const b = source.map(v=>v*100)
const hist:number[] = []
xs.merge(a,b).addListener({
next: v=>hist.push(v),
})
expect(hist).toEqual([10,100,20,200,30,300])
Currently the result I'm getting is this:
Expected value to equal:
[10, 100, 20, 200, 30, 300]
Received:
[10, 20, 30, 100, 200, 300]

I am no expert with xstream, so I can't propose you a solution. However, I think I can explain why you are getting the output you are getting as this is a common occurrence in other streaming libraries.
You have a merge of two sources. The of operator guarantees that it will emit the array values in order, the map operator guarantees that it will emits the transformed values in the same order as the received values, etc. But merge(a,b) does not guarantee that it will interleave value of a and b. It does guarantee that it will pass on the values of a in order, those of b in order, i.e. it guarantees only a partial order on the resulting output.
The question of, given some values to emit, which ones to emit at what time, and in which order relates to scheduling. I am not aware of xstream at this point of time exposing a scheduler interface , from which you could customize the scheduling of emission of values. Hence you are bound to the default scheduling.
Now back to why you observe those values in that order :
merge(a,b) connects first to a
a emits immediately and synchronously all the values from the array
merge(a,b) then connects to b
b emits immediately and synchronously all the values from the array
If you want to get the values 1, 2, 3 not emitted synchronously, you need to use another operator than of to do so, or construct your timed sequence explicitly, i.e. emit 1 at t0, 2 at t0+1, 3 at t0+2.

Taking a clue from user3743222's answer, using periodic to sequence the list.
Changing:
//const source = xs.of(1,2,3)
const source = xs.periodic(1).drop(1).take(3)
...produces in the console:
10 100 20 200 30 300
ESNextbin demo.
drop(1) is one way to deal with periodic starting with 0. One can alternatively use ++v*10 in the map function.

Related

ArangoDB - Aggregate sum of descendant attributes in DAG

I have a bill of materials represented in ArangoDB as a directed acyclic graph. The quantity of each part in the bill of materials is represented on the edges while the part names are represented by the keys of the nodes. I'd like to write a query which traverses down the DAG from an ancestor node and sums the quantities of each part by its part name. For example, consider the following graph:
Qty: 2 Qty: 1
Widget +------> Gadget +------> Stuff
+ + Qty: 4
| Qty: 1 +---------> Thing
+----------------------------^
Widget contains two Gadgets, which each contains one Stuff and four Things. Widget also contains one Thing. Thus I'd like to write an AQL query which traverses the graph starting at widget and returns:
{
"Gadget": 2,
"Stuff": 2,
"Thing": 9
}
I believe collect aggregate may be my friend here, but I haven't quite found the right incantation yet. Part of the challenge is that all descendant quantities of a part need to be multiplied by their parent quantities. What might such a query look like that efficiently performs this summation on DAGs of depths around 10 layers?
Three possible options come to mind:
1.- return the values from the path and then summarize the data in the app server:
FOR v,e,p IN 1..2 OUTBOUND 'test/4719491'
testRel
RETURN {v:v.name, p:p.edges[*].qty}
This returns Gadget 2, Stuff [2,1], Thing [2,4], Thing [ 1 ]
2.- enumerate the edges on the path, to get the results directly :
FOR v,e,p IN 1..2 OUTBOUND 'test/4719491'
testRel
let e0 = p.edges[0].qty
let e1 = NOT_NULL(p.edges[1].qty,1)
collect itemName = v.name aggregate items = sum(e0 * e1)
Return {itemName: itemName, items: items}
This correctly returns Gadget 2, Stuff 2, Thing 9.
This obviously requires that you know the number of levels before hand.
3.- Write a custom function "multiply" similar to the existing "SUM" function so that you can multiply values of an array. The query would be similar to this :
let vals = (FOR v,e,p IN 1..2 OUTBOUND 'test/4719491'
testRel
RETURN {itemName:v.name, items:SUM(p.edges[*].qty)})
for val in vals
collect itemName = val.itemName Aggregate items = sum(val.items)
return {itemName: itemName, items: items}
So your function would replace the SUM in the inner sub-select. Here is the documentation on custom functions

Apache Flink 1.6.0 - StateTtlConfig and ListState

I am in the process of implementing a proof-of-concept stream processing system using Apache Flink 1.6.0 and am storing a list of received events, partitioned by key, in a ListState. (Don't worry about why I am doing this, just work with me here.) I have a StateTtlConfig set on the corresponding ListStateDescriptor. Per the documentation:
"All state collection types support per-entry TTLs. This means that list elements and map entries expire independently."
"Currently, expired values are only removed when they are read out explicitly, e.g. by calling ValueState.value()."
Question 1
Which of the following constitutes a read of the ListState:
Requesting the iterator but not using it - myListState.get();.
Actually using the iterator - for (MyItem i : myListState.get()) { ... }
Question 2
What does "per-entry TTL" actually mean? Specifically, what I'm asking about is the following:
Assume I have a specific instance of ListState<Character>. The descriptor has a TTL of 10 seconds. I insert a 'a'. Two seconds later, I insert 'b'. Nine seconds later I insert 'c'. If I iterate over this ListState, which items will be returned?
In other words:
ListState<Character> ls = getRuntimeContext().getListState(myDescriptor);
ls.add('a');
// ...two seconds later...
ls.add('b');
// ...nine seconds later...
ls.add('c');
// Does this iterate over 'a', 'b', 'c'
// or just 'b' and 'c'?
for (Character myChar : ls.get()) { ... }
Answer 1
The answer is 1. For ListState the pruning is done for myListState.get();.
Answer 2
"per-entry TTL" means the timeout is applied to a single entry rather than whole collection. For your example assuming at the point of reading 10 seconds passed since inserting the a it will iterate over b and c. a is going to be pruned in this case.

Return a list after using the pandas filter function

I am new to Pandas and having some really difficult problems with it.
What I would like to do is group samples by a a value in a respective columns and then run an api calls based on that column value.
That part is done. After the object is created I would like to return the objects and store it to a local variable is proving the challenging part.
Here is my data set that comes in a .CSV file.
Sample Sample Type Tumor Age Location
1 Blood Benign 43 LUNG
2 FFPE Benign 23 LUNG
3 Blood Benign 12 LUNG
I am filtering the Sample Type of either Blood or FFPE and then applying a function to create the samples
def create_samples(x):
sample_objects = Sample.create({
'count': x.shape[0],
'type': x.iloc[0]['Sample Type']
})
return sample_objects
if __name__ == '__main__':
df = pd.read_csv(path)
blood_samples, ffpe_samples = df.groupby('Sample Type').filter(lambda x: create_samples(x))
It iterates through the functions twice because there are two SampleTypes, I believe it creates the Blood Samples first and then creates the FFPE Samples second.
In both time the object is created I want to return those objects to a variable blood_samples and variable_samples respectively. Is this possible to do so?
My only hack I can think of is to assign some global variables which I am hoping to avoid.
thoughts?
You're using groupby.filter wrong. In a groupby context, filter takes a function that returns a boolean value. The result is a combined dataframe that only consists of the groups in which the function returned True
What you want is this
blood_samples, ffpe_samples = (create_samples(d) for _, d in df.groupby('Sample Type'))
And this only works when there are exactly two unique values in df.Sample
It might be better to leave it as a dictionary
sample_dict = {n: create_samples(d) for n, d in df.groubpy('Sample')}

Map-reduce on more than one keys

Problem
We have records, say ri where i = 0,..., n. n can be large (in tens of billions).
Every record has a number of keys, kij where j = 0,..., m. m is small (say 20)
We say, rp = rq, if kp0 = kq0, kp1 = kq1, …, or, kpm = kqm
That is, records are equal, if at least one their keys are equal. We need to find such sets of records and generate unique ids for those sets.
Approach
Run m map-reduce jobs, where each job reduces on one key.
So, for job i, mapper emits (rp, ki) and reducer gets ({r1,...,rp}, ki)
At the end of all the m jobs, we will have sets of records that have one equal key.
Sk = {rl}
We expect k to less than n but still could be in hundreds of millions, and l to be a small number (say between 2 to 5000)
To get our final results, we will need to merge the above sets which have at least one member in common.
I have the following questions:
How to efficiently merge these sets?
Alternatively, is there any other way to solve this problem?
I realized that it is a ConnectedComponent problem and have well known solutions.

SSAS How to scope a calculated member measure to a number of specific members?

I'm trying to create a calculated member measure for a subset of a group of locations. All other members should be null. I can limit the scope but then in the client (excel in this case) the measure does not present the grand total ([Group].[Group].[All]).
CREATE MEMBER CURRENTCUBE.[Measures].[Calculated Measure]
AS (
Null
),
FORMAT_STRING = "$#,##0.00;-$#,##0.00",
NON_EMPTY_BEHAVIOR = { [Measures].[Places] }
,VISIBLE = 1 , ASSOCIATED_MEASURE_GROUP = 'Locations';
-----------------------------------------------------------------------------------------
SCOPE ({([Group].[Group].&[location 1]),
([Group].[Group].&[location 2]),
([Group].[Group].&[location 3]),
([Group].[Group].&[location 4]),
([Group].[Group].&[location 5])
}, [Measures].[Calculated Measure]);
// Location Calculations
THIS = (
[Measures].[Adjusted Dollars] - [Measures].[Adjusted Dollars by Component] + [Measures].[Adjusted OS Dollars]
);
END SCOPE;
It's as though the [Group].[Group].[All] member is outside of the scope so it won't aggregate the rest of the members. Any help would be appreciated.
Your calculation is applied after all calculations already happened. You can get around this by adding Root([Time]) to the scope, assuming your time dimension is named [Time]. And if you want to aggregate across more dimensions, you would have to add them all to the SCOPE.
In most cases when you have a calculation that you want too do before aggregation it is more easy to define the calculation e. g. in the DSV, e. g. with an expression like
CASE WHEN group_location in(1, 2, 3, 4) THEN
Adjusted_dollars - adjusted_dollars_by_comp + adjusted_os_dollars
ELSE NULL
END
and just make a standard aggregatable measure from it.
I've searched high and low for this answer. After reading the above suggestion, I came up with this:
Calculate the measure in a column in the source view table
(isnull(a,0) - isnull(b,0)) + isnull(c,0) = x
Add the unfiltered calculated column (x) to the dsv
Create a named calculation in the dsv that uses a case statement to filter the original calc measure CASE WHEN location IN ( 1,2,3)THEN xELSE NULLEND
Add the named calculation as measure
I choose to do it this way to capture the measure unfiltered first then, if another filter needs to be added or one needs to be taken off, I can do so without messing with the views again. I just add the new member to filter by to my named calculation case statement. I tried to insert the calculation directly into a named calculation in the dsv that filtered it as well but the calculation produced the incorrect results.

Resources