wiki:PartitionerInterface

Partititioner Interfacce

Why Partitioning & Concept Of Slices

StreamMine3G is a highly scalable and elastic Event Stream Processing engine. One important concept to achieve scalability as well as elasticity is the partitioning of data, hence, instead of having an operator deployed only on one single machine to do all the processing, the same operator (code) will be deployed on multiple machines where each instance will only process a subset of the data coming from an upstream operator or the "external world" through the adapter interface.

Whenever a new operator is created in StreamMine3G, the SLICES parameter determines how many partitions (which we call slices in StreamMine3G) for that specific operator should be used in future. The chosen number also determines the maximum scale out factor. For instance, if an operator consists of only five slices, the operator can only be spread across five nodes in a StreamMine3G cluster where each slice will then process 1/5th of the total incoming data rate.

KeyRange Partitioner (Default Partitioner)

The partitioner is used to decide to which of the slices an event should go to. By default if no custom written partitioner is available, StreamMine3G will use the internally available KeyRangePartitioner and the routing key of each event to decide, in which key range the event falls into and needs to be forwarded to the respective slice later on.

Let's assume the key range is set to 1000 (via the ROUTINGKEYRANGESIZE parameter) and two slices have been choosen for some operator X. StreamMine3G will then internally build up a tree that consists of leaves with [0-499] and [500-999] where the first key range is associate with a so called sliceID=0, and the second with sliceID=1. If an upstream operator emits now an event with e.g. routingKey = 456, this event will clearly go to all slices identified by sliceId=0 whereas events that have a routingKey >= 500 and < 1000 would automatically be sent to slices identified by sliceId=1.
StreamMine3G supports active replication, hence each slice can be replicated. To distinguish replicated slices, i.e. slices that process events with the same key range, slices are assigned a unique identifier, a so called sliceUId.

Partitioner Interface (Custom Partitioner)

Some application require to group/partition events based on some custom field in the payload of the event rather than using the routing key. As an example: Let's assume an application has one upstream operator and two downstream operators where the first one of these performs some aggregation based on some userId whereas the second on the companyId. Hence, the two downstream operators require different event partitioning schemes even though they are consuming the same incoming events. StreamMine3G offers the partitioner interface for those cases where the event is simply passed to a getPartition() method first in order to determine to which sliceId the event belongs to and should be sent downstream later on. Hence, the partitioner method (even semantically belonging to the downstream operator) is executed upstream as it has impact on the internal event routing.

int SamplePartitioner::getPartition(int key, void* event, int size)
{
    return slicesCount - key - 1; //inverse partitioner
}

The getPartition method must return the sliceId an event belongs to, hence only a value between 0 and SLICCES-1 should be returned. However, StreamMine3G can also be used to build application such as publish subscribe systems: Those systems require mechanisms like broadcasting of events, hence, events should be sent to all downstream partitions. This can be achieved by returning -1 in the getPartition method.
In addition to a broadcast, events can also be discarded, hence, in case a certain operator should not receive a certain event a -2 can be returned and the event will be discarded internally.

Last modified 6 years ago Last modified on Feb 5, 2013, 10:32:38 AM

Attachments (1)

Download all attachments as: .zip