Kafka aggregate single log event lines to a combined log event -


i'm using kafka process log events. have basic knowledge of kafka connect , kafka streams simple connectors , stream transformations.

now have log file following structure:

timestamp event_id event 

a log event has multiple log lines connected event_id (for example mail log)

example:

1234 1 start 1235 1 info1 1236 1 info2 1237 1 end 

and in general there multiple events:

example:

1234 1 start 1234 2 start 1235 1 info1 1236 1 info2 1236 2 info3 1237 1 end 1237 2 end 

the time window (between start , end) 5 minutes.

as result want topic like

event_id combined_log 

example:

1 start,info1,info2,end 2 start,info2,end 

what right tools achieve this? tried solve kafka streams can figure out how..

in use case reconstructing sessions or transactions based on message payloads. @ moment there no built-in, ready-to-use support such functionality. however, can use processor api part of kafka's streams api implement functionality yourself. can write custom processors use state store track when, given key, session/transaction started, added to, , ended.

some users in mailing lists have been doing iirc, though not aware of existing code example point to.

what need watch out handle out-of-order data. in example above listed input data in proper order:

1234 1 start 1234 2 start 1235 1 info1 1236 1 info2 1236 2 info3 1237 1 end 1237 2 end 

in practice though, messages/records may arrive out-of-order, (i show messages key 1 simplify example):

1234 1 start 1237 1 end 1236 1 info2 1235 1 info1 

even if happens, understand in use case still want interpret data as: start -> info1 -> info2 -> end rather start -> end (ignoring/dropping info1 , info2 = data loss) or start -> end -> info2 -> info1 (incorrect order, violating semantic constraints).


Comments

Popular posts from this blog

php - How to add and update images or image url in Volusion using Volusion API -

javascript - IE9 error '$'is not defined -