I will try to spend some of what I have learned by studying, working and being part of the Apache community Kafka®
Kafka is a distributed streaming platform.
And in practice what that would be?
With it you can publish, store, process and consume a large data stream. In this way, we are decoupling the data stream.
Okay! It may seem a little confusing and very generic, yes, I know, but let’s move on.
Let’s imagine a very basic example.
Every day the postman leaves the post office, with a lot of letters, which were posted by several senders. These letters usually have a recipient.
When a certain postman collects the letters, he receives all of them organized by zip code, so the postman knows where he needs to deliver each set of letters. That is, for each zip code I have several streets, with this several addresses, people, companies and so on.
Then, your postman, with possession of the zip code, goes out for delivery. On the street, we have several numbers, and for each number we have several cards. With all this information, postmen are able to make their daily deliveries. So that if one postman is missing, the other can take over and make the delivery. Eventually, a letter or other may not be delivered.
When the letter is delivered, it does not matter to the postman what you will do with the letter, if you will read, if you will pay the bill, if you will make the purchase in the market of the pamphlet, as it does to me who handed me the letter, nothing would change in the boleto value, for example.
Let’s say this whole process is a data stream, a data stream in the form of letters, let’s say better, that this is a mail stream.
No Kafka
In addition to the postman...
Let’s go to the technical side of the matter, with Kafka in hand we can create applications for streaming information in real time between various systems, on various platforms and in various formats, including being able to process this information in real time, either enrichment, arithmetic operations, grouping and other possibilities.
Kafka is a great ecosystem.
This ecosystem is usually composed of a source system, which will provide the input data, a producer, who will be responsible for generating the data, the core processes of Kafka that make the management of the whole process, a consumer, to "sign" the generated data and provide to a "target system".
To carry out all these processes, Kafka provides us with the following API’s:
Producer API - Allows an application to publish a flow of information on one or more Kafka topics.
Consumer API - Allows an application to subscribe to one or more topics and process the data flow produced for them.
API streams - Allows an application to act as a stream processor, consuming the data and generating the results processed in another topic.
Connector API — Lets you create and run reusable producers or consumers that connect Kafka topics to existing systems. For example, a connector for a relational database that captures all changes in a table.
Okay, fine, but how do I communicate with Kafka?
Communication between clients and servers is all carried out via simple TCP protocol.
Let’s go to Topics!!!
The topics are data sets, making a parallel, are like a table in a database, the difference is that in the topic the data are immutable.
So all that information that was generated by the producers and that was sent to Kafka is stored in the topics. Topics are the "centralizers" of information. Every topic has an assigned name, for example:
stackoverflow.topic topic
Each topic can have one or more consumers accessing (signing) the data that is recorded, as they are immutable, there is no possibility for one of the consumers to change the state of an information that was stored in the topic.
Topics can be divided by one or more partitions. By definition the partition is an orderly and immutable sequence of records that is continuously attached to a structured confirmation log. Each partition receives a set of information within that topic.
Inside each partition, the information is stored, and for that we have the offset.
In my view, the offset is the smallest unit of measurement within Kafka. Each partition is composed of 0 to X offsets, that is to say that the offset will start at position 0 and increase on demand in each partition (0,1,2,3,4,5,6,7,8,9,10,11,12...X), roughly speaking, it works as an array, but we do not need to worry about this growth, Kafka makes it for us in a natural way. Type the Java Arraylist. The stored information is identified by its offset number. The drawing below demonstrates the anatomy of a topic, published by the people of Apache:
And a picture of me ;)
Going back to the data, every record that is stored in a partition, belongs solely and exclusively to it. Partition 1 offset 3 has no relation to partition offset 3 0.
Although the data is immutable, the time of permanence within Kafka can be configured according to the context of the application, by default the data is maintained for seven days. Kafka’s performance is effectively constant in relation to the size of the data, so storing the data for a long period is not a problem (depending on the below, of course).
It is important to say that the data is delivered to the topic and not to a partition. Kafka manages which partition to save to. However we can pass a key in the message, and with this key we guarantee that that type of information will always be stored in the same partition, but again, who will decide the partition is Kafka, and not us, mere mortals.
Key recording
This guy Cluster
The great thing about all this is that Kafka is "clustered" it runs on one or more servers, and these servers may be distributed. These servers are the famous Brokers.
The brokers provide and receive data. Kafka alone does not manage all this alone, he needs Zookeeper to organize this whole mess there. For example, when you create or change a topic, in fact, who is performing this action is the Zookeeper.
Our brokers are identified by an integer ID, and it is within the brokers that our topics are. For us users, this lot of brokers is transparent, because when we connect to a Broker, we are connected to the whole chain of brokes.
This is where things get interesting for us, because when we have more than one partition per topic, it is that the concept of brokers and Clusterization becomes clearer. By the time we are creating a new topic we determine how many partitions we will have within this topic, so Kafka already distributes partitions of a particular topic among brokers.
We can also define the replication factor for the topic, when we perform this configuration, there will be a synchronization of the data between the topics, ensuring the availability of the information.
In this example the data in Topic 01 of Broker 01, are synchronized in Broker 02, as well as the data in Topic 02 that are in Broker 03.
And if Broker 03 goes down?
We will still have the data available, because they were synchronized in Broker 02. For us, nothing changes, once I am connected in a Broker, I am connected in the chain of brokers.
And if Broker 01 falls too?
We are still guaranteed, because the data is also synchronized in Broker 02.
Concept of Leader
Within each partition of a topic within Broker, we will have a leader. Only the leader who receives and sends data, the others only synchronize the information, IE, follow the leader.
Zookeeper is elected leader! Kafka is only informed.
If the leader fails, one of the followers will automatically become the new leader, as in the examples above.
Segmentation
We already know what topics are made of partitions, and what partitions are made of?
Partitions are made of segments! The segments are operating system files and they define a set of offsets. In my example below, a segment is a set of four offsets:
With this, Kafka works with two Dexes, one to know what offset position he should read and one to know which timestamp he should read (by segmentation).
Producers
A producer is responsible for producing and sending data to Kafka, simple as that.
This data can be an email, a tweet, a JSON, a sms, a string, anything you want and make sense of your context.
The only thing we need to do in a producer is to tell what topic and which Broker we want to send the information, from there Kafka will automatically take charge of distributing and storing the information.
Consumers
Consumers read data stored on a topic.
Consumers should also connect to Broker to read the information. Important information, should not have more consumers than partitions!
We can have more than one consumer for the same topic, can be in group or individually, this is possible because Kafka balances this reading of the data.
A consumer group allows you to read data faster and more dynamically, if I have a topic with three partitions and have three consumers, I can read the three partitions simultaneously and in parallel, This is also true if I have five consumers and so on.
Just as in the control of writing, Kafka also stores in which offset a consumer is reading, in case the process falls, he knows where it stopped and returns consumption from then on, saving us once again to do some control.
We can read the data of a certain offset from the beginning!
Kafka also stores consumers as a topic:
named_consumer_offsets
Even if a consumer subscribes to all information contained on the partitions of a particular topic, Kafka ensures that other consumers will re-sign this information.
Follows a drawing of how producers and consumers work:
Below are some references that help me and help me a lot in this process.
Quickstart
https://kafka.apache.org/quickstart
Source
https://kafka.apache.org/intro
He is an alienated Indian, with anxiety, guilt, linked to the bureaucratic nonsense. I was right? :)
– Maniero
But I laughed too much kkkkk
– Barbetta
I forgot to quote the cockroach
– Maniero