Let me start by saying that, if you have experience with using Kafka producer and consumer APIs to build your streaming data pipeline applications, you are halfway there to learn and use kinesis for the same purpose.
Kinesis is an AWS managed service which serves as an alternative to Apache Kafka. At the moment, AWS provides four types of kinesis services namely, kinesis streams, kinesis firehose, kinesis analytics, kinesis video stream. In this post, I will only be reviewing the Kinesis stream.
Kinesis Stream: low latency self-developed streaming applications to ingests data at scale.
Kinesis Firehose: continuously collect, transform, and load streaming data into AWS storage services such as s3, Redshift, ElasticSearch & Splunk
Kinesis Analytics: performs real-time analytics on streaming data from Kinesis streams and kinesis firehose using SQL.
Kinesis Video Streams: build applications to stream live video from devices to the AWS Cloud.
An Overview of Kinesis Stream
Kinesis “Streams” are roughly equivalent to Kafka “Topics”. So why all the hype? In reality, they are both a messaging system. Messaging system is a hugely important piece of infrastructure for moving data between systems. To see why let’s look at a data pipeline without a messaging system.
Unfortunately, in the real world, it is not that easy as data exists on many systems in parallel, all of which need to interact with each other. The situation quickly becomes more complex, ending with a system where multiple data systems are talking to one another over many channels. Each of these channels requires their own custom protocols and communication methods and moving data between these systems becomes a full-time job for a team of developers. In a nutshell, what is experienced in the real world are complex data pipelines.
Kinesis stream is one of AWS solution to managing the complexity of data pipelines. It is a great service for real-time data processing of application logs, metrics, IoT, clickstreams, etc. Also, it can be used with streaming processing frameworks such as Spark and NiFi. Data in Kinesis is automatically replicated synchronously to 3 Multi-AZ deployments which has high availability and failover support. This means AWS automatically provisions and maintains a synchronous standby replica in a different availability zone.
Sharding
A Kinesis data stream is a set of shards. Each shard has a sequence of data records. Each data record has a sequence number that is assigned by Kinesis Data Streams. A stream can be created with many shards as you may require and the number of shards can evolve over time as requirements changes. A process is known as re-sharding and merge-sharding.
Using the diagram to describe how the Kinesis stream architecture works. You create a Kinesis stream named “truck sensors”. The records are ingested into the truck sensor stream in parallel, i.e., records are produced into different shards in parallel, and waiting to be consumed by a consumer application which could be a database or some kind of frontend application. What is important to know is that all records of a stream are produced into different shards, and AWS adds a sequence number for each record put in shards, however, records within a shard are ordered and records are not ordered globally across multiple shards. Likewise, records in the shards can be made available in batches or per record call and AWS billing is per shard provisioned.
Data Record
A data record is the unit of data stored in a Kinesis data stream. A record is also referred to as a data blob, a unit of data stored in an AWS Kinesis data stream. Data records are composed of a sequence number, a partition key, and a data blob, which is an immutable sequence of bytes. Kinesis Data Streams does not inspect, interpret, or change the data in the blob in any way.
A data blob can represent anything, text, photos, pictures, etc. Each data blob is serialized and up to 1MB. In every shard, a “record key” is sent alongside a record to help group records in shards. An advantage of the record key is that AWS guarantees that the same key will also go to the same shard all the time. It is important to note that using a highly distributed key can help to prevent a “hot partition” problem.
Retention Period
The retention period is the length of time that data records are accessible after they are added to the Kinesis stream. A stream’s retention period is set to a default of 24 hours after creation. You can increase the retention period up to 168 hours (7 days) using the IncreaseStreamRetentionPeriod operation, and decrease the retention period down to a minimum of 24 hours using the DecreaseStreamRetentionPeriod operation. Additional charges apply for streams with a retention period set to more than 24 hours.
Producer
Producers put records into Amazon Kinesis Data Streams. For example, a web server sending log data to a stream is a producer. The throughput capacity of a kinesis producer is 1MB / second or 1000 records / second at write per shard. This means
{number of shards * 1000 records = number of records / second}
{number of shards * 1 MB = number of MB / second}
Having a throughput over this limit will result in a “ProvisionedThroughput Exception”.
Consumer
Consumers get records from Amazon Kinesis Data Streams and process them. These consumers are known as Amazon Kinesis Data Streams Application. A consumer uses a polling method to consume records from shards. Each shard has 2MB total aggregate throughput, which allows a consumer to read 2MB / second per shard across all consumers. Similarly, Consumer can only make five API calls/sec per shard across all consumers.
Conclusion
Amazon Kinesis Data Streams is an AWS managed service that works just like Kafka. It is a massively scalable, highly durable data ingestion and processing service optimized for streaming data. You can configure hundreds of thousands of data producers to continuously put data into a Kinesis data stream and the data will be available within milliseconds to your Amazon Kinesis applications. Those applications will receive data records in the order they were generated. Amazon Kinesis Data Streams is integrated with a number of AWS services, including Amazon Kinesis Data Firehose for near real-time transformation and delivery of streaming data into an AWS data lake like Amazon S3, Kinesis Data Analytics for managed stream processing, and AWS Lambda for event or record processing.