Optimum utilization of resources has always been one of the key business objectives. It becomes more important for corporations where the cost of infrastructure is almost 50% of the overall IT budget. Cloud computing is a key concept that is changing the face of corporate IT, where it not only helps in achieving coherence and economies at scale but also provides enterprise-class functionalities such as upgrades, maintenance, failover, load balancers, and so on. Of course, all these features come at a cost, but you only pay for what you use.
It did not take much time for enterprises to realize the benefits of cloud computing and they started adapting/implementing/hosting their services and products by leveraging either IaaS (https://en.wikipedia.org/wiki/Cloud_computing#Infrastructure_as_a_service_.28IaaS.29) or PaaS (https://en.wikipedia.org/wiki/Platform_as_a_service). The benefits of IaaS/PaaS were so noticeable that soon it was extended to cloud-based streaming analytics. Cloud-based streaming analytics is a service in the cloud that helps in developing/deploying low-cost analytics solutions to uncover real-time insights from the various data feeds received from varied data sources, such as devices, sensors, infrastructure, applications, and more. The objective is to scale to any volume of data with high or customizable throughput, low latency, and guaranteed resiliency, with hassle-free installation and setup within few minutes.
In late 2013, Amazon launched Kinesis as a fully managed, cloud-based service (PaaS) for performing cloud-based streaming analytics on the real-time data received from distributed data streams. It also allowed developers to write custom applications for performing near real-time analytics on the data feeds received in real time.
Let's move forward and find out more about Kinesis.
This chapter will cover the following points that will help us to understand the overall architecture and various components of Kinesis:
In this section, we will talk about the overall architecture and various components of Kinesis. This section will help us to understand the terminology and various components of Amazon Kinesis.
Amazon Kinesis is a service provided by Amazon in the cloud that allows developers to consume data from multiple data sources, such as streaming news feeds, financial data, social media applications, logs, or sensor data, and subsequently write applications that respond to these real-time data feeds. The data received from Kinesis can be further consumed, transformed, and finally persisted in various data stores such as Amazon S3, Amazon DynamoDB, Amazon Redshift, or any other NoSQL database either in its raw form or filtered according to predefined business rules.
Kinesis can be integrated with real-time dashboards and business intelligence software, which thereby enables the scripting of alerts and decision making protocols that respond to the trajectories of incoming real-time data.
Amazon Kinesis is a highly available service where it replicates its data across different AWS availability zones within an AWS region. Amazon Kinesis is offered as a "managed service" (https://en.wikipedia.org/wiki/Managed_services) for dealing with incoming streams of real-time data that includes load balancing, failover, autoscaling, and orchestration.
Amazon Kinesis is undoubtedly categorized as a disruptive innovation/technology (https://en.wikipedia.org/wiki/Disruptive_innovation) in the area of real-time or near real-time data processing, where it provides the following benefits:
Apart from the benefits listed here, Kinesis can be leveraged for a variety of business and infrastructure use cases where we need to consume and process the incoming data feeds within seconds or milliseconds. Here are few of the prominent verticals and the associated use cases for Kinesis:
The key challenge with CDRs is the volume and velocity, which varies from day to day or even hour to hour, and at the same time the CDRs are used to solve multiple business problems. Kinesis can undoubtedly be a best-fit solution as it provides an elegant solution to the key challenges very well.
The preceding use cases are only a few examples of where Kinesis can be leveraged for storing/processing real-time or streaming data feeds. There could be many more use cases based on the industry and its needs.
Let's move forward and discuss the architecture and various components of Kinesis.
A fully fledged deployment of Kinesis-based, real-time/streaming data processing applications will involve many components that closely interact with each other. The following diagram illustrates the high-level architecture of such deployments. It also defines the role played by each and every component:
This diagram shows the architecture of the Kinesis-based, real-time/streaming data processing applications. At a high level, the producers fetch/consume the data from the data sources and continually push data to the Amazon Kinesis streams. Consumers at the other end continually consume the data from the Kinesis streams and further process it, and they can store their results using an AWS service such as Amazon DynamoDB, Amazon Redshift, Amazon S3, or any other NoSQL database, or maybe push it in to another Kinesis stream.
Let's move on to the next section where we will discuss the role and purpose of each and every component of the Kinesis architecture.
In this section, we will discuss the various components of Kinesis-based applications. Amazon Kinesis provides the capability to store the streaming data, but there are various other components that play a pivotal role in developing full-blown, Kinesis-based, real-time data processing applications. Let's understand the various components of Kinesis.
Data sources
Data sources are those applications that are producing the data in real time and providing the means to access or consume the data in real or near real-time; for example, data emitted by various sensors or web servers emitting logs.
Producers
Producers are custom applications that consume the data from the data sources and then submit the data to Kinesis streams for further storage. The role of producers in the overall architecture is to consume the data and implement various strategies to efficiently submit the data to Kinesis streams. One of the recommended strategies is to create batches of data and then submit the records to the Kinesis streams. Producers optionally may perform minimal transformation, but we should ensure that any processing should not degrade the performance of the producers. The primary role of producers is to fetch data and optimally submit it to Kinesis streams. The producers are recommended to be deployed on the EC2 instances themselves, so that there is optimal performance and no latency while submitting the records to Kinesis. Producers can optionally be implemented using the APIs directly exposed by the AWS SDK or Kinesis Producer Library.
Consumers
Consumer applications are custom applications that connect and consume the data from Kinesis streams. Consumer applications further transform or enrich the data and can store it in some persistent storage services/devices such as DynamoDB (https://aws.amazon.com/dynamodb/), Redshift (https://aws.amazon.com/redshift/), S3 (https://aws.amazon.com/s3/), or may submit it to EMR, which is short for Elastic MapReduce, (https://aws.amazon.com/elasticmapreduce/) for further processing.
AWS SDK
Amazon provides a software development kit (SDK) that contains easy-to-use and high-level APIs so that users can seamlessly integrate the AWS services within their existing applications. The SDK contains APIs for many AWS services, including Amazon S3, Amazon EC2, DynamoDB, Kinesis, and others. Apart from the APIs, it also includes code samples and documentation. Producers and consumers of Kinesis streams can be developed using the Amazon Kinesis APIs exposed by AWS SDK, but these APIs provide basic functionality to connect, and submit or consume data to/from Kinesis streams. Other advanced functionalities, such as batching or aggregation, either need to be developed by the developer or he/she may also optionally use high-level libraries such as Kinesis Producer Library (KPL) or Kinesis Client Library (KCL).
Refer to https://aws.amazon.com/tools/?nc1=f_dr for more information about the available SDKs and http://docs.aws.amazon.com/kinesis/latest/APIReference/Welcome.html for more information about Amazon Kinesis API.
KPL
KPL is a high-level API developed over AWS SDK that can be used by developers to write the code for ingesting data into Kinesis streams. The Amazon KPL simplifies producer application development through which developers can achieve a high write throughput to an Amazon Kinesis stream that can be further monitored using Amazon CloudWatch (https://aws.amazon.com/cloudwatch/). Apart from providing a simplified mechanism for connecting and submitting data to Kinesis streams, KPL also provides the following features:
It is also important to understand that KPL should not be confused with the Amazon Kinesis API that is available with the AWS SDK. The Amazon Kinesis API provides various functionalities, such as creating streams, resharding, and putting/getting records, but KPL just provides a layer of abstraction specifically for optimized data ingestion.
Refer to http://docs.aws.amazon.com/kinesis/latest/dev/developing-producers-with-kpl.html for more information on KPL.
KCL
KCL is the high-level API that was developed by extending the Amazon Kinesis API (included in AWS SDK) for consuming data from the Kinesis streams. KCL provides various design patterns for accessing and processing data in an efficient manner. It also incorporates various features such as load balancing across multiple instances, handling instance failures, checkpointing processed records, and resharding, which eventually helps developers to focus on the real business logic for processing the records received from the Kinesis streams. KCL is entirely written in Java, but it also provides support for other languages via its MultiLangDaemon interface.
Refer to http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html for more information on KCL, its role, and its support for multiple languages.
Kinesis streams
One of the most important components of the complete architecture is the Kinesis streams. Understanding the design of Kinesis streams is important for architecting and designing a scalable and performance-efficient real-time message processing system. Kinesis streams define four main components, which are used in the structuring and storing of the data received from producer: shards, data records, partition keys, and sequence numbers. Let's move forward and look at all these components of Kinesis streams.
Shards
A shard stores a uniquely identified group of data records in an Amazon Kinesis stream. A Kinesis stream is composed of multiple shards and each shard provides a fixed unit of capacity. Every record, when submitted to a Kinesis stream, is stored in one of the shards of the same stream. Each shard has its own capacity to handle read and write requests, and the eventual number of shards defined for a stream is the total capacity of the Amazon Kinesis stream. We need to be careful in defining the total number of shards for a stream because we are charged on a per-shard basis. Here's the capacity for reads and writes for one shard:
Let's take an example where you define 20 shards for your Amazon Kinesis stream. So the maximum capacity of your reads will be 1000 (20 shards * 5) transactions per second, where the total size of all reads should not be more than 40 MB (20*2) per second, which means that the size of each transaction, if equally divided, should not be more than 40 KB. For writes, there will be a maximum of 20,000 (1000*20) writes per second, where each write request should not be more than 1 MB per second.
Shards need to be defined at the time of initializing the Kinesis stream itself. To decide the total number of required shards, we need the following inputs, either from the user or developers who will be reading or writing records to the stream:
Considering the preceding numbers, we will require at least two shards. Here's the formula for calculating the total number of shards:
number_of_shards = max(incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB/2000)
This will eventually result in the following calculation:
2 = max((1000*2KB)/1000,(1000*2KB)/2000)
Once we define the shards and streams are created, we can spin off multiple clients for reading and writing to the same streams.
Let's move forward and discuss the format of data stored within the streams/shards. The data within the shards is stored in form of data records, which are comprised of partition keys, sequence numbers and the actual data. Let's move forward and study each of the components of the data records.
Partition keys
Partition keys are the user-defined Unicode strings, each with a maximum length of 256 characters, which map the data records to a particular shard. Each shard stores the data for a specific set of partition keys only. The mapping of shard and data records is derived by providing partition keys as an input to a hash function (internal to Kinesis) that maps the partition key and associates the data with a specific shard. Amazon Kinesis leverages the MD5 hash function for converting partition keys (strings) to 128-bit integer values and then to further associate data records with shards. Every time producers submit records to Kinesis streams, a hashing mechanism is applied over the partition keys and then the data records are stored on the specific shards that are responsible for handling those keys.
Sequence Numbers
Sequence numbers are unique numbers assigned by Kinesis to each data record once it is submitted by the producers. The sequence numbers for the same partition key generally increase over a period of time, which means the longer the time period between write requests, the larger will be the sequence numbers.
In this section, we have discussed the various components and their role in the overall architecture of Kinesis streams. Let's move forward to the next section where we will see the usage of these components with the help of appropriate examples.