Chapter 5. Getting Acquainted with Kinesis

Optimum utilization of resources has always been one of the key business objectives. It becomes more important for corporations where the cost of infrastructure is almost 50% of the overall IT budget. Cloud computing is a key concept that is changing the face of corporate IT, where it not only helps in achieving coherence and economies at scale but also provides enterprise-class functionalities such as upgrades, maintenance, failover, load balancers, and so on. Of course, all these features come at a cost, but you only pay for what you use.

It did not take much time for enterprises to realize the benefits of cloud computing and they started adapting/implementing/hosting their services and products by leveraging either IaaS (https://en.wikipedia.org/wiki/Cloud_computing#Infrastructure_as_a_service_.28IaaS.29) or PaaS (https://en.wikipedia.org/wiki/Platform_as_a_service). The benefits of IaaS/PaaS were so noticeable that soon it was extended to cloud-based streaming analytics. Cloud-based streaming analytics is a service in the cloud that helps in developing/deploying low-cost analytics solutions to uncover real-time insights from the various data feeds received from varied data sources, such as devices, sensors, infrastructure, applications, and more. The objective is to scale to any volume of data with high or customizable throughput, low latency, and guaranteed resiliency, with hassle-free installation and setup within few minutes.

In late 2013, Amazon launched Kinesis as a fully managed, cloud-based service (PaaS) for performing cloud-based streaming analytics on the real-time data received from distributed data streams. It also allowed developers to write custom applications for performing near real-time analytics on the data feeds received in real time.

Let's move forward and find out more about Kinesis.

This chapter will cover the following points that will help us to understand the overall architecture and various components of Kinesis:

  • Architectural overview of Kinesis
  • Creating a Kinesis streaming service

Architectural overview of Kinesis

In this section, we will talk about the overall architecture and various components of Kinesis. This section will help us to understand the terminology and various components of Amazon Kinesis.

Benefits and use cases of Amazon Kinesis

Amazon Kinesis is a service provided by Amazon in the cloud that allows developers to consume data from multiple data sources, such as streaming news feeds, financial data, social media applications, logs, or sensor data, and subsequently write applications that respond to these real-time data feeds. The data received from Kinesis can be further consumed, transformed, and finally persisted in various data stores such as Amazon S3, Amazon DynamoDB, Amazon Redshift, or any other NoSQL database either in its raw form or filtered according to predefined business rules.

Kinesis can be integrated with real-time dashboards and business intelligence software, which thereby enables the scripting of alerts and decision making protocols that respond to the trajectories of incoming real-time data.

Amazon Kinesis is a highly available service where it replicates its data across different AWS availability zones within an AWS region. Amazon Kinesis is offered as a "managed service" (https://en.wikipedia.org/wiki/Managed_services) for dealing with incoming streams of real-time data that includes load balancing, failover, autoscaling, and orchestration.

Amazon Kinesis is undoubtedly categorized as a disruptive innovation/technology (https://en.wikipedia.org/wiki/Disruptive_innovation) in the area of real-time or near real-time data processing, where it provides the following benefits:

  • Ease of use: Hosted as a managed service, Kinesis streams can be created just by a few clicks. Further applications can be developed for consuming and processing data in no time with the help of the AWS SDK.
  • Parallel processing: It is common to process the data feeds for different purposes; for example, in a network monitoring use case, the processed network logs are shown on the real-time dashboards for identification of any security breaches/anomalies and at the same time it also stores the data in the database for deep analytics. Kinesis streams can be consumed by multiple applications concurrently, which may have different purposes or problems to solve.
  • Scalable: Kinesis streams are elastic in nature and can scale from MBs to GBs per second, and from thousands to millions of messages per second. Based on the volume of data, throughput can be adjusted dynamically at any point in time without disturbing the existing applications/streams. Moreover, these streams can be created.
  • Cost effective: There is no setup or upfront cost involved in setting up Kinesis streams. We pay only for what we use and that can be as little as $0.015 per hour with a 1 MB/second ingestion rate and a 2 MB/second consumption rate.
  • Fault tolerant and highly available: Amazon Kinesis preserves data for 24 hours and synchronously replicates the streaming data across three facilities in an AWS region, preventing data loss in the case of application or infrastructure failure.

Apart from the benefits listed here, Kinesis can be leveraged for a variety of business and infrastructure use cases where we need to consume and process the incoming data feeds within seconds or milliseconds. Here are few of the prominent verticals and the associated use cases for Kinesis:

  • Telecommunication: Analyzing call data records (CDRs) is one of the prominent and most discussed business use cases. Telecom companies analyze CDRs for actionable insights, which not only optimize the overall costs but, at the same time, introduce new business lines/trends, resulting in better customer satisfaction.

    The key challenge with CDRs is the volume and velocity, which varies from day to day or even hour to hour, and at the same time the CDRs are used to solve multiple business problems. Kinesis can undoubtedly be a best-fit solution as it provides an elegant solution to the key challenges very well.

  • Healthcare: One of the prominent and best-suited use cases in healthcare for Kinesis is to analyze the privacy-protected streams of medical device data for detecting early signs of diseases and, at the same time, identify correlations among multiple patients and measure efficacy of the treatments.
  • Automotive: Automotive is a growing industry in which we have seen a lot of innovation over the past few years. Lot of sensors are embedded in our vehicles that are constantly collecting various kinds of information such as distance, speed, geo-spatial/GPS locations, driving patterns, parking style, and so on. All this information can be pushed in to Kinesis streams in real-time, and consumer applications can process this information and can provide some real and actionable insights, such as assessing the driver risk for potential accidents or alerts for insurance companies, which can further generate personalized insurance pricing.

The preceding use cases are only a few examples of where Kinesis can be leveraged for storing/processing real-time or streaming data feeds. There could be many more use cases based on the industry and its needs.

Let's move forward and discuss the architecture and various components of Kinesis.

High-level architecture

A fully fledged deployment of Kinesis-based, real-time/streaming data processing applications will involve many components that closely interact with each other. The following diagram illustrates the high-level architecture of such deployments. It also defines the role played by each and every component:

High-level architecture

This diagram shows the architecture of the Kinesis-based, real-time/streaming data processing applications. At a high level, the producers fetch/consume the data from the data sources and continually push data to the Amazon Kinesis streams. Consumers at the other end continually consume the data from the Kinesis streams and further process it, and they can store their results using an AWS service such as Amazon DynamoDB, Amazon Redshift, Amazon S3, or any other NoSQL database, or maybe push it in to another Kinesis stream.

Let's move on to the next section where we will discuss the role and purpose of each and every component of the Kinesis architecture.

Components of Kinesis

In this section, we will discuss the various components of Kinesis-based applications. Amazon Kinesis provides the capability to store the streaming data, but there are various other components that play a pivotal role in developing full-blown, Kinesis-based, real-time data processing applications. Let's understand the various components of Kinesis.

Data sources

Data sources are those applications that are producing the data in real time and providing the means to access or consume the data in real or near real-time; for example, data emitted by various sensors or web servers emitting logs.

Producers

Producers are custom applications that consume the data from the data sources and then submit the data to Kinesis streams for further storage. The role of producers in the overall architecture is to consume the data and implement various strategies to efficiently submit the data to Kinesis streams. One of the recommended strategies is to create batches of data and then submit the records to the Kinesis streams. Producers optionally may perform minimal transformation, but we should ensure that any processing should not degrade the performance of the producers. The primary role of producers is to fetch data and optimally submit it to Kinesis streams. The producers are recommended to be deployed on the EC2 instances themselves, so that there is optimal performance and no latency while submitting the records to Kinesis. Producers can optionally be implemented using the APIs directly exposed by the AWS SDK or Kinesis Producer Library.

Consumers

Consumer applications are custom applications that connect and consume the data from Kinesis streams. Consumer applications further transform or enrich the data and can store it in some persistent storage services/devices such as DynamoDB (https://aws.amazon.com/dynamodb/), Redshift (https://aws.amazon.com/redshift/), S3 (https://aws.amazon.com/s3/), or may submit it to EMR, which is short for Elastic MapReduce, (https://aws.amazon.com/elasticmapreduce/) for further processing.

AWS SDK

Amazon provides a software development kit (SDK) that contains easy-to-use and high-level APIs so that users can seamlessly integrate the AWS services within their existing applications. The SDK contains APIs for many AWS services, including Amazon S3, Amazon EC2, DynamoDB, Kinesis, and others. Apart from the APIs, it also includes code samples and documentation. Producers and consumers of Kinesis streams can be developed using the Amazon Kinesis APIs exposed by AWS SDK, but these APIs provide basic functionality to connect, and submit or consume data to/from Kinesis streams. Other advanced functionalities, such as batching or aggregation, either need to be developed by the developer or he/she may also optionally use high-level libraries such as Kinesis Producer Library (KPL) or Kinesis Client Library (KCL).

Note

Refer to https://aws.amazon.com/tools/?nc1=f_dr for more information about the available SDKs and http://docs.aws.amazon.com/kinesis/latest/APIReference/Welcome.html for more information about Amazon Kinesis API.

KPL

KPL is a high-level API developed over AWS SDK that can be used by developers to write the code for ingesting data into Kinesis streams. The Amazon KPL simplifies producer application development through which developers can achieve a high write throughput to an Amazon Kinesis stream that can be further monitored using Amazon CloudWatch (https://aws.amazon.com/cloudwatch/). Apart from providing a simplified mechanism for connecting and submitting data to Kinesis streams, KPL also provides the following features:

  • Retry mechanism: Provides an automatic and configurable retry mechanism in scenarios where data packets are dropped or rejected by the Kinesis streams due to lack of availability of sufficient bandwidth.
  • Batching of records: It provides the APIs for writing multiple records to multiple shards in one single request.
  • Aggregation: Aggregates the records and increases the payload size to achieve the optimum utilization of available throughput.
  • Deaggregation: Integrates seamlessly with the Amazon KCL and deaggregates the batched records.
  • Monitoring: Submit the various metrics to Amazon CloudWatch to monitor the performance and overall throughput of the producer application.

It is also important to understand that KPL should not be confused with the Amazon Kinesis API that is available with the AWS SDK. The Amazon Kinesis API provides various functionalities, such as creating streams, resharding, and putting/getting records, but KPL just provides a layer of abstraction specifically for optimized data ingestion.

KCL

KCL is the high-level API that was developed by extending the Amazon Kinesis API (included in AWS SDK) for consuming data from the Kinesis streams. KCL provides various design patterns for accessing and processing data in an efficient manner. It also incorporates various features such as load balancing across multiple instances, handling instance failures, checkpointing processed records, and resharding, which eventually helps developers to focus on the real business logic for processing the records received from the Kinesis streams. KCL is entirely written in Java, but it also provides support for other languages via its MultiLangDaemon interface.

Note

Refer to http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html for more information on KCL, its role, and its support for multiple languages.

Kinesis streams

One of the most important components of the complete architecture is the Kinesis streams. Understanding the design of Kinesis streams is important for architecting and designing a scalable and performance-efficient real-time message processing system. Kinesis streams define four main components, which are used in the structuring and storing of the data received from producer: shards, data records, partition keys, and sequence numbers. Let's move forward and look at all these components of Kinesis streams.

Shards

A shard stores a uniquely identified group of data records in an Amazon Kinesis stream. A Kinesis stream is composed of multiple shards and each shard provides a fixed unit of capacity. Every record, when submitted to a Kinesis stream, is stored in one of the shards of the same stream. Each shard has its own capacity to handle read and write requests, and the eventual number of shards defined for a stream is the total capacity of the Amazon Kinesis stream. We need to be careful in defining the total number of shards for a stream because we are charged on a per-shard basis. Here's the capacity for reads and writes for one shard:

  • For reads: Every shard has a maximum capacity to support up to 5 transactions per second with a maximum limit of 2 MB per second.
  • For writes: Every shard has a maximum capacity to accept 1,000 writes per second, with a limit of 1 MB per second (including partition keys).

Let's take an example where you define 20 shards for your Amazon Kinesis stream. So the maximum capacity of your reads will be 1000 (20 shards * 5) transactions per second, where the total size of all reads should not be more than 40 MB (20*2) per second, which means that the size of each transaction, if equally divided, should not be more than 40 KB. For writes, there will be a maximum of 20,000 (1000*20) writes per second, where each write request should not be more than 1 MB per second.

Shards need to be defined at the time of initializing the Kinesis stream itself. To decide the total number of required shards, we need the following inputs, either from the user or developers who will be reading or writing records to the stream:

  • Total size of each record: Let's assume that each record is 2 KB
  • Total required reads and writes per second: Assume 1000 writes and reads per second

Considering the preceding numbers, we will require at least two shards. Here's the formula for calculating the total number of shards:

number_of_shards = max(incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB/2000)

This will eventually result in the following calculation:

2 = max((1000*2KB)/1000,(1000*2KB)/2000)

Once we define the shards and streams are created, we can spin off multiple clients for reading and writing to the same streams.

Let's move forward and discuss the format of data stored within the streams/shards. The data within the shards is stored in form of data records, which are comprised of partition keys, sequence numbers and the actual data. Let's move forward and study each of the components of the data records.

Partition keys

Partition keys are the user-defined Unicode strings, each with a maximum length of 256 characters, which map the data records to a particular shard. Each shard stores the data for a specific set of partition keys only. The mapping of shard and data records is derived by providing partition keys as an input to a hash function (internal to Kinesis) that maps the partition key and associates the data with a specific shard. Amazon Kinesis leverages the MD5 hash function for converting partition keys (strings) to 128-bit integer values and then to further associate data records with shards. Every time producers submit records to Kinesis streams, a hashing mechanism is applied over the partition keys and then the data records are stored on the specific shards that are responsible for handling those keys.

Sequence Numbers

Sequence numbers are unique numbers assigned by Kinesis to each data record once it is submitted by the producers. The sequence numbers for the same partition key generally increase over a period of time, which means the longer the time period between write requests, the larger will be the sequence numbers.

In this section, we have discussed the various components and their role in the overall architecture of Kinesis streams. Let's move forward to the next section where we will see the usage of these components with the help of appropriate examples.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset