Online streaming has become part and parcel of information consumption in today’s era. However, creating live, real-time systems is a niche skill in the world of cross-platform integration, subscriptions, instant notifications, etc. The core component of creating a real-time system is the continuous streaming of data from one application to another. Various tools provide this ability: RabbitMQ, Apache Kafka, Amazon Kinesis, and many more. Each tool has its fair share of advantages and disadvantages. Today we are going to focus on Amazon Kinesis.
2. Amazon Kinesis Data Streams
It is used for capturing item-level modifications of any DynamoDB table. Our applications can access the Kinesis stream and view changes in near real-time. The Kinesis data stream will be able to continuously capture and store terabytes of data per hour, which we can use for longer retention by having additional audit and security transparency. Kinesis Data Streams can also be used with Kinesis Data Firehose – a delivery stream platform and Amazon QuickSight – where we can create real-time dashboards, generate alerts, etc.
3. Amazon Kinesis Data Firehose
It is a fully managed ETL service used for reliable loading of streaming data to the data stores, data lakes, analytics services. It can capture, transform, and deliver streaming data into S3 and other destinations like Redshift, OpenSearch, DataDog, etc. Kinesis Data Firehose can scale automatically to match the throughput of the data and used to batch, compress, transform and encrypt the data streams which minimizes the storage used and increased security
4. High-Level Architecture Diagram
5. Step-by-Step Data Lake implementation guide for DynamoDB tables using Kinesis Streams
I will use AWS Kinesis Data streams to store DynamoDB table data into S3 (as a data lake) using Kinesis Data Firehose.
- Create Kinesis Data Stream by provisioning required data stream capacity by selecting either On-demand capacity mode or provisioned capacity mode
- Create a Delivery Stream which is used for sending streamed data into the S3 bucket
- Choose the source as Kinesis Data Streams and destination as an S3 bucket
- Under source settings, select the data stream created in the earlier step
- We can transform the data in two ways either using Lambda (if stream data is not JSON) or using Glue to convert the records to Apache Parquet or Apache ORC format (converts JSON data to table schema which we can define) which provides efficient querying, or we can send the raw data directly to S3
- Under Destination Settings, select the S3 bucket where the streamed data is to be stored. Select the custom S3 bucket prefix to store the data and error output prefix where any errors occurred will be logged
- Dynamic Partitioning is a feature that can be enabled on the S3 bucket in Destination settings used to partition the streaming data into multiple folders as per our requirement. This feature can be enabled only when creating a delivery stream and cannot be allowed for the existing one.
- We can set the S3 buffer limits with buffer size and buffer interval. Compression and encryption (for data records and server-side encryption) can also be enabled to reduce storage size and provide additional security.
- After selecting all the required specifications, create the Delivery stream whose status will be Active upon creation
- Now, go to DynamoDB console and enable Kinesis Data streams for the tables required
- Any item modifications that have been happening on the DynamoDB table are now being captured and stored in S3
AWS Kinesis Data Streams and Data Firehose combined can be used as an efficient way to create a centralized data lake used for performing advanced analytics or sending the data to redshift for optimized querying. In addition, they can create dashboards using QuickSight or Athena for better visualization of data.
As Kinesis is a Managed Service, meaning AWS handles most of the administration and developers can focus on their code and not worry about managing their system. Hope that this step-by-step guide has been useful to you.
7. About CloudThat
We here at CloudThat are the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge on cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
Feel free to drop a comment or any queries that you have regarding AWS Kinesis, AWS Firehose, managed services, consulting and we will get back to you quickly. To get started, go through our Expert Advisory page and Managed Services Package that is CloudThat’s offerings.