
Data Lake Solution for Live Streaming Text & Audio Data with AWS Deep-Learning/ML Services
October 9, 2022 | Comments(0) |
AWS Solution Overview:
The below architecture diagram illustrates high-level infrastructure components of the customer’s production environment.
Data Engineering Pipeline for Text and Audio Data
This solution helped in building a data engineering pipeline for data in the form of JSON and audio data from the source.
The pipeline includes 4-5 different steps that were performed on the source data:
- Source text data coming from the customer application was first stored in a central data lake solution like S3
- An ETL operation is performed on it to clean the raw data
- A querying mechanism is set up to extract important data from the cleaned data.
- The audio data is transcribed into SRT format and translated to the desired language
- This formatted audio data is again stored in a central data store for more feature implementation.
AWS Services Leveraged:
- Amazon API Gateway
- Amazon DynamoDB
- Amazon S3
- AWS Lambda
- Amazon Transcribe
- Amazon Translate
- AWS CloudFront
- AWS Glue Crawler, Data Catalog, and ETL-jobs
- AWS Athena
- Amazon Kinesis Data Streams
- Amazon Kinesis Firehose
Solution Outcome:
- Data-driven Architecture helped in new feature releases for their application, with an increased number of customer registration and better customer experience.
- AWS Transcribe and Translate integration helped in attracting more customers from different language backgrounds.