Data Lake Solution for Live Streaming Text & Audio Data with AWS Deep-Learning/ML Services

September 12, 2022 | Comments(0) |

AWS Solution Overview:

The below architecture diagram illustrates high-level infrastructure components of the customer’s production environment.

huut-arch-diagram

Data Engineering Pipeline for Text and Audio Data

This solution helped in building a data engineering pipeline for data in the form of JSON and audio data from the source.

The pipeline includes 4-5 different steps that were performed on the source data:

  • Source text data coming from the customer application was first stored in a central data lake solution like S3
  • An ETL operation is performed on it to clean the raw data
  • A querying mechanism is set up to extract important data from the cleaned data.
  • The audio data is transcribed into SRT format and translated to the desired language
  • This formatted audio data is again stored in a central data store for more feature implementation. 

AWS Services Leveraged:

  • Amazon API Gateway
  • Amazon DynamoDB
  • Amazon S3
  • AWS Lambda
  • Amazon Transcribe
  • Amazon Translate
  • AWS CloudFront
  • AWS Glue Crawler, Data Catalog, and ETL-jobs
  • AWS Athena
  • Amazon Kinesis Data Streams
  • Amazon Kinesis Firehose

Solution Outcome:

  • Data-driven Architecture helped in new feature releases for their application, with an increased number of customer registration and better customer experience.
  • AWS Transcribe and Translate integration helped in attracting more customers from different language backgrounds.

Leave a Reply