Data Pre-Processing Using SageMaker Data Wrangler – Part 1

November 17, 2022 | Comments(0) |

TABLE OF CONTENT

1. Introduction
2. What is Amazon SageMaker Data Wrangler?
3. Amazon SageMaker Data Wrangler Core Functionalities
4. Data Transformation using SageMaker Data Wrangler
5. Conclusion
6. About CloudThat
7. FAQs

 

Introduction

Nowadays, With the increment in the production of a vast variety of data from multiple resources inside the pipelines, the preprocessing steps to manage those amounts of data are also tough in the pipelines. So, to handle the preprocessing steps, Amazon SageMaker has a working functionality to preprocess the data which is known as SageMaker Data Wrangler. With the help of Data Wrangler, we can handle the vast amount of data in the pipeline itself, we just need to set up the flow of the preprocessing steps inside the Data Wrangler service.

What is Amazon SageMaker Data Wrangler?

Amazon SageMaker Data Wrangler (Data Wrangler) is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. We can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify and streamline data pre-processing and feature engineering using little to no coding. We can also add your Python scripts and transformations to customize workflows.

Amazon SageMaker Data Wrangler Core Functionalities

  • Import – We can connect to and import data from multiple sources like Amazon Simple Storage Service (Amazon S3), Amazon Athena (Athena), Amazon Redshift, Snowflake, and Databricks.
  • Data Flow – We can create a data flow to define a series of ML data prep steps. We can use a flow to combine datasets from different data sources, identify the number and types of transformations you want to apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline.
  • Transform – We can Clean and transform our dataset using standard transforms like string, vector, and numeric data formatting tools. We can also Feature our data using transforms like text and date/time embedding and categorical encoding.
  • Generate Data Insights – We can automatically verify data quality and detect abnormalities as well as anomalies in our data with Data Wrangler Data Insights and Quality Report.
  • Analyze – Using Data Wrangler we can analyze features in our dataset at any point in our flow. Data Wrangler includes built-in data visualization tools like scatter plots and histograms, as well as data analysis tools like target leakage analysis and quick modeling to understand feature correlation.
  • Export – We can export our data preparation workflow to a different location. The following are example locations:
    • Amazon Simple Storage Service (Amazon S3) bucket
    • Amazon SageMaker Model Building Pipelines – Use SageMaker Pipelines to automate model deployment. You can export the data that you’ve transformed directly to the pipelines.
    • Amazon SageMaker Feature Store – Store the features and their data in a centralized store.
    • Python script – Store the data and their transformations in a Python script for your custom workflows.

Data Transformation using SageMaker Data Wrangler

Multiple functions are provided by the SageMaker Data Wrangler Transform feature to transform data. Here are some examples of the functions:

  • Join Datasets – We can join multiple datasets using the join operation
  • Balance Data – We can also handle the imbalanced dataset using different sampling techniques
  • Custom Transforms – Through custom transforms, we use Python (User-Defined Function), Pyspark, Pandas, or Pyspark (SQL) to define custom transformations.
  • Custom Formula – Use a Custom formula to define a new column using a Spark SQL expression to query data in the current data frame.
  • Encode Categorical – We can encode categorical features as well in the flow pipeline.
  • Featurize Text – Using the Feature Text transform group to inspect string-typed columns and use text embedding to featurize these columns.
  • Transform Time Series – We can also transform the time series data in the pipeline.
  • Handle Outliers – We can also handle the outliers in the pipeline using Data Wrangler.

Conclusion

Amazon SageMaker Data Wrangler helps to preprocess the data within the pipeline. Earlier there was no such service that maintain the data integrity while preprocessing and provides the feature of transformation along with multiple different feature engineering steps like handling missing values, dealing with imbalanced data, along with handling outliers automatically in the pipeline itself. SageMaker studio provides the feature, and we can also use these features in different real-time MLOps projects as well for preprocessing stage and dumping the data into the Data Warehouse.

About CloudThat

CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

Drop a query if you have any questions regarding SageMaker and I will get back to you quickly.

To get started, go through our Consultancy page and Managed Services Package that is CloudThat’s offerings.

FAQs

  1. Is SageMaker Data Wrangler able to treat real-time data? 

A. Yes, Data Wrangler can be used to handle Data conversion in flight. 

2. Can we integrate SageMaker Data Wrangler into SageMaker Pipelines? 

A. Yes, we can very well integrate it with pipelines to maintain the whole flow for an MLOps project. 


Leave a Reply