Data Analytics Ecosystem – AWS & Azure (Part 1)

April 11, 2017 | Comments(2) |

In the past decade, technologies around data analytics and business intelligence have seen a tremendous growth in number and reach. The emergence of Cloud has undoubtedly been a catalyst to this growth. Amazon Web Services and Microsoft Azure have both been working towards offering services for Collecting, Uploading, Storing and processing for Data. Below is an attempt to bring both the ecosystems forward into a comparative study of their services for various stages of Data Analytics.

In a broad sense, the lifecycle of data to be analyzed goes through the below stages -

  1. Data Ingestion
  2. Preservation of Original Data Source
  3. LifeCycle Management and Cold Storage
  4. Metadata Capture
  5. Managing Governance, Security and Privacy
  6. Self-Service Discovery, Search and Access
  7. Managing Data Quality
  8. Preparing for Analytics
  9. Orchestration and Job Scheduling
  10. Capturing Data Change

In this Part 1 of the blog, I would be exploring the first 5 stages of Data and how do both AWS and Azure serve the purpose.

Data Ingestion

Both AWS and Azure provide REST support so that users need to only perform HTTP(s) calls to be able to upload data onto their Cloud. Azure offers few connectors to migrate data to Databases, but currently doesn’t offer any specialized services to perform Data Ingestion onto Azure resources. Whereas, AWS has been supporting Ingestion stage for quite a while.

AWS began with DataPipeline, a service that can be used to perform/schedule Data Transformation and Loading into multiple AWS data storage solutions. DataPipeline can be used to move data from original data source like S3 or RDS to analysis environment like Redshift or EMR.

Later, AWS also introduced Kinesis which has become very popular for its Real-Time data streaming capabilities. Many organizations working on collecting and analyzing Sensor data or on IoT, use Kinesis to perform the data ingestion into storage solutions.

Preserving Original Data Source

Both Azure and AWS have been very prudent in making sure they provide secure and durable solutions to be able to store and preserve data. In many cases, both AWS and Azure have also not been charging customers to upload data onto their resources.

S3 and Blob storage of AWS and Azure respectively have been object store solutions with very high durability. Collectively, they will be having a few billion objects.

Along with object stores, they both offer database solutions, both SQL and NoSQL Solutions. While Amazon RDS is the SQL data storage solution on AWS, Azure offers Virtual machines with MSSQL database and managed database pools to store SQL Data. In NoSQL space, both AWS and Azure Offer Document databases viz DynamoDB and DocumentDB respectively.

LifeCycle Management and Cold Storage

From the moment data enters the analytic ecosystem of both the providers, until the processed information is delivered to the stakeholder, the data in both cases goes through multiple extractions and transformations. Finally, the source data might be needed to be stored for compliance reasons or any re-use and cold storage plays a very vital role is archiving the data.

In the cold storage area, AWS provides highly durable Glacier as a solution which is being very widely used now and in 2016, Azure introduced Azure Cool Blob Storage, which is again a cost effective and highly durable archival solution.

Metadata Capture

Gathering information about the data and the transformations the data goes through in its lifecycle also needs to be captured either as information or as API calls that the data goes through. While both the providers give options to include metadata for the information in terms of simple key and value pairs, they do not provide any extensive mechanisms to update and store custom metadata. Tracking the transformations at data level still needs work to be done from both the providers but API tracking is available on both the providers viz CloudTrail in AWS and API Management in Azure.

Managing Governance, Security and Privacy

In this area, AWS has an upper hand now and Azure is yet to catch up. With the help of JSON based security policies, AWS offers a lot of flexibility in controlling access and security of the data in transformation. IAM policies and resource based permissions combined with CloudTrail enable administrators and Auditors to control and track data security and privacy. Using Key Management Service and CloudHSM, AWS has also given user the flexibility to control access to the encryption keys. With AWS Certificate Manager, one can also get SSL Certificates for ELBs and CloudFront Distributions without any hitherto knowledge of SSL. Both data at rest and data in transit within in AWS are well integrated to ensure end-to-end data security during various transformations.

Azure on the other hand offers Role based Access Control to authorize personnel and APIs to resources on azure. Although they are easy to implement, but they lack the flexibility that is expected. Azure offers flexibility in terms of storage account level security, data plane security, shared access security and server side encryption for data at rest in Azure. There is also support for SSL for data in transit, although the certificates cannot be generated on Azure for encryption. Azure offers Azure Key Vault, which is a HSM module to store and manage encryption keys. The API Management module in Azure would enable auditors to keep track of the API changes on the data.

In part 2 of this blog, I will be discussing the remaining 5 stages of Analytics from both AWS and Azure perspective. Please feel free to give your comments and watch out this space for part 2.

 


2 Responses to “Data Analytics Ecosystem – AWS & Azure (Part 1)”

  1. Kapil

    The most exciting one is the metadata information digging. It is one thing which needs full attention to get the accurate results.

    Reply
    • Sankeerth Reddy Sankeerth Reddy

      Absolutely Kapil. I will be pondering on Metadata analysis under Search and Access and Managing Data Quality.

      Reply

Leave a Reply