Data Analytics Ecosystem – AWS & Azure (Part 1)

Posted on April 11, 2017 by Sankeerth Reddy | Comments(2)

In the past decade, technologies around data analytics and business intelligence have seen a tremendous growth in number and reach. The emergence of Cloud has undoubtedly been a catalyst to this growth. Amazon Web Services and Microsoft Azure have both been working towards offering services for Collecting, Uploading, Storing and processing for Data. Below is an attempt to bring both the ecosystems forward into a comparative study of their services for various stages of Data Analytics. In a broad sense, the lifecycle of data to be analyzed goes through the below stages – Data Ingestion Preservation of Original Data Source LifeCycle Management and Cold Storage Metadata Capture Managing Governance, Security and Privacy Self-Service Discovery, Search and Access Managing Data Quality Preparing for Analytics Orchestration and Job Scheduling Capturing Data Change In this Part 1 of the blog, I would be exploring the first 5 stages of Data and how do both AWS and Azure serve the purpose. Data Ingestion Both AWS and Azure provide REST support so that users need to only perform HTTP(s) calls to be able to upload data onto their Cloud. Azure offers few connectors to migrate data to Databases, but currently doesn’t offer any specialized services to perform

Continue reading…

MongoDB Monitoring Service – Installation and Set Up

Posted on May 21, 2014 by Sankeerth Reddy | Comments(0)

MongoDB Monitoring Service or MMS is a free monitoring application developed by the MongoDB team to manage and troubleshoot MongoDB deployments. Once set up correctly, you get a bunch of metrics that can be very useful during troubleshooting production issues. MMS is also used by MongoDB team to provide suggestions and optimization techniques. In this post, I will be briefing about the steps to install and few tips and tricks to setup MMS for one’s MongoDB cluster, without having to spend a lot of time. On the whole, there are two steps to set up MMS for a sharded cluster. 1: Install and start monitoring agent on one of the nodes. Preferably on a mongos machine as it has access to all the nodes ( Shards and Config Servers ) 2: Add the nodes to the MMS Console for monitoring Task 1: 1. Firstly, create an account on and login into MMS. 2. Get the monitoring agent installation instructions on the settings page. 3. Select the platform on which MMS agent needs to be installed to get the corresponding instructions. 4. The API keys would be needed to be updated on the configuration file after installing the agent. The

Continue reading…

5 Reasons why DynamoDB is better than MongoDB

Posted on April 29, 2014 by Bhavesh Goswami | Comments(5)

If you are considering MongoDB or any other NoSQL databases, its a must that you consider DynamoDB. In the MongoDB vs DynamoDB matchup, DynamoDB has a lot of brilliant features that help ease the pain of running NoSQL clusters. Below I give five reasons to choose DynamoDB over MongoDB. Reason 1: People don’t like being woken up in the middle of the night One sure-shot way to motivate someone to rethink their priorities in life, and reconsider their choice in becoming an IT professional, is to hand them pager-duty for a MongoDB cluster. Maintaining a MongoDB cluster requires keeping the servers up and running, keeping the MongoDB processes up and running, and performance monitoring for the cluster. Check this image for example (time there are in UTC). In the middle of the night, a client’s MongoDB Cluster generated few automated CloudWatch alarms. At 4 AM the conversation between a systems engineer and me is like following: Engineer: Hey, got woken up by the pager, seems like CPU utilization is spiking, but requests are running fine. I looked around but found nothing. Can I just resolve this issue and look at it tomorrow? Me: You woke me up to just ask this?

Continue reading…

Sample Questions for MongoDB Certified DBA (C100DBA) exam – Part II

Posted on April 21, 2014 by Sankeerth Reddy | Comments(8)

Here are some more sample questions for C100DBA: MongoDB Certified DBA Associate Exam. Please give them a try and the answers are at the end of this blog post. If you have not yet attempted Part I of sample questions – they are available here. Section 1: Philosophy & Features: 1. Which of the following are valid json documents? Select all that apply. a. {“name”:”Fred Flintstone”;”occupation”:”Miner”;”wife”:”Wilma”} b. {} c. {“city”:”New York”, “population”, 7999034, boros:{“queens”, “manhattan”, “staten island”, “the bronx”, “brooklyn”}} d. {“a”:1, “b”:{“b”:1, “c”:”foo”, “d”:”bar”, “e”:[1,2,4]}} Section 2: CRUD Operations: 1. Which of the following operators is used to updated a document partially? a. $update b. $set c. $project d. $modify Section 3: Aggregation Framework: Questions 1 to 3 Below is a sample document of “orders” collection { cust_id: “abc123″, ord_date: ISODate(“2012-11-02T17:04:11.102Z”), status: ‘A’, price: 50, items: [ { sku: “xxx”, qty: 25, price: 1 }, { sku: “yyy”, qty: 25, price: 1 } ] } Select operators for the below query to determine the sum of “qty” fields associated with the orders for each “cust_id”. db.orders.aggregate( [ { $OPR1: “$items” }, { $OPR2: { _id: “$cust_id”, qty: { $OPR3: “$items.qty” } } } ] ) 1. OPR1 is a.

Continue reading…

Sample Questions for MongoDB Certified DBA (C100DBA) exam – Part I

Posted on April 5, 2014 by Sankeerth Reddy | Comments(20)

Below are some of the sample questions Sample Questions for C100DBA: MongoDB Certified DBA Associate Exam. You can read more about the MongoDB Certified DBA Exam here. Please give them a try and the answers are at the end of this blog post. Section 1: Philosophy & Features: 1. Which of the following does MongoDB use to provide High Availability and fault tolerance? a. Write Concern b. Replication c. Sharding d. Indexing 2. Which of the following does MongoDB use to provide High Scalability? a. Write Concern b. Replication c. Sharding d. Indexing Section 2: CRUD Operations: 1. Which of the following is a valid insert statement in mongodb? Select all valid. a. db.test.insert({x:2,y:”apple”}) b. db.test.push({x:2,y:”apple”}) c. db.test.insert({“x”:2, “y”:”apple”}) d. db.test.insert({x:2},{y:”apple”}) Section 3: Aggregation Framework: 1. Which of the following is true about aggregation framework? a. A single aggregation framework operator can be used more than once in a query b. Each aggregation operator need to return atleast one of more documents as a result c. Pipeline expressions are stateless except accumulator expressions used with $group operator d.  the aggregate command operates on a multiple collection Section 4: Indexing: Below is a sample document in a given collection test. { a

Continue reading…

1000 jobs for BigData Analytics posted in 1 week !!

Posted on February 20, 2014 by CloudThat | Comments(2)

I teach a BigData Analytics course in Bangalore and I routinely check up for jobs that exist in this domain on ( is the no.1 job site in the country). You must be hearing about BigData , Cloud technologies and Analytics being the ‘hottest’ jobs of the century. Quoting Harvard Business Review ,” Data Scientist: The Sexiest Job of the 21st Century. So who is a Data Scientist ? It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data. The title has been around for only a few years. (It was coined in 2008 by one of us, D.J. Patil, and Jeff Hammerbacher, then the respective leads of data and analytics efforts at LinkedIn and Facebook.) But thousands of data scientists are already working at both start-ups and well-established companies. Their sudden appearance on the business scene reflects the fact that companies are now wrestling with information that comes in varieties and volumes never encountered before. If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a

Continue reading…

Facebook Open Sources Presto SQL Query Engine

Posted on November 12, 2013 by Himanshu Sachdeva | Comments(0)

In June 2013 at Analytics @ WebScale conference, Facebook announced Presto which they were using internally to process petabytes of data. It has now been made open-source as per a recent post by Facebook Engineering. So what is Presto? Hive, which was initially developed by Facebook used MapReduce chaining to transform a query into multiple MapReduce Jobs. Presto different as it does not use MapReduce & is 10 times faster that Hive for most queries as per Facebook. Presto allows querying data where it lives, including Hive, HBase, relational databases or even proprietary data stores. You can issue SQL like queries on Presto that include left/right outer join, subqueries or even common aggregate functions. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. Facebook uses Presto internally to interactively query over a petabyte of data by about 1000 employees running more than 30,000 queries a day. Currently its also being used by leading internet companies including Airbnb and Dropbox.   You can find more about Presto here : Presto Website Facebook Blog about Presto Gigaom Story

Jobs – Big data and Analytics

Posted on August 20, 2013 by CloudThat | Comments(0)

A search on Naukri with key words Hadoop + analytics / Big data + analytics gave me some interesting insights into the type of jobs that are currently floating in the Indian market. My take-away is as follows :- 1. Job Description: – this is very brief . Eg:- Design and Implementation of Big data utilities for Business Analytics Data mining and Explorative analytics for various Industry Benchmark Reports 2. Competency Required :- contains terms of technology understanding , logical thinking and ability to solve unstructured problems 3. Desirable Skill Sets :- Good understanding of Business Metrics and parameters. Ability to apply right statistical tools to business case Technical – Apache Hadoop, Pig, hive, big data Domain Exp MANDATORY 4. Pay – Best in the industry – often 1.5 times that in pure play analytics or Hadoop 5. Experience – 4+ years’ 6. Employer – Top MNC Terms that crop up include Identify pilot opportunities Drive early development Build reference architecture and blue prints Network extensively with the ecosystem, lead and participate in conferences and workshops, and establish thought leadership So what is big data analytics? As a domain, it is very new and encompasses ETL, data visualisation and Statistical analytics on ‘big data’. Big data

Continue reading…

Career opportunities in Cloud Computing and Big Data

Posted on August 14, 2013 by CloudThat | Comments(0)

“The global analytics market is expected to reach $25 billion by 2015 and the global cloud market is expected to be ~$675 billion by 2020. Indian IT players need to capitalize on its already well established IT/BPM market presence by increasing their services portfolio beyond the typical IT offerings.” Social, Mobile, Analytics & Cloud – The Game Changers for the Indian IT Industry June 2013 – Dinodia Capital Advisors. Organizations have realized the importance of Big Data and are now looking for ways of gleaning insights from it that will be to their advantage. The large quantity, velocity and the diversity of this data has given rise to the need for Data scientists who are trained and have the ability to analyze data that is available at this magnitude. Cloud Computing on the other hand utilizes the internet to provide software and hardware capacities to businesses through third party vendors. Services involving the Cloud offer to reduce cost, complexity and save time. Players in the IT sector are seriously looking towards offering their services through the cloud as a result of the continuous and growing deployment of businesses on the cloud. This has created opportunities for firms and individuals who

Continue reading…

Will Amazon’s Data Warehousing Solution – “Amazon Redshift” Change the Game?

Posted on August 1, 2013 by CloudThat | Comments(0)

Its core message comprises of price, performance and simplicity which AWS considers core principles for all its services. According to AWS “Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools.” To have a product that operates at that scale and showcases itself as “economical” is really difficult to accept as something that works and works well. But initial responses have been overwhelmingly positive. Within a few months the numbers have crossed the thousand mark. There must certainly be more than meets the eye here and certainly something to find out more about. If you are hungry for proof, then you will find this interesting. Click here