What is Big data? It is definitely the most talked of ‘new kid on the block’ in the analytics fraternity . Everyone seems to be talking about it . So what exactly is Big data ?
Big data consists of data sets that grow so large that they become ‘difficult / awkward’ to work with using existing database management systems (Oracle, Sybase, MySQL, Teradata etc.). Difficulties include capture, storage, search, sharing, and mining the data for analytics. With an explosion in sources of data – internet forms, cookies , sensors , mobile applications , satellite data etc., the quantum on data is growing and will continue to grow at an astronomical pace. The cost of storage of data is reducing exponentially too . The cost of a 4 GB pen drive is now 10% of what it was a couple of years ago. Coupled together, these two trends will fuel the growth in quantum of data that we will have access to .
The world’s technological per capita capacity to store information has roughly doubled every 40 months since the 1980s (about every 3 years) . Some people say it is no wgoing to double every 1.5 years. Every day 2.5 quintillion bytes of data is created.
As you can visualize, these new sources of data will mostly be non-relational data and the storage is in non-relational DBMS. Thus, transactional data within an organization will, bynecessity, be in the traditional relational database while there is this ‘other’ data which is where there will be maximum growth .This ‘other’ data will need to be mined and put into MIS and reports, analyzed for trends and used to create probability equations.
This ‘other’ data is generally called Big data .
As systems and processes stabilize and mature on capture and storage of Big Data, the focus is shifting to ‘WHAT NEXT’?
Logically, the next step is to mine the data for information – business intelligence and Analytics.
I will specifically look at Apache Hadoop in this context .HadoopMapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
Many vendors that have caught the Hadoop bug and released versions of the software such as Cloudera, HortonWorks, Microsoft with HDInsight.
Cloudera’sHadoop schematic diagram
Sounds simple , but for a data analyst with no Java coding skills , it is all latin and greek . Until you take a look at Pig – high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. It abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. High Level language is very close to natural language and spoken English , thus making it very user friendly for the non – coder .
Wow ..thismakes life so much better for data analysts . And enables us Analysts to look forward to many more projects where we will effectively crunch ‘Big Data ‘ .
Software that competes with Hadoop is Google’s BigQuery. And the comparisons between these two giants is a story for another day . But in the real world out there, Hadoop is the current favorite Big Data Management and Analysis system .
Interesting aside :- Apache Hadoop is an open-source software framework that supports data-intensive distributed applicationsHadoop is written in the Java programming language Hadoop was created by Doug Cutting and Mike Cafarella and Doug named it after his son’s toy elephant !! All parts of the Hadoop framework have names commonly found in a Zoo J.
From 2002 onwards, Subhashini has a decade of experience across roles in Analytics in Retail Finance and Banking. These roles have been across Risk Management , Collections strategy , Fraud Control and Marketing in GE Money, Standard Chartered Bank, Tata Motors Finance and Citi GDM . Her area of interest is the integration of results / outputs of Analytics with Business Decisions – Tactics and Strategy.
She is currently active in the Analytics Training and Consulting arena.
(Link to LinkedIn profile – http://in.linkedin.com/pub/subhashini-s-tripathi/3/405/77b )