V’s of Big Data
Basically V’s of big data is to point various problems that we face with data.
This volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. At its core, Hadoop is a platform for distributing computing problems across a number of servers.
High rate of data and information flowing into and out of our systems, real-time,incoming!
Variety refers to the many sources and types of data both structured and unstructured. We used to store data from sources like spreadsheets and databases. Now data comes in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. This variety of unstructured data creates problems for storage, mining and analyzing data.
Necessary and sufficient data to test many different hypotheses, vast training samples for rich micro-scale model-building and model validation, micro-grained “truth” about every object in your data collection, thereby empowering “whole-population analytics”.
Value starts and ends with the business use case. The business must define the analytic application of the data and its potential associated value to the business. Use cases man searching for value in Big Data attributesare important both to define initial “Big Data” pilot justification and to build a road map for transformation.
This is the hard part of big data. Making all that vast amount of data comprehensible in a manner that is easy to understand and read. With the right analyses and visualizations, raw data can be put to use otherwise raw data remains essentially useless. Visualizations of course do not mean ordinary graphs or pie charts. They mean complex graphs that can include many variables of data while still remaining understandable and readable.
Big data volatility refers to how long is data valid and how long should it be stored. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis.
Variability is often confused with variety. Say you have bakery that sells 10 different breads. That is variety. Now imagine you go to that bakery three days in a row and every day you buy the same type of bread but each day it tastes and smells different. That is variability.
Variability is thus very relevant in performing sentiment analyses. Variability means that the meaning is changing (rapidly). In (almost) the same tweets a word can have a totally different meaning. In order to perform a proper sentiment analyses, algorithms need to be able to understand the context and be able to decipher the exact meaning of a word in that context. This is still very difficult.
Viability and value as distinct missing V’s. Biehn’s take on viability is similar to Press’s. “We want to carefully select the attributes and factors that are most likely to predict outcomes that matter most to businesses
Distributed, heterogeneous data from multiple platforms, from different owners’ systems, with different access and formatting requirements,
private vs. public cloud.
Schema, data models, semantics, ontologies, taxonomies, and other content- and context-based metadata that describe the data’s structure, syntax, content, and provenance.
Confusion over the meaning of big data (Is it Hadoop? Is it something that we’ve always had? What’s new about it? What are the tools? Which tools should I use? etc.)
Data quality, governance, master data management (MDM) on massive, diverse, distributed, heterogeneous, “unclean” data collections.