What is Big Data and how is it Useful?
As the name implies, the term Big Data can be applied to any internal and external enterprise-based information that can be used to make business forecasts, improve existing infrastructures, manage smart power grids, manage business intelligence and other applications.
Incidentally, this phenomenon is characterized by three main factors namely:
Volume – how much data is too much data
Velocity/Speed – the speed at which data is flowing in and out making examination difficult
Variety – the type of data as well as its range which are too much to take in
In other words, big data is typically used by enterprises to manage their business intelligence processes and programs. However, with relevant analytics, it can be harnessed to gain richer insights into business practices from a number of resources and transactions to unearth hidden trends and relationships.
Big Data Analytics in Action
There are basically 4 types of analytics that big data depends on Prescriptive
These analytics reveals what kind of actions should be taken and which determines future rules and regulations. These are quite valuable since they allow business owners to answer specific queries. Take the bariatric healthcare industry for example. Patient populations can be measured using prescriptive analytics to measure how many patients are morbidly obese. That number can then be filtered further by adding categories such as diabetes, LDL cholesterol levels, and others to determine the exact treatment. Some companies also use this data analysis to forecast sales leads, social media, CRM data etc.Diagnostic
These analytics analyze past data to determine why certain incidents happened. Say, you end up with an unsuccessful social media campaign; using a diagnostic big data analysis you can examine the number of posts that were put up, followers, fans, page views/reviews, pins etc that will allow you to sift the grain from the chaff so to speak. In other words, you can distill literally thousands of data into a single view to see what worked and what didn’t thus saving time and resources.
This phase is based on present processes and incoming data. Such analysis can help you determine valuable patterns that can offer critical insights into important processes. For instance, it can help you assess credit risk, review old financial performance to determine how a customer might pay in the future and even categorize your clientele according to their preferences and sales cycle. Mining descriptive analytics involves the usage of a dashboard or simple email reports.
In a nutshell, we can say that harnessing the potential of big data can aid entrepreneurs add context to their business data to get a more in-depth and focused view of their needs. With analytics, those massive volumes of information can be simplified to determine actionable steps to ensure accurate business decisions. In other words, if you can understand and demystify big data, then you can increase your business value tenfold and leave your competitors in the dust to boot.
These analytics involves the extraction of current data sets that can help users determine upcoming trends and outcomes with ease. However, these cannot tell us exactly what will happen in the future but what a business owner can expect along with different scenarios. In other words, predictive analysis is an enabler of big data in that it amasses an enormous amount of data such as customer info, historical data and customer insight in order to predict future scenarios. In this way it allows organizations to utilize large volumes of information to determine their clientele’s future perspectives.
This type of data analytics takes different theories of the world into account or certain parts according to certain ‘subjects’. In other words, they take a small sample of info to determine certain facets of bigger issues such as a large population. It basically takes the quantity the analyst cares about along with any anomalies in estimates and relies heavily on the population and sample type.Causal Analytics
These types of analytics allow big data analysts figure out what can happen if they change a component or variable in a bigger scheme. This method typically involves a number of random studies but non-random studies are also conducted at times to infer causation. Causal analytics is considered the ‘gold standard’ when it comes to analyzing large volumes of data and involves random trial data sets.
These take the most effort but pay off with clear results. Mechanistic analytics, as the name implies, allow big data analysts to understand clear changes in procedures or variables that can result in a change of variables in single objects. The results are typically determined by equations in engineering and physical science, but they can also be quite hard to infer. Additionally, if the analyst knows the equation but not the parameters, they can infer it with data analysis.
Big data is here to stay and business owners couldn’t be happier. The term emerged in a bid to describe the large volumes of information that databases comprise of, manage and maintain, but the concept has taken on a life of its own in the modern era. Now it not only refers to the information itself but also encompasses a number of technologies that can handle the same tasks to solve complex problems.
It’s because of its flexible nature that investments in big data continue to grow on a global scale. In fact, according to Forbes, it will amount to a whopping $40 Billion dollars this year alone and will expand further by almost 14% in the coming 5 years. It is currently a $5 Billion business already and will easily reach new heights of success if new business models are enabled that can leverage big data to create more beneficial analytical abilities along with state of the art applications that can solve critical business issues in minutes rather than hours.
New approaches to this concept has made the IT sector better than ever by allowing the creation of new game-changing models, in-depth business analytics and app development that are just the tip of the iceberg. Big data has literally changed the way businesses compete to get ahead of their competitors along with business models that aid in such endeavors. It has also altered how enterprises view their databases, warehouses and especially their business intelligence operations.
It’s no wonder big data is such a huge hit in Silicon Valley and is well on its way to becoming a global phenomenon in the next couple of years. Needless to say, the concept has changed the way businesses develop and is just beginning to gain momentum as a significant movement.
The term itself is not new and there is a very good reason for that. Companies across the globe, both large conglomerates, and small startups are utilizing its potential to gain valuable insight into existing operations for future development and to improve their customer service.
Take today’s data for instance. According to a study conducted by scientists at the UC San Diego, by 2024, most businesses across the globe would have processed the digital equivalent of a gigantic number of books that if placed on top of each other, could go from Earth to Neptune and back. At the rate global enterprises are focusing on big data, that feat would be repeated 20 times each year!
What is Big Data Analytics?
However, just why are so many enterprises dependent on this phenomenon? This is where analytics comes into the picture. The process refers to the examination of big data to reveal revealing patterns, significant correlations and other useful info that business owners can use to increase decision making and unearth new opportunities. Data scientists today use such data to get access to and simplify huge volumes of info that traditional analytics falls short of.
To understand its importance, say your company has already collected large amounts of data along with multitudes of data combinations, formats, and stores. Analyzing billions of rows of data to figure out what is important is not possible manually. Big Data analytics can allow you to go through said information in context via:
These processes have allowed countless business owners to streamline their decision-making processes and pinpoint the best ones for enterprise development. Additionally, there are four ways entrepreneurs are harnessing the power of Big Data analytics to improve their businesses:
Big Data Business Intelligence
Business Intelligence or BI refers to typical business reports, ad hoc reports, alerts, OLAP and even notifications that are based on this process. The main aim of this process is to analyze the static past that can be used to determine future actions. When reporting involves extraction of data from huge data sets we call it performing Big Data business intelligence or BI. However, the decisions that result from these two methods are largely reactionary.
This method is largely proactive and requires a hands-on approach which involves optimization, predictive analytics, modeling, text mining and statistical analysis on a large scale. These processes allow big data analysts to pinpoint weaknesses, strengths and also figure out new and better decision making practices for the future. However, this is where it gets interesting; using big data analytics, business owners can hone in on and extract relevant information for easy analysis.
In other words we can say that big data analytics is more than just a one-time endeavor. If they are proactive with it, business owners can do wonders for their enterprises and remain ahead of their competitors with tactics that the latter would not be privy to.
What is a NOSql Database?
Typically referred to as a non SQL, a NoSQL offers a set mechanism for storage and extraction of important data. It actually encompasses a number of database technologies that were created to accommodate large volumes of data regarding users, products, objects, how much data was accessed, performance metrics and processing requirements. These are way more beneficial than their relational counterparts; the latter is not able to handle the scale and speed alterations in modern applications, nor are they capable of handling agility changes.
Types of NoSQL Databases
There are basically 4 types of these databases:
1. Key-value store – These are the least complex options and are designed to store data without a schema. All of the data it comprises of has a key that is indexed, thus the name.
2. Column Store – Column or wide column stores are designed to store large volumes of data as sections or columns, thus the name. This type of NoSQL database allows the storage of data in rows rather than columns thus ensuring high performance and a scalable structure.
3. Document Database – These databases comprise of more complex data and each has its own key for easy extraction. Document databases are designed to store, manage and extract info that is mainly in document form and is also called semi-structured data.
4. Graph – As the name implies, these NoSQL databases are based on graphs and comprise of data that has interconnected components along with variable relations between them.
A NoSQL database allows easy and quick retrieval of complex data and also ensures its availability on a consistent basis. Since these are created on a distributed architecture, anomalies can be handled quickly and effectively; if a node goes down, others will continue operations without data loss thus ensuring consistent performance 24/7.
What is Map Reduce?
Map Reduce is a java based programming paradigm of Hadoop framework that
provides scalability across various Hadoop clusters
How Map Reduce works in Hadoop?
MapReduce distributes the workload into two different jobs namely
1. Map job and 2. Reduce job that can run in parallel.
The Map job breaks down the data sets into key-value pairs or tuples.
The Reduce job then takes the output of the map job and combines the data tuples
into smaller set of tuples.
What is ‘Key value pair’ in Map Reduce?
Key value pair is the intermediate data generated by maps and sent to reduces for
generating the final output.
What is the difference between MapReduce engine and HDFS cluster?
HDFS cluster is the name given to the whole configuration of master and slaves
where data is stored. Map Reduce Engine is the programming module which is used
to retrieve and analyze data.
Is map like a pointer?
No, Map is not like a pointer.
Why are the number of splits equal to the number of maps?
The number of maps is equal to the number of input splits because we want the key
and value pairs of all the input splits.
Is a job split into maps?
No, a job is not split into maps. Spilt is created for the file. The file is placed on
datanodes in blocks. For each split, a map is needed.
How can you set an arbitrary number of mappers to be created for a job inHadoop?
This is a trick question. You cannot set it
How can you set an arbitary number of reducers to be created for a job in Hadoop?
You can either do it progamatically by using method setNumReduceTasksin the
JobConfclass or set it up as a configuration setting
How will you write a custom partitioner for a Hadoop job?
The following steps are needed to write a custom partitioner.
– Create a new class that extends Partitioner class
– Override method getPartition
– In the wrapper that runs the Map Reducer, either
– add the custom partitioner to the job programtically using method
– add the custom partitioner to the job as a config file (if your wrapper reads from
config file or oozie)
What is the difference between TextInputFormat and KeyValueInputFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as
key to the Mapper and actual line as Value to the mapper
KeyValueInputFormat: Reads text file and parses lines into key, val pairs.
Everything up to the first tab character is sent as key to the Mapper and the
remainder of the line is sent as value to the mapper.
What is a Combiner?
The Combiner is a “mini-reduce” process which operates only on data generated by a
mapper. The Combiner will receive as input all data emitted by the Mapper instances
on a given node. The output from the Combiner is then sent to the Reducers, instead
of the output from the Mappers.
If no custom partitioner is defined in the hadoop then how is data partitioned before its sent to the reducer?
The default partitioner computes a hash value for the key and assigns the partition
based on this result.
Have you ever used Counters in Hadoop. Give us an example scenario?
Anybody who claims to have worked on a Hadoop project is expected to use
Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?
Yes, The input format class provides methods to add multiple directories as input to
a Hadoop job.
Is it possible to have Hadoop job output in multiple directories. If yes then how?
Yes, by using Multiple Outputs class
Explain what are the basic parameters of a Mapper?
The basic parameters of a Mapper are
LongWritable and Text
Text and IntWritable
Explain what is the function of MapReducer partitioner?
The function of MapReducer partitioner is to make sure that all the value of a single
key goes to the same reducer, eventually which helps evenly distribution of the map
output over the reducers.
Explain what is difference between an Input Split and HDFS Block?
Logical division of data is known as Split while physical division of data is known as
Mention what are the main configuration parameters that user need to specify to run Mapreduce Job ?
The user of Mapreduce framework needs to specify
Job’s input locations in the distributed file system
Job’s output location in the distributed file system
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes
In the present era, Big data analytics is no longer used only for the purpose of experimenting. Many companies began to achieve a lot more real results with its approach, and they are expanding their efforts to surround more data and models. It is a term that used to describe the collection, availability, and processing of streaming data in real-time of huge volumes. The three V’s are nothing but volume, velocity , and variety. To make more accurate decisions, companies who are combining their marketing, customer data, sales, transactional data, external data and social conversations such as stock prices, news, and weather in order to identify the correlation and root are statistically valid models.
Timely: It can save plenty of time since on every working day 60% knowledge workers are spending time attempting to find and manage data.
Accessible: Half of the senior executives report that accessing the right data is difficult. So this helps to access the data more vulnerable.
Trustworthy: Due to poor data quality in the average of 29% companies are measuring the monetary cost. Even the simple things like customer contact information updates monitoring in multiple systems will help the company to save millions of dollars.
Relevant: Keeping irrelevant data is a curse for the database since it will make the filtering process complicated. But the statistics say, around 43% of companies are having tools which are unable to filter the junk data. A simple thing like filtering the customers from web analytics will be able to provide an insight for the efforts of your acquisition.
Secure: With data hosting and technology, companies can secure their infrastructures since an average of the security breach in any company costs $214. So with this technology, the company can save up to 1.6% of their revenue per year.