Is big data the right solution?
From: Developing big data solutions on Microsoft Azure HDInsight
The first step in evaluating and implementing any business policy, whether it’s related to computer hardware, software, replacement office furniture, or the contract for cleaning the windows, is to determine the results that you hope to achieve. Deciding whether to adopt a Hadoop-based big data batch processing approach is no different.
The result you want from your solution will typically be better information that helps you to make data-driven decisions for your organization. However, to be able to get this information, you must evaluate several factors such as:
- Where will the source data come from? Perhaps you already have the data that contains the information you need, but you can’t analyze it with your existing tools. Or is there a source of data you think will be useful, but you don’t yet know how to collect it, store it, and analyze it?
- What is the format of the data? Is it highly structured, in which case you may be able to load it into your existing database or data warehouse and process it there? Or is it semi-structured or unstructured, in which case a Hadoop-based mechanism such as HDInsight that is optimized for textual discovery, categorization, and predictive analysis will be more suitable?
- What are the delivery and quality characteristics of the data? Is there a huge volume? Does it arrive as a stream or in batches? Is it of high quality, or will you need to perform some type of data cleansing and validation of the content?
- Do you want to combine the results with data from other sources? If so, do you know where this data will come from, how much it will cost if you have to purchase it, and how reliable this data is?
- Do you want to integrate with an existing BI system? Will you need to load the data into an existing database or data warehouse, or will you just analyze it and visualize the results separately?
The answers to these questions will help you decide whether a Hadoop-based big data solution such as HDInsight is appropriate, but keep in mind that modern data management systems such as Microsoft SQL Server and the Microsoft Analytics Platform System (APS) are designed to offer high performance for huge volumes of data—your decision should not focus solely on data volume.
As you saw earlier in this guide, Hadoop-based solutions are primarily suited to situations where:
- You have very large volumes of data to store and process, and these volumes are beyond the capabilities of traditional relational database systems.
- The data is in a semi-structured or unstructured format, often as text files or binary files.
- The data is not well categorized; for example, similar items are described using different terminology such as a variation in city, country, or region names, and there is no obvious key value.
- The data arrives rapidly as a stream, or in large batches that cannot be processed in real time, and so must be stored efficiently for processing later as a batch operation.
- The data contains a lot of redundancy or duplication.
- The data cannot easily be processed into a format that suits existing database schemas without risking loss of information.
- You need to execute complex batch jobs on a very large scale, so that running the queries in parallel is necessary.
- You want to be able to easily scale the system up or down on demand, or have it running only when required for specific processing tasks and close it down altogether at other times.
- You don’t actually know how the data might be useful, but you suspect that it will be—either now or in the future.
Note
In general you should consider adopting a Hadoop-based solution such as HDInsight only when your requirements match several of the points listed above, and not just one or two. Existing database systems can achieve many of the tasks in the list, but a batch processing solution based on Hadoop may be a better choice when several of the factors are relevant to your own requirements.