BIG DATA: interviews with the experts [KM World]
(KM World Via Acquire Media NewsEdge) Big data is a hot topic these days, and with good reason. The emerging technology provides new ways of analyzing massive amounts of data and extracting business value for a multitude of purposes. KMWorld senior writer Judith Lamont had an opportunity to interview four experts in this fastmoving field who share their insights here. They include: Kapil Bakshi, chief architect for Cisco Public Sector; Anjul Bhambhri, VP of big data, IBM; Charles Zedlewski, VP, products, Cloudera; and Dan Vesset, program VP for business analytics solutions, IDC.
Q Lamont: How did big data get its start
A Vesset: What triggered much of the initial excitement about big data were companies like Google (google.com), Yahoo (yahoo.com), Amazon (amazon.com), Facebook (facebook.com) and Twitter (twit ter.com), which all produce a lot of clickstream data that is only valuable if it is collected and analyzed. The volume and flow of information was such that traditional Web analytics methods were not capable of handling it.
Q Lamont: Why has big data become so important lately
A Zedlewski: Data volume is growing faster than Moore's Law now, and the old ideas through which companies met the challenge of increased data are not sustainable. Also, there is a whole class of problems for which solutions have not been addressed because there was not a solution that was scalable, economic and flexible. With the new Hadoop (http://hadoop.apache.org) technology, which can scale across thousands of commodity servers, these solutions become feasible.
Bakshi: The amount of digital information being collected and stored is growing exponentially, especially unstructured. According to one Cisco study, global IP traffic will reach 1.3 zettabytes annually by 2016, which is a fourfold increase from 201 1. By 2016 there will be 19 billion global network connections, the equivalent of two-and-a-half connections for every person on earth. This new tsunami of data is being generated from new types of source, mostly machine-generated source, like sensors, smart phones and other Internet connected devices. All these trends together mean that a huge amount of data needs to be moved, collected, stored and analyzed to create value out of it.
Bhambhri: Businesses are realizing that they need to make decisions based on all the data that is available, particularly the 80 percent that is unstructured. Information is now coming from Facebook, Twitter and many other sources that did not exist before. People can express themselves in ways they previously could not. This information is valuable, particularly in consumer markets. Companies that do not tap into this information, but only look at point-of-sale information, are missing out on a lot of insights, and they are increasingly recognizing this.
Q Lamont: What are the primary drivers for big data
A Vesset: One is efficiency. Finding the right tool for a given workload is important. Relational databases are not the most efficient way to store and process large semi-structured or unstructured data sets, so users are looking for alternatives. Another is innovation. Big data analytics lets organizations do things that were not feasible before, either because the technology was not there or because it was too expensive. Finally, compliance is a large and growing problem because large amounts of data need to be stored for longer and sometimes retrieved relatively quickly.
QLamont: What role do Cisco, Cloudera and IBM respectively play in big data
A Bakshi: Cisco enables the connected systems of Internet of Things, which is the main source of (machine-generated) big data. Second, we are addressing the data in motion aspect of big data, and Cisco's networking products support capturing and moving around large sets of big data. Third, with our ecosystem analytics partners, we are providing data center network and unified computing-based architectures and solutions for big data analytics. Here we are focusing on MapReduce, NoSQL, In-Memory databases and massively parallel database systems architectures.
Zedlewski: Cloudera is an open source data management platform company. We provide a system that includes Apache Hadoop and other subsystems that let enterprises store, process and analyze large volumes of data. We have 400 partners that develop packaged software applications for our platform, ranging from business intelligence vendors such as MicroStrategy (microstrategy.com) to Hadoop startups. IBM's Biglnsights runs on top of our platform as well as its other tools for big data such as InfoSphere DataStage for ETL and data integration.
Bhambhri: IBM is making it easier for customers to handle big data through several offerings. InfoSphere Biglnsights is a platform built on top of Hadoop that complements the open source product by providing analysis and visualization of data. InfoSphere Data Explorer is a discovery and navigation product that allows users to access and analyze big data along with data from enterprise applications. InfoSphere Streams analyzes data streams on a continuous basis, monitoring them for information in real time. Vivísimo can federate and integrate information from other enterprise sources to allow it to be incorporated into big data analyses.
QLamont: What are some of the viable use cases for big data analytics
A Bakshi: Many government organizations have large amounts of data that can be analyzed productively. Hence the use cases are all around the government verticals of DoD, intelligence communities, healthcare, citizen services and scientific research and experimentation. Some of the more commonly discussed use cases include big data analytics for cybersecurity, intelligence, full motion video, electronic health records, financial fraud detection, scientific experimentation and many more.
Vesset: Big data supports improved fraud detection, not only in banking but also in government and retail. It is also effective at complex optimization, such as the sale of airline tickets where many variables can affect the price. Sensor data is also a very big area. Railroads, for example, have sensors on the train cars to monitor the performance of wheels to assist in maintenance, the goal being to fix problems when they are small. Utilities that are now using smart meters are looking for ways of using that data to improve the availability of power, load balancing and responding to outages.
QLamont: What does IDC expect from big data in terms of growth of the market
A Vesset: We expect the market to reach $16.9 billion by 2015, up from $3.2 billion in 2010. Different segments will grow at different rates - we expect the annual growth in software to be about 40 percent, just under 30 percent for servers, and about 60 percent for storage. There are legitimate and appropriate uses right now across industries, as well as some growth due to hype and the fear of being left behind. But the demand for analytics in general is strong - the traditional data warehouse market grew 18 percent during 2011. Whether through new technologies such as Hadoop or mature technologies such as business intelligence solutions, companies want to use the data they collect in order to support business decisions.
QLamont: What is the fundamental difference between Hadoop and relational databases
A Zedlewski: They operate in different ways and have different applications. Relational databases are structured with well-defined schema. When data sets are constantly changing, analysis using databases is difficult because they were designed for optimization of repeated queries. Hadoop breaks the information up into different blocks, does not need a defined schema and is designed for flexibility and experimentation. It is ideal for looking for patterns in data and dealing with unpredictable data sets.
QLamont: Is Hadoop easier to use than other technologies for this type of application
A Zedlewski: That depends on what you are comparing. If you compare the complexity of building a petabyte scale Hadoop system with building a petabyte scale RBMS, Hadoop is the much simpler technology. It's free to download, runs on any kind of hardware or cloud platform and doesn't require a lot of preplanning of data models. On the other hand, if you compare Hadoop to say, downloading MySQL, sure Hadoop is more complex, but then they are not serving comparable functions. For users, it's still the case that there are more tools and interfaces in the traditional database world than there are for Hadoop and there's a large industry of DBAs that are familiar with them. Also, if you are using Hadoop for advanced analytics, or "data science," you can get supply constrained because there are only so many people who are both math savvy and big data savvy. Cloudera is working to improve Hadoop in both of these areas.
QLamont: Are there limitations to Hadoop versus relational databases in terms of speed
A Vesset: Today, Hadoop is good for storing large data sets and for performing batch processing on that data. Typically, a small number of data scientists are analyzing data held in Hadoop clusters. Sometimes, data pre-processed in Hadoop is migrated to relational analytic databases for further ad hoc analysis. Compared to traditional relational databases, though, Hadoop is not good at having many people query at the same time and getting instant responses. If that is your goal, a data warehouse based on a relational database is a better solution. There is increasing realization that at least for now, the two are complementary, and that each one has a role.
Q: Lamont: How would you compare performance of Hadoop to that of other data processing systems
A: Zedlewski: Hadoop has great scale, performance and elasticity properties when it comes to processing. Processing data might mean aggregating log data, reconciling trades or calculating value at risk. It is not uncommon to see data processing times collapse from hours to minutes when these processes are implemented in Hadoop. That's the power of an architecture based on scale-out commodity hardware. On the other hand, users who are accustomed to BI tools that respond in seconds are a different story. BI tools are designed to answer a finite range of questions for which the queries and answers are already defined, so the response can be much faster. We recently released the public beta version of Cloudera Impala, which is the first real-time query engine for Hadoop. It does not yet have an extensive feature set that matches that of advanced SQL but is an effective tool for exploratory analyses.
Q: Lamont: How would you describe the flexibility of Hadoop compared to relational databases or data warehouses
A Zedlewski: If you ask developers how long it will take them to add a field or dimension to a database to answer a question that does not fit the original schema, you'll typically hear it takes from a few weeks to a few months. You have to pull that new field or dimension data off of archive storage, you have to append it to all your historical data, you have to update your data dictionary, you have to update your ETL jobs and you may also have to update parts of your batch reporting infrastructure. With a big data approach, you can keep all the original data granularity and add certain information in later, rebuilding it on the fly. Adding new variables and new dimensions is not a problem.
Q: Lamont: Can you give an example of a specific current use case
A: Vesset: In Denmark, a wind turbine manufacturer was testing different sites to see if they were feasible locations for a turbine. The company used to have to spend a year gathering data about wind speeds and other environmental factors. Now, they get huge data sets from the National Weather Service (weather.gov) and analyze it themselves using big data solutions. As a result, they have cut a year off their decision process and have saved money because they don't need to collect the data themselves. The results are as accurate as the ones they were getting from their previous method.
Bhambhri: At the University of Ontario, IBM's big data solutions were used in a project called Data Baby. Premature babies are routinely connected to sensors that collect all kinds of data that provide insights into their medical condition. The volume of information is staggering - 1,000 pieces of data per second, which was being aggregated into several ratings that the doctors and nurses check on during their rounds. However, infants were coming down with infections despite positive health indicators. In this big data application, patterns of data were correlated with the development of subsequent infections. Based on these patterns, researchers were able to predict the potential onset of infections 24 hours in advance, allowing preventive treatment. This work is now being used at other hospitals to predict events such as brain hemorrhaging in stroke patients. When you can process information this quickly, a lot of options open up.
Q: Lamont: What obstacles stand in the way of big data applications
A: Bakshi: To start with, the sheer volume, variety and velocity of big data would be a challenge to keep up. Hence, from an enterprise and agency strategy perspective, one needs to develop a big data strategy starting point. Issues of data governance such as data privacy, ownership and lifecycle need to be reconsidered in the context of new and growing sources of big data. Second, the notion of enterprise data is going to be broader than traditional sources. It will include a lot more unstructured data, which generally does not have a well defined schema. To extract the most value from the different types of data, the agencies will need to adopt an integrated approach for analyzing traditional structured and unstructured data, which would imply extracting that information from multiple silos.
Bhambhri: One of the biggest challenges businesses face is finding enough data scientists to make sense of the petabytes of available data. Businesses need marketers, managers and IT staff with data analysis skills to allow them to leverage data and use it to improve business management and marketing. IBM is training the next generation of data scientists by helping universities integrate big data into their curriculums and creating project-focused case studies through which students can gain hands-on experience .
Q: Lamont: What is the most common mistake when organizations first start working with big data
A: Zedlewski: A common error from the IT department is not identifying a first use case that has compelling business value. The technology sweet spot should align with the business sweet spot.
Vesset: One problem is that the IT department may look at the technology first and become enthusiastic about how much information they can store, without thinking about how they can use it. IT groups need to work hand in hand with the business side to identify specific use cases and proceed from there.
Q: Lamont: How should an organization get started with big data
A: Zedlewski: Hadoop is an open source product that can be downloaded and installed from http://hadoop.apache.org, and the documentation is available. Users can try it out. There is a preinstalled virtual node on the website or users can build their own node, which is an individual computer in a cluster of computers. Start with a relatively narrow project to demonstrate what it can do. A good use case to try out is a recommendation engine for a product based on consumer preferences, or analyze an event log that records data from network performance to look for abnormalities. These applications are ideal for Hadoop.
Bhambhri: The organization should start an experiment in its own shop. Companies should take a look at the sources of data that they care about but are not tapping into now. Then, bring the data into a platform and start the discovery process, with a few people dedicated to the job. The analysts can then see what the data is telling them, and present the insights to their lines of business. If there is value added, they can expand beyond the pilot and build out their infrastructure.
Q: Lamont: What do you see in the future of big data applications
A: Bhambhri: We expect big data to continue to grow at a very rapid rate. So many things are instrumented today, such as traffic information - GPS devices, traffic lights, sensors in the roads, for example - and it generates a tremendous amount of information. This information can be used by city planners to solve traffic problems and improve transportation efficiency. Consumer devices such as set-top cable boxes are generating logs. The number of users on smart devices has exploded in the last couple of years, along with data derived from social media. Meanwhile, technology breakthroughs have come along that allow organizations to make use of the information.
One problem is that the IT department may look at the technology first and become enthusiastic about how much information they can store, without thinking about how they can use it. IT groups need to work hand in hand with the business side to identify specific use cases and proceed from there.
By Judith Lamont, KMWorld senior writer
Judith Lamont, Ph.D., is a research analyst with Zentek Corp., e-mail firstname.lastname@example.org.
(c) 2013 Information Today, Inc.
[ Back To Cloud Computing 's Homepage ]