By Cassandra Balentine
As business data continues to grow exponentially, management strategies are critical. Data lakes store raw data in native format until it is needed. As opposed to a hierarchical setup found in data warehouses, which are organized in files or folders; data lakes are known for the use of a flat architecture for data storage.
As companies begin or continue on their digital transformation journeys, enterprise data lakes are becoming an important element. The constant growth of data needs to be organized, managed, and accessed in a way that best supports the business. Modern data lakes provide a different, freer approach to data management, which is particularly important when it comes to unstructured data.
The Modern Data Lake
The modern data lake complements data warehouses by providing capabilities that make up for the shortfalls of a data warehouse. They are also used to offload warehouse data and workloads that no longer need to be stored in data warehouses.
While organizations are collecting more data than ever before, it is important to remember that not all of it has value. “Companies need to dig through the mounds of data to figure out what has value. The data lake serves a key role of allowing companies to collect all the data, dig through it, find out what has value, and then move that data to an analytics platform,” offers Mark Shalinman, marketing director, Teradata.
The role of the data lake within the enterprise is to consolidate the data in one place to address any type of query and business-oriented use case. “This is the start for any advanced analytics or artificial intelligence (AI)/machine learning project in the enterprise. A modern data lake is based on an architectural style that allows multiple services to coalesce as a single unit and react to surroundings while remaining aware of each other. They are event driven, asynchronous, and scale dynamically,” offers Jerome Sandrini, senior vice president and head of big data and cybersecurity, Atos North America.
Today’s data lakes need to give internal and external users dynamic access to data from one location, comments Ilya Pupko, chief architect, Jitterbit. The traditional approach to data lakes—and the data warehousing approach before it—originated from the issue of siloed data. However, it still required a conglomeration of all data into one place, creating issues with privacy, data protection, and other complications. “The modern solution is what we call a virtual data lake, which allows users to access data in a proper and structured way, with proper access controls and without having to worry about storage and transmission issues. The data remains in its original location and owners are able to connect different data sources incrementally rather than taking on a massive project that doesn’t provide value until all the data is transferred over into a new physical location.”
A modern data lake offers a number of advantages for organizations across a range of industries. Tajinder Pal Singh Ahluwalia, product marketing lead, unstructured data storage, Dell, says that by consolidating file-based, unstructured data to form a data lake, organizations eliminate costly storage silos, simplify management, increase data protection, and enable the use of data analytics to unlock the value of their data assets.
“Broadly speaking, an analytical environment may consist of a number of technologies, and the center of that environment might be a data warehouse. A data warehouse is helpful for highly responsive business intelligence analytics or structured or traditional data, but it does not do well with modern analytical requirements,” comments Dale Kim, senior director, products/solutions, Arcadia Data. He adds that in particular, data lakes help create new opportunities that would otherwise prove too costly to do on data warehouses, such as analyzing unstructured data at scale, or deploying advanced analytics in machine learning frameworks.
With nearly all businesses looking to use the cloud in some way, a modern data lake should include a combination of on premise data storage along with cloud storage from multiple public and private cloud service providers. “This allows organizations to address rapid data growth and optimize data center storage resources by using the cloud as a highly economical storage tier with massive storage capacity for cold or frozen data that is rarely used or accessed. In this way, more valuable on premise storage resources may be used for more active data and applications,” says Ahluwalia.
The data lake has transitioned from a simple repository of structured, semi-structured, and unstructured data to a single data repository that helps drive modern workloads such as self-service analytics and AI projects to provide business insights that were never possible before. “However, only the enterprise-ready modern data lakes must provide critical capabilities such as data cataloging, data quality, data prep, data relationships, data masking, and data governance. These data management capabilities need to be automated with AI for agility, accuracy, and collaboration. Without the support of any of these critical capabilities, you’ll end up with a modern data swamp,” says Vamshi Sriperumbudur, head of marketing, big data and analytics portfolio, Informatica.
Kumar Thangamuthu, senior technical architect, SAS, says the idea of the data lake focuses on storing all analyzable data sets in raw or lightly processed form into the easily expandable scale out Hadoop infrastructure to ensure that the fidelity of the data is preserved. “The modern data lake will be a managed data lake, meaning one uses a data lake management platform to manage data ingestion, apply metadata, and enable data governance so that users know what’s in the lake and can use the data with confidence. These factors help to avoid turning data lakes into data swamps,” he says. Modern managed data lakes allow businesses to explore data quickly and easily, identifying opportunities for business and process improvements across the organization as well as to better understand their customers.
At one point, the data lake was synonymous with Hadoop, that has changed with the evolution of the modern data lake, explains Shalinman. “The lake is no longer confined to just Hadoop or one platform. It has evolved to consisting of multiple platforms and technologies within a larger analytical ecosystem. The modern data lake is an architectural design and decision—not a technology,” he offers. The modern data lake can consist of data in Hadoop, object stores like Amazon S3, Microsoft Azure Blob Store, or Teradata Vantage. “It is more about what the type and value of the data is and what it is used for versus the technology used.”
Additionally, cybersecurity and personal data protection play a vital role in every organization. “This is a very big data problem as threats come from internal and external sources of an organization. The General Data Protection Regulation is intended to standardize expectations and protect personally identifiable information on employees, clients, and applicable data subjects. This means cybersecurity data collection and analysis has to become proactive and always on. Data governance is important in securing the data in a data lake with an organization’s procedures and policies that manage the data usage, availability, privacy, and security of data,” adds Thangamuthu.
Explosive Growth
According to recent research published by MarketsandMarkets, the market size for data lakes is estimated to grow from $2.5 billion USD in 2016 to $8.81 billion USD by 2021 at a compound annual growth rate of 28.3 percent.
Data lakes are primed for growth. This is due to the fact that organizations are turning away from the hype that they are the panacea for all analytics problems, shares Kim. Instead, they offer the proper due diligence that any new software system requires. “As a result, organizations get real value from data lakes, especially for the new data initiatives they seek that cannot be efficiently handled by traditional technologies.”
Pupko believes the market is even larger than the study predicts, considering the new role that virtual data lakes will play in the enterprise. “Instead of standing apart as a separate piece of infrastructure, the virtual data lake will serve as the backbone of all data-driven business processes, which also happen to be the fastest growing areas of focus for most companies.”
Ahluwalia adds that in today’s increasingly digital economy, data is a valuable business asset that needs to be stored, managed, and protected accordingly. “Various market research organizations indicate data, especially unstructured data, is continuing to grow rapidly—well over 20 percent per year. This trend is expected to continue for the foreseeable future. This means that many organizations will see their data storage requirements doubling every three to four years.”
The data lake market is growing fast. “This is not surprising when you start peeling the data lake market onion a bit. New data types—social, mobile, and Internet of Things, formats, and use cases drive exponential growth of the amount of new data generated; this data is coming in at various latencies and new users access all this big data. These are the exact market drivers we see from our customers globally, and across industry verticals,” says Sriperumbudur.
Thangamuthu says there has been explosive growth of streaming data from high-velocity, high-volume sources from small to large companies. Streaming data can be constantly generated, accumulated quickly, and unstructured or semi structured in its original form.
Driving Growth
While growth is evident, it’s important to look at what is driving demand for data lakes and pushing their adoption in the modern enterprise.
“Above all, the interest and need to provide new analytical tools to more uses is promoting market growth. Businesses see their competitors gain value from modern analytical technologies and they know they need to jump in to keep up,” comments Kim. He says as data lakes move past the initial hype they gain recognition as tools that a wider audience of users can leverage.
Digital transformation is disrupting every industry. Data is the backbone of this digital transformation and is for this reason one of the most important assets for most organizations today. “This is why an increasing number of businesses use data to identify new opportunities and efficiencies well beyond traditional business assets.”
In Sandrini’s experience, clients go from pilot projects or specific business unit deployments to more enterprise wide data lake deployments. “This is triggered by digital transformation acceleration with massive cloud adoption, automation projects, and business driving the need for new data-driven business services.”
The demand for data lakes is driven by many factors. Thangamuthu says this includes the evolution of data lakes on cloud infrastructures such as Amazon S3 that offload the maintenance effort, take care of automatic scaling, encryption, and ease of management. Further driving demand is the availability of choice and better accessibility with the emergence of data lake querying tools based on SQL and ad-hoc querying. Another push comes from varied pricing options for agile businesses, especially with the separation of storage and compute methodologies in data lakes. And finally, Thangamuthu sees the desire to give data scientists access to enterprise data for exploration, discovery, and insight creation as another driver for adoption.
Most modern businesses will adjust to how they think of their data and the majority will start looking to it as a growth opportunity rather than a challenge that needs to be solved, points out Pupko. “For example, in the world of old data lakes, we have GM’s OnStar system where they have a decade of data collected but haven’t done anything with it. Now they’re considering selling external access to that data as a business model and so they need a way to control their data properly. That means creating a virtual data lake with variable user access and permissions instead of forcing everything in one place and then handing buyers the keys to the proverbial castle. They simply had no choice on this, that’s how they had to solve the pain. Now they have a huge opportunity—one that may actually turn the whole company around, modernize it, and truly make it profitable long term.”
Sink or Swim
The amount of data today’s enterprises must track, organize, and store is overwhelming—and it’s just the tip of the iceberg. Data lakes are one tool, enabling businesses more capacity to store this data. Its role is evolving as its capabilities support modern functionalities like self-service analytics as well as AI driven by machine learning. SW
Jan2019, Software Magazine