By Olivia Cahoon
Data virtualization is a data management strategy that allows applications to retrieve and manipulate data without accessing detail such as how it is formatted or where it is located. Deployable in the cloud and on-premises, data virtualization is available as standalone software, data integration solutions, and application tools.
According to MarketsandMarkets research, the data virtualization market is projected to grow at a significant rate over the next five years, due to the need to access large volumes of data stored at heterogenous sources across enterprises that require integration to achieve a unified view of data, which enhances the decision making process of solution providers. The research firm says the data virtualization market size is expected to grow from $1.28 billion USD in 2016 to USD $4.12 billion USD by 2022, at a compound annual growth rate of 21.1 percent during the forecast period.
Accessing Data
Data virtualization enables real-time access to data regardless of location, source system, or type. It delivers a data services layer that integrates data and content on demand from heterogeneous sources in real time, near real time, streaming, and batch, says Lisa Spagnolie, senior director, product marketing, SAP. She believes that what differentiates data virtualization from other data integration methods is that it does not move data at all. Instead, it reads the data in place using connectors, federates queries the business applications need, and caches them to memory in virtual views.
“These virtual views can then be exposed to consuming applications through common web service protocols for data sharing,” explains Spagnoli. Data virtualization is often used to complement other data delivery styles to improve flexibility, agility, reusability, and cost savings as a component of an end-to-end data integration architecture.
It also reduces the risk of data exposure by not creating additional physical copies of data while providing reverse proxy access to data, keeping physical data sources anonymous. “With data virtualization, customers securely access and integrate data without the need to physically move the data from its current location,” adds Spagnolie.
Additionally, data virtualization provides analysts and applications with easy, yet governed access to the widest possible range of data sources including enterprise data, big data, cloud data, and device data, shares Bob Eve, senior director, TIBCO Software. “Data virtualization transforms data from source structures and syntax into business-friendly formats and terminology.”
Core Functions
Traditionally, organizations secured data in various datastores throughout the enterprise, resulting in data silos and difficult data access. For example, a data scientist attempting to build a machine learning model against customer data needs to discover the data sources and what information is stored, develop the data pipelines to copy data into a centralized location, and then build models to complete the work, says Tanel Poder, chief evangelist and co-founder, Gluent Inc.
However, with data virtualization, customer data is accessible from anywhere. “This simplifies the process and shortens the time to data access for our data scientist, allowing her to develop results faster and more efficiently. Plus, once she’s ready to share the final dataset with others, they can transparently query the data from any location,” explains Poder.
Data virtualization improves the agility of the data platform to support a variety of business use cases. This includes providing an alternative to moving data for regulatory reasons. Spagnolie believes data virtualization is useful when regulations prohibit the physical movement of data from one system to another. For example, if an organization’s policy restricts physical movement of personal data outside of its firewall, Spagnolie says the organization can use data virtualization for abstracted data sharing without making an additional copy of the data, which is prohibited.
Another emerging use case involves large amounts of event data generated through Internet of Things (IoT) sensors installed on equipment. Spagnolie says this sensor data is only truly useful when combined with other related enterprise data to give the perspective needed for predictive maintenance and advanced analytics. “Data virtualization tools can potentially federate data that is either at rest, in motion, in message queues, or directly from the IoT edge nodes, with relevant enterprise data.”
Extract, transform, and load (ETL) has been the common method for transforming and moving data in an enterprise for years, but building out ETL processes to move data from point A to point B can become a bottleneck in providing timely access to enterprise data, says Poder. “Developing data pipelines often requires launching a new ETL project, expending the valuable time of data engineers, and ends up as a complex spiderweb of data flows throughout the organization,” he explains. With data virtualization, extract and load components are virtualized, leaving only transform required.
Data virtualization also allows transparent access to enterprise data without accessing private data information like location. According to Poder, it requires zero code changes to existing applications, allowing the same database-specific SQL syntax and approach to data access, regardless of where the data physically resides. Furthermore, queries are performed against tables from various data sources using the data virtualization platform, all without ETL processes.
Other functions associated with data virtualization include assimilating data from the cloud, centralized data governance and security, data quality, metadata management and tracking, and rapid prototyping for physical data movement.
Driving Demand
One primary demand for data virtualization is the need for speed as enterprises can no longer afford latency in the data supply chain, and analytical, reporting, and other operational applications need access to data as it is created, shares Ravi Shankar, chief marketing officer, Denodo.
It also offers a cost effective solution to other data integration tools that copy data from multiple sources and house it in a central repository—duplicating the data, delaying delivery, and increasing costs due to warehousing. “In contrast, data virtualization provides real-time data access without replication and its associated costs,” offers Shankar.
Other data integration technologies may also require numerous developers due to intensive coding requirements. Because data virtualization combines data using virtual views, Shankar says it requires an average of one-fourth the number of developers used in other technologies.
The demand for data virtualization is driven by an increasing number of data and enterprise data silos. Michael Rainey, technical advisor, Gluent Inc, reveals that data volumes are growing exponentially with customer transactions, IoT devices, streaming data, and other sources constantly creating data. Additionally, the cost to store, transform, and share data is increasing.
The need for IoT data access also drives enterprises to data virtualization. IoT data access can be difficult for existing applications and reporting systems based on relational database management systems (RDBMS). “Typically, due to the sheer volume, this data lands in a modern, centralized distributed data store, such as Hadoop, rather than an RDBMS,” says Rainey.With data virtualization, IoT data can be shared as virtual tables and transparently accessed in real-time via the RDBMS.
Evolving Solutions
Originally designed for data engineers, data virtualization is becoming easier and smarter, while evolving to better support analysts and other less technical users, says Eve. Today’s data virtualization uses machine learning to automatically catalog data, generate data models, and optimize queries.
Furthermore, as data volumes, variety, and velocity grow exponentially, Eve believes data virtualization capabilities are continuing to expand rapidly. “Massive parallelization, in-memory grids, streaming engines, and more are just a few of the technologies most data virtualization solutions now use to add performance, reliability, and scale.”
Eve shares that according to Forrester Wave: Enterprise Data Virtualization, Q4 2017, many enterprise architects see the opportunity in data virtualization and are not standing idly by—56 percent of global technology decision makers claim they already implemented or are implementing/expanding/upgrading their implementations of data virtualization technology, up from 45 percent in 2016.
Data virtualization is projected to become a major component of enterprise data fabric. “With a centralized, one-stop shop for enterprise metadata, the entire organization will be able to seamlessly access all enterprise information,” shares Poder.
In the next three to five years, he predicts a major push towards data virtualization in the cloud. “To take advantage of infrastructure as a service, relational databases that were previously on premises migrate to the cloud.” Because on-premises databases have specific performance features, Poder believes it can lose horsepower once it migrates away from a local enterprise data centers. “Data virtualization provides a powerful backend for computation in order to achieve comparable, or better, query performance.”
Data-Driven Businesses
To succeed in today’s competitive landscapes, the need for near, real-time analytics is integral. A variety of businesses, if not all, benefit from data virtualization as there are data silos across all large enterprises, says Madhu Kochar, VP, product development, IBM Analytics.
“Increasingly varied data sources—old and new—must be accessible by lines of business (LOB), on request. The abstraction layer of data virtualization provides unified access for LOB’s applications and self-service tools, as well as enterprise-grade performance which is needed by the different applications,” he explains.
The first integrators of data virtualization included data-driven businesses like investment banks, pharmaceutical firms, and communications services providers. But today, all businesses are data-driven and can benefit from data virtualization. “From a time-to-solution and ROI basis, data virtualization is an easy sell,” admits Eve. In his book, Data Virtualization: Going Beyond Traditional Data Integration to Achieve Business Agility, ten profiled customers achieved time and cost savings up to 80 percent.
Additionally, business users like customer service agents, data analysts, and marketing managers are main beneficiaries of data virtualization. Shankar says data virtualization provides them with real-time, holistic, and up-to-date data, served from the applications they use most. IT teams also benefit as data virtualization enables them to focus on core tasks.
Tools in Demand
A variety of data virtualization tools allow applications to retrieve and manipulate data, organizations also seek solutions that offer enterprise data sharing without the need for complex ETL processes. “Moving data can become expensive, both to build and maintain. Data Virtualization provides a simple and rapid approach to data sharing,” says Rainey.
Equally as important is transparent access to data silos without the need to change application code. Rainey believes data virtualization should require zero code changes to existing applications in order to access data across heterogenous sources. “Ensuring the process for data sharing is transparent is key to user adoption.”
In the data virtualization market, organizations look for tools that provide greater agility to query data across data silos. According to Kochar, organizations are interested in using the technology to reduce the heavy costs associated with tasks like data duplication and storage. They also look for methods to ease the viewing of information across the organization and for simplification of complexities. “In addition to all of this, they’re looking for a trusted environment through user security and data curation.”
Organizations typically seek query performance comparable to or better than direct access to the data. All too often, Poder says data virtualization tools that promise connectivity to any data source fall short on the query performance expectations. With the proper tools, data is offloaded to a centralized, distributed storage and compute engine. “Queries that run against these virtualized tables can now have joins, filters, and aggregations pushed down to use the power of distributed computation engine.” This reduces the resource usage on the source database and speeds query performance.
Challenges
Cases remain where data virtualization is generally unsuitable including situations that require complex data cleansing and transformation as well as transactions with high data volume. Spagnolie suggests organizations avoid using data virtualization if source systems have limited spare capacity to accommodate additional workloads generated by data virtualization. Additionally, the data access speed depends on the remote database’s capability, database, and the amount of data transferred across the network.
Shankar reveals that some older data virtualization solutions only support data federation, which is a simple form of data virtualization in which two or more sources appear as a single source. However, he says these solutions don’t provide advanced data virtualization functionality like a unified security and governance layer. “Some data virtualization solutions do not perform dynamic query optimization or provide massively parallel processing capabilities to boost performance.”
Additionally, Poder warns that most available data virtualization tools are not truly performing virtualization from end-to-end. “True data virtualization must allow existing enterprise applications to continue operating unchanged or without major reengineering or code rewriting efforts.” Otherwise, he believes virtualization software becomes a data federation tool that requires source application code, including all dependences, batch jobs, and data feeds to be modified to query the new engine.
Data Virtualization
Today’s data is everchanging and increasing in a variety of formats. To navigate this complex data supply and ease the viewing of information, businesses invest in data virtualization solutions with real-time analytics, user security, and transparent access to data silos.
Jul2018, Software Magazine