By Cassandra Balentine
Businesses are in tune with the need to better manage and leverage data. Business intelligence (BI) provides opportunities for everything from reducing operational expenses to up-selling relevant products to existing clients in real time. However, none of it is possible without the root level of data and the database management systems (DBMS) that house it.
With all of the development and buzz surrounding big data, we are still just tapping the surface. “The challenge of managing and bringing together large amounts of data is only step one,” says Ryan Rosenberg, VP of marketing and services, FileMaker, Inc. “Managing data effectively is not enough to drive business change. The biggest challenge organizations face is how to make use of the data and how to unlock the data potential to enable real-word enhancements to the business.”
DBMS exists to house huge stores of data for their respective organizations. A variety of tools are designed to plug in and turn this unstructured information into actionable assets. Gartner defines DBMS as a product used for the storage and organization of data that typically has defined formats and structures. Further, the research firm notes that DBMS are categorized by their basic structures, and to some extent, their use or deployment.
The hierarchy of DBMS begins with SQL and NoSQL solutions. Relational database management systems (RDBMS) represent a major piece of market share, while NoSQL solutions are based on—but not limited to—other models such as document store, wide column store, and key value store.
According to DB-Engines, a knowledge base of relational and NoSQL DBMS, the top ten ranked systems at press time included Oracle, by Oracle, a RDBMS; MySQL by Oracle, RDBMS; Microsoft SQL Server, by Microsoft, RDBMS; PostgreSQL, by PostgreSQL Global Development Group, RDBMS; MongoDB, by MongoDB, Inc., document store; DB2, by IBM, RDBMS; Microsoft Access, by Microsoft, RDBMS; SQLite, by Dwayne Richard Hipp, RDBMS; Cassandra, by Apache Software Foundation, wide-column store; and Redis, by Salvatore Sanfilippo, key-value store.
The DB-Engines Ranking is a list of DBMSs sorted by current popularity. The site measures DBMS’ popularity by measuring the number of mentions of the system on websites, general interest, frequency of technical discussions about the system, number of job offers in which the system is mentioned, the number of profiles in professional networks in which the system is measured, and relevance in social networks.
The role of DBMS is essential to the future of big data. DBMS providers are keen on trends and consistently update systems and support new offerings to remain relevant.
Trends in DBMS
Specifically as it relates to big data initiatives, several trends stand out for DBMS. Key goals involve access and movement of data, hybrid approaches that mix unstructured and in-memory options, cloud integration, and simplification.
Tiffany Wissner, senior director, data platform marketing, Microsoft, says customers are interested in handling and processing data quickly and efficiently to stay ahead of the competition. “The quick but deep insights delivered by real-time analytics capabilities, such as in-memory, are in demand. As customers use all kinds of databases with various frameworks and languages, including SQL, Hadoop, and NoSQL, they want to use these technologies together seamlessly.”
Lynnette Nolan, media contact, Apache CouchDB, adds that ensuring high availability for applications is increasingly important. Therefore, they expect to see organizations adopt strategies for moving data between data centers and devices to make applications highly available. “Enterprises are adopting offline-first design principles for Web and mobile applications (apps). Organizations need to have the flexibility to move their data to locations both locally and around the world as well as the ability to enable app usage regardless of network connectivity.”
Nancy Kopp-Hensley, director of business data strategy, IBM, expects to see clients branching out to leverage different types of DBMS as they advance in capability. “Just a few short years ago, clients were experimenting with mixing in unstructured capabilities and in-memory. Today, many clients are leveraging both capabilities in production. By doing so, this will not only enhance existing capabilities in analytics, but also allow them to leverage operational data faster and provide overall faster service to the business.”
Integration into the cloud is also apparent. “We expect to see many workloads moving to the cloud to enable self-service and speed development for key business applications,” says Kopp-Hensley.
“Additionally, we’re seeing a greater need for hybrid technologies as companies look to take advantage of the scalability and flexibility offered by the cloud for their big data solutions,” says Weissner.
The simplification of data is another important trend. “There will be a continued trend on simplifying data access in the enterprise. That can come in many forms, from simplified technology that removes heavy administration to fully managed data services in the cloud,” adds Weissner.
The Options
Many DBMS exist and consistently emerge. Depending on the need it serves, they range from SQL to NoSQL and come in a variety of types. Another differentiating factor is whether the solution is commercial or open source. Here, we highlight a few examples of popular DBMS.
Apache
Apache delivers a range of DBMS across a variety of types, one of which is CouchDB.
Nolan says it is an open source NoSQL database that moves data through a unique model of replication and synchronization. It allows data to be distributed across multiple data centers and devices, whether they’re on premises or across the globe—bringing application data to users wherever they need it. The database is well suited for high-traffic applications that use a lot of reads and writes, and CouchDB’s replication and synchronization allows for continuous use—even when devices are offline. Always-on applications can withstand network problems and maintain uptime.
Specific to big data, CouchDB stores data as JavaScript Object Notation (JSON)—a lightweight data exchange format—documents, rather than in structured tables found in RDBMS. “This approach lends itself to a more de-normalized data model where all the data related to a particular record is encapsulated in a single document. Because all data in CouchDB can be logically grouped together into individual documents, CouchDB implementations provide the ability to shard the database for better performance and scale horizontally across many nodes,” says Nolan.
She explains that at the heart of CouchDB’s distributed system design is peer-to-peer or master-less data replication. This means that the same mechanisms that allow CouchDB to create readable and writable database replicas within a single cluster allow it to move and synchronize data between clusters around the world or directly with Web browsers and mobile apps. As apps grow, so does the amount users interact with and create data from.
“Eventual consistency works well for the type of applications CouchDB is built for, but can be limiting for applications that rely on databases to support transactions that prioritize hard consistency over high availability,” she offers.
FileMaker
FileMaker solutions help organizations unlock business potential that Rosenberg says can become “trapped” in big data systems. “Using FileMaker’s External SQL Sources, FileMaker solutions can connect directly to SQL systems such as Oracle, Microsoft SQL Server, or MySQL, and enable real-time, two-way data exchange. “Establishing integration is easy and fast. Once connected, organizations can leverage FileMaker’s strength in rapid application development to create custom workgroup solutions for Apple iPad and iPhone, Windows, Mac, and Web browsers,” he explains.
These solutions can deliver real-time reporting dashboards to executives, connect mobile workers using Apple iPads and iPhones to centralized systems, and automate departmental projects. Additionally, FileMaker solutions can be developed by departmental citizen developers without the need for programming experience. “Due to the ease of use and speed of FileMaker development tools, solutions can be created by those with the most direct knowledge of business problems, or by technical staff working directly with the business experts,” he adds.
Specific to big data, FileMaker offers a database platform that is tuned to the needs of real-time workgroup solutions. FileMaker is capable of holding and managing significant amounts of structured and unstructured data, but FileMaker itself is not a big data system. Rather, it connects to and extends big data systems to help unlock the business potential.
IBM
IBM features an extensive portfolio of more than 30 DBMS offerings, including its flagship DB2 database. Kopp-Hensley says DB2 is designed to be the next-generation of database fueling, “the new era of big data, providing in-memory technology, industry leading performance, scale, and reliability on the user’s choice of platform—Linux, UNIX, and Windows.”
DB2 continues to innovate on IBM’s z/OS platform as well. “In fact, 80 percent of the world’s corporate data still resides on the mainframe because of its reliability and performance,” she adds.
DB2 is designed to provide continuous availability to data in order to keep transactional and analytics operating at the most maximum efficiency.
Specific to big data, DB2 delivers analytics at the “speed of thought and includes a range of innovations including BLU Acceleration, a next-generation database technology that changes the game for in-memory computing,” says Kopp-Henley.
BLU Acceleration is IBM’s in-memory technology that loads terabytes of data in memory and enables business users to perform real-time analytics on big data and transactional data in the same operational system using shadow tables. This eliminates the need to maintain one database for transactional workloads and another for analytical workloads.
Extending its capabilities, Kopp-Hensley notes that the company acquired a database as a service startup, Cloudant, in March 2014. “IBM Cloudent is a NoSQL JSON document data store, delivered as a fully managed Web service or an on-premises cluster. Designed to move data freely between devices, branch offices, and mobile devices, Cloudant provides a seamless experience for users by bringing the data close to where they need it,” she says.
IBM Cloudant features a clustered database designed to scale horizontally to meet high loads of concurrent users of an application. Geared towards high availability needs, Cloudant distributes users’ databases over many nodes, allowing users to scale past what could be held on a single server.
Microsoft
Microsoft offers its SQL Server 2014, Analytics Platform System (APS), Azure SQL Database, and HBase in Azure HDInsight NoSQL DMS solutions.
The Microsoft SQL Server 2014 is the latest release of the company’s relational DBMS. It enables customers to build mission-critical applications and big data solutions using high-performance, in-memory technology across online transactional processing (OLTP), data warehousing, BI, and analytics workloads. It also uses a common set of tools to deploy and manage databases both on premises and in the cloud.
The in-memory features of the SQL Server 2014 are what Microsoft says differentiates it from competition. For example, the in-memory capabilities for OLTP built into SQL Server provide up to 30 times the performance improvement. It also has proven results with its in-memory column store technology, with more than 100 times the performance increase.
It is also set to support on premises and the cloud. Microsoft offers SQL Server in a managed virtual machine on Azure and also offers Azure SQL Database, a platform as a service offering supported by common development and management tools.
Specific to big data needs, SQL is a leading database offering. At press time, Microsoft reported more than 1.2 million downloads of SQL Server 2014 since its release. It is the cornerstone of Microsoft’s comprehensive data platform, which provides a variety of ways to work with data outside of the standard DBMS model. For example, SQL Server is able to connect to Hadoop infrastructures through connectors. And, the company’s broader data platform includes all the building blocks customers need to take advantage of big data—whether they use SQL Server to store and retrieve data, Microsoft Azure HDInsight to deploy and provision Hadoop clusters in the cloud, Power BI to analzyze and visualize data, or Azure Machine Learning to build predictive analytics solutions in the cloud.
The company’s APS is described as a modern data warehouse solution for storing massive amounts of data. APS includes built-in big data and BI tools. Microsoft features direct integration between data in Azure with on-premises data in the appliance.
APS brings both structured and unstructured data into a single, pre-built application. APS offers integration between Hadoop—HDInsight—and high-performance relational database management, tier-one performance, cloud integration with Microsoft Azure, and accessibility into insights to all end users through BI tools such as Microsoft Excel.
Microsoft’s Azure SQL Database is a relational database as a service, which makes tier-one capabilities easily accessible for architects and developers building business applications in the cloud. Customers using Azure SQL database can scale up to 90 terabytes of data across hundreds of thousands of databases, supporting millions of log-ins per day. Its tiers are based on a predictable pricing model where features are baked in versus bolted on.
Azure SQL database delivers predictable performance, scalability, and self-managed service—all backed by Microsoft Azure. It is well suited for cloud-designed applications where near-zero administration and enterprise-grade capabilities are key.
Microsoft also supports HBase in Azure HDInsight. HBase is a NoSQL database component of the Apache Hadoop ecosystem. Support for HBase in Azure HDInsight adds to the list of features in Azure HDInsight, Microsoft’s open and flexible platform offering 100 percent Apache Hadoop as a service in the cloud.
HBase is a columnar NoSQL database that was built to run on top of Hadoop Distributed File System. As a low-latency database, it can perform OLTP capabilities such as updates, inserts, and data in Hadoop. HBase will have a set of tables that contain rows and column families that users must predefine. However, it provides flexibility in that new columns can be added to the column families at any time. This means HBase provides flexibility in the schema to adapt to change requirements daily.
MongoDB
MongoDB is a fast-growing database ecosystem, which is designed to help businesses transform their industries by harnessing the power of data. According to the company, MongoDB has the largest ecosystem of modern databases, with more than nine million downloads, thousands of customers, and over 700 technology and service partners.
When compared to traditional relational databases, Kelly Stirman, director of products, MongoDB, says three differentiating factors stick out—flexibility, scalability, and total cost of ownership (TCO). The solution provides flexibility with a dynamic schema and JSON document data model that allows users to build and adapt applications faster, while providing seamless support for new and rapidly changing semi and unstructured data. The solution also features the ability to scale horizontally using cloud platforms and commodity hardware. TCO, time, software, and hardware savings mean projects delivered on MongoDB “cost a fraction of the cost of Oracle,” says Stirman.
When compared against NoSQL databases, she says that MongoDB offers developer productivity and ease of use, general purpose, operational simplicity, and a large ecosystem.
Relative to big data, Stirman notes that MongoDB is designed to support variety, volume, and velocity; and enables users to unlock value from data.
Rather than volume, a significant challenge presented by big data is diversity. “Big data applications incorporate a variety of data, bringing structured, semi-structured, and unstructured data together from mobile, social, cloud, and sensor-enabled applications. This diversity is a far cry from simple general ledger and address book applications that helped to popularize the relational database,” she explains. “Organizations must embrace database technologies that provide the flexibility to model, store, process, and analyze these new complex data types.”
MongoDB stores data as documents in a binary representation known as BSON—Binary JSON. Each document is a rich data structure, including sub-documents and arrays, “making them much better suited to the rich data generated by big data applications,” suggests Stirman. Each document can vary in structure without imposing changes to other documents in the system. Users can adapt the structure of a document’s schema easily, making it simple to handle the rapidly changing data generated by fast moving big data applications.
To support volume, MongoDB provides horizontal scale-out for databases on low-cost commodity hardware using a technique called sharding, which is transparent to applications. “Sharding distributes data across multiple physical partitions called shards. This allows MongoDB deployments to address the hardware limitations of a single server, such as bottlenecks in RAM or disk I/O, without adding complexity to the applications. MongoDB automatically balances the data in the cluster as the data grows or the size of the cluster increases or decreases. “Unlike relational databases, sharding is automatic and built into the database. Developers don’t face the complexity of building sharding logic into their application code, which then needs to be updated as shards are migrated. Operations teams don’t need to deploy additional clustering software to manage processes and data distribution,” says Stirman.
She explains that unlike other NoSQL databases, multiple sharding policies are available, including hash-, range-, and location-based, which enable users to distribute data across a cluster according to query patterns or data locality. “As a result, you get much higher scalability across a diverse set of workloads,” she adds.
Stirman admits that the ability to store data is only part of the challenge. “Organizations also need to analyze that data to make it useful to the business, whether enhancing the customer experience, guiding product development, or driving operational efficiencies.”
MongoDB is not limited to key-value operations. “You can build rich applications using complex queries and secondary indexes that unlock the value in structured, semi-structured, and unstructured data for highly scalable operational and analytical big data applications,” she says.
PostgreSQL
The PostgreSQL Project is a not-for-profit worldwide association of software developers creating the PostgreSQL Database. They are not a company, but multiple companies that support and contribute to the project.
Josh Berkus, contributor, PostgreSQL, explains that PostgreSQL is a general-purpose relational database, which means that it’s not specifically adapted to big data uses. However, many users find PostgreSQL works for smaller data warehouses—up to 10 terabytes, especially when they need to combine OLTP activity with analytics in a single package.
Unlike proprietary systems, PostgreSQL releases a new, major version every year. Version 9.4, which was due out at press time, contains support for concurrent updates of materialized views, and several new windowing query operations. In 2015, Version 9.5 will support bitmap block indexes for very large tables and other big data features.
“As a liberally licensed open-source project, PostgreSQL is frequently forked and used as the basis for dedicated data warehousing systems, including Netezza, Greenplum, RedShift, Aster Data, Hadapt, Translattice, Postgres-XL, and CitusDB. By a count of platforms, PostgreSQL has been the genesis of more big data database systems than it hasn’t. This provides an advantage, because it allows users to move data between core PostgreSQL and various data warehousing systems using a uniform set of tools,” says Berkus.
PostgreSQL is also used as a user-friendly SQL interface to big data systems, which Berkus suggests, “may be awkward and idiosyncratic to work with directly.” In version 9.1, PostgreSQL added Foreign Data Wrappers, a feature that allows users to interface with external data sources as if they were local tables.
Extending the discussion of PostgreSQL, database startup Citus Data has open sourced a tool, called pg_shard, that lets users scale PostgreSQL deployments across many machines while maintaining performance for operation workloads. pg_shard is a PostgreSQL extension that evenly distributes the database as new machines are added to the cluster.
According to Citus Data, having an elastically scalable, open RDBMS that runs on commodity machines complements commodity scale-out infrastructure, whether it is in a public, private, or a hybrid cloud. CitusDB and the new open-source transparent sharding extension, pg_shard, allows users to tackle heavy mixed analytic and short-request workloads using PostgreSQL.
DBMS for You
It is important to acknowledge that there is no one-size-fits-all database technology to service all big data application requirements. MongoDB’s Stirman points out that organizations expand the range of data-driven applications they deliver to market and CIOs are rationalizing their technology portfolios to a strategic set of vendors they can leverage to more efficiently support their business. These trends must be balanced when it comes to adding or enhancing a big data strategy.
“We believe modern databases will attempt to address this dynamic by supporting multiple data storage engines within a single database architecture. These innovations will enable users to deliver a wider variety of applications—from read- to write-heavy analysis to low latency in-memory—while using the same database query language, data modeling, and tooling,” she adds.
In order to get the most out of data, organizations looking to leverage BI should consider how accessible data stored within DBMS are, as well as how easily they integrate with other systems. SW