By Amber E. Watson
Industry analysts and software vendors agree that certain skill sets are necessary for employees collecting and translating big data within their organizations. To address the growing concern of the shortage of such expertise, software developers work to simplify user interfaces to make it easier for novice users to extrapolate and analyze data.
Thanks to this new breed of business analytics products, there is opportunity for everyone in an organization to see and understand data. Still, the right mindset and certain skills play an important part in the emerging role of the “data scientist.”
While a move to simplified systems expect to reduce the need, it is still important that companies train or hire a skilled team to effectively analyze big data.
The U.S. Department of Labor predicts a 25 percent growth in the need for analytics-trained workers through 2018. At its Gartner Symposium/ITxpo 2012, the research firm estimated the creation of more than 4.4 million big data-related jobs by 2015, of which only one-third is expected to be filled. These statistics show both a need for future talent as well as new technologies that eliminate or reduce the drastic demand. By simulating solutions to mask complex processes and automate analytics, developers can help solve the crush on skilled data scientists.
The Role of the Data Scientist
With the spotlight on big data, the role of culling and analyzing it continues to evolve.
“Over the last five years, the term data scientist has become more widely applicable to a growing number of people throughout an organization as opposed to a narrow team of statisticians, due to the increasing availability of business intelligence platforms, and the strategic use of analytic frameworks across the organization,” states Andres Parker, senior consultant, ElementOne Analytics, Neustar, Inc.
According to Tom Wheeler, senior curriculum developer, Cloudera, today’s data scientist is a multidisciplinary role, which is found at the intersection of several other roles such as statistician, software engineer, and business analyst.
“In a glorified way, today’s data scientist is yesterday’s business analyst,” compares Puneet Pandit, founder/CEO, Glassbeam. “What’s changed is the landscape on the amount of data, technical skills required to mine data, and an aptitude to be iterative on finding the right answers to questions from the data itself. Gone are the days where questions are well-framed up front and data is structured to find answers right away.”
Daniel Ziv, VP, voice of the customer analytics, Verint, agrees that in the past, when the role was more statistics-based, there were specific tactical questions for which an organization tried to find answers, however, now, defining what the issue is or what the question should be is a key part of the equation. It is important to consider that less focus is on technical expertise and more emphasis is on the business problem.
Not only must a person be experienced in IT to implement the best big data technologies, but he or she must also have a strong business sense to be able to identify opportunities to glean value from big data. “I am an evangelist inside and outside the company for technical thought leadership and domain expertise. I also spend a good chunk of time with customers, then work with our technical and business teams to develop solutions to address their business problems using our software,” says Michael O’Connell, chief data scientist, TIBCO.
According to Donald Farmer, VP of product management, QlikTech, data scientists currently play many roles in an organization. First, they must discover or establish the data sources available that are most useful for analysis. The second role is to analyze and discover patterns, trends, and exceptions within the mass of interrelated data. “This requires an understanding of what is significant in terms of the data itself, statistically perhaps, but also what is potentially important to the business,” says Farmer. Finally, the data scientist communicates and explains the findings. He or she may also have a role in operationalizing the discoveries.
Data scientists represent the evolution of data-driven business analysis. “They understand where critical data is, how to access and integrate it, extract critical information from it, and how to best communicate so that it drives business action,” says Tamara Dull, director of emerging technologies, SAS. Today’s role is more discovery based, and data scientists often find themselves playing a key part in a business’ decision-making process.
The Skills
Data scientists possess a diverse mix of skills. “A true data scientist is a ‘renaissance’ individual in their field, possessing skills across a variety of disciplines,” shares Anjul Bhambhri, VP of big data, IBM. “While core skills are rooted in computer science, statistics, math, and analytics, a data scientist is set apart by the addition of strong business acumen and an understanding of their industry that allows them to apply business decision-making processes to the insights gained from their analysis.”
Parker highlights programming and statistics as the two most important skills. “At a minimum, he or she must understand basic mathematical theory and be able to connect the dots when the answer is not clearly spelled out. True data scientists take critical thinking to the next level,” he states.
Data scientists need to know what type of data to collect, how to analyze it, how to interpret the results, and how to act on what is found. “A good data scientist also needs to be able to write code, especially in Python, R, and Java,” advocates Wheeler. “He or she does not have to be a programming expert, but needs to be able to get the job done. A data scientist is more likely to write prototypes and custom processing scripts than the infrastructure of production systems.” He attests that versatility is integral, since a typical day might involve an arsenal of tools ranging from traditional databases or statistical packages to big data systems like Impala, Apache Hadoop, and machine learning libraries.
Pandit adds that a data scientist should exhibit a passion for analytics but also for discovering—in some cases even creating—new insights that were untapped in the past. “Besides the soft skills, hard skills require statistical data modeling experience with open source tools like R Systems, or well-known commercial applications like SAS or SPSS.”
Slava Pastukhov, content marketing specialist, Dundas Data Visualization, Inc., agrees that good data scientists have data modeling experience for data warehousing including star, snowflake, and schema designs. “He or she should also have experience with the extract, transform load processes, and a number of data integration tools. Obviously, the more databases the person is familiar with, the better.” Since almost no database is fully complete and consistent upon export, data scientists should also be familiar with data cleansing and performing low-level cleaning to ensure data is user ready.
Furthermore, managing unstructured data such as video, texts, and emails, requires someone that comes from the business side of an organization. “These employees are more comfortable with the concept that there is not one precise answer, which is often the case when ‘adding up’ unstructured data,” adds Ziv.
Farmer has seen traditional database developers extend into data science roles by sharpening their analytic skills. “Similarly, I know analysts that had to acquire new skills in data access and querying, especially in big data, because they needed to explore sources that were not readily available within the organization’s existing infrastructure,” he shares.
Ravi Chandran, CTO, XtremeData, Inc., advises organizations to look for someone with the ability to identify new data sources such as clickstreams, social media, and mobile data that provides useful information to the business, as well as the capacity to understand new technologies such as Hadoop, NoSQL, MPP, and SQL, along with the competence to apply appropriate technology to design a data flow pipeline for analysis.
Ziv describes a business-based data scientist as someone who falls somewhere between a research scientist and a business consultant. “He or she is able to aggregate seemingly disparate data and translate it into business context. They also have the ability to deal with the non-specific information within unstructured data. Because they can orally and visually communicate findings in a way that relates to business issues, they make the data actionable,” he says. This is where solid communication skills come into play.
“A data scientist must communicate effectively with a variety of people including technical staff and business executives,” suggests Wheeler.
Analytic skills are simply not enough. “As data science continues to reach almost every field of business, data scientists must develop the skills of collaboration and storytelling—a unique discovery is not effective unless the business can understand it,” shares Farmer. Storytellers impart meaningful and actionable descriptions around data.
Inside Job
While hiring new talent is an option, it may not always be feasible. Bhambhri recommends organizations look from within at a number of data professionals—from application developers to database administrators to IT managers—who possess the foundational mathematical, statistical, and computer science knowledge to develop into more targeted big data skills. Since data science involves aspects of several roles already staffed, it is worth looking internally first.
Several tools, frameworks, and libraries help lower the skills barrier. Gary Nakamura, CEO, Concurrent, Inc., advises organizations pick tools that are widely deployed and hardened, and that can be incorporated into a standard development processes. “Once you have a clear understanding of what is available, assess the skills gap you face,” he suggests. “Organizations that do due diligence realize they can easily leverage their existing enterprise java developer skills, data analysts, data warehousing ETL experts, and data scientists to execute big data strategies.”
Wheeler poses the following questions, “Do any of your statisticians know how to write code? Do any of your programmers have a scientific or mathematical background? Are they versatile and creative? Are they as comfortable pitching an idea to a senior executive as they are leading a technical session to their team?” Once candidates are identified, training opportunities help develop and extend skills.
IBM’s Big Data University, for instance, is a free online set of tutorials and classes designed to help working professionals grow big data skills. Courses range from traditional analytics to newer and more disruptive technologies like Hadoop. IBM also works closely with academia and universities and provides access to software through the academic initiative program so that students and faculty can use the software as part of the course work.
Online educational certification and a range of training courses targeted at specific job roles, including developers, system administrators, data analysts, and a data scientist certification program are offered through Cloudera University as well.
Pandit suggests conducting regular “hackathon” sessions among different engineering teams. “Give them broad directions and lots of data, turn them loose on analytics with different tools. Of course, teams must be organized by domain or vertical expertise so that the experiments are constructed in a meaningful way.”
Pastukhov also recommends that employees practice with non-confidential business data that is available because there is an added bonus that they may discover important facts along the way. “If no business data is available, there are hundreds of open databases that are free to experiment and use,” he says. “With dozens of great resources online that deal with the best practices of visually displaying data, any novice can get a tentative grasp of what goes into building decent data visualizations.”
Matt Quinn, CTO, TIBCO, reminds people that most big data technology providers have teams that work to train customers’ employees. “The technologies themselves have become fairly easy to implement, so with minimal training a company is up and running and realizes benefits from their big data in a matter of months.”
To do this, it is important to partner with a well-known big data technology company.
While someone can learn specific technical skills in a training class, certain soft skills are harder to learn. “If you have a candidate who is naturally curious, creative, and versatile, with an analytical mindset, and knows the basics of data analysis, hire them and send them to a training class to learn Hadoop and Impala,” suggests Wheeler. “On the other hand, if you have someone who has the technical skills but cannot articulate how to apply it in a meaningful way, it may be better to pass.”
In many cases, an organization does not need to set up a new competency center just for big data. Instead, Dull suggests focusing on individual roles and skills that add to your existing competency center(s). “This may mean the addition of data scientists, Hadoop experts, or others—and it may be possible to expand the skills of current members,” she says.
When it comes to analytics, the more the merrier. As Pastukhov points out, “If only certain employees in your company are able to analyze the data, it not only puts a strain on them to find relevant information for multiple departments, but it also increases the likelihood of something falling through the cracks.”
Instead of focusing on hiring an employee with a specialized skill set, research potential business tools to help everyone in the company spot trends and patterns.
Devise a Team
It is rare that all the skills necessary for effectively analyzing big data are available in a single person. More often, a team is involved with different members possessing strengths in specific areas.
“The requirement is not simply for one type of skill, such as deep analytics; it is as important for the organization to have the right skills to collect, integrate, and prepare the data as well as the skills to interpret the results and take action based on the result,” points out Dan Vesset, program VP, business analytics and big data, IDC. In other words, while critical to the analytic workflow, data scientists alone do not suffice.
Bhambhri explains that data scientist teams are usually composed of statisticians—or modelers, domain experts—or analysts, and cognitive scientists—or interaction designers, bringing together capabilities for driving information from data, applying the insights appropriately to the domain, and presenting it in a innovative visual form.
Chandran recommends forming a small three- to five-person team, and giving them a short, three-month, hands-on pilot project that includes clear deliverables. “Identify new data sources and propose technologies for an analysis pipeline.” Understanding the scale and objectives of the data collection and analysis helps define the right size of the team.
Additionally, when facing a specific business issue, Ziv encourages organizations to define the top business issues by examining the data then assembling a cross-functional team that involves people from the business side, including experts with typically ten to 15 years experience in the field. “Pick tools that are relevant to the business issue, then train employees to put their resources to work. It is typically faster to pick practitioners familiar with the business issue and educate them on how to use the technology, versus selecting someone who is prolific with the analytics tools, but uneducated on critical business issues.”
According to Pandit, software teams looking to leverage big data analytics should be well versed in a cloud-based model. “It is almost impossible to play with lots of data in the old Web-based development tool methodology where data resides inside a firewall,” he cautions. “The challenge in such use cases is the infrastructure needed to handle peak loads on compute, storage, and networking bandwidths. A cloud, like Amazon, provides these resources dynamically, scaling up or down, and at a competitive price.”
Furthermore, a culture of change is important. “To act differently based on new discoveries from big data goes beyond the system and analysis interdependencies, to that of a willingness to alter business processes based on discovered insights,” says Dull. “Such teams therefore not only need the skills, characteristics, and systems, but also the power to advise an organization that is willing to make the necessary changes for greater efficiency or effectiveness.”
In many cases, a business analytics team already exists. “It is up to the organizations to change the paradigm of data management from focusing on the low-level technologies and the process of retrieving the data into one that supports questioning by non-experts,” says Elissa Fink, CMO, Tableau Software. User-friendly software aids everyday business workers in analyzing data and making decisions themselves.
Smart Solutions
Many smart tools and features now provide proactive capabilities that help users navigate powerful analytic tools.
Dull notes that while there is debate regarding the skills for a data scientist, technology is advancing to lower the barriers to entry, with simplified interfaces and drag and drop exploratory environments. “Moreover, skills can be taught to willing individuals,” she says.
However, IDC’s Vesset cautions against using too many short cuts, as mistakes are often made by taking results of analysis out of context, applying inappropriate algorithms, or introducing a variety of biases into the analytics and decision-making processes. “There are, however, inefficiencies in the analytics workflow that technology helps minimize or eliminate,” he shares. “For example, about three quarters of a typical big data and analytics project may involve preparing data for analysis. Data integration, data quality, data access, and master data management technology helps automate this process. Similarly, modern visualization tools help simplify interaction and manipulation of data sets.” Some of the latest databases have automation functionality to help store data in the most appropriate storage layer, eliminating certain administrative tasks that had to be performed by people in the past.
“With role-based interfaces, analytical solutions are easy for novice users, information consumers, and data scientists,” says Dull. “One system configured for different audiences in one cohesive platform expedites the interdependencies of business decisions, and analytical intelligence built into the software makes it more accessible to less sophisticated users and eases some of the analysis work. Once a method, model, or process is defined, standardization of routine tasks that automate activity is a key feature of powerful analytic technologies.”
A final element, notes Dull, is the automatic documentation of detailed coding specifications behind the scenes of a drag and drop, role-based interface. “For novice users this provides a way to learn the underlying technology, and teaches them how an open system operates. For organizations, it provides a complete audit trail of the analytical system,” she adds.
Parker finds the ability to show segments that are relevant and precise, along with dynamic data visualization, most helpful. “For example, if the goal is to define a business’ ‘best customer’ and then to quantify the potential to reach more customers like them within a market, one can create a stand-alone schema from scratch, tie that to market potential, and identify which audiences, messages, or venues to focus marketing efforts. Alternatively, one can link disparate data sets from internal and external sources, then visually explore what the data is saying in order to efficiently target consumers or measure results. Developing or testing a hypothesis with just a few clicks saves time and resources, and the more precise the data, the more actionable the results.”
Several systems support the ability to analyze and leverage big data in real time. “Analytics and visualization software turns rows of data into interactive graphics helping businesses more quickly spot opportunities and risks. In-memory computing uses shared RAM resources, enabling immediate real-time response from systems and services. Business process management works to optimize processes end-to-end to drive operational excellence, and master data management enforces processes and policies to uphold data accuracy and consistency,” adds Fink.
Hadoop and surrounding technologies evolve to meet the demands of enterprises, reducing the need for specialized skills or experience. “A few years ago, doing analytics on Hadoop required specialized skill and custom code. Leading analytics and business intelligence tools today already support Hadoop, Hive, and now Impala,” shares Wheeler. “Where the data comes from is just an implementation detail for an end user of these applications. It is no longer necessary to choose between scalability and ease-of-use, meaning it is as easy to query data in Hadoop with these tools as it was to query it in a relational database,” he adds.
Data scientists who have evolved with the market are in high demand and possess the new skills for technologies like Hadoop; especially data scientists with software development skills such as Cascalog, Scalding, Pig, Hive, and Map Reduce. Tools such as Concurrent’s Cascading Pattern help close the gap, enabling any data scientist to deploy machine learning applications and predictive models on Hadoop with little or no new training.
A More Approachable Role
While there is a need for people with expert statistical and programming skills to help with highly sophisticated analytical questions, new tools for analysis make the job easier and more approachable for everyone in the organization.
Fink reminds us that it is not about a few smart features slapped onto heavy, hard-to-use BI products, instead it is about a new breed of easy-to-use business analytics software that helps people see and understand data. “Self-service analysis is democratizing data across the organization. People can connect to data in a few clicks, then visualize and create interactive dashboards with a few more.”
With today’s tools and technologies many people can become analytics experts. Still, the importance of training in big data and the proper analytics skills cannot be overlooked.
Valuable information hides within the growing and varying amount of big data collected every day. Organizations that have the ability to effectively and efficiently decipher big data and translate it into actionable business intelligence are at an advantage.
As the role of data becomes increasingly important to every organization, a mix of savvy tools and skilled analysts is a smart recipe for success. SW
Feb2014, Software Magazine