What is A Data Scientist?
Hilary Mason, Chief Data Scientist at bitly
Although some bristle at the conjoining of two highly charged words, the term “data science” appears to be sticking. It’s sticking because it represents something real going on the world of business, which is increasingly data-driven, no matter what that business does. Data has been useful to business for some time, and statistics and modeling techniques have always been useful for wringing value out of data. But what’s new is the greater size of data volume, and a greater diversity of data than there ever was before. Additionally, there is now a wide range of tools available to study data, from Hadoop toSAP SAP +1.27% HANA to Tableau. “Data science” recognizes that there is a significant opportunity to combine some business functions that had not been combined in the past, and the people who will do this are Data Scientists.
We continue our series, “What Is a Data Scientist?” by speaking with Hilary Mason, chief data scientist at bitly. (See the CITO Research problem statement “Growing your Own Data Scientists” for other articles in the series.)
Mason is charged with finding value in the data collected by bitly and building systems from it. bitly is best known for making short URLs from long ones, providing more convenient data transfer for users. But it also a giant data collector, grabbing a large slice of the URLs being shared across the Internet, including email, Twitter, Facebook, Tumblr, Blogger, and WordPress.
“My team builds the mathematical models, the code, and some of the production systems to make products out of this information,” Mason says. One of these products is a real-time search engine that indexes every document shared through bitly. It builds a relevance model based on how frequently the document is shared and the number of social networks that drive traffic to the document, among other factors. Bitly can also make predictions on how many clicks the document will receive in the near future.
What Is Data Science?
For Mason, “data science” may not be a brand new phenomenon, but one that is gaining business importance.
“I think of ‘data science’ as a flag that was planted at the intersection of several different disciplines that have not always existed in the same place,” Mason says. “Statistics, computer science, domain expertise, and what I usually call ‘hacking,’ though I don’t mean the ‘evil’ kind of hacking. I mean the ability to take all those statistics and computer science, mash them together and actually make something work.” The following diagram shows the intersection of skills that Mason believes are central to the role of a data scientist:
The power and availability of massive amounts of commodity computing hardware has made analytics possible, which has led to the development of frameworks such as Hadoop. This is one of the prime factors driving enterprise interest in data science. The capability has spurred interest in its possibilities, and the results gleaned from using these capabilities has in turn generated more interest in the capability, Mason says.
“I can get a hundred machines and I can sequence my own DNA for $150, or analyze the rise of Justin Bieberlinks across the Web over the last two years really quickly,” she says. “If we didn’t have that kind of commodity access to computer power and commodity access to analytics tools, we wouldn’t be able to do the things we’re able to do, and we certainly wouldn’t be able to do them at startups with small budgets.”
Mason divides data science into two halves. The one half is analytics, which she simplifies as “counting things.” The other half is the invention of new techniques that can draw insights from data that were not possible before. “Data science is the combination of analytics and the development of new algorithms,” says Mason. “You may have to invent something, but it’s okay if you can answer a question just by counting. The key is making the effort to ask the questions.”
“The job of the data scientist is to ask the right questions,” Mason explains. “If I ask a question like ‘how many clicks did this link get?’ which is something we look at all the time, that’s not a data science question. It’s an analytics question. If I ask a question like, ‘based on the previous history of links on this publisher’s site, can I predict how many people from France will read this in the next three hours?,’ that’s more of a data science question.”
It is far more important that organizations collect good data scientists than it is that they acquire any particular tool, Mason says. Her advice to organizations is to find the people first, then let them use whatever tools they deem useful.
Modern data science also bridges disciplines in a unique way, bringing together academia, the startup community, and, to a limited but growing degree, the corporate world, Mason says. The gaps between these communities’ take on data science can be characterized by their degree of openness. On the one hand, academia is known for its propensity to share information and conduct peer reviews. Startups are commercial in nature, but are just as likely to share findings with the open-source world as they are to consume open-source solutions. Corporations, unsurprisingly, tend to be the most locked-down and opaque about their use of tools and their findings.
“The results of data science are peer reviewed in the capitalist marketplace,” Mason says. Even when the results being communicated are mostly marketing, without a great deal of statistical rigor, Mason says it’s worthwhile to perk up one’s ears, as it’s a good indication that the company is beginning to take data seriously.
Bringing Data Science Down from “the Priesthood”
One of the great promises of the new crop of data-exploration software is that the inflection point of value in the data chain moves toward the right—out of heavy-duty processing systems that are expensive, complicated and must be maintained by IT, and into lightweight solutions almost anyone can use. Thus, the “priesthood,” ivory-tower academics or deep programmers, won’t need to be consulted about every data-related question going forward.
“We’ll see increasingly commodity-based analysis frameworks, where pretty much anyone can pick up some of the data and be able to use the tools in a visual way, without having to do math or write code to come to a useful conclusion,” Mason says.
But that won’t necessarily answer the bulk of an organization’s questions, she cautions. Anyone can buy a Fisher Price guitar and play real notes, but that doesn’t make them Jimi Hendrix. The “priesthood” will be able to focus on the most difficult, high-value tasks, while many more people will be doing simple data exploration on their own.
“The tools we are creating will restrict the kind of analysis you’re able to do immediately, but they will open a capability for analysis to a much wider population,” she says. “The amount of learning you need to do to be effective does decrease from a multi-year process to a multi-hour process.”
Creating a Data-Driven Culture
Cultivating an organizational and recruiting culture that will support data science is an essential task. That will involve finding people who are driven to solve problems and who can use a multitude of skills to build an infrastructure for handling large amounts of data, Mason says.
Closing the skills bottleneck around data science will be the combined job of easier-to-use tools, educators, and employers seeking out people with the right skills combination, Mason says. Mason cites several leading master’s in statistics programs, such as Carnegie Mellon, as leading the charge.
Mason also sees a lot of self-education, where people have two of the three essential components for conducting a productive inquiry and teach themselves the third on their own, once they sense how close they can get—and the fact that there is a market for that combination of skills. Startups are ahead of many large enterprises on this front, because many large corporations have heavy technology legacies and multiple layers of “priesthood” to negotiate.
“Startups don’t have the luxury of hiring somebody with only one skill,” she says. “So you hire people who are smart, who have at least one skill, but who can figure out anything else you might throw at them.”
Mistakes to Avoid When Building a Data-Driven Organization
Organizations that are invested in building a data-driven culture do well to avoid two common pitfalls, Mason says.
Over-investment in one tool or skill set
Mason advises data-driven organizations not to make the mistake of “standardizing” on one tool, or hiring a phalanx of people skilled in one technology, such as Hadoop. And, just as importantly, data is useful only in the context of the problem that needs to be solved.
“It is very easy to find the things you are looking for in data, and not find the things you are not looking for, but perhaps you should have,” she says. This is why finding people who can ask the right questions, and supporting them, is more important than finding people who seem to have all the answers, or limiting oneself to the latest technology.
Data doesn’t speak for itself
Another core problem that holds back the data-driven culture is assuming that advanced algorithms will solve everything without skilled human contextual interpretation, or that data will speak for itself.
Mason cites the recommendation engine of Netflix NFLX +0.00% as a limited technology that could be vastly improved. “Netflix recommendations are good, but not great,” she says. “Netflix only knows about the universe of things you have watched on Netflix. So if Netflix algorithms could know about everything you see in your life—all the media you’ve seen, all the books you’ve read, all the articles you read, the music you listen to—the recommendations would be much better. But the data that that algorithm has explored is just a tiny component of the whole problem, and I think that that’s true for most of the problems we try and solve with data, particularly as they relate to business. The machine might explore only one dimension, and so it’s really important to have a human contextualize it and understand what it really means.”
Additionally, data scientists are responsible for effectively communicating the things that they learn. That might be creating visualizations or telling the story of the question, the answer, and the context beautifully.
微信名:
HadoopSummit
微信ID:
hadoopinchina
中国Hadoop技术峰会是亚太地区举办最早、规模最大、影响力最广阔的大数据盛会。
Chinahadoop.com是China Hadoop Summit的内容网站。
HadoopSummit是Chinahadoop.com的微信发布平台。