Spark and YARN(文字版)

由 DSGBG · 发布日期 2016年4月10日 · 已更新 2016年4月19日

演讲嘉宾：Jeff Markham, APAC CTO at Hortonworks
文字记录：王宇熙

本篇文章是Hortonworks的APAC CTO Jeff Harkham在China HADOOP Summit 2016北京站的演讲文字版。感谢王宇熙的整理。Jeff的英语发音字正腔圆，饱满纯正，语速已经放慢以适应国内听众。读者可以对照着本文来聆听Jeff的演讲音频，会有更好的效果。对于想练习英语听力的同学们，这也是个非常好的现实素材。

音频文件及PPT的下载请戳这里。

What an incredible turn on, it is really hard to believe two years ago I was in this room. It’s the first time I met the conferences’ organizers, right in this hotel, right in this room. I accepted the invite to come to this conference and speak. I thought I would be speaking infront of a handful of people. But there was a huge room of people just like this. So congratulations to the conference organizers for getting a turn out just like this. I wish I could speak Chinese, I wish I could have my QR code put on the screen just like everybody else, but I will hand my phone around later on so that you can scan my wechat code. Yes, this is a great conference to be here. I really appreciate your time to come here and talk about all the new things that is happening in hadoop. Whenever I come here, I do like to just kind of get an update on some of the new things that are going on the Hortonworks data platform. And of course I want to start out by saying that this year, this year is the ten year anniversary of Hadoop. It’s hard to believe Hadoop has been around for ten years. Ten years ago, a group of engineers of Yahoo decided to do so do something different, and do something disruptive. The storage they were using at that time, the processing they were using at that time just wasn’t good enough for their use cases, just wasn’t good enough for the requirements of their day to day jobs. So they did what a lot of people do in the silicon valley, they created their own tools. That tool for them was Hadoop. And ten years ago, I don’t think any of the engineers could see the future of what that tool would turn out to be today. The adoption that it had worldwide, not just in China, and in the US. The worldwide adoption of Hadoop has been phenomenal. About four years ago, those engineers left Yahoo to create their own company. And that company is Hortonworks. The company I am working at. I have been here for over three years and serve Hortonworks as atechnical director for Asian Pacific. We have a number of customers here in China, we also have a number of customers throughout the Asia Pacific. But I don’t really want to talk about necessarilythe business aspect. When I come to China, when I come to conferences, a lot of people want to know what is the new technology not only within the distribution today. But what is some the newroute mapfeatures of the technology as we go forward. Even though Hadoop has been around now for ten years, there has been a significant change. When I came here to first speak at this conference more than two years ago, Hadoop two was the brand new thing. Hadoop two, yarn, the new features inHDFs two. Those were all brand new and those were the radical changes. For the first six seven years of Hadoop, almost everything related to Hadoop was related to map reduce. Map reduce did its job very well but it was also very limited. So there were a number of runtimes that were created to be better than map reduce. Map reduce were on the good reputation for batch processing. But as the requirements of your organizations, the requirements of the other organizations worldwide increase, We don’t always see the requirements for batch processing for every use case. And one of the technologies that is new on the scene, it’s not new anymore. But one of the technologies that has been very recently adopted is spark. And so we want to talk about spark and how it fits into the platform and we view it in Hortonworks and how we are integrating spark into Hortonworks data platform. So first of all, we want to clear up right away is that of course we love spark. We view spark as a great processing engine. We view spark as more than just a great processing engine. We view spark as a means to give a lot more tools to developers and to data scientists than we would have otherwise. Often times especially in the very early part of spark adoption, we would get a question a lot, like how does spark compete with Hadoop, how your guys feel that spark now is going to replace Hadoop. That’ s not the right question simply because if you look at Hadoop in the very beginning, there were lots of things that were missing. One of the first things were missing is how easy is Hadoop to install, configure, manage, monitor, all of that we put into the category of operations. What about security, what about full featured fine-grained security, yes Hadoop has some progress base security from the very beginning. We only set some means to apply some security in Hadoop, but that wasn’t the for full featured security that enterprises require today. So how do we integrate that type of requirement into a distribution like HDP. All these thing that things were added onto distributions along with data governance and along with other types of technologies like streaming and so forth, and how does this whole distribution now fit together now that we put in something like spark. And so we love spark at Hortonworks simply because of the reasons you see appeared. It’s made for data science, it makesdata science use cases much easier to tackle for the developers and just for the data scientists, and we have two developers that can use spark to deliver data science capabilities like never before. So these kinds of tools to use spark to deliver data science not only for the developers but for the power user, for the data analysis. Spark really enables us in a way that no other components does in the distribution. So it really opens up the world of data science and machine learning like no other components does in the distribution. And it democratizes machine learning. What we mean by that is this, in the very beginning of Hadoop there was map reduce. And in order to use map reduce, you have to be a java developer. So to make map reduce easier to use, pig was invented as a scripting language to generate map reduce, then hive came along to have c code developers to be able to write in a language very similar to c code. Something that would resolve onto a map reduce job. So what we use spark as is something that can do that today for machine learning. It’s much easier through its developer APIs to begin using machine learning into your applications. Because today we need to get pass the batch processing, we need to still get pass the standard web application transactional processing type of application and we need to have a more predictive feature to the applications to deliver to our end users. Predictive analytics, some type of way for recommendation, these are always the very first starting point for data science and for machine learning and for virtually every applications, spark makes that much easier. And it makes it easier through the developer APIs that it has. Spark not only addresses the Java developer community but also the Scala community, the Python community. We have multiple language points that we can give to developers to start developing spark based applications right away. And then what we always talk about at Hortonworks is our philosophy on how this whole distribution fix together. We’ll talk about that in more detail in a couple of slides. But what we call yarn is the data operating system. The data operating system means that we can have a number of different applications run in a single distribution and that’s the value of yarn, that’s where we address how do we put spark in the distribution. And then of course the community. Spark has just recently been noted as the most active open source projects in the world, replacing Hadoop as the most active open source projects in the world. And so with that kind of community involvement, not only through that kind of big businesses that we see IBM as big investment in spark. There is a number of large organizations, Microsoft is making a big investment in spark. It’s not just that, it’s the users. It’s the everyday developers, system administrators, and everyday users of spark that create that large community involvement. So sparks in Hortonworks data platform, where does it fit in? While as we have mentioned in the last slide, where we refer to yarn as is the data operating system, the resource management. What is resource management? Resource management is of course the memory that used by a specific job, or the CPU might be used by a specific job. And what we are able to do by putting spark on top of that is to take advantage of all those capabilities of yarn. We will address those in a little bit more detail later on. But because we’ve deployedspark on yarn, we are able to take advantage of spark being able to be deployed everywhere. We don’t necessarily to stand up spark as a separate cluster, we put it as running on top of yarn. Therefore we can put spark jobs and spark itself wherever yarn can be, yarn can be on premise, it can be on windows, it can be on Linux, it can be in the cloud, it can be virtualized, it can be in anywhere. So because of that ability to be on yarn, we get a number of different deployment options right away. And self-service spark on the cloud. We have a technology at Hortonworks called .. We purchased a company called sequenced u that delivers a project called codebreak. Codebreak(12:58) allows us to deploy the clusters out to the cloud anywhere, amazon, Google any open stack based cloud. Even in your own internal cloud, it allows you to stand up and crop this cluster in minutes and of course that crop of cluster is going to help yarn which will allow you to run spark. It’s a very simple way to get up and running right away. Little bit hard to see the graphic here, but what we want to point out is the middleblue space there. That middle space there is yarn and again we refer to yarn as the blue, the data operating system that holds all of this together. We have a few different categories that you see in the graphic there or around the far left, we see something called governance and integration. And what is governance? One of the use cases that has emerged over the last couple of years is the requirement to track data. Data linage. Where is the data come from? Who is consuming that data? It goes of course hand in hand with security. But a linage use case goes even further than that. It goes further than that so even also address how long do we keep certain data sets that might be a part of a data flow. A data processing flow. Data flow is very common where you have one step that maybe can clans data another step maybe join other data. And another step that will present it to the end user. How often do you actually address that innermediate data? How long do you keep it? You keep it for six months? Do you replicate it three times? do you replicate it once? All these can be addressed with the tools that you have for governance and integration. All those tools will be easily incorporated in the Hortonworks data platform because our focus on yarn, because our focus on delivering all these components on top of yarn. In the middle piece there we see all the data access components. In the middle piece there, we have h-base, we have teds, we have storm, we have spark. And all that is in the data access part of yarn platform. And then over to the right we have security and operations. Security is of course is delivered by project which we call a …. Open source project that resulted from an acquisition that we made early on of a company called XA secure. But because we have yarn as a focus, we are able to deliver all the different components that we put into Hortonworks data platform as something that runs on top of yarn. So let’s just take one of the examples there that we have and that’s spark, of course we are going to talk about spark running on top of yarn. Again with yarn as the architectural focus of the entire distribution, as the glue that holds all those components together, we are able to take that philosophy and put it toward any component that we include in the distribution. That’s important because if you have another component that you want to include in the distribution, that component can often times be its own separate cluster. Every major vender, every major Hadoop vender delivers spark as a separate piece, as a separate cluster. Same way as many other venders will put h-base as a separate cluster or storm or Kafka as a separate cluster. But what we do is we will have all that running on the same cluster so that yarn can manage the resources effectively of all your notes. One of the early limitations of Hadoop one was that either the most finely toned clusters there is about 60% utilization rate of that cluster, and that was with the most finely toned clusters. The average was about probably around 50%. 50% utilization. And when you talk about a cluster size, 200 notes, 500 notes, 1000notes. That’s a lot of resources, that’s a lot of resources, a lot of power not being utilized. Yarn allows utilize the entire resources of your cluster, the entire memory, the entire CPU, the entire disk. It allows us to have a much more control over all the jobs that were running on top of that cluster. So that’s the big reason why we focus on yarn as being the architectural focus of HDP so that when we include a number of component, we don’t include it as a separate cluster. What happens when we have a separate cluster, anybody in here has experiences running h-base as a separate cluster or running storm or spark or something else as a separate cluster what happens? We increase our operational complexity. It means the system administrator has to administrate multiple different clusters. What happens with security? Do we have one security? Access point for all these different clusters that simply increases complexity. If I can define one area, one spot where my users are going to come in,the Hadoop cluster. And I can also define one area inside the Hadoop cluster where I can define my Hadoop security. That is much more secure, that is much more optimal, that is far less prone to err and that is what we focus on when we deliver each component, each new component into HDP. What about data linage, data governance? When you have separate clusters, you have to eventually copy or move data from one cluster to another so that you can do the processing which each separate cluster is responsible for. Again we don’t want that, the concept of a data wave is that we keep data in one location as much as possible and then process it in multiple different ways. We can process it in batch, we can process it in more real time retails, we can process it more interactively with spark. We can do some stream processing, some injuss with storm, Kafka. And later on today we have a college from Hortonworks talking about nifi and how we incorporate that into our platform. All these things are distributor requirements that yarn allows us to incorporate quickly and easily simply by being the resource manager and the data operating system. So first of all what is spark? There might be some people in the crowd this side who might not be familiar with spark. Spark is itself not just one thing. We talk about things like storm or Kafka or h-base, those are just a single component that address by and large a single type of use case, streaming, or k-value store or c code something like that. But spark itself is a collection of humidity projects. Spark has spark core, spark has in land spark has c code, streaming. There are different components to spark. Some of them are more mature than others. Some of them has more life in production than others. But nevertheless, spark itself is kind of community of different components. So what we do is we want to make sure that as we introduce spark, we can introduce each one of these components in making sure that it tested its skill, making sure it integrate well with what we’ve done to put spark on top of yarn. We want to make sure that there is no issues when we introduce all of these new components in the spark community. Everybody here who is familiar with spark know that some of these components in spark are more active than others, some of them are brand-new, some of them have been around for a long time. So what we want to be careful when we introduce each of these kinds of components that we do it in a responsible way, responsible for customers, making sure that its capability, security, operational enablement is all there. We don’t want to just simply add on one of the new spark components as it comes. So again just talking about how we integrate with HDP, the latest version of HDP is 2.4 and the latest version of spark that we include is 1.6. We do have the centralized resource management capability such that we are able to run spark on yarn. And give you some capability that you might not have otherwise. Capability first of all is running spark on yarn is the only way to have spark be cuperized in the same coperize cluster as Hadoop. So running it on yarn can give you some advantages that you would not have if you just run it all by yourself in a separate cluster and then somehow have a connectivity between it and a Hadoop cluster. Consistent operations again, when I talk about embory. Embory(23:23) was one of the first open source projects that we have at Hortonworks outside Hadoop. Of course, Hadoop is the open source project that started it all, but if you have any experiences back in Hadoop 1, trying to install it, trying to download the tarball, trying to add it maximal files, trying to add property files and hope you got it right, that was the way to install Hadoop perhaps three four years ago before embory. Now that we have embory, we are able to have not only the much more simpler process to install Hadoop but to install all components in the distribution. Architecture of embory has at a very high level similar philosophy to the architecture of yarn. Yarn wants to have all the processing in a single cluster and embory wants to have all the installation, management, configuration all in a single payment wise, all within a single UI. So embryo give us the ability to simply add spark during an installation process. Embryo gives us the ability to adjust configuration premiers, spark, spark job all through a nice UI. I’m running out of time, I told myself I’m gonna speak slow but perhaps I’m speaking too slow. I’m from California so I speak slow naturally right? I would try to speed it up. Map reduce versus spark. I think we have a good grasp of what map reduce versus spark is. The core concept is the differences between the disk and memory. Spark does its process in memory. We do have the capacity and capability of taking advantage of both now with the most recent versions of spark. In the beginning, the criticism of spark was that everything was in memory. So if data centered process exceeded memory capacity, we might run into problems there. but these days with the newer versions of spark, we are able to address that issue. One of the key use cases that we could have now with spark is the designing pattern in the architecture that we see in a lot of cloud based environments, where we can scale the compute notes independent of the storage note. So before when we use to scale out a Hadoop cluster, we would scale out one note that have both the data note process and a note manager process so that we will have the storage compute capacity on one additional note. Now what we can do, we can scale out the compute note independent of the storage note. That gives us the ability to take advantage of the spark executer logic such that we don’t have to move data around. We don’t needto move data to a new note, say we stand up a ten data note, we don’t need move data there. Because we simply not going to stand up a new data note. What we do now is we just stand up a compute note. We pull the data into that spark data notes from the existed data notes. How does it work? If you want to know in detail how it works come to see me afterwards. But just want to touch this a little bit on a couple of use cases. One is an insurance company that we have, a couple of use cases that related to spark usage at Hortonworks. I want to challenge they are overwhelmed by that ingest rates. Again if we took the differences what was inexistence in Hadoop one, which is taking huge files and processing over them. But now we have a lot streaming use cases. Those use cases are addressed by storm, kafka, combination to all bringing in spark streaming depending on the ingest rate. Thechallengealsothereisthe team has a lot ofexpertiseinjust R, not a lot in java, not a lot in scala. The key features in R they require just aren’t there. So the solution was to introduce spark and we introduce spark the developers were able to not just use java but use multiple different languages so that they were able to process cliams in a way that made them comfortable for their initial use case. The initial use case is to detect are my clients being overpaid are they underpaid what’s going on? So the impact is that the insurance company can be sure that all their clients are processed with the timely payment and the accurate payment and of course that impacts bottom-line. You wanna go to webtrends,webtrends is a marketing analytics term and the storage in the processing they have was not economical scale. It was a common story no matter who is trying to adopt the descriptive process in storage like Hadoop. Existing storage and existing process just not gonna happen as scale. They just didn’t want to have duplicate clusters. What they didn’t want to have is a situation that they started to go down, different cluster for spark, different cluster for h-base, different cluster for this, this, this , this. They want to have it all in one locations so that their data can be in one location. So the solution is in their case use spark streaming and machine learning, have thatingust the small data sizes in large volume come in via spark streaming, be processed with machine learning and the result was that they save a significant amount anywhere from 25% to 50% on their hardware cost. So we were able to avoid duplication of cloud, cloud deployments, we unify it into a small number of clusters possible. And they were able to increase, they were able to process previously. So we went to 10 billion events daily and about 20 milliseconds per event. There is a lot of different use cases like this with spark. The bottom-line with spark is this. Spark is gonna give you a processing, machine-learning library that you would not normally have outside of a Hadoop distribution. What we are trying to do and what we have done successfully so far in Hortonworks is to integrate such capability in such a way that you still have the same operations, you still have the same security, you have the data governance and you have the data-based consent, i.e. all of your data in one location. We skip to the end of slide because I think we are out of time here. Anyway we have in here a little use case in uber, but I don’t think anybody here uses uber. Is it still illegal, I don’t know. But the thing for uber was that we just want to maximize where the driver would be so that the driver can maximize revenue using spark using geo-spatial libraries. They are used commonly today. They are something uber was able to use to maximize driver location as it relates to revenue. So Hortonworks data platform again, what is our philosophy on spark. Our philosophy on spark is that it compliments not replaces not any components that is in the distribution. Our philosophy is that we want to give it to you so that it’s easy for manage, monitor, secure have data governance. And we want to make it easy for you to process that data that you keep it in one location. Not to move that data all around. I’ll just wrap up again by saying thank you for your time, thank you for your patience I understand maybe sometime during the presentation I spoke too fast, I’m very sorry. If you do have any questions concerning spark, Hortonworks, Hadoop where is it going, I’ll be here all day. The section this afternoon going a little bit deep down in yarn, how to develop applications on yarn. we can touch base on more than detail around how yarn makes all of this possible. So if you are interested in that and would like to have my session, and again I want to say thank you for your invitation, thank you for sitting there vey kindly for an English speaker. I really hope to see each and everyone of you come to me and ask me questions later on. Thank you very much. Thank you for your time.