In Palo Alto (CA), we meet Founder & CEO of Altiscale, Raymie Stata. Raymie talks about his story how he came up with the idea and founded Altiscale, how the current business model works, as well as he provides some advice for young entrepreneurs.


Martin: Hi. Today we are in Palo Alto in the Altiscale office. Hi, Raymie. Who are you and what do you do?

Raymie: Hi. Yes, I’m Raymie Stata, the Founder and Chief Executive Officer of Altiscale. We do big data in the cloud using both Hadoop and Spark.

Martin: What did you do before you started this company? And how did you come up with the idea for Altiscale?

Raymie: Well, in many ways, the roots of Altiscale go all the way back to AltaVista. If you remember that search engine from the 1990’s. I was lucky enough to have a chance to work on that. I actually got out of college thinking that I would work in a research lab, and that research lab was at Digital Equipment Corporation where they created the AltaVista web search engine. So while I was there, I think pretty much everybody in that laboratory got pulled into the world of web search, and I was no exception. So I spent quite a few years working on AltaVista, getting experience in web search.

And then after the dot com crash in the early 2000’s, I actually started my first company which was Desktop Search Company, and we ran that for about four years and started to get some traction. Around 2004, that company was acquired by Yahoo. And about that time, actually slightly earlier, Yahoo had acquired a bunch of companies. They acquired Overture, and Overture, ironically enough, had just acquired AltaVista. So Yahoo had bought in all the old tech that I used to work with through Overture. They also bought the web search assets of FAST, which is a European search engine company, and a few other assets. So they were building up a search technology to go compete head to head with Google. And a little after those major acquisitions, they decided they wanted to have some Desktop Search as well, so they acquired my company.

So that was 2004, and in the history of Hadoop, that was a significant year. That was the year that the MapReduce paper was published from Google. And for people in the search world, that was a really fascinating paper. I think for us, the basic technique that they were using, which was a cluster-wide search, was something that pretty much everybody in the search world had developed in a way. But the MapReduce paper took that basic technique and put it in a very elegant package where the mechanism of the MapReduce paradigm was very clearly separated from the application code. And so one could very easily develop and improve the framework code, the heart of the distributed system part, if you will, and pan it of the application code. And that very clean separation, I think, was missing in most of the search world.

So it got our attention for sure at Yahoo and I think in other search companies at the time. But what was fascinating about that paper as well is it got the attention of, I think, a broader universe. At that time, Google was certainly in ascendance. People were beginning to think, “There’s some magic over there. They must know something we don’t know.” And so that paper, I think, was seen as an insider’s look into the magic of Google. And so there was a lot of very broad fascination with that paper. So interestingly, other search engines, they were at that time a little bit more competitive in the world of search, so there were other search engines. I think other companies were interested in doing better than Google, so they all embarked on these very ambitious internal projects to do something better than Google.

But there were a few of us at Yahoo who said, “Instead of trying to out-engineer Google, maybe what we can do is demystify this a little bit by basically taking that concept and implementing it in open source.” And I had known Doug Cutting prior to getting to Yahoo. We built some work with the Internet archive, and I got to know his work with the Nutch Foundation at the time. He had his own foundation. And so I was able to get Doug to come work at Yahoo first as a contractor and then ultimately as employee, and we decided, “Yes, we’re going to do the open source MapReduce.” And that complemented well work that was already in place in the Nutch search engine where they had also implemented this thing called the Google File System.

So that was how Yahoo got involved with Hadoop. I think Yahoo over the years made massive investments in Hadoop that really made it the enterprise strength software that it is today. So I was actually just at the very beginning of my tenure at Yahoo that I was getting a project on the way. Through the years at Yahoo, I had a number of positions. So I started in the algorithmic web search team where I was the chief architect of that team. And then over the years, I was the chief architect of the search and advertising team, and then the chief architect of all of Yahoo, and ultimately became the CTO of Yahoo. And as I got broader and broader responsibilities at Yahoo, I helped drag Hadoop along with me to be used in broader and broader use cases across Yahoo. And as a result, by the time I left Yahoo, we had a very large central installation for Hadoop. We had 40,000 nodes of Hadoop over a thousand users being used for a very wide variety of use cases.

When I started thinking about leaving Yahoo, since I had a lot of experience with Hadoop inside of Yahoo, I took a look to see, “Hey, what’s it like to use Hadoop outside of a company?” And not just Yahoo. I knew how folkswere using it at Facebook. Twitter wasn’t so big at the time, but eBay. And it was the same model, a large central cluster run by a very competent professional team. Outside of those larger Internet companies, what I saw was something very different, which was these small clusters, 20 nodes. Very often they were being supported, if at all, by an operations team. It was like a discretionary effort. They weren’t really full time. Usually there were other production activities that had more of a priority. So the users of these clusters were often stuck dealing with problems on their own.

And one of the things that I didn’t fully appreciate until I left Yahoo and started looking outside Yahoo was just how much internal support our end users have for using Hadoop. So if something went wrong and they were not sure why, Hadoop likes to throw stack traces, for example, every time something goes wrong. At Yahoo, people just turn and say, “Hey, have you seen this before?” and there was a lot of people inside the company that had used it and had the experience for answering those questions. And ultimately, the central team that was operating Hadoop had seen pretty much everything that could ever go wrong and so could always answer a question for you.

Again, on these small 20-node self-surved clusters, it’s a very different experience and people were spending days searching the web, desperately asking questions in these email forums to get some help in figuring out what’s going on. So in contrasting that experience of those larger Internet companies where they had that professionally large scale Hadoop clusters to what I was saying in the industry, it became pretty clear that there was an opportunity to create a company that can offer Hadoop as a service the way you would experience it at Facebook or Twitter, in eBay or Yahoo. And that’s what we do at Altiscale.

Martin: So Raymie, this means after you’ve left Yahoo, you’ve tried to validate your assumptions on whether there’s some kind of business need also because you have seen smaller companies which have been not very much focused on managing the cluster and so on. What was the next step then? So did you build some kind of MVP or did you just raise some money? What did you do?

Raymie: By that point in time, I was a second time entrepreneur. There were investors who were interested in seeing what I was going to do next, so I had the benefit of having folks who were interested in backing me. So, I was able to raise a seed round fairly quickly. And we used that money to start hiring because at the end of the day, getting the right talent is really the hardest part of the job and often causes the most delay, so that was very useful. And while we were doing that, even before I left Yahoo… In fact, my last title at Yahoo was entrepreneur in residence. I think I was Yahoo’s first and last EIR. And for various reasons, they gave me an opportunity to spend a few months starting to think about what my next steps were going to be. So I had the benefit of a couple of months of research upfront. And so I was able to validate that the issues that I saw indeed were pretty widespread.

Martin: So once the investor wrote you a check, how did you go about acquiring your first employees? So how did you actually find them? Did you tap into a network? Did you post a job? Did you ask friends? What did you do?

Raymie: By and large, it’s through networking, through people I knew directly. But there tends to be a transitive nature to that. So somebody I know introduced me to who’s now our VP of Engineering, Ricardo Jenez. And even though we went to the same school, we both went to MIT, we didn’t quite overlap in time and we didn’t know each other at school but we got to know each other through a mutual friend. So sometimes it’s not quite that direct. I didn’t hire the friend but I hired the friend’s friend. So there’s levels of indirection there. And then once you hire… I hired Ricardo and our networks were slightly different. He spent some time at Google, for example, and was able to tap into the Google network a little bit more effectively than I could. So there is that spread of the corporate network as you bring people on. Your reach gets broader and broader.

Martin: So Raymie, if you look back in time just by memory, the first 6 to 12 months, what was it really like? What type of obstacles did you really perceive in the day to day business where you say, “I think I’m not sure that this is the right business idea,” or “There’re some kind of issue that I need to solve but actually I don’t know,” or something like that?

Raymie: Yes. Let me think back. Again, getting that core initial team is always a big focus and at those stages where it’s literally three people and nothing, it’s a pretty big leap for people to sign up for that. And so that certainly took a fair bit of time. Altiscale is a bit unique as a technology company where a lot of technology companies, you’ve got some idea for some core IP and you hire a bunch of developers and they all go away, and you work on that IP for six months or so, and you really develop this relatively small compact piece of software that is really great in some dimension, and then you go out and you start to use it. And then you spend the next seven years putting layers of gunk around that beautiful center.

At Altiscale, we inherited the Hadoop ecosystem, millions of lines of code. And so when you’re at that first three going on six technical people and you’ve got to own millions of lines of code, that’s a unique challenge for a startup. And so another aspect of our early struggles was to figure out how are we going to make that work and where are we going to start, because it felt like the problem was so big. And there were some false starts. One thing we ultimately decided to do was to restrict how much of the Hadoop ecosystem we would start with, with the idea of growing over time, and then really focusing in on deployment automation and not on multi-tenancy. It was the first technical problem we were going to tackle was another way in which we started to reduce the problem and the scope to something that a small team could work on.


Martin: Raymie, let’s talk about the business model. So you briefly touched on your target customers basically, which is more, as I understood, the smaller companies, right?

Raymie: No. We don’t think of them as smaller companies per se. We really think of companies more in terms of how much data they have versus how many people they have. So ultimately when people say a big company versus a small company, you can measure in revenue, but often revenue is pretty directly correlated to the number of people. That’s the more traditional metric of “Hey, how big is a company?” But for us as a big data service provider, what matters a lot more is how much data you have versus how many people you have. It turns out that some of the smaller companies have much more data than some of the largest companies.

So from that perspective, you can look at the amount of data, if you will, that an organization is ready and willing to put into a data processing system like Hadoop. You can maybe call it small, medium and large, where small is, say, 10 terabytes or less. Medium is maybe we can call it 20 terabytes or 15 terabytes, so medium is when you get into many tens of terabytes to hundreds of terabytes, maybe to a petabyte or so, but it’s not multiple petabytes. And large is multiple hundreds of terabytes. And for us, we tend to target the ones in the middle, the folks that have more typically a hundred to a few hundred terabytes to a petabyte or two. It’s a good range. As you get smaller than that, I think the problems of Hadoop become a little bit more self-manageable. As you get bigger than that, additional complexities start to come in that we’ll get to tackle over time, but again as a startup, you need to be focused. So that’s how we measure the size of our customers. It turns out that it’s a good swath of companies and it covers some of the largest companies in the world and it covers some of the smallest ones. It’s all over the map in terms of how big the company itself is.

Martin: Raymie, can you briefly describe what is the value proposition that you are delivering to those target customers?

Raymie: Sure. Well, at those scales, when you start to get into a hundred plus terabytes of data, maintaining your big data infrastructure. In our case, that’s anchored in the Hadoop Distributed File System. The Hive Metastore actually is an important component and then Spark and YARN sitting on top of that. So it’s the core platform. As you get to that hundreds to many hundreds of terabytes of data, keeping that operating well and, in particular, keeping jobs running fast and running reliably, completing it successful. It turns out to become harder and harder as you scale up. And I think it’s just the nature of distributed systems. There’s this kind of exponential issue where as you add more and more pieces, it gets exponentially more difficult to keep it all running reliably.And so the value proposition of Altiscale is to keep your big data infrastructure running well as you grow, as you scale, allowing your customers to focus on what they’re doing with Hadoop and not get wrapped up in running it day to day or worrying about how they’re going to grow it over time, which itself is a significant issue.

Martin: But still the customer is providing the jobs, or are you supporting and inviting the job? Or is it only that you are trying to have some kind of maintenance or monitoring on whether the jobs are running correctly or not?

Raymie: That’s a good question. I think that what we do not typically do is actually write the jobs themselves. Either the customers do that or there’s lots of software services companies out there who will help you write big data applications. So we’re really focused on keeping those jobs running well, which includes, as you point out, monitoring the jobs and helping to deal with problems. However, I think one unique service that we can provide is that a lot of times you’ve written a job but it’s not quite right, it’s not scaling well, it has performance problems, and so I think one service that we do provide is that where there are those particularly problematic jobs, we’ve got enough experience where we can say, “Hey, it looks like you have a skew problem or a memory problem or this or that.” We can give advice to help people quickly get that job working. And again going back to those founding stories of Altiscale where people were spending weeks at a time wrestling with Hadoop over what turned out to be a relatively trivial issue. We could really save customers a lot of time that way. And if what they had to do is go out and find some external consultant to come in to look at that, it would just be ridiculously expensive. But the fact is we’re sitting there and we’re monitoring the jobs. There’s certain telltale signs. There’s certain problems, so we can very quickly and easily say, “Hey, here’s a problem that you might want to look at.”

Martin: Raymie, are those problems identified by humans or are you using something like machine learning and then automatic recommendation engines on what to do if that problem occurs?

Raymie: Yes. Well, that’s a great question because in many ways it incorporates, I think, a standard assumption that people make. You either use humans or you use algorithms.

Martin: Or both?

Raymie: Yes, but I think the “or both” is not an insignificant point. I think that having a blend of people and machines is really the secret of doing a lot of things well. Part of that is indeed how we help customers. There are some automatic alerts that look for certain patterns, but there’s also more manual things, and there’s a whole spectrum there. But even if you look at how we operate the clusters themselves, we’re going back to “Hey, what were some of the challenges upfront?” I think we decided not to go start way over on the side of too much automation but rather start in a more moderate point, really have the automation and the human operators work hand in hand, and based on actual real world experience, use that to drive the technology and also to use that to say, “Hey, what are people best at? And let’s use what they’re good at,” whereas “Where can we use machines to really ultimately just pull labor out of the activity?” So getting a good man-machine symbiosis, as we like to call it, is really core to our approach to a lot of problems here at Altiscale.

Martin: Raymie, imagine you are going to a potential client and ask him whether he would like to buy an Altiscale product. How do you sell this kind of product over competitor products like other big data platform providers?

Raymie: Sure. Well, I think it depends on who the customer is and what are their existing experiences with Hadoop. In the early days, all of our customers were actually what I would call Hadoop veterans. And what I mean by that is they’re not only people who have used Hadoop before but as an organization, they actually had Hadoop in production fairly successfully and yet they shifted over to us, which is a little bit counterintuitive, but there’s a reason for that. The reason is that over time Hadoop doesn’t get easier. It actually gets harder because you do more and you do more and the technology, it’s hard to believe but it actually changes faster and faster over time. The rate of improvement in the Hadoop ecosystem is just stunning. But if you’re trying to operate Hadoop cluster, keeping pace with all that change is very challenging.

So if a customer prospect is already fairly experienced with Hadoop, we talk the same language. It becomes a lot easier. You can go in and say, “Hey, are you having this problem? Are you having that problem? Is it a pain? When is your next upgrade?” A lot of times, actually it’s the upgrade thing which tips them over because they’ll be a year out of date and we could say, “Hey, the latest Spark has all that stuff. Can we use it? Oh, no. My internal users are very upset.” And we have a common language that we can use and common experiences, and the dialogue actually is a little bit easier.

A variation on that team is a lot of folks have set up a Hadoop cluster initially for the focus of supporting some kind of what you might call production pipeline, some kind of ETL process where data comes in, it gets processed, it goes some place. That cluster was originally set up to just do this very mechanical thing, and that software itself is not evolving very quickly, so the people in charge of the cluster view it as a maintenance problem. But then what happens is some data scientists start to dig up because the data that’s running through those pipelines is very interesting to them. And before you know it, you’ve got a bunch of data scientists who are trying to use this cluster and it’s not working because it wasn’t set up for their active use, it doesn’t have the tools that they want to use, and they find themselves frustrated that they don’t have the kind of data science environment that they really like. So there are two we can relate to the actual grounded experiences that they have. A lot of times they say, “Hey, why don’t you get a separate data science environment do that at Altiscale and leave the production part?” That’s fine to have too, and that’s very successful for us.

I think it has been more challenging for organizations that are new to Hadoop. They say, “Hey, I think we ought to do Hadoop.” And in context, they don’t know what they don’t know. They don’t know how complicated Hadoop is going to be. They think of it more as a traditional software application where “Yes, in the first year or two, it’s a big pain and it’s going to cost us a lot of money. But once you get over the hump, it becomes easier. It goes into maintenance mode and we can be done with it.” And educating those customers that it doesn’t look like that, it just actually gets worse and worse, becomes a challenge to us because they don’t have those experiences. They don’t have the vocabulary. Fortunately, since we had those earlier customers, we can use them to provide some degree of testimony, and that helps. Over time, we’ve been working on other ways to help educate people who are new to Hadoop that they should really think twice before they get too deep into running it themselves.

Martin: So thinking about the Hadoop ecosystem, what are from your perspective the biggest misconceptions or mistakes the end users or companies are doing when they are interacting with Hadoop? You briefly touched on one.

Raymie: Yes. Obviously for us, we think mistake number one is run it yourself. Let the experts do that for you. It saves you a lot of time and effort. I think that putting that one aside, I mentioned before that the rate of innovation in and around Hadoop is quite high, and I do think that another mistake people make is that they think they need to have the very, very latest, and a lot of times the latest new project that gets announced is really not ready to be used in a production way. So I think that’s actually another fairly common mistake.

I think another mistake that people have is they don’t understand very well what I would call the performance characteristics of various components. They understand the functional characteristics. When I put this query in, I’m supposed to get those results out. But they don’t understand the performance characteristics depending on how much data is in the table, for example. Is there a skew in the data? There’s a lot of factors that will determine what kind of performance you’re going to get. And I think there’s this unrealistic expectation that it’s just going to be fast for everything, and they don’t engineer their applications around a realistic expectation for what can be done. I think HBase actually is a particularly challenging technology in that regard. It’s great in many dimensions but I think people tend to think that it can do anything and it can’t and then get themselves in trouble. So understanding what are the performance characteristics and the failure characteristics to some degree and making sure that you engineer your application around those instead of making unrealistic expectations is another lesson for first time Hadoop users.


Martin: So Raymie, over the last two companies that you started, what have been the major learnings that you got and that you can share with other people interested in becoming an entrepreneur?

Raymie: Well, I think one thing that you’ll hear over and over again and in some sense you can never do enough of it is that the more you can talk to and engage potential users of your product, the better off you are. And there’s no point in diminishing returns there. And so I would say that that would go to the top of the list because I think people always think they’ve done enough of that, and myself is no exception to that, and yet if you push yourself a little bit harder, you find out, “Wow, this is even better.” So I would say never be satisfied with the amount of customer input you get.

I think on a related note, what I’ve noticed, in fact, if you compare, I’ve also consulted with another, I’ve been involved with startups in various capacities going back for many years actually, and one of the things that I’ve observed is I think the rate at which you can start to actually engage a customer in some type of activity, even as a paid customer, but if you put that aside, like “Hey, give this a try. I’m getting it to try this with folks” way, way earlier than you think is at all realistic.

I think there’s a tendency to feel like, “Oh, you have to get to a certain point,” especially a few more engineering during the company. You have to get the thing to a certain point, and if you engage too early, it’s going to be wasteful to the customer or to you. And I think that what time has shown, especially the last 10 to 15 years, is that in fact you can engage with folks by actually giving them software way earlier than the typical engineer thinks is possible and benefit from doing that. So I think that not only talking to a large number of customers early on but here you have to be a little bit mindful. You can’t do too manyof these, but engaging with one or two almost from the very beginning is really, really critical as well.

Martin: When did you start engaging with customers?

Raymie: When did we start engaging customers? Let’s see. I’d have to go back. I don’t know what the timing was. I do think it was probably six to nine months into starting. It’s pretty early but maybe it could have been even faster. But it was a nice little tiny startup itself. It was actually a hedge fund. And it was interesting because they had not that much data. It was less than a terabyte. But they were expecting that was a sample of a much larger data set they were going to get, which would have measured in a few tens of terabytes. Still not huge. And they were expecting that to come any minute now. So they were a little bit in a panic. As a hedge fund, it was the perfect customer for us because they were not at all interested in running Hadoop. Any smart people they had, they weren’t figuring out traits, not figuring out Hadoop, which was actually not the case in many other companies where if you’ve got smart people. In some sense smart people get bored in a lot of companies, and so you throw them on Hadoop to keep them challenged. And so that can sometimes be a challenge for us because we’re saying, “Hey, I know that’s interesting, but let us take that away from you. Find something else to do.” At hedge funds, that’s not an issue, so it was a great first customer for us.

Another thing that was great about it is that there’s alleged tens of terabytes a day that was going to show up any minute now, it took a couple of months for that data to show up, so we had time to get seasoned and to get ready for that larger data set.

Martin: What other type of advice can you provide, Raymie?

Raymie: As I indicated, on the one hand, I was lucky to have a little bit of a track record and therefore the ability to raise money upfront, but I think that’s a mixed blessing. As soon as you raise money, there’s a clock that starts to tick and people want to say, “Okay, how many customers do you have?” And I think a more modern, quite honestly, kind of approach is for two to five people to really take a year to somehow figure out how to make that work for themselves personally from a financial perspective but to really take a longer period of time to find that idea to do the kinds of things that I was talking about, which is to talk to a number of people and to maybe even engage in a couple. So I think all things being equal, in this regard I’d suggest doing something a little different from what I did, which is to be a little bit more patient in the beginning phases.

One thing that I am happy that I did, it might seem a little bit mechanical, but in both cases, I took time to educate myself on some of, you might call it, the formalities of running a company in terms of getting the IT people work in place, getting a documentation system in place, having your records in good shape. For example, if you go down the path I just recommended, which is you spend a year doing stuff and you find pay dirt, like here is this thing and you find an investor that wants to go in, they’re going to put you through a due diligence process. And if you had all of your funds come in and hack on the weekends and there’s no clear ownership of the code and there’s no records or anything like that, you could find yourself either spending a lot of time trying to put all that back into place or taking a lot of risk, where essentially you say, “Hey, I promise that if my friend who came hack for that weekend comes and sues the company after we’ve become a billion-dollar company, I will own that and I will take all of that risk,” which is not a good thing to do. So I think while I would recommend not necessarily being in a big hurry to raise a lot of money in an early company without seasoning your idea, I would say getting the formalities of the company in place early on and being very clean with respect to your records is important.

Martin: Raymie, thank you so much for sharing the knowledge.

Raymie: Sure. My pleasure.

Martin: And if you are a company with lots of data but actually you don’t want to bother managing them yourself, check out Altiscale.

Raymie: Thank you.

Comments are closed.