Welcome to the 12th episode of our podcast!

You can download the podcast to your computer or listen to it here on the blog. Click here to subscribe in iTunes.

CP12: Podcast with Girish Pancha from StreamSets about Performance Management of Data Flows


Martin: Hi, today we are here with a very interesting data start-up entrepreneur. Hi, Girish, who are you and what do you do?

Girish: Hi there, my name is Girish Pancha. I am the co-founder and CEO of StreamSets. Prior to starting up StreamSets, I worked at Informatica, which is a leader in data integration or the independent leader in data integration. I spent actually quite a few years at Informatica. I was its first VP of Engineering, beginning in 1997. I went and did another start-up in the late 1990s, early 2000s. Then I rejoined Informatica in the early 2000s and spent a dozen years there having roles all the way from general management type roles (GM roles) and then finally at Informatica, I was the Chief Product Officer there for the last few years that I was there.

Martin: Cool, how did you come up with the business idea for StreamSets, because from my perception it is closely related to your experience at Informatica?

Girish: Yes, absolutely. I think when I left Informatica a couple of years ago in 2013 I didn’t envision that I would necessarily stay in the space. I happened to reconnect with a Senior Architect that had worked for me at Informatica, Arvind Prabhakar. He and I ultimately ended up co-founding this company.

What ended up happening was that Arvind and I was trading notes. Arvind actually was an early employee at Cloudera. He and I started trading notes on just some of the challenges that Cloudera customers were having when it came to ingesting data into Hadoop. Initially, I kept believing that this problem had been solved or was being solved by the previous generation technologies. The more Arvind and I talked we came to the realization that the approaches of the past were just not where really, kind of inconsistent or incongruent with what the needs were for the emerging use cases. That caused us both to get very excited about potentially solving this problem from the ground up, for the right way, for the emerging big data world.


Martin: Girish, how is the traditional type of solution to this kind of problem and how are you trying to solve this problem with StreamSets?

Girish: Sure. In the past, the focus was very much on having a schema-centric or a model-centric approach to solving data integration needs. Let me explain what I mean by that. The problem in the old days of data warehousing was that these data warehousing projects use to fail. It used to fail because people manually developing sequel scripts, Cobalt script, Java programs, etc. to move data from databases into data warehouses. So the solution that we came upon at Informatica was to allow the users to specify the schemas and then have an engine interpret this and generate the appropriate logic to move data from source to target.

The strength itself has become kind of the weakness for the new world because in the new world the types of data we are trying to ingest into big data sources no longer just transactional database and application data, but much more multi-structured data in the form of application logs. In the form of information from devices, sensors, and these data sets are subject to what we describe as data drift – much more so than transactional data.

Martin: When you started out how long did it take for you to develop the MPV solution so that you can get in touch with customers?

Girish: We spent about two months noodling on the need for this new technology before we started the company. Once we started the company we actually spent about six weeks just talking to customers and prospects validating the idea. So did not have any developers on board for the first couple of month of our existence as a company. By validating up front we were able to identify not just that the product idea we had would solve this particular problem, but we were able to identify our V-1 scope, you know, the MVP for our Version 1. So then we started developing it and within I would say 10 months we got it from zero to our V-1 GA version.

Martin: From my experience what I have seen is that lots of companies are building manual or their own solutions for building data pipelines. For example, from the access logs to the Hadoop cluster using streaming technology like Kafka so on and so forth. Is this what you are trying to solve that you are having a modular way of getting from this source to the target?

Girish: Yes, that is actually exactly right. Technologies such as Kafka, Flume, and there is a number of other lower level open source ingest frameworks. All of these transport technologies still require people to code the data logic in some form either manually or using other tools. So we interoperate with these technologies to provide a better resilience, better operational characteristics, and better agility when comes to dealing with change.

Martin: This would mean that you would need fewer data engineers for example?

Girish: Well, I think the way we think about it is that this will allow data engineers to focus more on innovation than troubleshooting problems in their pipelines on an ongoing basis.

Martin: Okay, cool. How did you acquire the first customer? Did you know them before and reach out to them? Did you get some introductions? How did it go?

Girish: The process I described where we went and talked to a lot of customers is one where we really don’t try to use too many preexisting contacts because we want to make sure we are not ending up with false positives. The way we approached it was, we actually characterized the types of businesses and the types of decision makers we would want to talk to. By having this very open-ended discussion around what their needs were and validating our product ideas, what we found was that a good percentage, close to 60% of those conversations ultimately ended up being candidates for our charter customers. They effectively helped shape product definition and product scope.

And out of that 60% we ended up talking to 30 customers in total, out of which we had 18 to 20 that would be interested in the product. Out of those naturally fell a dozen or so logos that we were able to engage with during our beta cycle and became our charter customers.

Martin: At what point and time did you add the first engineers and how did you fund them?

Girish: So as I mentioned we started hiring our engineers about two months after we were in business. Prior to that, we were able to get some seed funding from a couple of Silicon Valley VC’s. So we collected a small amount of money that we could use to hire up our “C” team.

Martin: Cool, what is the value proposition that you are trying to deliver and how are you trying to monetizing that?

Girish: Sure, the main rally proposition from our perspective is that we are focused on delivering higher quality data into big data stores on a continuous basis. While historically the value proposition was around developer productivity, what we are talking about is consumption readiness of data. So from a business model perspective instead of charging for the amount of StreamSets technology that we are deploying, what we are doing is charging for the amount of data the is under management in these big data stores. The way to think about this is that you have unlimited ability to deploy as much StreamSets pipelines as you want to get the job done and thereby delivering value to the end user, the data scientist or the line of business you serve.

Martin: I am wondering why did Cloudera not develop something like this themselves?

Girish: If you look at historically where these technologies fit in, there was always a need for the independent vendor to deal with what I described as “any to any problem”. Any data store vendor that focuses on this typically ends up not worrying about the kind of a breath of sources, the breath of destinations, and from a customer’s perspective, they really want a single piece of technology that can solve any to any problem. When we look at Cloudera, Cloudera is very comfortable partnering with us because they understand that by keeping us kind of at arm’s length that we will be the best of breed in solving this problem rather than focusing on a technology that is just going to deliver data into Cloudera.

Martin: Cool, so how I understand this is that you have been companies to transport data off to other providers like MapR and so and so forth?

Girish: Exactly and in the end of it more than not just Hadoop. One of our other key value propositions is to deliver data to technologies like Elasticsearch and that is what customers want. They want a single way to manage their data movement between Hadoop, search and other types of technologies.

Martin: If I am looking at this big data ecosystem there are some storage layers, there are some analytic layers, and then there is something like what you are doing is an ingestion layer for example. Do you have in mind adding some analytics on this ingestion pipeline, because when I looked at your website it seemed to be something like this, which would be nice because you are earlier in this kind of pipeline than the analytics player?

Girish: Good question. We very much are focusing on providing kind of analytics, but around data in motion. There are two ways to think about analytics. One is the analytics the business cares about that is typically done on data address or data in the store. There is another set of analytics that you can think about while data is moving. This analytics typically have to do with data availability. Did the data actually get there or how quickly did it get there, etc.? Then, what we call data fidelity. Did something get lost in the process of getting from A to B. That is the area of focus for us when it comes to analytics.

Martin: What have been the biggest surprises for you when scaling the company?

Girish: I guess as an entrepreneur everything to a certain extent is a surprise I think. Probably what I would say is fundraising. It’s an art form. It’s not a science form. So there is no magic formula. What I think I learned was that as I have been going through this process and I’ve raised two rounds now. What I have come to realize is that you can optimize for too many variables and you have to double down on a smaller set of variables.

Despite everything I said about how we developed the product. I would say that finding a repeatable product market fit is still and always is a bigger challenge than you think it is. You have got to be very-very focused to be intellectually honest about whether you found that repeatable product market fit. The worst thing you can do is to invest in scale before you know it is repeatable. Then of course, I think attracting talent that have also been surprisingly kind of challenging. The Silicon Valley is full of people that all want to set their own thing up. So getting like an “A+” starter team was a little bit harder than I thought it would be.

Martin: How did you go about that? Did you reach out to your network or did you write a job blog post? What did you do?

Girish: I think ultimately we decided for our first fifteen or so employees that we would effectively ensure that there was, at least, one degree of separation or one degree of connection. So we have very much used our network to get our core team in place.

Martin: Cool, what are the major trends in the big data sphere from your perspective?

Girish: The last few years there has been a lot of experimentation. When it comes to big data, there have obviously been application vendors that have used big data analytics, big data techniques to deliver specific kind of single fit for purpose applications. But with respect for personal enterprises, I think there has been a lot of experimentation with the technology stack. As I look forward or look out this year. I feel like I’m hitting an inflection point where the focus is much more on operationalization. Basically, kind of extracting value from all the experimentation and investment that has happened. What I think this is going to mean is that the technology stack is going to need to focus on kind of being always on and always trusted.


Martin: Cool, Girish imagine your child comes to you and says: “Daddy, I would like to start a company.” What advice would you give to your child?

Girish: Well, my children are pretty young so it is probably going to be a few years before they begin to do that.

Martin: We can start a lemonade shop also.

Girish: That’s right. Well, I think from my perspective they have to answer the question: “Why me?” By that what I mean is, that you need to develop a differentiated vision that is sustainable. I think going through the thought process and saying you know, what is somebody else doing? Why is it you are going to be able to do it differently or better than them? Is it something you want to do as a thought exercise well before you decide that you want to get both feet wet and jump into the actual act of starting up and growing a business.

Martin: How did you go about this yourself because you cannot check every person in the world whether he might be better than you are or not. How did you answer this question why you?

Girish: Yes, so from our perspective we probably thought about this in two different axis. There is a number of different people that are already out there that are doing something. The way we looked it was we would say: “Ok, what are the fundamental philosophy of each of those vendors in terms of what they were trying to solve? What kind of design decisions they made? What they were optimizing for?” We wanted to make sure we were different from that. That’s one axis.

The other axis is well typically then is will somebody else be able to “copy us” or out execute us? I think from that perspective in our case what we felt was that there was a very small set of people that have lived through the first few generations of this particular problem in this problem space. so we felt that we were uniquely suited to solving this having kind of that 20+ year of experience thinking about this particular space.

Martin: So was your assumptions:

  1. I am a super high domain expert. So I have seen it all and know quite a bit so I have a few people around that.
  2. If I am raising enough money I can scale out the other competitors, thereby owning the market.

Girish: I would say that is a good way to put it. I would actually say that it is 80% the former and 20% the latter.

Martin: Understood. Great. Girish, thank you so much for your time.

Girish: You are welcome.

Martin: Good. Thanks, have a nice day. Bye.

Girish: Bye-bye.


Thanks so much for joining our 12th podcast episode!

Have some feedback you’d like to share? Leave a note in the comment section below! If you enjoyed this episode, please share it using the social media buttons you see at the bottom of the post.

Also, please leave an honest review for The Cleverism Podcast on iTunes or on SoundCloud. Ratings and reviews are extremely helpful and greatly appreciated! They do matter in the rankings of the show, and we read each and every one of them.

Special thanks to Girish for joining me this week. Until next time!

Comments are closed.