Remember when you used to keep all of your data packed into data boxes and stacked up on a bunch of data shelves in your state-of-the-art data warehouse? Well, it might be time to fire up the data forklift and haul all of those boxes out of the structured order of your data warehouse and dump them into a data lake so that it can float and sink and swim around in semi-structured and unstructured waters. On this episode, Rohan Dhupelia joins the gang to talk about his thoughts and experiences from engineering just that sort of move at Atlassian. So, pop in your earbuds and strap on your data swim trunks and give it a listen!
Links to Items Referenced on the Show
- Decision Support Systems (DSS)
- Extract, Transform, Load (ETL)
- Amazon S3
- Google Storage
- Kimball vs. Inmon
- Apache Spark
- Amazon Redshift
- AWS Athena
- Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department
- Oculus Go
- #092: A Special Report – Data Journalism Meets Business Analytics with Walt Hickey
- Numlock News
- CDP Institute Daily Newsletter
- Inbox Zero
00:04 Announcer: Welcome to The Digital Analytics Power Hour. Tim, Michael, Moe and the occasional guest, discussing digital analytics issues of the day. Find them on Facebook at facebook.com/analyticshour and their website analyticshour.io. And now, the Digital Analytics Power Hour.
00:26 Michael Helbling: Hi, everyone. Welcome to the Digital Analytics Power Hour. This is episode 101. Or is it one zero one? Or a hundred and one? I don’t know. It’s gonna have to get, take a while to get used to that. Alright. All of that aside, data, big data, data science. I mean, our understanding of all these things exists somewhere on the spectrum of massive industry buzzword all the way down to the electronic storage of a zero or a one. And it defines the era in which we live. We produce it faster and faster, which probably means most of it’s garbage. Nevertheless, we need to store it, connect it, understand it, find meaning in it and most importantly, translate it into value. You’ve heard of data warehouses, data lakes. Well, it’s time to dive in, get it? And learn along with us on this topic. Tim, the data shows that you are a host of The Power Hour. Welcome.
01:37 Tim Wilson: Hey Michael, I’m glad you decided to wear a Speedo given the nature of the topic we’re covering.
01:44 MH: Really?
01:45 Moe Kiss: Yeah, really?
01:49 MK: Such a low note to start with.
01:51 TW: So it can only go up from here.
01:54 MH: I’m not ashamed of my body, Moe. Alright, there is a high correlation, Moe, that you are also a co-host on this show. Welcome.
02:02 MK: That I am, but I am fully dressed, thankfully.
02:05 MH: And I am the third data point that makes a trend. I’m Michael Helbling. But we needed another data point to round up this show so that is where we bring in our guest. His name is Rohan Dhupelia. He’s a data engineering manager at Atlassian. You may have probably… Well, you have definitely used some of their products. And prior to that, he’s also held data engineering roles with companies like Westfield, Coca-Cola and Mars. But right now, he is our guest. Welcome to the show, Rohan.
02:39 Rohan Dhupelia: Thanks, everyone. Thanks for having me.
02:42 TW: Also fully clothed.
02:43 MH: Yeah, [chuckle] but working on three hours of sleep, apparently. Alright. I think probably, definitions are in order. Well, actually, a better place to start. Rohan, love to hear a little bit about what you do day-to-day at a company like Atlassian ’cause I think that’ll help give context.
03:03 RD: Yeah, sure. I’ve been at Atlassian for about three years now and I run their analytics platform team. That’s the team that runs the data lake or the data warehouse or whatever you wanna call it. And so, we’re a team of probably around nine software engineers and we build services that help ingest, transform, and present data better for everyone in the company. That’s kind of what I do and what I like doing.
03:31 TW: So, data warehouses and data lakes are basically the same thing?
03:33 RD: Well, I kind of feel like they are.
03:36 TW: Alright everyone. Thanks for listening, everybody. Great show.
03:45 RD: There are differences but I think that one is probably like a maturation of the other. It’s kind of like the next step of the other.
03:51 TW: So which, data warehouses came first, is that right?
03:54 RD: Yeah. I mean data warehouses have been around probably, I guess they probably started… I remember when I was back in university. We used to call them data/decision support systems, right?
04:06 TW: Okay.
04:08 RD: That’s how I learnt about them and I thought, “God, that looks like a boring thing to do. I never wanna work in one of those things.” And somehow, like 10 years later, I ended up managing a team which is building these things. But a data warehouse is kind of where the mutation of that, in that it started to become more than just supporting decisions or the scientists or data that potentially didn’t always support decisions like you started storing more and more. And, I guess, data lakes is the further mutation of that, in which you start to store unstructured and structured and semi-structured data sets and it’s potentially not always easy to read it straight away. You probably need to do some further preparation of the data before you can actually get any sort of value out of it.
04:53 MK: The way that I’ve thought about it, and I’m really grateful if you can correct me if I’m thinking about this in an incorrect way, is that in data warehouses, the data tends to be structured. It’s ETL’d before it goes in, so it’s in a format that an analyst can just tap in and query it a lot easier; versus the data lakes where, like you said, it’s thrown in more unstructured. There’s little to no ETL work before it’s pushed in there. And then whoever is using that data, whether it’s your data scientist, your analyst, whoever, it’s kind of their responsibility to get it to a state that they can use it in on the other end. Is that on the ballpark?
05:39 RD: I would say even with data warehouses, you do tend, it’s a good practice to throw data in, in its raw form. It’s always been a good practice to just extract from sourcing systems in their rawest form because you never know what level of detail you need to get down to or if you need to re-build a model or something like that. It’s not worth going back to that sourcing system and doing that. You wanna be able to be agile and rebuild things quickly. You always want the data in its rawest form. I think the difference is that data lakes, it’s probably the technologies is what’s the biggest factor in what’s different.
06:15 RD: I think, traditionally, data warehouses were always built on RDBMS databases, and not entirely scalable. There’s always things you could do to make them scalable, but they’re never gonna be to the same sort of scale as what you can get to with something that’s backed by Amazon S3 or Google Storage or whatever, right? You’re not gonna have that same level of capacity to actually store what you need to store. I think what’s different to me is the technology behind the scenes.
06:45 MH: Alright, I’m gonna pretend that there’s listeners out there who need you to describe what the definition of RDBMS is.
06:53 MH: And that way, I can also find out.
06:55 RD: Oh, sure.
06:56 MH: No, I think I know, but I wanna make sure.
07:00 MH: And sort of what other more modern technologies would be, kind of like, “Well, this is the modern equivalent as well.” Maybe it could be a good thing… Just as a foundational knowledge for everybody.
07:13 RD: Okay. Well, RDBMS basically mean a relational data store. So something like Microsoft SQL, I’m sorry, MySQL or Postgres or Oracle database, whatever. And I wouldn’t say that there’s really… Like it’s a dead technology. I would say that it’s just no longer as valuable for the purpose of analytics, as it used to be. I think, for the purpose of analytics, especially when you’re querying or storing lots and lots of data, you go beyond the boundaries of what you can do with a relational data store.
07:49 TW: In that relation then… And to put it really simply, and I remember it given the debates over your Kimball versus Inmon with your data warehouse approach, ’cause you were fundamentally figuring out what were your tables and each table had rows and columns. And the records were the rows and the columns were the data in those records and you join. And a lot of the work with the challenge was scaling some of that, and I have a very, very loose understanding, was, “Oh, which of these… What are we gonna index? How do we manage when we have 20 billion rows in a table and we wanna join it to another table?” Is that kind of what the… Ultimately, the limitation, the non-RDBMS, which I don’t really understand, starts to get around that limitation, that if you just have these massive tables with rows and columns, that they’re just clunkier to work with?
08:48 RD: I think that’s pretty much right. Typically, when you have a system or when you’re ingesting a billion records a day, and you’re then trying to join that data set with another data set, which is ingesting a 100 million records a day or something like that, just big numbers. Whenever you’re trying to do that, it starts to hit the limits of something that’s running on one node, like a single-node environment, as opposed to taking advantage of distributed compute environments such as an analytical engine such as Spark or Presto or something like that, where you can scale up to hundreds of nodes, to crunch that data and get to an outcome a lot faster.
09:34 MK: So, but the thing that I struggle with, I guess, from the analyst’s perspective, and I know that a data warehouse structure doesn’t solely fix this problem, but ultimately, if you’re pumping all the raw data in, How do you protect against different people running queries in different ways, to produce different results to not create this whole level of complexity around some of your key metrics? Have you guys actually put anything in place to help navigate this or…
10:11 RD: I think it’s a problem everyone still has. What we try to do now in our data lake, and we did the same thing when we had a Postgres data warehouse, as well, but what we try to do is just create separate areas in the data warehouse or the data lake, like separate… Potentially, like a raw zone where every schema is prefixed with raw, so you know that the data hasn’t been touched or manicured in any sort of way. Then we have model schemas, which you know that a data engineer has come by and prepared that data in some form or fashion. And then you have playground schemas, or something like that, where an analyst can have their own sandpit where they can do their own shit, pretty much whatever they wanna do.
11:01 TW: So how does that work? You’ve got the raw data comes in, it comes in quickly, it’s in extract and load. Does that mean that there then is transformation where you’re transforming and doing some level of duplicating, maybe adding structure, maybe aggregating? Like are you basically… Is that what you’re saying, you’ve got… You then have stuff that’s a little cleaner, more inspected, more monitored. I guess, does it fall on a spectrum? You get all the way to the point where the casual user hooking in with Tableau is gonna be relatively safe.
11:38 RD: Yeah, so I think with the raw data sets and the raw schemas that we create at Atlassian, typically, the most manicuring or preparation we do to the data is trying to get it back to how it was presented in the sourcing system. If it was an event, for example, an event that occurred, we would try and make sure that event is still represented as it was in the sourcing system, but we’ll make sure that there’s no duplicates. So our pipelines potentially sometimes duplicate data and we wanna just make sure that that doesn’t happen in the raw dataset. So what you see in the raw dataset is exactly what you would see in a sourcing system.
12:17 RD: And then as you go up, you just get to different levels of maturity in that dataset. The modeled one will be where we just start to do the traditional Kimball-style modeling, where drawing facts and dimensions together and create D1 and D2 dimensions or whatnot, to create value and make it easier for people to query and explore the datasets with whatever reporting tool that they have.
12:43 MK: What made you move in this direction in the first place? What, was the decision made by your team or was it kind of the business driving the decision to move away from a more structured data warehouse, to a more kind of data lake with lots of different areas for specific purposes?
13:03 RD: I think if you step back like about maybe four years now, we had a data warehouse and we had a Postgres data warehouse and we had a Redshift data warehouse. And we had like two things. [chuckle] And the reason we had two things was because we had these massive data sets and we had these more traditional financials and licensing datasets, and we wanted to bring them all into one single platform. And we knew that both of those platforms had scaling problems. We couldn’t really scale Redshift to where we wanted to get to and we couldn’t scale… You obviously couldn’t scale Postgres to where you want it to get it to.
13:40 RD: We started exploring the market and seeing what are the big players doing, what’s like the Netflix of the world doing? And we saw a common trend, that was like, people are relying on separated storage and compute, so that they could scale both independently of each other. You could scale your compute as and when you need to do, like large loads of large transformations, and you could infinitely scale your storage, right? That’s kind of why we decided to go down that route. It was mostly that we just wanted to have one platform and to have it infinitely scalable and not have to worry about moving for a little while again.
14:18 TW: And storage is literally data source piped in in a raw, granular format and mucked with as little as possible, and compute is then transformation. Compute it’s kind of everything else. Is that… Or does compute have a stop point as well? And if you’re hooking Tableau or some frontend into it, is that still part of compute or is there a level beyond?
14:44 RD: I guess when I say compute, I mean more just like compute that’s transforming data. The big data query languages, query engines like Spark or Presto or Hive or whatever it be, that’s what I mean when I say compute, I guess.
15:00 TW: And that gets so… If done really purely and well, that’s so separated that you could swap out the compute?
15:08 RD: That’s right. That’s kind of what we have done in the past. I think as the technologies start to evolve and we start to see a general trend in the industry moving away from Hive, and moving towards Spark, we can easily jump on that and say, “Okay, let’s start transitioning our jobs across it every time we touch them.” And the impact of the data sets is unknown. We don’t need to even [15:31] ____ doing this, but the benefits are maybe a bit faster or a bit more efficient with the usage of CPUs or something like that. It’s those kind of things that’s the real benefit.
15:42 RD: Or the other example is, for example, we recently moved from Presto, which is a query engine built by Facebook and has become quite popular in the open source community and we moved to something provided by Amazon called AWS Athena. And we did that because it was completely managed. But that transition from Presto to Athena was a simple as flicking a switch for us. We didn’t have to do really anything to do that. It was just…
16:13 MK: I don’t wanna give away my last call, but one of the, I guess, hot topics that a lot of people in the industry are talking about and full of buzzwords, is the concept of… Maybe you don’t even need ETL at all or you don’t need a data platforms team. Data scientists can just manage all of this and I don’t have a fully formed opinion yet. I’m doing lots of reading on it.
16:41 RD: Oh, man.
16:42 TW: You literally can’t live without an E, right? You have to get… Unless you’re hooking directly into the source systems.
16:49 MK: Well…
16:50 RD: I hope you and I don’t have the same last call. [chuckle]
16:56 MK: Whoa! This is interesting ’cause that’s exactly what mine’s about. But it’s about the concept of basically letting the data scientist own the process end to end, and the data platforms team then is just to help, I guess, build and manage the infrastructure that they may…
17:15 RD: That’s… That’s like the Nirvana for me, as well. I think that’s why my team exists. My team exists to try and make it… We don’t build the pipelines. We try and facilitate the process of building the pipeline to others in the company. So anyone can ingest data in the company, anyone can create their own transformations, and visualize their own data. We just step back and provide the tools and we measure how successful we are based on how many people are using our platform in the company.
17:44 TW: But when you say anybody can ingest data, they’re ingesting it from the data lake, you’re still getting it from source system into the data lake?
17:51 RD: No they ingest data into the data lake.
17:55 TW: Okay.
17:55 RD: Again, this is probably not data scientist. I guess, something to remember, I guess, is we are a software company, so people are pretty tech-savvy here, and they know how to get data. They know how to use APIs and whatnot, so if they can figure out how to do that, then it should be simple enough for them to get data into a data lake.
18:15 MH: So people can just show up and say, “Hey, I’ve got an API I’m gonna push it into the data lake?”
18:20 RD: Yeah. If you have an event stream, just being able to push that somewhere to some endpoint, and we’ll to listen to that endpoint and land it in a nice table for you.
18:30 MH: Okay.
18:30 MK: Here’s what I don’t get, ’cause we’re having lots of these exact same discussions. I don’t wanna say who’s responsibility is it, but when you build a new feature or a new piece of tech or whatever it is, and that connection needs to be made from the API to wherever you’re pushing all of this data, I kinda feel like sometimes when no one’s accountable for something, then lots of times, it’s just that step gets overlooked, and ultimately, there are lots of people in the business that will need to use that data at some stage in some capacity. How do you make sure that it doesn’t just get mixed by someone being like, “Oh, but I don’t really care about data so whatevs.”
19:10 TW: Or how do you deal with if somebody is to set one of those up, it becomes heavily used, and then the source system changes, or breaks, or whatever? How do you maintain? I don’t know if that’s kind of another side of the same question.
19:23 RD: Yes. This is a challenge I don’t think we’ve really overcome yet. But my thoughts are that it’s sort of a, almost a tiering kind of process. As data becomes more critical to the business and more people start to use it, we start to tier up that data. I suppose, it probably starts like a tier three level, which is basically used by one team for some random purpose, not critical. And as it tiers up, potentially, you start to make sure it follows certain procedures, like the code is maintained in a repository, and there’s [19:58] ____ and green builds against that repository or… There is a lading and whatnot assigned to that push, those kind of things. And I think as you tier it up, then potentially, eventually, it might fall into an essentially managed data engineering team to take care of the real tier one things, the ones that really are critical and drive the business.
20:18 TW: Do you have… Or is just a feature of a data lake typically that there is some sort of monitoring of the use and access of the data? If somebody says, “Oh, there’s this really easily available event stream.” I hook it up. It turns out it’s pumping a lot of data in, maybe still not a big high storage cost, but that could just run forever and nobody’s looked at any of that data ever? Is that… Is there kinda a monitoring of it, some way of which parts of the lake are just kinda sitting out there dormant being unused?
20:57 RD: Yeah, this is… We have rudimentary monitoring, but this is something we are trying to improve as well. It’s… We have a good idea of how much data is getting pumped in and how much data is getting moved around. Where we would want to get to is like, how much is data getting used and viewed? And then tie them back together and say, “Your data is getting viewed by three people a week, but it’s costing us $3000 a week to land. [chuckle] Is it really worth having that dataset there?
21:30 MH: Well, and this even begs the question of sort of governance and security. As we get more regulatory structures around data, its storage and its use, how will you govern that? Or how do organizations govern that? Is I guess the question I have.
21:47 RD: I think… I like the idea. It kinda falls in line with Atlassian’s… One of Atlassian’s values, company values, which is “open company, no bullshit.” We try and keep everything open. As much as possible, we try and keep information open throughout the company. Obviously, there are things that we can’t share with everyone in the company, but where we can, we try and keep it open. My interpretation of that is let’s make the data safe to use. Where we might not present the raw data all the time, but what we could do is, in the manicured data sets, we would remove things that could make that data potentially harmful, remove any direct identifiers, or remove anything that’s financially sensitive, or anything like that. So then everyone in the company can use it and limit access to the financially sensitive data to a very small group of people. But the real goal for me is just to keep it open for everyone.
22:51 TW: Did your team get pulled in, because you guys are a global company with GDPR and kind of the traceability? Did that wind up hitting your team at all or in…
23:04 RD: Yeah.
23:04 TW: And is it, “Hey, we’ll have some stuff. Now, we have to manage some of the deletion?” Or was it, no we’re just gonna make sure that we draw a line and we don’t allow the… Or we’re very, very stringent about the personal data that we allow into the data lake?
23:19 RD: Yeah. I think the real goal for us is to have no data, no personally identifiable data in the data lake, and that that’s how we’re thinking of solving the problem. It’s like, just don’t let it in in the first place, and then you don’t have to deal with any of that stuff. If it’s completely anonymized, then it’s safe for everyone to use. That’s the real goal for me, is just making the data lake safe.
23:44 TW: But it’s interesting, ’cause then that automatically introduces some transformation, right? ‘Cause if you… That means you’ve gotta do some hashing or something, right?
23:52 RD: Yeah, and that’s… That’s where I think it comes down to a platform team like myself to provide those kind of tools on ingestion, on transformation, that… We can say flag particular fields which you don’t want to be landed or you want to be obfuscated and have automations around monitoring landed sources and doing samples and making sure that we’re not actually, we’re adhering to our policies.
24:22 MK: The other topic that I wanted to touch on, I’ve kind of been, especially since my time at The Iconic. I’ve been learning more and more, and working more closely with our data platform’s team, just kind of a little bit out of necessity, but also the lines of roles are really blurred here. It doesn’t seem to be a topic, though, that lots of analysts, I think maybe more kinda people with the title data scientists, maybe know a little bit more about data warehousing and data lakes. But in your view, how much do those that are using the data lake actually need to understand about it in order to, I guess, use it correctly and to be able to… How much should we be investing our time, as people that are not software engineers, to learn about these systems? Or is that part of your team’s role to help coach people on that?
25:14 RD: Yeah, I think this is kind of a really interesting point. I think where we’re… There’s probably a line somewhere, where I think I don’t want people to have to worry. In fact the… And I think people need to know how to use distributed compute, how to spin up a cluster or something like that. I think that’s… It’s…
25:32 TW: But man, that’s the easiest part. You just need a console and a credit card, right? It’s just…
25:42 RD: But then, choosing what is the right instance type or all that stuff. It sounds so terribly boring and I think once you’ve… That’s the kind of stuff that… Like an engineering team should figure out once and make it a pattern and just say, “Okay, just press this button and this will do a little all the work for you.” What I think people, like analysts, need to actually start learning a little bit more is how to write the queries more effectively, and actually understand how the technology works and how the technology will interpret their queries. I think, back in the day of relational databases, you had cost optimizers and things like that, which you could kind of explain your query and see how it’s gonna run. These days, it’s a bit harder to do, but if you can understand the data set that you’re querying, and understand how it’s partitioned, and how the data is stored on disk, then you can query it more effectively and use your computer more effectively. And I think that’s what will make your data engineer very happy.
26:42 MK: And have you been managing that through your team coaching analyst or is that more been analyst-driven as they’ve had to do specific tasks they’ve just kind of picked it up.
26:54 RD: I think we would do like monthly boot camps on our platform, where anyone can sign up in the company and they just come along and we talk about what our platform is, how they can get access and then give them a bit of a briefing on how to use SQL, and how to effectively query the data, how the data is stored, and whatnot. What we’d like to do is take that one step further. And it’s something I’ve seen like Airbnb do. They actually have a data university. And I’d love to get to that stage where you actually have a full curriculum that you can sign up to, like once a semester, or once a quarter or something like that, where you just have a full course of things that you need to do to be successful with data at Atlassian in the company.
27:44 MH: That’s interesting because, really, at the core, that’s sort of a component now for an analyst of data literacy, which I haven’t often associated that aspect with data literacy. I don’t know about you guys, but that’s really kinda cool.
28:00 TW: Well, I think if you start thinking of the university concept internally, that you then start thinking about different degree programs, so you start mapping what’s your role, what are you trying to do, what’s the course work, what are the general requirements. It seems like various platforms, I mean, like Domo has Domo University, that’s kind of geared around learning their platform, which is a big tall order ’cause that’s basically them training you to lock in, whereas, if you’re inside at a company saying, “This is the way we’re gonna think about our internal training, and we’re gonna formalize that,” it probably forces a lot of thinking about what do we really need to do. And a lot of that… I feel like the questions we’re asking you, Rohan, are ones that you’d say, “Yeah, that would probably need to be part of a class. You can’t… You know what? This is probably general ed.”
28:49 RD: Yeah.
28:51 TW: We’ve gotta have people having a basic understanding of what data to never, never, never, never pump in.
28:57 RD: And I think that’s what we’re thinking. This is really, probably in the last two or three weeks, I’ve actually started putting this course together in our company. And what we’re thinking is, there’s probably, I think you mentioned it a bit earlier, but not everything is relevant to everyone. We’re thinking of having probably three separate courses. And one is like a data creator’s course, like where you’re actually pumping data in or transforming data in the data lake. And then there’s probably a data explorer’s course, where you just wanna know how to use the tools to visualize data, how to query more effectively and things like that. And lastly, there’s probably like a data wizard’s course or something like that, where you wanna do magical shit with Python or R or something. And that’s probably where we’ll lean on our data science group in the company to help us out there ’cause we don’t know that…
29:48 TW: And R is going away anyway. At least, that course won’t have to be taught, so…
29:52 RD: Yeah, yeah, yeah.
29:54 MH: Yeah. Feel free to bring in Tim for any R questions you have.
30:01 TW: I don’t know how much of a… Given your role, I don’t know if you have a mental catalogue of everything that’s piping into the data lake. Is web analytics data or web behavioral data from whatever platforms are used in Jira or Bitbucket?
30:18 MK: Self-built, self-built.
30:21 RD: Yeah. I think we do two things. What we have for our web analytics, I guess what you call in marketing assets, they go through Segment and that’s not self-built. We use those client libraries and everything like that. And then for our internal behavioral analytics for our products, we’ve built our own solution there. And that’s a recent thing that we just probably finished in the last year or so.
30:49 TW: And so with that, in that sort of world, if that’s being built, is that being built with a mind to, we’re going to pump this into our data lake and where all of the reporting and analysis will therefore, the building of that is really figuring out a data structure and a data collection and then saying “but we have a data lake and we’re gonna handle the actual use of it” or do those tools get built with there is direct access to reporting and analysis within the transactional system? And maybe where I’m really going with the question is I feel like every platform out there, be it Salesforce as a CRM or be it Adobe Analytics, is doing that data collection bit which is being done in one way, then it’s also doing transformation aggregation and providing some sort of interface for dashboards and reporting an analysis. And it seems like a mature data lake starts to say, “Well maybe we only need this collection piece.” And so I guess I’m wondering if when you build it your own, do you say really this is really just a collection and taxonomy and scheme of figuring out, we’ve got this other infrastructure where we are actually gonna do stuff with the data? Did I phrase that as a question?
32:11 RD: I think I get what you’re saying. Are you saying that basically, do you rely on these vendors like each one of these sourcing systems, these CRM systems and whatnot, and their reporting mechanisms, or do you try and pull that into the data lake and and say “no, we’re not gonna use this stuff, we’re just gonna use our own thing.” Is that what you are saying?
32:36 TW: Well, I just feel like every, those platforms, they’re inherently wired to see their universe as more of the universe than it really is for their end users so they wind up building… Adobe Analytics is the perfect example, right? They build Analysis Work Space and in their universe, the solution is pump more data into Adobe and then you’ll use our web based environment. And there are plenty of companies that are like well, that’s well and good but really, I want your raw data feed and I’m gonna pump it into somewhere else and that is 80% of the value that I’m getting. I don’t feel like the platform… And that’s all, that’s a pure analytics platform.
33:19 TW: If you look at a Salesforce, there’s a need to see my opportunities. There’s a blurrier line because the operation of the system is interacting with the data and drilling in to actually act on records. I guess I’m thinking out loud that the more robust enterprises get with that in-house ability to bring that data together, the less they really have any interest in whatever the operational system has built as a reporting and analysis platform because it’s inherently in a silo with whatever that system is doing.
33:54 RD: Yeah. I think…
33:56 MK: I think you’ve answered your own non-question.
33:58 TW: Yeah, we’ll just cut all this out.
34:00 MK: By just continuing to talk, as Tim does. [chuckle]
34:03 RD: I think we did both, to be honest. So we’ve recently started using Amplitude because they just have great reports for quick stream analysis, like pre-built funnels and things like that. Things that someone would have to actually create in Tableau or something, to actually get any value out of, whereas, you can just get this stuff off-the-shelf with these pre-built reports that you can just take immediate advantage of. While we do pump into our data lake everything, we also pump to some of these services where we see value and then we think we can leverage their tooling and get more people data-driven faster. If those tools are easier for PMs to use, for example, product managers to use, and then why not just enable them.
34:49 TW: There’s not a black and white answer. It’s judgment calls.
34:51 RD: Yeah.
34:51 TW: It’s grey area and it’s one, the other, or both, depending on the what makes sense.
34:57 MK: Rohan, I did wanna ask. I know when you, when Atlassian moved from a data warehouse to the data lake model that you have now, there was a lot of learnings, and teething pains. And one of the things that I was really interested in when I heard you speak at Web Analytics in Sydney a while back was about you guys turned one off one day and turned one on. I mean the one that was off was still on but you just didn’t show it to people anymore, so to kind of force them to migrate. What was the reaction from the teams like when you made that decision?
35:33 RD: We tried to tackle things use case by use case. All we did, we never really got rid of our old data warehouse. In fact, it’s still there today. It’s just like part of the ETL now. It’s just sitting in the background crunching old data sets and we’re slowly migrating off it, but it’s still there in some form or fashion. But what we tried to do is when we got off it, we just picked every use case. We looked at the analytics basically like the Postgres query logs, and then tried to figure out who is using it for what, then basically picked on them and tried to say, “Okay, why are you using this? Why can’t you use the data lake for this? We’re turning off your access.” And we just went through one by one and did that, until we could find, like we got down to the 10%, which were the hardest ones, and we left them on there for a bit longer until we found better solutions for them in our new roles.
36:28 RD: That’s how we migrated off. The hardest part is moving the pipelines. I think that’s always gonna be the hardest thing to do. The real value for us was we wanted to get people on the platform as fast as possible, so we could take advantage and learn things faster. We had that feedback loop in place, and if we could start seeing how we can improve that platform, as opposed to transition one user at a time and having some people still supporting this old environment, but having to support this new environment as well. We just wanted to get going as fast as possible.
37:02 MH: Alright. I gotta jump in because we’ve gotta start to wrap up. This has been very informative for me and I think probably for our audience as well. So Rohan, thank you so much. One thing we love to do on the show is a SQL query we call Select Last Call from podcast, where podcast guests… [chuckle] Okay, sorry. It’s been a long time since I’ve written any SQL.
37:31 MK: Worst joke ever. Worst.
37:35 MH: Rohan’s laughing, so my job’s almost done. Okay. Anyways, we do a last call. It’s just we go around the horn and we talk about something we’ve seen or anything we think is interesting we’ve seen recently. Rohan, you’re our guest. We’ll give you first crack at it. What have you got as a last call?
37:52 RD: Okay. I’ve been watching… I like to watch other companies and see what everyone else is doing. One company I like watching, because I like their visionary thoughts on ETL and engineers and data scientists, is Stitch Fix. I don’t know if you’ve ever heard of Stitch Fix. This is your call, isn’t it, Moe?
38:11 MK: This is totally the same last call.
38:13 TW: Well, then I will be delighted to learn your two different perspectives on it. Wait. What does Stitch Fix do? What’s the company?
38:22 MH: They send you a box of clothes, Tim.
38:22 MK: Stitch Fix is in the US, Tim.
38:25 TW: Oh, yeah. It used to be…
38:28 RD: I think it’s a machine-learning-driven delivery of clothes. They decide, you give them some preferences. They send you some clothes based on like…
38:37 MH: I tried to sign up and they were like… Yeah, there’s no help. They actually, they ran through it and said, “You’re just not… You’re off-brand. We cannot help you.” [laughter]
38:43 TW: For those keeping score at home, you can now get your clothes delivered, your food delivered, and everything else delivered via data science in Amazon. Okay, so Stitch Fix.
38:56 MK: Michelle Keys is a very big fan of Stitch Fix ’cause you never walk into a store and they just send you like seven items and then you pick to keep four or five.
39:04 MH: They’re your fit, and like it’s a really cool… Yeah.
39:08 TW: Quick aside. I actually, a couple years ago, was like, “Look, get me one of these subscription things.” My sister, who has, for years, is my fashion fallback person, worked with my parents, so they could for Christmas… Literally, the company that I went with went out of business between my first and second shipment. So I do think I have the ability to actually bring down these organizations. Sell your Stitch Fix stock right now ’cause if I check it out, they’re going down.
39:39 MH: Alright. But back to the actual topic.
39:40 TW: Back to the actual last call.
39:42 MK: But anyway, yeah.
39:43 TW: Okay, Rohan, back to you.
39:45 RD: I’m interested to hear what Moe’s opinion on all this is now, but… What I really like about them, and this is what really inspired us, I guess, to spin off and have a separate analytics platform team in the first place when I first read this article, I don’t know, three years ago, two years ago, was that they actually… They had this pretty bold statement that engineers should not write ETL. And I can see based on your smile that you probably read the exact same shit.
40:14 RD: And I was like that like, “Oh. Fuck, yeah, that sounds cool. I don’t wanna do my job anymore. I want someone else to do it for me.” [chuckle] That’s kind of why we built the services we’ve built. We’ve built, made it easier for people to do ETL without us having to do it. We’re trying to figure out how to make it easier for people to ingest data without us having to do it. And I just really like how they’ve approached it and said they wanna empower the data scientists in their company to be completely autonomous, and not, like you know, have the zero friction with their engineering team. And it’s kind of like, I think that’s a good goal to have for any data infrastructure team, I think.
40:54 MK: Yeah. I can’t believe you stole my thunder. I actually was so proud because I never read stuff on ETL and I was like, “This is so appropriate for this episode.” But…
41:04 TW: You could always just do your last call was The Undoing Project. I mean, you know.
41:10 MK: Oh. The dude that wrote it, his name’s Jeff Magnusson, I think.
41:14 RD: Yeah, I think so.
41:15 MK: And it’s “Engineers Shouldn’t Write ETL: A Guide to Building a High-functioning Data Science Department.” And I think he made some really interesting points, and like I said, I’m still mulling all of this over, I don’t have a set opinion yet. I think his point about, you think you have big data but really, you don’t. And if you make software engineers spend their time doing ETL, they’re just gonna go to bigger companies that do actually have interesting data problems. I thought that point was really interesting. I think the concept of who in the company are thinkers, I’m not sure how I feel about that quite yet. The thing that really struck me was he was like, “Well, no one likes to do ETL.” And I actually wanted to ask you, Rohan, is that the case that really no one does wanna do it?
42:00 RD: I think there’s like a particular…
42:00 MK: Because I’m like, there’s heaps of people that, that is their job. Surely, someone enjoys it.
42:04 RD: There are some people who love modeling data.
42:06 TW: There are people who like to do QA.
42:08 MH: Yeah.
42:08 TW: And they like to teach kids.
42:09 MK: Yeah.
42:09 TW: I mean, I can’t explain humanity.
42:11 RD: Yeah.
42:12 MH: There are people out there who are actuaries. [laughter] So, yes.
42:19 RD: I think data engineers, in my opinion, they wanna build. They see a pattern and they wanna build a framework. And I think that’s kind of what, the experience I’ve had at Atlassian at least. They do it once or twice, they get bored of doing it and they wanna try and figure out how they can just automate the way they’ve done it. And that’s like what an engineer should be doing, right? You shouldn’t have to do things over and over again.
42:43 MK: Yeah, I guess. I think the bit that’s still, I’m a little unstuck on is, I can see… I actually got sent this article by one of our data scientists. I can see why data scientists would be pro-this approach. I’m still struggling a little bit about why data engineers would necessarily see, I guess, stepping back and letting data scientists and analysts do all of this would necessarily be a… Like I feel like there could be some pushback from that of like, “Hey, actually, we’re SMEs in this. We should be owning this space.” Yeah, so I’m still mulling.
43:18 RD: I think it’s… We’ve taken it with a grain of salt. I think at Stitch Fix, they strongly believe this and I don’t think that they actually have… They probably have data infrastructure engineers and have data scientists. I don’t believe they have data engineers, like modelers, data modelers, whereas we’ve taken it virtual. Yes, we need… We wanna allow people to be their own data engineers, but we also understand that there are some data sets which we wanna make sure are pristine and well-modeled and we would rather get someone who’s familiar with the best modeling techniques to come in and do the right things with that data set. So we’re taking the best of both worlds, but that article was a pretty good inspiration for us to go on our journey as well.
44:03 TW: Interesting.
44:03 MH: Alright, Moe, what’s your last call.
44:04 MK: Yeah, nice.
44:05 MH: No, I’m just kidding.
44:08 MH: No. It’s a joint…
44:11 MK: Honestly.
44:11 MH: Aussie last call, so good job. Way to show unity.
44:17 TW: You wanna go next?
44:19 MH: Sure, I’ll go. I’m sure I’m stealing yours too, Tim. No, I’m just kidding. Mine is actually a product, of all things. Recently, I managed, I found some points in a point system that I had nothing, I had no idea what I would do with them so I bought something frivolous, which was an Oculus Go. Oculus is a company in the VR space bought by Facebook. It was never apparent to me why Facebook bought this company until I actually put the thing on and then I was like, “Oh, that’s why they bought it.” But here’s why you should consider getting one of these, because it is the holiday season coming up soon. This is why.
44:57 MH: Not to use, ’cause who cares? It just sits there. You bring it to parties and watch your friends ride a roller coaster in VR for the first time and laugh uproariously. [laughter] It’s one of the funniest things you will ever see. Ideally, there’s this amazing use cases for exploring data and VR in three dimensions, which I’m super, super excited about. We’re years away from that, but for right now, watching your friends try and ride a rollercoaster VR will provide you a lot of laughs. Until there’s data, we’ve got you. Alright. So Oculus Go. It was a surprising, interesting little… I don’t know, it’s a cool technology. I don’t know.
45:38 TW: I’m actually literally going to repeat a last call that one of our guests from a few episodes back had, when we had Walt Hickey on, and his last call was email newsletters. And since then, I actually took his last call to heart, and so I definitely subscribed to Numlock News, which still is my favorite little gift every morning.
45:58 MH: It’s getting better and better, it’s really good.
46:02 TW: Oh, it’s… My 13-year-old, she also gets it, so she… Pretty likely I’m getting texts from her at school, where she’s sharing something. But part of his admonition was, I’ve actually signed up for some other ones that, and they’re a variety of… They’re not all daily, some are weekly, but I get a… CDP Institute has a daily one, which is not really that fun to read but is, I get some quick headlines, so because I’ve gone the zero inbox route, that’s literally what I’m doing for five to 10 minutes when I sit down at my desk and it’s been wonderful. I endorse those in and I’m kind of also endorsing the Numlock News because it’s all, they’re all set up. They’re always gonna come in. You can set up a rule so it’s auto flagged, so you can read it, archive it, and it’s filed away. There’s my last call.
46:50 MH: Thanks Tim, thanks everybody. Now, you’ve probably been listening and been like, “Wow, I wish I could find out more about data lakes and how to use them and not doing ETL as a Data Engineer and all of the above.” And so, the best place for you to interact with us is on the Measure Slack or on our Facebook page, and we would love to hear from you. Rohan, it’s been extremely informative from the foundational knowledge of RDBMS to getting deeper into the actual ins and outs of streaming data in the APIs and managing all those things. Thank you very much for sharing your knowledge and coming on the show. I appreciate it very much.
47:34 RD: Thanks for having me.
47:36 MH: And for my two co-hosts, namely Tim Wilson and Moe Kiss, not Michelle, related, not the same. Just good for us to keep saying that, apparently.
47:50 MH: I wish all of you the ability to keep analyzing in the data lake of your choice.
48:02 Announcer: Thanks for listening. And don’t forget to join the conversation on Facebook, Twitter, or Measure Slack Group. We welcome your comments and questions. Visit us on the web at Analyticshour.io, Facebook.com/analyticshour, or @analyticshour on Twitter.
48:20 S6: So smart guys want to fit in so they made up a term called analytics. Analytics don’t work.
48:30 TW: This is the only aspect of the show you have any control over.
48:35 TW: But people use it to figure out what they think of you as a person.
48:38 TW: No, I’m just kidding.
48:45 MK: On the podcast, ’cause I find it really interesting, but we’ve never actually had someone that’s literally going through the process, to kind of talk to us about their experiences… [chuckle] What you’re doing helps.
48:58 TW: Sorry.
48:58 MK: Seriously? Is that your ring tone?
49:02 TW: Yeah for that. I don’t know. Sorry, turning my phone off. Is it a unicorn? Are you guys a unicorn?
49:08 MH: Oh yeah. We’re totally a unicorn.
49:10 TW: Oh well, there you go. Is that Australia’s only unicorn? That would be cool.
49:16 MH: Once you stop growth hacking, you’re no longer a startup.
49:22 RD: Sorry, do mine one last time.
49:24 MH: Yeah, yeah, sure, that’s fine. That’s why we edit. If you can say, “Yes. That’s correct.” Or nod, and then elaborate, that would be great.
49:32 MH: R-D-B-M-S. RDBMS.
49:37 MK: No, don’t like literally old. Like, good old analyst.
49:42 MH: Oh yeah. That’s what I took away.
49:44 MK: Man, throwing it out there again.
49:46 MH: Yeah. We’re pretty triggered right now.
49:48 MK: I feel like there should be some decorum in Twitter.
49:53 TW: Moe, you have a face.
49:54 MK: I always have a face. I don’t know what you’re talking about.
49:57 TW: Rock flag and data mashes.