What does it really take to bring data science into the enterprise? Or… what does it take to bring it into your part of the enterprise? In this episode, the gang sits down with Dr. Katie Sasso from the Columbus Collaboratory…because that’s similar to what she does! From the criticality of defining the business problem clearly, to ensuring the experts with the deep knowledge of the data itself are included in the process, to the realities of information security and devops support needs, it was a pretty wide-ranging discussion. And there were convolutional neural networks (briefly).
References Made During the Show
- R-Ladies Columbus
- Women in Analytics Conference
- Convolutional Neural Networks
- Logistic Regression
- Julia Silge – Text Classification with Tidy Data Principles
- #095: The Rise of BI with Taylor Udell
- (Video) Using Electron and node.js with Shiny to make standalone, interactive, Shiny apps (#useR2018 conference presentation by Katie)
- (Video) Ad fraud deep dive: what is the true impact of digital ad fraud?
- Pawel Kapucinski
- Jupyter Notebook for Beginners: A Tutorial
- The NFL’s Inaugural Big Data Bowl
- Will Sasso
- @katiesasso on Twitter
Interested in our newly posted Producer job that pays in friendship and whisky recommendations? Here is what you need to know.
00:04 S?: Welcome to the Digital Analytics Power Hour. Tim, Michael, Moe, and the occasional guest discussing digital analytics issues of the day. Find them on Facebook at facebook.com/analyticshour, and their website, analyticshour.io. And now, The Digital Analytics Power Hour.
00:26 Michael Helbling: Hi, everyone. Welcome to The Digital Analytics Power Hour, this is Episode 107. Have you ever thought to yourself, “Well, of course, we wanna do data science, but we are but a 120-year-old aluminum company or something, and no one here has even contemplated the invalidity of the pie chart, much less the inner workings of centralized code repositories for commonly used R snippets.” But that’s what this episode is about, not just data science, but data science at the enterprise level. And in the next 40 or so minutes, we will slightly be more useful than all those Watson commercials you’ve seen on TV. I used a predictive algorithm to arrive at that conclusion. Hey, Moe, have you ever worked at a company like that I was talking about?
01:17 Moe Kiss: Just a little bit, some growing pains, for sure.
01:20 MH: Oh, okay. And, Tim, I believe you were the first person in the world who recognized the invalidity of the pie chart. Is that correct?
01:27 Tim Wilson: As long as nobody tries to Google that, we’ll go with it.
01:31 MH: Okay. And I’m Michael Helbling, as someone who thinks about data science but is content to let others actually do it. [laughter] But to really get this conversation over that IBM Watson hump, we needed a guest, someone who could bring some real perspective to the enterprise application of data science. And so we found our guest. Katie Sasso. She is a data scientist at the Columbus Collaboratory. She is the organizer of the R-Ladies Columbus Meetup, and is also one of the planners of the Women in Analytics Conference. She has her PhD from the Ohio State University in Experimental Psychology, and we are pleased to have her on our show as our guest. Welcome to the show, Katie.
02:14 Katie Sasso: Great. Thanks for having me, guys. I’m really excited to be here.
02:17 MH: Yeah, we are, too. I think one thing to get us kicked off, would be great, is just to get a little better understanding… The word “data science” and “data scientist” gets thrown around a lot, so maybe just give folks a sense for your day-to-day what you do at the Columbus Collaboratory.
02:33 KS: Yeah, absolutely. Like you said, I’m a data scientist at the collaboratory, and day-to-day, I do a variety of different things. Typically, from a data science development perspective, I’m usually working full-time on about two projects. And at the collaboratory, we try to time box those a bit, so it varies, but on average, they’re about six to eight weeks in length. And so on a typical day, I’m doing a mix of meetings with clients to review goals and targets, and interim deliverables for those projects. I’ll be doing a lot of programming in R as well for those projects. And then more recently I’ve started to have a heavier hand in scoping new projects and bringing those in. And I’ve also started to run some things we’re calling workshops, so those are basically half-day use case discovery sessions. It’s growing to be kind of a mix of all of those activities, but I’d say on a typical day, it’s meetings, it’s programming in R, and it’s talking with clients and figuring out what is gonna be best for helping them answer the question or improve the process that they’re focused on.
03:42 MH: So just as you’re referring to your clients, can you maybe give the quick rundown on what the Columbus Collaboratory is and how it works? ‘Cause that’s kind of… It’s unique and interesting, and I think your clients are literally different enterprises, right?
03:55 KS: Yeah. Yeah. The collaboratory, we are kind of considered a start-up, and in a lot of ways, we’re not like start-up at all. We were founded about four years ago by seven major non-competing companies in Columbus. Those include AEP, Battelle, Cardinal Health, Huntington Bank, L Brands, Nationwide Insurance, and OhioHealth, so seven companies in completely different industries, and the goal of our charter was basically to focus on bringing these seven companies together and building each other up, and also the Columbus area specifically in advanced analytics and cybersecurity, and then also the intersection of those two. The long-term plan was, let’s harness the power of the collective hair of these seven non-competing companies to improve one another, and then eventually let’s go out into the marketplace and start to sell our services and products in a way that can just move the area forward in these two paths and also their intersection. My clients for the past two years have been folks at these large companies, but then more recently we’ve also started to take on external clients.
05:04 MK: And so how exactly is your team structured? Or is it more that every person in the team works for different clients? How exactly is it structured work-wise?
05:15 KS: Yeah, good question. I think what we really try to do is keep things fresh and interesting for all of our developers and practitioners. Typically, how you decide what project you’re working on as a data scientist is a combination of your expertise and your skill set, and then also capacity, too, is something we factor in. But the goal is to keep things fresh such that you’re not working on the same project for years and years, which was really exciting to me when I started. Basically, we have projects, you can view them as coming in from all parts of the business. And then this past year, we started to focus a little bit more into certain areas, of course, cybersecurity and applying advanced analytics there. The IT space naturally falls out of that. And then we’ve done work across all different areas, but we’ve built up a great collection of work broadly in areas of risk and finance. So, just depending on where there’s demand from members and where we see opportunities that will be the content of the project that falls on your lap or in the queue.
06:18 TW: My impression is that some of the enterprise clients I work with, they really don’t have any in-house data science, and to Michael’s joke at the beginning, they’re still struggling with… They don’t even have a BI tool, they’re barely doing stuff in Excel. Other enterprises have a data science team, but for whatever reason, it’s kind of the scope of what they work on, there’s a boundary drawn around it somewhere, even though, from a high level, you’d say, “Well, there are probably all sorts of other areas in enterprise that can use it.” What is driving the need or the recognition that, “Hey, we can use this pseudo-outsourced… ” You’re effectively an external consultant, even if you’re known and therefore presumably blessed from a data access and NDA and all that sort of stuff. Is it, “We’re not doing any of this stuff,” or, “We just don’t have the capacity?” What makes them stand up and say, “Hey, this would be a good six to eight-week sprint of a project for you guys to work on?”
07:24 KS: So, I’d say it kind of… It varies quite a bit. I’d say within all of our members, so those are pretty large companies, within all of them, there is some sort of analytics function somewhere. So, I’d say it’s not the case that none of them have any expertise in analytics. I’d say quite the opposite. Some of them have very advanced expertise in analytics, and others have expertise in their more specific domain of problems that they see come across that are more industry specific. I’d say, to answer your question, for members, it varies in terms of how or why they come to us for a problem. I’d say certainly, for cybersecurity, that is probably the main area where the members might look to us as having this expertise or skill set that perhaps might be harder to find in-house. That would be one reason, I think, they come to us on the cyber side. And cyber projects look all sorts of ways. It can be anything from trying to detect insider threat or trying to basically detect attacks early in the stages of them happening, rather than reacting to them, and all sorts of other different things in the IT and cyberspace are things that folks might turn to us for ’cause they feel like we have an expertise built up in that area, which is definitely the case.
08:44 KS: I’d say, in other areas, oftentimes folks will come to us because, A, they might have the expertise but maybe they don’t have the capacity, or maybe it’s a risky project that they have a highly skilled analytics team, but maybe this project has an uncertain ROI on it, and they want us to move quick to just prove our value, and then they might decide to take it from there and run with it, or maybe they want us to work on it. And then I’d say the other… So it’s expertise capacity needs. And then I’d say the third bucket, sometimes maybe folks within the enterprise, you don’t actually typically have access to the analytics team, so it’s kind of a combination of, in that case, I guess, expertise and lack of availability. So, there’s this internal prioritization typically in a lot of these companies and a lot of demands on where the analytics team spends their time. For those folks, I’d say they’re maybe in a totally different silo of the business, like a call center or HR. They might have access to somebody who does really lightweight reporting, like Tableau dashboards, but it might be much harder for them to get resources that fall more in the predictive modeling domain. So, that’s probably the third big bucket.
09:57 KS: And then I think in the marketplace, for us turning to external clients, that’s been kind of… We’ve seen a mix of all those different needs. I think there are some folks that are really looking at us to be the experts that don’t have folks in-house, and I’d say there’s also been a lot of interest for folks who maybe do have quite a practice in-house, but they want a team that’s gonna be able to move quickly and ship them an R package and the ROS form, and then their team will take it and run with it. Hopefully, that helps a bit.
10:24 TW: That’s interesting. There’s so much demand for these internal resources. We know we’d like to try it as well, but we’re not gonna be able to make that case internally, that thinking about it from a… If there’s an analyst, using me as the example of an analyst, who’s trying to learn aspects of data science… Now, I’m not inside an enterprise, but if I was inside an enterprise, it seems like there’s the opportunity to say, “Look, we’re probably never gonna bubble up on the data science teams radar. We’re gonna struggle to get those resources, but we can maybe tap into them a little bit.” Not circumvent them, just say it’s an opportunity for the traditional analyst who has now learned or has been digging into modeling to say, “Let’s try this out.” Now, they don’t have the necessary… That’s not a way to have the safety blanket of the actual knowledge, expertise, experience to do it, but it’s just what you’re making me think of that, that those pockets where the call centers say never has had really advanced analytics work being done, and it’s an opportunity to say, “We know our data, we know our problems, we just need to slide in and see if we can prove out,” without necessarily staffing up a full team presumably. Right?
11:42 KS: And I think one of… One of the offerings we’ve started to develop in the past year or so has been this idea of these half-day use case discovery workshops. We might actually go into a team, like the one you just described, where they know their data, they know their processes and they know where they wanna be, but they’re just not… Perhaps they don’t have the hard technical resources in all the areas, or even the time to get there to build more of that predictive modeling solution. So, we will come in there and we’ll lay out a roadmap of different analytics use cases. And then it’s really nice because it’s up to them to say, “Maybe we wanna engage the collaboratory on this use case, maybe we can go in-house and see if our internal enterprise analytics team can handle this one.” So, those have been nice ways of showing what’s possible and then arming them with the information to say, “Hey, here’s how we wanna tackle these.”
12:35 MH: Yeah. When you mentioned the half-day stuff, I was like, “I wanna come back to that,” so I’m glad you did.
12:39 KS: Yeah, yeah.
12:40 MK: I wanted to turn a little bit and ask your thoughts on a situation… Actually, over the break, it was really funny. Clearly, I have too much free time, ’cause a bunch of data nerd people in Sydney got together to hang out and just talk about data nerd problems. But surprised to say we were all talking about the issue of data scientist nowadays, they’re often getting tasked with building dashboards and reporting, and the way that a data scientist often does good work is to be left alone with a problem for two months and to have time to build something and iterate, and that doesn’t often happen in the business context. I’m just curious to hear how you are able to manage that because of the structure, or do you think you’re doing something inherently different that’s giving you, that you and the team you work with, the time to actually tackle those big problems?
13:31 KS: Yeah, that’s a good question. I think we’ve definitely… I’ve definitely encountered the desire for the dashboard. And I think it’s maybe not that folks are just like, “I just want a dashboard.” But I think there’s something that you can touch and feel about something like a dashboard, something very tangible. So, I think that’s definitely come up and that’s a pitfall. I think the way to guard against that is maybe something, I think, that our team collectively in our process has done a really good job of hardening up. And I think really that is taking the time in the scoping phase to really identify the question or the set of questions you’re trying to answer, and trying to get the user to focus on those as opposed to the tool. It may not be the case that convolutional neural net is the best approach to your problem. Maybe a nice little logistic regression will work, depending on what the use case is. I think we’d try to get the user to focus on their question and the processes, and we try to give them the visibility into why we select the algorithms we do, if they’re interested. So, that’s a one big thing, is focusing on the question, and then picking apart, asking really specific questions about the process that might underpin that question they’re trying to address or that aspect they’re trying to improve on. That’s one thing.
14:51 KS: And I think the other thing that I really learned from my co-workers as well is the huge importance of knowing who your end user is and what that looks like. Is your end user, in the case of a company like L Brands, Victoria’s Secret, is the end user gonna be me, somebody shopping online who’s gonna see an ad? Right? Or in a totally different context, maybe your end user is internal, maybe I’m talking to the end user and they’re using this application to detect insider threat for cybersecurity. So, it’s really important, at the end of the day, to know your end user. And depending on who your end user is, I think that is gonna dictate the importance of something like a tangible dashboard that can really translate the results of maybe a very sophisticated statistical model, versus if your end user is maybe more of a data practitioner, they don’t want that at all. They just want you to build a package in R and give that to them maybe, and that will help them move quicker.
15:45 KS: So, I think those are the two big things, is knowing your question, knowing the data and the processes that underpin it, and then knowing who your end user is. And you’ll still get some of that push back sometimes in terms of folks just focused on the AI solution, or magic button, or the dashboard they want. But I found that most folks are pretty smart, if you work with them to help them understand their problem and our process, that they’ll be pretty receptive to opening up to different solutions and what a different end product might look like.
16:18 MH: The workshops, whether it was a formal workshop or whether it was really just the discovery, how do you balance there’s understanding the problem, but then presumably you have to understand what data might be available? It seems like it would be easy to get sucked down into talking about a convolutional neural network versus logistic regression in the discovery to make sure that the use case or the problem matches, the data matches. Some approach… Is there a tendency to get sucked down into the solution-ing when you’re trying to brainstorm around problems?
17:01 KS: I’d say typically, in a workshop, those sorts of technical details don’t typically come up. I will say we… I definitely have experienced before a focus on the solution versus understanding what the use cases are, so that’s where that walking through what the process is, “Hey, this is really about pulling out analytics use cases that are gonna be of the highest value to you, and then we’ll give you our best ideas of what those potential solutions might be to answering those questions.” One thing I try to do before these is I try to get a sense… It’s hard because the whole point of the workshop is to figure out what use cases you have, but depending on the space, if you know it’s an IT service desk or maybe a marketing department or a call center, you know roughly the type of problems they have. So, there’s a bit of pre-work involved, and then typically I’ll try to get a sense of what data they have ahead of time, and the most important piece is finding somebody who really understands and owns that data, and making sure they’re present at the workshop, as well as somebody who knows the process.
18:07 KS: Typically, if I can get a sense of that ahead of time, I’ll at least know, “Hey, there’s absolutely no data here to address this question,” or I’ll know, like, “There might be a sparse number of records for this type of question.” I think I have to really pull out what they’re looking for here, and look through the lens of, “Is this possible?” And then I’ll often do some followup work to actually pull in their data, see how many records they are, see if data sets even join to one another, before I tell them, “Hey, go after this use case,” where really they don’t have the data for it and the data doesn’t pair up. But, yeah, it’s definitely happened before where folks are interested in a specific end solution. But I found that I think with the appropriate prep work ahead of time and the appropriate conversations, that they’re more than receptive to finding the right solution, but there is that conversation up front, I think, to get everybody on that page.
18:58 TW: I’m no data scientist, but you are sounding like you were describing some of the first couple of steps of the CRISP-DM methodology. Do you guys follow the CRISP… Are you familiar with the CRISP-DM? This is me reading just enough to be dangerous, but it’s like business understanding and then data understanding?
19:17 KS: I’m not. I don’t know the article you’re talking about. Actually in a…
19:21 TW: Well, Episode 108 is totally gonna be you trying to help us with our relationship…
19:26 KS: Love it.
19:26 TW: As podcast co-host. I guess that’s good to… Like you say, I think I have that initial piece… Forget the methodology. I just talked to one too many… I’ve heard it one too many times and thought everybody does this, and I was gonna sound cool, and instead I sound like a hack. [laughter] But to play that back, you were saying you have some familiarity of the problem beforehand or possible problems and the data when you say, “This use case feels pretty good, let’s do a little bit of exploration of the data to make sure,” just in case you don’t have… If somebody is not as knowledgeable as they think they are about the data, they may say, “Oh, yeah, we have all historical data going back and it’s super easy to access,” and then it turns out that it’s flat files stored on tape in Iron Mountain.
20:18 KS: Yes. I think that’s part of the prep work, too, is I say broadly what data is gonna be of use here, and then when I get the attendee list, I walk through and I try to say, “Hey, is this the person who knows this database? Is this the person who knows, I don’t know, your marketing data?” It would be more specific than that. And they say, “Yeah, I think so.” And then I try to actually give whoever I’m scheduling this with, which is often somebody who’s more in leadership, I give them specific examples of questions. I hope this person could answer, like, “So, would they be able to tell me if these are audio recordings or transcripts? Would they be able to tell me if your HR employee data matches up to this IT data? Would they be able to tell me that sort of information?” And if the answer is, “Gee, I don’t know,” then I schedule a call with the person who we think knows, and if they don’t know, I try to find the person who does.
21:11 MK: Damn, you’re really good. [laughter] Like everything you say, I’m like, “Man, yeah, that’s a great plan. Great.” Okay. On that though, what happens when you do get into the process and the data doesn’t actually look like what you anticipated or it’s just a really shitty data set that’s got heaps of missing values. How do you tackle it when you go in thinking one thing and along the way you’re like, “Well, this is gonna blow out by a month”?
21:42 KS: Right. Usually a quick litmus test is okay for knowing, “Hey, I think we’ll be able to generate something useful.” It might not be, “We might be getting you like, I don’t know, Toyota Camry versus a BMW, but I think we’ll be able to get you a car,” some mode of transportation. Usually a quick litmus test of me looking at the data can tell me that much. And I think most of that actually just comes from experience being a practitioner in data science and knowing what you need to build a certain type of model or answer a certain type of question. I’d say actually one kind of… From a data perspective, I’d say one of the bigger black holes that I think is hard to know till you really do more than a rudimentary pass is text data. Sometimes folks might have an idea that text data is really useful to get all these useful insights, and then you get in and it takes a few hours of work to get to the point where you’re like, “This is garbage. People are just saying the same thing over and over again, people drag and drop, copy and paste.” If there’s 10 records, sure that’s easy to spot, but if there’s 10,000, you’re gonna have to do some processing to get to that point.
22:50 KS: So, I’d say those are the typical black holes where a quick litmus test is not gonna be good. If you pull in the file, you can do a quick read on table on how many missing values you have, that’s gonna be easy to check. You get a quick sense of the number of records, how far back in time they go, if they link to one another, but things like the quality of certain fields and what predictive weight they’re gonna hold, those typically take more than just an hour or two of looking and pulling the data into our or whatever you’re using.
23:18 TW: Where are you actually typically working with the data? Given you’ve got cybersecurity is another focus, it seems like, “Okay, we’ve got a use case now, give me a sample of the data.” Are you having to typically work through VPN and keep the data in-house? Are you able to pull the data somewhere else? How do you have to… And I’m thinking you give them a… If I’m noodling around with it inside of the company, I’m still gonna have to probably put the data somewhere and I’m probably not gonna get IT to necessarily stand up an environment for me.
23:53 KS: Yeah, that’s a great question. I think hats are off to the collaboratory DevOps team who does a ton of work. But basically I’d say, when I’m doing development work, I’m never on my own computer, I’m always gonna be somewhere. So, it varies. The majority of our member companies, we actually have hardware that sits behind their firewall, so super secure. Basically, there’s one way in and one way out. The data never leaves their premises actually, so that’s really cool. Obviously, a huge investment to stand up those environments. I’d say, on the other hand, if it’s somebody who’s a newer client or depending on the circumstances, we might discuss if setting up an environment like that is something they have the appetite for. But typically it’s something like an SFTP transfer that our team will set up for them so they can securely get us data. And oftentimes use cases don’t require PI usually, it depends.
24:48 TW: Do you somehow use a data partner on there and know to actually do the scrubbing or the hashing to make sure that what they’re sending to you is compliant with whatever they need?
24:58 KS: Yes. We will help, if they need that sort of help. Typically, there’s somebody on their side who’s capable of doing that, so we’ll just set up that SFTP transfer for them, we’ll send them the information. But then on our side, we do almost always work within a VM. We call them stacks. The collaboratory has its own kind of set of hardware, so typically we’ll have a VM set up either per data scientist or per project. It’s pretty rare that I’m doing anything on my laptop, like our studio desktop. Usually if I do, it’s tinkering or simulated data just to make sure I can hand off something to somebody.
25:35 TW: So, ball park, not to… I’m just curious. What’s the ratio of DevOps to data scientists? How many DevOps does it take to support? And maybe that’s not even… Maybe that’s a meaningless question.
25:48 KS: No, that’s a good question. Our DevOps guys work super hard. When I started which is, goodness, I guess two years ago now… When I started, there were basically three… There were two our head of data science. And then there was our senior data scientist who was doing more of the scoping, bringing in projects. But basically there were three core developers, data science developers, one with a lot of software expertise. And then as we’ve grown, at this point, we have five to six data scientists, some are part-time transitioning from PhD programs. And at this point, DevOps, they are currently three full-time folks, but I think they might be expanding in the future as we expand. So, I’d say it’s… Even though it sounds like it’s not one-to-one, it’s close-ish, because some of our data scientists are part-time. So, I’d say it’s trailed back and forth in terms of being matched versus there being a few more data scientists.
26:50 TW: I feel like the data science, it’s not hard to stumble into where all of a sudden you’re trying to deal with AWS or Google Cloud platform. And I’ve definitely lost hours, I don’t want to get back where I was trying to read step-by-step post to do something in the Cloud and thinking, “Man, I wish there was a DevOps person.” I just wonder what… I guess, if you’re in-house at a company, potentially you could say, “I’m doing kind of a proof of concept, let me get small enough data, scrubbed enough data within the company’s policies to put it somewhere.” But as soon as you have it locally on your laptop or your desktop, chances are somebody’s gonna get really, really squeamish. I guess maybe if the company just has the ability to set up VMs, you can say, “Look, that’s what I work off of.” I don’t know, that sounds like a blocker that can be unfortunate if you need that skill.
27:52 KS: They’re just different really. There might be a package in R that I have no problem getting on the OS on my Mac, this Linux-based back OS. But if I’m working in CentOS or Red Hat on a stack, it’s a different process for installing that package. I think there is a growing curve for us as data scientist practitioners to learn that, and then I think… I’d say hats off to the DevOps team for continually evolving these environments to make them easier and easier for us to work on. So, it takes, I think, on both ends… From the data scientist and the DevOps team, I feel like we both go out of our comfort zone a little bit to… We take on some maybe more DevOps-type skills than a data scientist would maybe at a huge company where they don’t touch that, and they probably know a little bit more about R than they expected to know. But overall, I’d say it’s not, it doesn’t… It’s never reached the point where we can’t do our work, by any means. And I think a lot of that has been good communication between the two teams in terms of what’s working and what needs to be improved.
28:58 MK: You mentioned that you do, you have a lot of people coming straight from PhDs to work with you. I’m curious to hear how do you… It sounds like you have incredible people skills and stakeholder management, but I assume that everyone starting wouldn’t walk in with all of that knowledge and experience. How do you, as a team, manage, I guess, development of that sort of stuff? Do you buddy people up or do you just like the newer staff learn on the fly?
29:27 KS: Yeah. It’s kind of evolved over time as we’ve grown. I’ve been there two years and when I started there were three of us. It probably wasn’t until the past… I’d say this past year is when we’ve grown from three to six. I think we all have a mix of different skill sets, and to this point, I think our team has been small enough such that we naturally partner and communicate on our different skill sets, what somebody’s strength is versus not. So, I’d say very much, there’s somebody on our team who I consider the technical sherpa from a statistic standpoint, like if somebody gets a new problem and they really wanna go in to building their own model from the equation or matrix level, there’s one to two guys who you consult with on that. There’s folks who own more of the client management, so that’s something that I’ve started to do a bit. And then we also have one of our data scientists who’s just an expert in software and all things internet, so he’s who you to go for those questions. But I think, as we grow, we’ll probably have to formalize that a bit. Right now, it just works, we just all talk to one another. But as we get larger, we might have to think through how that plays out in a more formal way.
30:40 TW: I guess I’ve got a question around the time horizon that companies should be using. If you’re a large company, then there’s a lot of things happening, sometimes it’s harder to turn a ship than a motorboat kind of a concept. But what kind of time horizon should companies realistically expect, in other words, in problems to work on with data science initiatives? In other words, there’s a lot of today problems. Our email marketing isn’t producing enough results, but there’s a lot of tomorrow problems like how we solve some of the talent gap that’s gonna happen in the next 20 years. Where do companies, how close to today can they work and how far to the future should they be looking and thinking? If I’ve got a bunch of data problems that I wanna bring, as an enterprise organization, what’s my horizon look like? Or what have you observed, I guess, is what I’m looking for?
31:36 KS: That’s a good question. You’re not asking basically how long should I expect a solution take to be built, but you’re more asking what solution do I focus on, the thing that’s gonna solve the fire that’s currently burning or gonna put out the fire in two years?
31:53 TW: You gotta look at it from both sides, right? There’s one that is like, “I’ve got these burning problems that I’d love to solve them right now.” But there’s also a reality of while it takes a little bit longer than 30 minutes to build appropriate model, test it, validate it, and actually give you back something’s gonna drive value to get that particular problem set. That’s kind of the… It’s the intersection. And so do we sell businesses, “Hey, listen, we wanna help you solve your major issue that you came to us about, but actually we need to look at it from this window or from this perspective,” because there’s this sense of expectation? And, yeah, you could go further out and say… But, I mean, most businesses are like, “We wanna do one-to-one personalized marketing tomorrow,” and it’s like, “Well, that doesn’t happen, that’s just not how it works. So, let’s talk about how we step-wise get you there.”
32:44 KS: I think you kinda answered it a bit, I guess, with that last piece. Right?
32:49 TW: Well, no, but I mean I was hoping you could be like… Here’s my perspective on things.
32:53 KS: Yeah. No, that’s totally how we do it though. I think this gets back… This is baked in to what I said about finding the right question. This, to me, is part of that pre-scoping process where you’re trying to figure out… Somebody might come to you with the question of, “Hey, we want to boost revenue from our ads 200%.” That’s not gonna help me from a data science perspective. I need to chip away at the core pieces that make that up, and one of them might be starting at the point of, “Well, how much revenue are you attributing to your ads today and where did this number even come from?” And let’s get better and sharper at measuring what credit you’re taking from your ads. And maybe sometimes even like what is the outcome you should be attributing to the success of your ads. Maybe it’s not revenue, maybe it’s people buying more units or people buying different products. I think that’s part of how we solve that, is in this finding the right question piece. And I think for the most part, folks come to us with, I hate to say it, but like high-value return questions. And those tend to be a mix of something we could be doing better today and also something we know that long-term is gonna continue to be a process, continue to be a question we’re answering. And we might have something in place to address it now, but we feel like it could be better because of X, Y and Z.
34:13 MH: Yeah. And I was just curious how often do you feel like you have to temper expectations as opposed to in that regard as well. ‘Cause sometimes that’s a piece of it, is people come in, they’re like, “Here’s all my data problems, go make the magic data science happen,” and then you have to start rebuilding people’s concepts of what’s possible.
34:36 KS: Yeah, that definitely happens quite a bit. I’d say it’s still probably the minority. I think most folks understand this is a process, but I definitely encounter that. And I think usually, when you walk through a concrete example of what this is gonna look like, usually folks are receptive to that, But sometimes it is the case that there’s a solution that it’s gonna be really quick and everything’s gonna just solve the world’s problems. But I’d say most often, if it’s a complicated problem, most folks on the other side of the table recognize, like, “Hey, we’re not perfect at this. This must be a challenging problem to address.” But, yeah, that definitely does happen and I think that’s where you gotta pare back the possibility of what you’re trying to answer.
35:19 MH: That’s cool. And then, I guess, another question that’s been brewing… Sorry, it’ll take two at a time, but Tim’s talked a lot… [laughter] Obviously, my experience is more on the marketing side of things, digital analytics. And so in that world, we’re really dealing with two kinds of data that’s typically descriptive data, and then behavioral or transactional data. And I’m curious, do you run into other types of data in other areas, in other venues? I’m just not familiar enough with other practice of analytics. So, descriptive data, the stuff you get from third party places, just like a soccer mom who drives an Audi TT, and then the behavioral data is like, “They bought from us three times last year,” or, “They visited our mobile website last Monday,” that kind of stuff. ‘Cause you go into other areas of analytics, I’m just curious, are there other areas of data that people should be aware of?
36:19 KS: Yes. Yes, there’s a world of other data. I’d say actually though, what’s interesting though is seemingly… And I can give you examples, but seemingly unrelated data often from a statistic standpoint can have a similar approach be applied to solve a similar type of question, like classification, for example. So, what’s an informative predictor could look totally different, but I still could have the same model be the best solution for the problem. I’d say data can look all sorts of ways different than that. A lot of data, I think, is more on internal processes. In the finance phase, it might be repeated measure time series data, so how much do we expect, how much do we spend on electricity every month or maybe, I don’t know, what does revenue look like month-to-month or day-to-day? What’s it gonna look like in the future? So, that’s more sales-y finance data, which I think everybody can wrap their head around.
37:12 KS: In the IT space, I think that’s the data that, before I started this job, I would have had no idea what that looked like. So, there’s a lot of data generated in bigger companies around things like IT Service Desk. Like if you work at Google or wherever, I don’t know, a big enterprise, and your computer crashes, you might submit a ticket to the help desk. So, that help desk typically has some way of aggregating and keeping track of all of the issues that are submitted at a tiering level for how critical they are. And then there’s often some process for trying to tie these back to changes that might have been put in place. So, that’s some IT data that we’ve come across, and there’s actually text in there as well, just like there would be in a customer feedback survey. The methods you might apply to it might look similar, the data might be more or less useful.
38:03 KS: I’m trying to think of what else is of interest here. From a cyber perspective, and I think this is why, for cyber analytics, a lot of folks will turn to us for our expertise, so data there might be something like NetFlow log, folks might have log from Palo Alto, basically things traveling across their network, so different IP packets going places. That data, I’d say of all our datasets, is the one that really requires a serious domain expertise, and I think that’s why folks might turn to us more so on that. So, some of these companies have very advanced analysts, but you really need to analyze that data, you really need to understand what it is. It’s not gonna make sense out of the gate.
38:47 KS: So, that’s where I think we’ve invested a lot of energy in pairing some of our most talented data scientists with some of our most talented attack and penetration analysts, which basically work for these companies to find their vulnerabilities and in a safe environment say, “Hey, here’s where I could see a back door for somebody to get into your network.” So, their knowledge combined with an analyst knowledge is really where you might need to go to get use of data that doesn’t make as much sense as sales data from a fashion company.
39:21 MH: Right. That’s awesome. So, outside the ones where you guys have the internal domain knowledge on cases where you’ve got a data set where… Say it’s a six to eight-week project, are you typically in regular couple of times a week communication ad hoc with the subject matter expertise? How are you rolling along to say what you’re finding makes sense? All of a sudden you wanna use this field, it seems like it’s really useful, but Spidey sense says “better check.” Are you usually paired with… Do you have partners on the inside at the client who are accessible, readily accessible for that sort of interaction?
40:06 KS: Yeah. That’s part of the requirement of starting the project and something we do in the scoping phase. In the scoping phase, before we even start that roughly six to 10-week clock, we have already identified somebody who knows the data. We’ve already gotten to the point where we know what the fields are. Now, something like cybersecurity, that’s definitely been a different type of investment and there’s been a lot of work that was not captured in that six to 10-week cycle, learning what something like NetFlow data even looks like and how to interpret it. For some data sets, certainly.
40:40 TW: But if it’s call center data, and there’s 47 fields…
40:44 KS: Yes. I need somebody on the other end of the table who’s gonna tell me what those are.
40:48 TW: And you’ll be able to understand it as long as they say, “Yeah, we do use this and it’s not… ” Okay.
40:52 KS: Right. Yeah. And there’s definitely fields in there that you’re like, “Oh, this looks good.” And they’re like, “Yeah, that field’s loaded.”
40:56 MH: Wait a second. Are you trying to tell me that not all of your clients are maintaining up-to-date to the minute data dictionaries for all the data sources?
41:05 KS: It’s a really tough business.
41:07 MH: I am astounded. I am astounded.
41:11 KS: Yeah. I mean, there’s a lot of data generated at a lot of places. And often there might be a data dictionary, but it doesn’t really make sense to an outsider. So, that’s definitely one thing we run into. And I’d say it just kind of varies. Often there’s somebody who knows the data, but you have to find them and you might have to have some conversations with them.
41:34 TW: I feel like, in the web analytics side, it’s not that uncommon now for there to be people who are really into the implementation in the data collection, and they’re collecting super rich sets of data. And in their minds, they’re saying, “These are the cool things that could be done with that data.” I’m not gonna do it, the analyst will probably do it. I kinda wonder, do you run into sometimes the data experts, the people who maybe live and breathe the system and the process that’s feeding that data, and they’re just hungry to say, “I’ve been collecting this crap for a year or this awesome gold mine for years, and we’ve never had anybody who can dig in. These are the 27 things you should be able to do with it.”
42:13 KS: Yeah. I’d say that’s kind of my gem of somebody on the other side of the table. And you do find that sometimes and it’s awesome, ’cause there’s somebody who intimately knows this data. They’ve been collecting it and curating it for years, and they just, for whatever reason, kind of business as usual task, just have not gotten around to minding the value out of it. So, that is definitely your ideal person on the other end of the table.
42:38 MK: What about the opposite though? Sometimes you have, I don’t know… Yeah. At my old work we had a partner who ran a model for us [42:48] ____ model, and they came to present and the findings were so enlightening and amazing, and I was like, “Yeah, we kinda already knew these three things were really important to high value customers,” that’s pretty obvious. We were very nice about it. But how would you manage that situation where you go back and they’re like, “Yeah, we already knew this, nothing special.” Or do you think that’s part of the process when you’re briefing in and scoping the project that you should be picking that sort of stuff up anyway?
43:19 KS: Good question. I’d say that’s definitely… I’ve been in that position before and I think typically when it’s happened, if I look back, there’s something differently that could have been done in the scoping process. And sometimes it might not be… It might be a two-way street, but typically there is something that I should have dug deeper on that I missed the boat on, for whatever reason. They wanted to move quicker or I felt like I had the right stuff, and I actually didn’t. That definitely can happen. I think your scoping phase is the best way to guard against it. And that’s the way I think the best… I really focus on knowing the end user and the end audience, because it’s happened before where I did have all the right conversations and asked the right questions, but it was just to the wrong people. I didn’t realize half of the other people were gonna be looking at this. So, that is definitely something I’ve learned along the way. It’s really getting the full landscape of who is gonna be using what you build and talking to them, taking the time to talk to them.
44:21 TW: That’s a great insight. All right. We’ve been chugging along, and this has been excellent. Thank you so much. Katie is awesome. One of the things, as we start to wrap up, that we love to do on the show is called “the last call,” and we like to go around, just share anything that we’ve seen recently that we think might be of interest to our listeners. Katie, you’re our guest, do you wanna share a last call?
44:47 KS: Yeah. Actually, just today, my co-worker and I were looking at this new blog post on text data, which we just talked about, and it’s by Julia Silge, I’m sure I butchered her last name, J-U-L-I-A S-I-L-G-E. And it’s about package in R, and basically she is one of the bloggers that helped write the tidy text mining package in R. She’s got some great resources for mining text in R, and she focuses on the Jane Austen novels for… It’s a great resource for cleaning texts and doing topic modeling, so you could get up and running pretty quickly. And the recent blog post is on basically applying modeling framework, [45:28] ____ to basically try to distinguish some of the words within your model that did the best job at distinguishing between two classes. So, this could be great for refining a model, if you’re trying to figure out, “How do I get more out of my predictors?” That was a super technical resource I just gave you, but…
45:49 TW: No, that’s awesome.
45:50 MH: I remember seeing that, thinking, “I knew all this stuff.”
45:54 MK: Yeah.
45:54 MH: If I had a nickel for every time I wanted to refine my text so I could do better predictors…
46:03 TW: I don’t know. I just pretty much throw everything in a convolutional neural net, and I figure I’m good to go. Did I say that right?
46:09 KS: Yeah, I think so. I think so. English is really hard though, so who knows?
46:13 MH: Okay, Mr. CRISP… What was it, CRISP-DM?
46:17 TW: CRISP-DM?
46:18 MH: I thought that’s when you have a snappy reply on Twitter, you just send to one person.
46:22 TW: You send a crisp DM?
46:23 MH: Yeah.
46:24 TW: Oh, wow.
46:25 MH: All right. Since I’m making fun of you, Tim, do you wanna do a last call?
46:29 TW: So, I’m gonna do… First, I’m gonna call out a last call from Episode 95. For anybody who thinks I feel like, “I’ve heard of this Katie Sasso person before,” I will note that I used her presentation at the useR Conference, is my last call for Episode 95. It’s still a really cool video on “Shiny meets Electron: Turn your Shiny app into a standalone desktop app” in no time, but I’m not gonna count that ’cause that would be a duplicate. I almost did it and said, “Wait a minute, I think I might have done that one before.” So, instead, I am going to go with the… There’s a video on YouTube… Michael, you’ll like it.
47:09 MH: Wow, I can’t believe it.
47:12 TW: That our China web optimization on Slack, the mystery person… I don’t know that we know who that is, it’s called an “Ad fraud deep dive: What is the true impact of digital ad fraud?” by Greg Miaskiewicz. It’s pretty interesting because we know ad fraud happens, but it actually does a great job, I think, of… I learned a number of things about some of the mechanics of it, both what drives it, the different reasons that ad fraud occurs, the obvious reasons in hindsight why it occurs, but then also some of the techniques, and then what is the arms race to try to combat it and trying to stay one step ahead. Pretty interesting talking about detecting mouse movements that are too straight and that sort of stuff. It’s an interesting video.
48:01 MH: Excellent. All right. Moe, what have you got?
48:03 MK: Okay. Well, as we established in the last episode, Pawel Kapuscinski always seems to be holding my hand as I learn something new. And recently, he sent me… It’s by DataQuest, it’s a blog, but he sent me a specific one on how to get started with Jupyter Notebook. And so if you haven’t had the joy of setting it up, which can be an absolute bitch, and this tutorial… [chuckle] I was complaining about it the other day to Tim. This tutorial walks you through it, and thanks to Pawel for sharing ’cause it’s… Yeah, that’s an easy to follow.
48:36 TW: Are you doing R and Python in Jupyter, or are you just doing Python stuff?
48:40 MK: I just like the R interface. I don’t want to go… I try to move over to Jupyter Notebook but I really…
48:47 TW: The R studio interface?
48:49 MK: Yeah.
48:49 KS: I used Markdown in R.
48:52 TW: R notebooks are Markdown.
48:55 MK: Yeah. I probably will get there at some point, but I’m not that motivated right now, too, so we’ll see.
49:02 MH: Oh, right.
49:02 TW: What have you got, Michael?
49:05 MH: Well, funny you should ask. And actually what’s funny is I think I’m pretty sure I learned about this from a tweet from Pawel. So, Moe, you and I both got one from Pawel, interestingly enough. It’s still going on. The National Football League announced an inaugural analytics contest this year, called the Big Data Bowl, where you can set up a team and you can compete with data that they’ll provide, and then the top eight entrants will be invited to the Combine to present their analysis to NFL executives, and players, and owners. Really interesting community sourcing of analytics talent in sports, which is a massively growing analytics field. As a lifelong Cleveland Browns fan, I have mixed feelings about analytics in the NFL, given that Cleveland Browns pursued an analytic strategy and then produced one win in two seasons directly following that. Obviously, there’s a ton of opportunity to do analytics better, so you still have about a week left. Entrance have to be submitted before the end of the day on the 22nd of January. So, if you’re interested in that, go check it out. It’s pretty interesting. I’ll be looking forward to seeing what the results of those are.
50:21 MH: All right. I’m sure you’ve been listening, and I’m sure, as you’ve listened, you found Tim’s comments both interesting and/or asinine… No, I’m just kidding.
50:32 TW: Mostly the latter.
50:34 MH: No. But I’ll probably raise some questions and maybe even have some comments or things like that, and we’d love to hear from you. And I’m sure Katie wouldn’t mind hearing from you as well, so feel free to look any of us up. You could easily find us on the Measure Slack, or on Twitter, or on our website. We’re delighted to hear from you, pass along any questions to Katie as well. Oh, Katie, do you have a Twitter? You wanna share your Twitter account really quick?
51:00 KS: I do. I’m not super active on Twitter, but…
51:04 TW: Okay. Don’t expect a fast response if you tweet @KatieSasso.
51:06 KS: Yeah, yeah. I’m sorry. I tried to get better after useR, and I just didn’t… I’m just not great… I don’t even know what my Twitter handle is right now.
51:16 TW: It’s @KatieSasso.
51:18 MH: Tim knows.
51:18 KS: Okay, great. Thanks.
51:20 MH: And then, obviously, just a natural question probably from our listeners, any relation to Will Sasso, the comedian and actor?
51:28 KS: No, I don’t believe so. Yes.
51:31 MH: That’s fine. It would be cool if you were like his little sister or something.
51:33 KS: That would be awesome, yeah. Not so much…
51:36 TW: No. Just making sure, checking all the tracks. Listen, Katie, we’ve never had Will on the show, but we have had you on the show and it’s been a pleasure. Thank you so much.
51:46 KS: Thank you guys for having me. It’s been great.
51:47 TW: Yeah.
51:48 MH: Yeah. It’s been awesome. And I’m sure I speak for my two co-hosts, Tim and Moe, when I tell all of you out there, keep raising the bar and analysis this year.
52:03 MK: Thanks for listening, and don’t forget to join the conversation on Facebook, Twitter or Measure Slack group. We welcome your comments and questions, visit us on the web at analyticshour.io, facebook.com/analyticshour, or @AnalyticsHour on Twitter.
52:22 S?: Smart guys who want to fit in, so they’ve made up a term called “analytic.” Analytics don’t work.
52:30 S?: Analytics. Oh, my God, what the fuck does that even mean?
52:37 MH: Oh, wait, actually, one quick thing for… Can you guys see behind me what my daughter…
52:44 TW: Yeah.
52:44 MK: Oh, wow.
52:46 TW: What on earth is that?
52:47 MH: It is the “I’ve been Matt Gershoff-ed, [52:50] ____ print, vinyl.
52:54 TW: It’s like the patron saint of Faustian bargain personalization… No. He’s a really smart guy.
53:05 TW: All right. Are we ready? Mostly asking for Moe’s benefit.
53:10 MK: Sure.
53:11 TW: Do a couple of vocal warm-ups.
53:13 MK: What could possibly go wrong?
53:15 TW: Mi, ma, mi, ma, mi, ma, mo.
53:18 TW: The problem is probably different, right, Moe? ‘Cause if I wake up late, then I just sound like some kind of DJ with super low voice. [laughter] I don’t know how it’ll work for you, so let’s see.
53:32 MK: Hurray. Okay.
53:35 KS: I kinda forget what the question was.
53:37 MH: Yeah, I’ll ask the question again and then you can… Then we’ll just take that whole thing out.
53:43 MK: Thanks, darling. I’ll start again.
53:47 KS: Hats off to kind of the…
53:49 MH: Maybe start back to “hats off to.”
53:52 KS: Yeah.
53:54 MH: ‘Cause could you imagine how boring it would be for Tim to just ask Katie questions for 45 minutes? [laughter] I bet her answer to them, so that made it not boring. Right?
54:06 MK: That’s an improvement. Right?
54:08 TW: Oh, wow.
54:08 TW: Rock flag and convolutional neural networks.
54:17 MK: Love it. Jesus, that’s scary.