Regression. Correlation. Normality. t-tests. Falsities of both the positive and negative varieties. How do these terms and techniques play nicely with digital analytics data? Are they the schoolyard bullies wielded by data scientists, destined to simply run by and kick sand in the faces of our sessions, conversion rates, and revenues per visit? Or, are they actually kind-hearted upperclassmen who are ready and willing to let us into their world? That’s the topic of this show (albeit without the awkward and forced metaphors). Matt Policastro from Clearhead joined the gang to talk — in as practical terms as possible — about bridging the gap between traditional digital analytics data and the wonderful world of statistics.
Items Referenced on the Show
- Adobe Target
- Platform Named After a Famous Literary Figure’s Sidekick
- Danger, Will Robinson
- Google BigQuery
- Adobe Analytics Data Feed
- p-hacking with fivethirtyeight.com
- @ourmagickfuture (Magick at Scale)
- (A New!) Sample Size Calculator
- #025: A/B Testing with Kelly Wortham
- Search Discovery Sample Size Calculator
- (Companion blog post) Sample Size Calculation – Myth Busting Edition
- R Shiny
- The UX Guide to Getting Consent
- #077: Lions and GDPR and Bears, Oh My! with Jodi Daniels
- Jeff Bezos’s 2018 Letter to Shareholders
- Marketing Evolution Experience
00:04 Speaker 1: Welcome to the Digital Analytics Power Hour. Tim, Michael, Moe and the occasional guest discussing Digital Analytics issues of the day. Find them on Facebook, at facebook.com/analyticshour, and their website, analyticshour.io. And now the Digital Analytics Power Hour.
00:27 Michael Helbling: Hi, everyone. Welcome to the Digital Analytics Power Hour. This is Episode 89. Have you ever been confused by a two-tailed ANOVA z-score with an extra shot of espresso and skim milk? We’re all hyped up on the R and the Python pour over, but are we forgetting the science in Data Science? Statistics. Moe Kiss, my co-host. How do you feel about lies and damned lies?
00:57 Moe Kiss: Well, a bit wonky to be honest, but plugging along.
01:01 MH: All right, and Tim Wilson, hello to you.
01:04 Tim Wilson: A standard deviation to you, too.
01:07 MH: Oh wow, thank you. And I, Michael Helbling. But we needed to make this trio a quartet because more sample size or something. But in truth, someone who spends all their time thinking in this vein, let me introduce you to our guest, Matt Policastro. He has said that the following is not a gross misrepresentation of his background. He’s a Data Scientist at Clearhead. It’s an experience optimization agency, it’s part of Accenture Interactive. Matt and Tim first met at MeasureCamp, Cincinnati a year ago or so, and by the middle of that afternoon they’d actually joined forces to run a session together where Matt was showing some linear models in R on some Google Analytics data while Tim was asking questions about what he was doing and how to interpret the results. And Matt at that time was wrapping up a Master’s Degree in Business Analytics at the University of Cincinnati. It was a program he joined after working for several years as Analytics Specialist at Global Cloud. Matt, welcome to the show.
02:10 Matt Policastro: Thank you for having me. I don’t have a fun stats response, so I’m sorry about that.
02:15 MH: That’s okay. Your response…
02:19 MP: You guys gave me a lot of time to think about that and I just…
02:22 MH: Yeah, and nor did I have a witty stats come back to your lack of stats response.
02:29 MK: Don’t succumb to the pressure. It’s fine.
02:31 MH: Yeah, it’s highly correlated with it being a good show.
02:34 MP: Nicely done.
02:36 MH: All right. I think off the top, Matt, let’s talk a little bit about what you do day-to-day as a data scientist, and how statistics enters into it ’cause I think that’s a great place to start and also helps our audience understand you a little better as well.
02:49 MP: Yeah, so the folks I work with at Clearhead, we are doing a lot of onsite optimization, not so much online marketing piece. Just really looking at, “Hey, how do we modify things on the website for lots of our great little clients that we have. How do we work with them? How do we modify things on the site to help people get to whatever they’re trying to do”, be that getting to signing up for a membership, doing a subscription conversion, buying whatever new pair of shoes that they wanna do. Really whatever we need to do to get there. But then as one of the data scientists that we have on the team, my job is to come in and not only just be doing our sort of day-to-day, “Hey, we saw that there’s a difference between the A group and B group in this A/B test. We wanna determine whether there’s actual statistical significance there.” But also doing some more complicated things like “Hey, let’s look at the composition of an order and whether there is a significant effect on the amount of sale items in that order”, things like that. Just going a little bit further of really diving deep in the, say, data warehouse or BigQuery, getting more of the data out of there and not just doing the “Hey, we have two different percentages. Looks good, let’s go for it.”
03:55 TW: Do you 100% of the time not use whatever the results spit out by whatever the testing platform are? Or do you use that as kind of a, “Hey, that’s maybe a starting point, but now we’re gonna move into other stuff”?
04:08 MP: Yeah, I, maybe have a little bit of a… Well, fairly earned reputation of not looking at the testing tool ever. I think, I actually had a client call when I said, “I tried to confirm what you’re saying, and I honestly don’t know how to use the platform, so if you could have one of your people walk me through it, that’d be great.”
04:24 MP: Most of the time we are just working off of what’s on the Analytics Tool that we use as our canonical source of truth, just ’cause more often… Not that it’s anything out of ill will on the part of the vendors, but it’s just they approach things in their way and we’re fairly opinionated and we have our own methods so we tend to stick to those.
04:43 TW: So I’ve got to ask. Has that ever happened that a client or even an analyst inside your organization says, “I’m looking in Target or Optimizely, and it’s saying X” and you’re drawing either the opposite conclusion or you’re saying, “Wait, there’s too much uncertainty there.” Does that happen?
05:04 MP: Yeah, it happens, and I think it’s a lot of, again, how you measure the things that sort of adds up. If it is a… There’s, let’s say I don’t know, one and a half units per order, and that’s what the testing tool’s going off of; when we’re looking at very different success criteria, then there’s total disagreement of which variant is the better one. And that’s why we just wanna broaden out to, “Hey, wholistically, what is the data telling us about the customer journey overall as opposed to the testing tools?” which again, I don’t think it’s anything that’s unfair on their part. It’s just they tend to say, “Hey, you set this as the primary metric inside this test. That’s what we think is making this one a winner” when there may be downstream effects that affect that. So, we get contradictions all the time about who’s winning or what’s actually happening and then that’s where analysts get to come in and do their jobs and sound all smart and clever and say, “Hey, the testing tool is telling you this one’s a winner. But as the analyst working on this, this is why this is actually valuable to you and why you might wanna think about this a little bit differently.”
06:01 TW: But I guess that’s I just… I feel like that’s where the tools are trying to say, we’ll just do all this in our magic black box for you, which is fine if we understand what it is. I mean that’s on one of the 27 reasons that I’m kind of trying to pursue this path is I feel like even though it makes things less black and white, the reality is if we understood the data better we’d realize it’s never black and white. And it would be awesome if we actually inherently understood that, “Hey, they’re always doing a two-tailed,” versus, I sometimes, maybe you wanna look at a one-tailed. But that’s the imperative where the platforms are trying to make it simple when sometimes, it really just shouldn’t be thought of as being simple.
06:50 MH: This show is already getting off on the wrong foot. We’re calling in to question all of the vendors, we’re gonna get so many emails and tweets…
07:00 TW: No, we’re not.
07:00 MP: I was gonna say that… I think it was [07:01] ____ test in Target really good white paper about how they do their significance calculations. I’m pretty sure that’s who it was. I was looking in those and it’s like, “This is all fantastic, but it’s also not a one size fits all scenario.” You want to have the right test in place, not just the A/B test, but the statistical test for what you’re trying to look at and determine whether that’s actually the thing that you want at that moment. And that’s why… Obviously, this is grossly self serving, but that’s why you want people working on these problems.
07:30 MH: There you go. And that’s a really good point because I think there is so much finesse, right? Nuance to this. There’s not certainty with, especially in this area. Or you’re not even supposed to have it probably, I guess. [07:48] ____ So let’s do this as we’re closer to the top of the show. Let’s just run through kind of a list of things that we hear a lot when we’re talking about statistics. And just do a good little definition together of those. Because I think what that’ll do is then, as they come up later in the show people have sort of a… If someone’s not as into statistics and a lot of our listeners are so this will be a review. But I don’t know what are… A t-test. What’s an example of a good definition or simple definition of that?
08:23 MP: Yeah, I mean the t-test probably, to be super reductive about it, and being a data scientist that gets me my old day job, is just taking all the nuance and boiling it out, so people don’t have to worry about it too much. But t-test is just looking at, “Hey, there’s an average of two different groups. Let’s say you have two different models of car and they have an average mileage on miles per gallon on each of them. Are they actually different or is it the one that has a different catalytic converter? Or something of that nature?” Is it just a, “Hey, they look different, but they’re not actually that different overall”? So that would just be your basic t-test. Is just saying… Averages between groups, whether that’d be a normally distributed thing, or there’s variations on that for non-normally distributed variables, which you do run into from time to time on the web.
09:08 TW: But an A/B test, in many cases, a t-test is what you’re doing. Group A is the A variation and group B is the B? Or no?
09:17 MP: Yeah, more often than not, it’s just, like I said, there’s a lot of approximations and elaborations you can do on those kinds of things. You know you can’t do an average of individuals or… No, that’s a terrible way of… Yeah, you totally can do that. You can’t say, “What’s the AOV of an individual who ordered something from our site?” And it’s like, “No, you can’t do the average order value of a single value.” And that’s when you start saying, “Oh, well what if we break this out by day and then look at the order, average order value? Can that be… ” You can then start t-testing that. And I’m kinda jumping the gun on some of this, but there’s ways to introduce variability and to conform things into the traditional statistical testing model that you do get to do a basic t-test and say, “Hey, this looks like it’s normally distributed. We can do the t-test.” Which again, to be fairly reductive is, if your two bell curves, your two normal curves that are sitting on top of each other, and one has an average that’s slightly higher than the other, how much overlap is there on those bell curves based on the variability inside of them? And then, is that enough to say that they’re substantively different or is there’s too much overlap basically?
10:20 MK: That was great. I literally had a visual image in my mind as you were talking through that. So what about the terms, one-tailed and two-tailed? Can you explain a little bit about the difference between those?
10:30 MP: Yeah, one-tailed and two-tailed would just be, one-tailed is just looking at one side of the distribution and saying, “Hey, is the average of one of these is higher than the other. And is it gonna be higher? Or is it gonna be lower?” Two-tailed is just making a two-way comparison and saying, “Hey, is it somewhere in between? Or is it either greater than or lower than?” Which is nice ’cause in some situations, you just wanna say, “Hey, is this group actually different from the other?” That lets you be a little bit more sensitive to the change that’s actually happening and be a little bit more confident or reduce some of that uncertainty. Whereas with the two-tailed, they can just be, “Hey, is there something here that’s happening?”
11:03 MK: And so with vendor tools… Is there one or the other that they tend to use? Just out of curiosity. I’m not sure if any of the other guys know either.
11:12 MP: I don’t know if I could totally pull that off the top of my head. I know some folks have been better about disclosing those than others. But it’s also, that’s exactly why you would wanna have a little bit more nuance when you’re looking at that, because you don’t want to assume, “Oh, we’re just looking for the basic difference here as opposed to actually understanding.” Yeah. If you’re looking at, say, average order value for customers, just saying, “Oh, they’re not the same as each other,” is not nearly as satisfying when you go back to the client or the stakeholder, as saying, “No, the variation, you know Variation B actually had higher order value. Not just that it was different from what happened inside of A.”
11:48 TW: Well isn’t there… There is a subtle, I think, unless I’m not understanding it correctly that… Say you’re looking at just conversion rate and you’re looking at A versus B. And you’d say, “Look, I see that B had a higher conversion rate.” And if I say 95% confidence, if I don’t know if it’s one tailed two-tailed, when we’ve got the old, say we’re going with a P value of.05, if you have it set as a two-tailed, you’re sticking basically a 2.5% on either end of that tail. Whereas if it’s one-tailed, you’ve got all of that 5% on one end. You may look at it and say, “Well, I’m only gonna call it a winner if it goes up. It doesn’t matter if I’m one-tailed or two-tailed.” But how you actually set it up will actually change what the threshold is.” Is that right?
12:36 MP: Yeah. If you’ve got a real buzzer beater of a test where it’s right up on that line, and you want to be absolutely confident one way or another, it totally will make a difference of where you’re setting that threshold. Because you are breaking up your error of, it could be erring in either direction outside of the window that we’re comfortable with. Whereas the one-tailed approach is gonna be a little bit more sensitive and a little bit more reliable in saying, “No this is actually higher than the other thing that we’re talking about.”
13:04 TW: So can we hit the whole assumptions of normality? I know we were heading down one path, but now you’ve got me kind of thinking, “Is everybody who’s listening is now thinking of the times they’ve seen normal distributions?” On the one hand, I feel like we assume normality. Or how do we know… If all we have is aggregated data, is there any chance of us knowing whether it’s normally distributed or not? How big of a risk is it? Do we just have to make that assumption sometime?
13:32 MP: It’s an unfortunate thing ’cause you want, at least in my corner of the world, and there’s always going to be a lot of disagreement about this and I never want to be the person coming out and saying, “This is clearly the way to do it, any other way is completely wrong.”
13:48 TW: I don’t think anybody with a statistics background would ever say and make any absolute statement. They could be 3 feet deep in water and would say, “I’m not going to guarantee that we’re not in a drought anymore.”
14:00 MP: We’re a lot of fun to have around, I’ll say that much.
14:05 MP: The normal distribution is fantastic. It does a lot of great work for us. It’s a solid performer and can always kinda carry you across the finish line, but it does not fit in every scenario. Say looking in revenue per visitor, I think is probably one of the most infamous examples of this doesn’t adhere to any distribution. It is complete nonsense. And that’s what we would call nonparametric of, it doesn’t obey any of the rules, it doesn’t have regular parameters that we can sort of approximate to help us to understanding. And then we can just do… We have nonparametric tests which throw all those assumptions out the window, but they behave differently. It’s not quite as tidy, harder to communicate to people. So there are plenty of things which do adhere to normal distributions: The number of visits in a day or in a week that someone might make, the number of items in their order, order contents, things like that. But…
14:52 TW: You’re like, “Of course revenue per visitor doesn’t have… ” I totally know the answer, I’m just asking for a friend or for our listeners.
15:00 MP: Oh yeah, from our long discussions, I know that you know this by heart.
15:08 MP: Forgive me for… Call me out on that stuff ’cause those are total things which I don’t even think about at this point. But RPV, when you think about it, you have a lot of cases where there are going to be values of zero where you have a visitor coming to your site, they never order anything. And they never bought anything. And yet that is probably a majority of your customers or visitors coming to your site who don’t end up ordering anything at all while you’re running a test, or while you’re trying to do an analysis. And then also saying, but the average is somewhere up around, let’s say $50. But then there are all these people down at zero. But that’s still a normal distribution, right? And no, not at all. You get these weird peaks and valleys in terms of where people are kinda clustering around their order values, or how much revenue they’re providing. And that’s one of those cases where yeah you can squint and get it to conform into a distribution. But more often than not that’s very much the level of, “Oh, are we just bending the math to this to make it fit with how we want to think about this?” As opposed to what we’ve actually observed is happening. And yeah, that’s where the nonparametric stuff comes in, it’s more technical, but it does help you answer that with a little bit more confidence. And lets you say that without sweating and crossing your fingers and saying, “Haha, this is, yeah that’s… Yep, okay.”
16:23 TW: And just another terminology thing, you’re saying parametric and nonparametric, am I right? That parametric means non-numerical, non-continuous?
16:36 MP: No, it’s more the parametric, and I know I threw that out there, and I think it is… Parametric, so in capital S Statistics, there’s a couple of key terms of, “Hey, this comes from math. We wanna use math terms to talk about these things.” And parameter would be one of those. And more often than not, when you’re talking about statistics you’re trying to estimate a parameter based on a sample you’ve taken from a population. Which is basically, there’s an unknown value here that we’re trying to estimate. Say the average age of a group of people based on the sample that we’ve taken, and we’re trying to estimate that parameter for that group. It’s kind of a, “Hey, here’s a known entity or value.” So a parameter for a normal distribution would be the average. And another parameter for the normal distribution would be the variance inside of that distribution. And so parameters are kind of what, especially with the more formalized distributions, like normal, like Poisson, other things where they’re, “Hey it’s kind of these fun little values we just get to plug in.” And that’s going to basically influence how it’s gonna look when we plot it out, of how wide is the bell curve or how sharp is the Poisson distribution when you put those parameters in there. And then there’s all the other, “Hey, here’s variables which represent this unknown parameter” and things like that. Does that answer that question?
17:50 TW: It does. Of course it raises the other question ’cause you went down to the statistics being typically a sample trying to estimate a population, and the other kinda question that has been vexing me for several years now is that, “But wait, don’t we have the whole population?” In testing we’re trying to estimate the future behavior, but when it comes to just traditional digital analytics, where does that… That seems like the sort of thing that breaks down in that like, “Hey, we have all the data and sampling is bad because we’ve been taught that sampling is bad.” How do we wrestle with that if we’re saying, “I just have my Adobe Analytics or Google Analytics data and I’m reading about sampling”, where does that fit? How do I reconcile that?
18:34 MP: I think a lot of folks’ responses to the sampling piece is a little bit knee-jerk gut reaction of, “Well, if I can have all of the data, why wouldn’t I just use all of the data?” And that’s where you get people doing all kinds of crazy stuff of, “I’m gonna hit the Google Analytics API and walk through the data on a day by day basis, and then stitch all that together into a dataset that I can then say I have full sample. I know everything I can about these people.” And I think that’s misplaced more often than not. It comes from a good place of, “I wanna be extra, extra, extra certain about this thing before I make a decision, or I go and do something for my business.” But that’s why statistics is there, it’s so that you don’t need the comprehensive measurement to get to that end point. But then is, for why you would go with the, “Hey, we’re treating this as a sample in the first place.” It’s maybe a little bit more navel-gazing but you are thinking about those questions in terms of… There are always gonna be, obviously, the future. You don’t know what’s going to be happening there.
19:35 MP: We don’t have data on the future, we’re trying to estimate what the effect of a change would be on future populations. If we roll out this variant for everyone on the site based on the test data that we got from this. But there’s also plenty of cases of, “Hey, someone has a plug-in blocker enabled, and maybe they’re behaving totally differently, but maybe they’re just otherwise a relatively normal person. Can we be certain about that?” And you’re covering those cases well. But also it just lets you say, it just lets you back up and have some… It encourages you to be more conservative with what you’re observing as opposed to just saying, “Hey, this checks out. We have 10% versus 15%. 15% is obviously great, but not when you’re talking about a sample size of 20 people, that’s not gonna tell you anything.
20:18 MK: So, I’m just curious though, for the analyst, how do you know when to make that decision? Because I’m thinking about my own work and the majority of time, I would use everything… Every observation that I have, I would just use because why wouldn’t I? Like you said, it’s all there. The only times that I’ve reverted to sampling is when the data set just gets too big for my machine, which doesn’t sound like the best way to be making this decision about whether to sample or not. So, do you have any advice about in what circumstances…
20:50 TW: And to be clear, you’re working with BigQuery, so basically hit-level, session-level data and Snowplow data. So you are in the realm of, you have the totally detailed data just…
21:04 MP: Yeah, I think I’ll do a giant bulk export from Adobe Data Warehouse or BigQuery to get a really low level, hit level data and then do a bunch of processing to get that back into data that sort of resembles what you would be seeing in Google Analytics, a little bit more granular, or Google or Adobe, whatever the platform is, you get your nice display upfront, but stitching those things back together, you can definitely go down the rabbit hole of building the infrastructure to get that going. Getting a Hadoop cluster running on your laptop, which is usually a good sign that you’ve come too far.
21:44 MP: It’s one of those things where, at least in data science, a lot of it is just, “Hey I’m trying to hack something together just answer this question”, when it feels like you’re reinventing the wheel just in practical terms of, I am building large cloud scale infrastructure on my laptop, and I’m expecting it to work fine. You’re probably approaching the problem the wrong way. So, totally practically speaking, that’s there. But then also the way that statistics operated for so long is just in many ways the discipline was not created with the kind of data that we work with in mind. You’re talking about getting a sample of a couple thousand people, not getting a few million in a week or something outrageous like that.
22:21 MP: There is a point where you are getting diminishing returns. Like if you’re trying to say, I am trying to detect a 0.1% change between two groups, then getting all the data that you possibly can is fantastic. But, maybe another more practical way to approach it is if I pull this same report a couple of times and it is being sampled by Google or Adobe, am I consistently or semi-consistently getting the same result out of that. Doing re-sampling, there’s all kinds of methods for re-sampling your data, of taking samples out of the same data set over and over and over again, and making sure that what you’re seeing is reproducible and that’s usually a pretty good indicator that you’re doing okay. I don’t know if that’s the…
23:03 MK: Actually, the last time I ran something, I did try doing that. I ran four different sample groups, and then I compared the results, and the results were very consistent or so similar. It wasn’t gonna change, there wasn’t gonna be any difference in what action we took.
23:21 TW: I thought one of the knocks was taking too much data that you’ll over fit a model. So if we take something specific like propensity modeling, where we’re saying, “We wanna try to figure out which characteristics are most indicative of somebody is likely to purchase or convert.” And if you take all the data, you have probably very precisely described that for all of your historical data, but not necessarily the best possible, going forward. Whereas if you take a sample and figure it out, then you’ve got the whole idea of saying, “I’m gonna train the model, and now I’m gonna pull another sample and see if that model still holds up pretty well. The whole training versus testing a model, which seems like it’s foreign to the world of aggregated digital analytics data.
24:09 MP: Absolutely, and it is something that I think is becoming more relevant as more folks are getting exposed to platforms which are doing more sort of model training, sort of the machine learning or artificial intelligence approaches that we’re seeing in a lot of places. I think that is getting more relevant and it is a better question of, “Is this becoming hyper responsive to the needs of a few and sacrificing the needs of the many?” Hopefully, the platform that you’re using is counterbalanced against that. But if we are doing something as simple as a t-test, then you’re not really gonna be running the risk of, “Oh, you’ve overfit something.”
24:41 MP: It’s really when you’re in the realm of categorizing or classifying or predicting future cases based on what we’ve observed then over fitting becomes more of an issue. And that can happen at any size of data of, “Oh, we took a really, really small sample size, and then we’re gonna use this to predict on… We took 5% of our users, train this model against it. Now we’re gonna roll that out to everybody without doing any additional quality checking.” That’s really fertile ground for making bad assumptions about your users based on that. Yeah, it sounds reductive and maybe a little bit too easy, but just going and redoing that process over and over again, five, 10, 20 times, and just making sure they’re getting results out on the other end that seem reasonable, that’s usually a pretty good indicator if you’ve got something good going.
25:26 TW: Which I guess is one of the cool things that in R or a Python, you can say, “I have all this data. Let me pull just 10,000 random samples” and it does that well. So you’ve got a smaller data set to work with and you can do that kind of repetition with it.
25:41 MP: Yeah, I’ve been working with a couple folks where I get handed a data set of, say, 20 million customers, and there’s no way you can get that working in a way which is expedient without doing a bunch of, really engineering your way around that data, at least on consumer grade laptop or even professional grade. And then it is totally, yeah, it’s a reasonable thing to do, that’s why we have statistical inference, which is to quantify the uncertainty that we might have from a lack of data that we might otherwise have.
26:18 MK: I really want to touch on basically this view of throwing data at the machine. And the reason I think that’s interesting is because it seems to be like everyone’s just, “Oh, if we just throw this data at some machine learning algorithm, everything will be great.” And I’m just worried about the danger of… And I’m thinking, particularly from the context of myself, everyone has to start somewhere, right? So you need to start playing around with some of these models so that you learn them, but then you start doing these pieces of analysis and sharing it with the business and the analysts are not always placed to be able to… They don’t actually always understand what’s going on in the background.
26:53 TW: To me, the throwing the data at the machines is there’s such a fundamental misunderstanding of what the raw data is, the world of feature selection, that if you take raw hit-level data and say, “I’ll just let the machine figure it out.” And this is one of those other things that the light bulb’s gone on a little more recently. If you think about a raw server log and say, “The machine will just figure it out”, there’s still, what’s missing there is the machine has no idea what those different things mean or what would make sense. So if you take and say, “I’ve got the raw data, I can convert that into,” to use something that I learned at Summit, “Let’s use how many unique days a visitor has viewed my site.” I have all the data to derive that, but there’s no way that a machine is going to stumble across combining the timestamp rolled up to a day, or maybe it’s how many unique weeks has a user visited. And so there’s that piece, that throwing the raw data without saying, “What’s some level of creating features or variables from that raw data and then have those features and variables be assessed to say, ‘Do they matter or not?'” They may not, but that to me is one of these insidious things that like, “Oh, once the machine gets smart enough, we can throw the raw data at it.”
28:22 MK: Well, but surely, you still need someone to clean and prep data.
28:27 TW: That prep data, that world of a data scientist actually having some business context and saying, “What are the features I might need to design?” Or maybe Matt can put it more.
28:38 MK: And also, what do I need to remove?
28:40 MP: Yeah. Absolutely, and a lot of folks can tell you this when they’re doing any sort of data science work, but the overwhelming majority of the work that you do is incredibly boring drudgery of picking through a text file that was sampled from a database and saying, “Is this relevant or not? This isn’t formatted right. I need to go back to the developer.” It’s so much of the boring drudgery of getting to that point. And I think it is helpful to back up and say, “Any sort of statistical inference is not valid if you do not have a good explanation for it.” And that’s one of the things which a lot of machine learning and computer, the way that computers approach these problems is fought. Because the computers will rarely, if ever, be able to tell you why something is happening.
29:24 MP: They can tell you that there’s a strong association, these are highly correlated, or, hey, this is a very highly predictive feature for the behavior that we’re trying to classify in customers, but it’s never gonna be able to tell you exactly why. Materially speaking, there’s very little difference between a computer trying to figure out the best way to organize boxes to fit in the back of a delivery truck, and then figuring out which features are more useful in trying to predict the way the customers are gonna behave when you present them with a call to action or a discount on their order. It’s an optimization problem and it’s just trying to find the most expedient and most reliable way to get to a solution in that situation. But it doesn’t actually necessarily solve the business problem.
30:09 MK: Yeah. I do agree with that. And when we were talking earlier about some of the different A/B testing vendor tools and your own work, it sounds like the context that gives credibility to any of the work is that the analyst or the data scientist has the context of the business to be able to explain why something’s happening coherently. And that’s where you get the stakeholder buy-in?
30:34 MP: Yeah. In my work, that has been where a lot of that has been, is just not, just a say, “Well, the machine said it did okay, so let’s just go with that. That sounds like the right thing for you to do with your business.” One, that doesn’t feel very good when you’re talking to a client that way. And two, it doesn’t help you decide what to do next and what makes sense, other than getting feedback from the machine and saying, “Well, we made the button green and it did better so let’s just keep moving. Let’s make more things green.” But it’s really the process of discovering those insights, the causal relationships there that you can then go back and reverify, that help you actually provide value to people at the end of the day. At least that’s how I’ve approached my work and how I try to restructure those problems. ‘Cause you can get in some really technical wonky, weird places where, “Well, we did the 5% lift on the revenue, but then people are exiting the site more and maybe they looked funny when they left and things like that.”
31:31 MK: The issue I have though, is there seems, and maybe this is just within my own sphere of people that I’m spending time with, there does seem to be this appetite for, “Okay, we’ve thrown this into some machine learning model, and therefore that’s superior to anything that an analyst could do.” And I don’t know where that’s coming from and I don’t necessarily understand it ’cause it sounds like your experience is really different.
32:00 MP: Yeah. Have you ever seen Terminator? I think there’s an overwhelming bias, especially for people who’re not necessarily working from an analytical perspective, say you’re a product manager or you are in charge of managing the ecommerce operations. You’re not necessarily seeing those things on a day to day basis, and I do think there’s a massive, massive problem in our industry of people over-hyping machine learning and artificial intelligence tools. Not to throw something too pointed but… There’s a product named after a very famous literary character’s apprentice or side kick that is a really good piece of marketing, not so much a good piece of data work because there’s nothing comprehensive to it. It’s a suite of very well-produced tools, in my opinion.
32:44 MK: Guess that tool!
32:49 MP: No idea what that could be. Everything I’ve seen from it has been, “Wow, this is a great application of this tool in this context, but it seems like it’s been purpose built for this”. As opposed to, “Oh, it’s an artificial intelligence, it can handle anything you throw at it”.
33:05 MP: I don’t think that’s the case at all. And I think it is a lot of, “Hey, how do we back this up?” And one, I’m a big fan of the KISS principle, Keep It Simple Stupid. If you’re over complicating things to the point that people can’t understand it anymore. It’s one thing if you’re Facebooking, you’re trying to decide what stories to show on someone’s newsfeed, yeah, those decisions probably never gonna be comprehensible to a normal human. But in terms of, “Hey, we’re trying to influence customer behavior to get them to a very clear end point in mind”, a lot of that uncertainty is not doing you any favors, at least in my opinion. And it is very much building that relationship and building that trust as an analyst or as a data scientist to say, “Hey, we did a lot of the quantitative work to qualify and to back up what we’re trying to say to you right now, and we can give you all those facts, we’re more than happy to. But in terms of you and your business and your goals right now, here’s what we found, here’s why we think that’s relevant.”
33:56 TW: And it’s also of course, we’ve talked about in the past with GDPR that having something that’s totally opaque and not explainable can wind up getting you in trouble. Why are you doing this? As well as the other, even just from an ethical perspective, that if the machine says you should be racist or ageist or you name it, sexist, that doesn’t necessarily mean that that’s a good thing to do. If the machine’s just running, it may be behaving that way, without any sorts of checks in the process.
34:28 MP: Yeah, there’s a lot of implicit bias that’s there. I lean more towards agreeing with that side of things of, “Hey, we should be really careful about how we apply this stuff” while also saying, “Yeah, when done correctly, this can be great. We can get to answers and we can get to solutions much more quickly than we could have before.” Just ’cause computers are a lot better at math than any human being can be. But then in terms of judgment and actually doing the right thing, not so much. Someday, maybe, who knows? But for the moment, it is really, let’s recognize these things as tools that help us get to goals as opposed to these things are self-sustaining processes that just mint money for your business.
35:06 TW: Well, I think that my soapbox, ’cause on the one hand, say, “So what we’re saying is statistics is dangerous and scary so stay away from them.” But the fact is, the more you understand about statistics, the more nuanced you’re thinking and understanding of the data is, which may mean that at times, you’re actually backing off and saying, “I’m not gonna go crazy with what I’m applying or some little obscure, fancy, crazy model, I’m gonna actually make a judgement call to do something that’s a little more comprehensible and probably is still pretty good.”
35:41 MP: Yeah. I think I have a fairly well-earned reputation for being a total wet blanket because I do do [35:45] ____…
35:47 MP: And say, “Hey, we’ve been talking about how… The monkey this data around into a format that’s gonna be able to give us the result that we’re looking for, are we doing the absolute wrong thing that we shouldn’t be doing right now just because we do want to get to that goal that… ” Everybody wants to get to ring the bell and say, “Hey, we did a great thing.” But then you have to just be careful of, “Hey, we introduced a lot of complexity here and we didn’t account for that in our results.” Oops. That’s a bad place to be especially when you do get somebody coming back and saying, “Hey, I reran this analysis and got something totally different. What’s up with that?” You wanna be able to answer that with a degree of confidence other than, “I don’t know.”
36:29 MH: Darn those computers again.
36:34 MH: Danger, Will Robinson.
36:35 MP: The computers are at it again.
36:39 MH: Oh man, this is good. And I’m trying to figure out where to get a question in and I’m like…
36:45 MK: I really wanna talk about correlation. But I feel like it’s such a like… I always take this down the garden path.
36:49 MH: Well hold on just a second. ‘Cause I think where I was trying to go was sort of… One of the things I feel like I’m hearing and honestly have felt in my career is sort of the balance of this really comes to life when you have the statistical skillset balanced with sort of the application or understanding of what’s most likely to happen in any kind of business context or scenario, which then equips you to use the right methods. So how do you suggest people fill both sides of that equation?
37:23 MP: I’m very much in the camp of being a generalist is not a bad thing. And I know at times that’s been an unpopular thing to say. I do think it is important that you sort of understand kind of soup to nuts of how you get from, how am I going to perform this analysis to, how am I going to communicate this in a way that makes sense to people and expresses why doing one thing, why making this decision may result in some good things, but may have some drawbacks. I really do think that that is important and that’s also a big part of why you work in teams of people. Everyone has their core competencies that they’re good at. But really fundamentally, if you’re not able to communicate that to someone, then I think you have a bigger problem on your hands, which is that “Oh, I can talk your ear off about the P values and the relative importance of what’s coming out of this and the accuracy, precision, recall, whatever about the classification model you’ve built. But if you can’t bring that back to the fundamentals, then I think that’s real issue. I’m sorry. I feel like I got away from that question.
38:21 MH: No. I think that’s certainly an aspect of it for sure.
38:25 TW: But I think sometimes we hide behind and say “Oh, being a generalist is fine. I’m just gonna generally understand aspects of the business and the data collection.” And I think there is that… The idea of being broad but shallow, we try to be in a millimeter of water when it comes to the statistics aspect of things. I think that’s where almost everybody’s coming from a digital analytics background would be well served to get to a half an inch of water.
38:55 MH: Just switch from metric to English…
38:58 MH: I did so…
39:01 TW: I’m trying to tap into the whole audience.
39:03 MH: Yeah. Just reaching out to the whole world there on that one.
39:05 MP: You just A/B tested that sentence.
39:14 MP: Yeah, I feel like it is very much a… Yeah. Just how do you communicate this in normal terms? Because as much as I wish everyone could have taken the Intro to Statistics class when they were in high school or when they were in college and that they had a basic understanding of these or even better, that they took that class and then actually remember something from it 10 years later, being able to speak competently to those things would be fantastic. But realistically, that’s not gonna happen anytime soon. And while that’s great ’cause I get to do my day job and get paid for it, it’s also, yeah, just being able to meet people in the middle and speaking in fairly plain terms about what’s happening.
39:50 MH: The only lasting…
39:51 MP: Again, sorry. I feel like I’m going off on a unhinged rant here.
39:53 MH: You’re on a right track. I told Tim earlier today, the only lasting memory I have of my statistics class in college was looking in the back of the book to look up z-scores. That was what we did… [chuckle]
40:07 TW: And there’s a table. But who makes the table? Where does the table come from?
40:12 MH: Where does this come from? Just look up the value. Just go with it.
40:14 MP: Shh. Don’t worry about it. It’s all fine. Just look at the book.
40:19 TW: Trust me. I wasn’t questioning it. But that is a challenge. To me this fundamental thing, that having taken two formal statistics courses, 10 years apart, college level and did fine in them. But then when I pivoted to web analytics data and correlation comes in there, I can say, “Great, I can correlate visits to revenue.” That’s kind of a… Yeah, no, duh. And I can see how tight I can take these metrics. But then I’m like, but that doesn’t tell me what by channel. All of a sudden, to me, the light bulb that went on was because I’m so conditioned to looking at aggregated data and then I’m kinda stuck doing these simple things of correlation where I’m just looking at metrics.
41:05 TW: Even if it’s a… Say maybe I wanna compare the correlation of visits to revenue by channel. Fine, I can do that. That has very, very limited utility generally and that comes down to everything in statistics. We were talking about observations and variables and all of a sudden, I’m in a world looking at this aggregated dimensions and metrics. And that to me is still the super big challenge. Once you’re in BigQuery, you’re good. Once you’re using Adobe’s data feed, fine. But that’s a big jump to get to and I’m still not articulating well how fundamental of a switch that is.
41:47 MP: No. It absolutely is. And I have not been in the industry that long so I can’t speak terribly authoritatively to this. But I do think, one, the way that measurement happened on the web was very idiosyncratic and it was very much a sort of engineer, computer science mindset of, capture all of the data. We can do all the analysis that we want. And it comes from a very different mindset of statistics, which has always been, we can never know the full truth.
42:14 MP: We can never get the full picture, let’s work off of what’s available. And I think that’s why you have, that’s why you have a lot of folks who are working on the web, doing more things with Bayesian things and building [42:25] ____ mail filters and things that way ’cause it does handle that sort of question much better. But then on the other side, I also think a lot of folks who do sort of frequentist or more traditional statistics done a really bad job of building a path to getting to that successfully.
42:39 MK: Sorry, can you just elaborate on that? What do you mean they’ve done a path?
42:44 MP: I think there are lot of aspects of traditional stats which are legitimately counterintuitive or difficult to explain. Ask anyone who would say that they’re a statistician or that they know statistics to define P values, and you will have a very, very long conversation ahead of you.
43:01 MH: Fivethirtyeight.com had a lot of fun with that. [chuckle]
43:03 S?: Yeah, yeah.
43:04 MH: So, P values. How do we define those? Oh sorry, go on.
43:08 MP: Yeah, exactly. And that’s where you get a lot of, “Oh, they’re P hacking that study, it doesn’t make any sense.” And a lot of it is arbitrary, but then a lot of it has also been a lot of folks who work with this do not do a great job of boiling this down and frankly, one of my favorite things is when I’m working with people is just say, “Hey, we just started working together. Do you wanna take 30 minutes for me to just walk you through how I measure when we’re doing a test and how I arrive at these conclusions? Because I feel like that helps us get to a lot better results on the other of it”, other than just saying, “It’s Math, don’t worry about it. I’m gonna tell you to do the right thing.” Again, it’s a very kind of, let’s talk about this as people, not just as, “Here’s a process, let’s bat some numbers out” and that’s where you should base your decisions off of.
43:52 MH: Well, it strikes me as sort of interesting potentially, that our ability, ’cause the promise of the web was that every behavior was trackable, every activity was trackable. And I think maybe we all kinda took a next logical step of thinking that everything was knowable or understandable as a result of that. And I wonder if that has really influenced the way people have approached digital analysis over the years.
44:23 MP: Yeah. I think there’s this notion of false positives, and I don’t know if that’s common outside of, it tends to, it’s kind of the lens that I’ve been thinking about a lot of things through for a while. And so would it be helpful to define what a false positive is or a false negative or is that kind of well trod territory?
44:41 S?: It’s good to touch on. Yeah.
44:41 MP: Good. Basically, it’s just, hey, we observe that there was a, we’re doing our A/B test, we observed that there’s a difference between these groups. Is that actually the case? If we, the frequentist or traditional statistics mindset is if we repeated this test a hundred more times, a thousand more times, would it consistently still be the case across those? That’s why we have P values, and that’s why we try to determine uncertainty in the way that we test things. False positives are, hey, we observed this when we ran this, but is it actually going to be the case that even though our confidence was high enough, even though we had a P value of.01, everything looks great. If we did re-run this again, would this actually turn out the same way? And you would point to that original case that told you that this was the case as a false positive. False negative would be the exact same thing just in the other direction of, we saw that there was no difference, but there actually is a difference between these groups. And I think that with the way that people think about web data of, oh, we know everything that we could possibly want to about our users or our customers.
45:39 MP: We have all these different ways to measure success coming in after the fact and trying to apply different metrics to get to a result is the easiest way to get yourself into that territory of trying to just desperately trying to extract meaning from something. Even if it was, we adjusted the line spacing on the descriptions of our products by one M, and now we’ve seen sales go through the roof and it’s like, “Well, hold on a minute, something does not seem right here.” Are you just grasping on to any sort of metric that you’ve gotten inside of a test that you think may have been affected, saying that’s the difference, that happened. We need to be able to latch onto them, when realistically, it’s just people rolling the dice or flipping coins all day long on your website or your campaign that you’re running. You’re gonna get flukes that happen, sit there with the coin. You eventually will get a streak that’s going hot on the heads, but in the long run, it will even out. Hopefully, if you have a coin that is supposed be working the way it is. But if you have a non functioning coin then you should probably go talk to someone.
46:41 MH: A whole different problem with that.
46:42 MP: Yeah, yeah, exactly.
46:44 MH: Alright. Well, I can’t wait to go back and listen to what we’ve just been talking about because I think it’ll be highly correlated with me learning a lot of new things again. But we do have to start to wrap up. One thing we do like to do on the show is what we call last call. That means we just go around the horn, share something we’ve seen recently that we thought was interesting or [47:05] ____ that would be of interest to our audience. Matt, you’re our guest, do you have a last call?
47:11 MP: I do, I feel very silly saying it out loud, but I’m a very big fan of bots on Twitter, not the meddling…
47:19 MP: Let me qualify that. Not so much a fan of the types that manipulate elections or do other less than ideal behaviors. More the ones that’s about nonsense which help me get through the day. And there’s one in particular, which is Magick at Scale. Magic in the old English with a C and a K. That really makes me happy. It’s Our Magick Future. And yeah, I don’t know if you all wanna take a look, but it blends a little bit of the old school mysticism with the late and breaking with how we do web analytics work. And it’s a good reminder that a lot of this is just black magic and trying to find meaning in nothing. And that helps keep things in context a little bit. Especially after a call where you’re just like, “I don’t know what just happened,” but on the bright side, Ontology just announced VBFT Consensus Alchemy. So that sounds exciting.
48:17 MH: Wow, never be at a loss for words again.
48:18 TW: I’m scanning it now thinking, “I’m gonna… This is a way to feel really stupid.” And then realized that it’s intended to be a little silly.
48:29 S?: Nice.
48:31 MH: Well, Tim, what’s your last call?
48:32 TW: So I’m gonna have a last call that could not be more on the nose with the topic of this episode. It led to some of my questions.
48:40 MH: As per usual.
48:41 TW: No, usually I’m off, way off, on whatever. So there is a sample size calculator, and we’ll have a link in the show notes that is kinda geared towards, I think by my count, four past guests. I count Matt, our current guest, is one of them who’ve had input into this thing, but trying to provide some intuition around the idea of measuring significance and where does alpha for confidence and p-value, and power, the beta and trying to sort of visualize, ultimately started with a question of, “How many visitors do I need to get if I’m running this test with these criteria?” But it’s kind of been spearheaded by Kelly Wortham from Episode 25, that Elea Feit has weighed in, Matt Gershoff has weighed in, Matt Policastro has weighed in. So it’s a really, I think, kind of useful and fun tool to try to get some intuition around the various leverage we can pull, and some of the complexity of trying to balance the right level of uncertainty with the cost of data collection.
49:46 MK: I don’t think you drew reference to the fact that you and the help of those other people built it.
49:52 TW: Yeah. There’s a whole R component. It’s all built in R, shiny…
49:55 MH: Ooh, shiny R.
50:00 TW: And I’m very excited about that ’cause I got to try to flex a few muscles on the shiny front, but…
50:05 MP: I would also take issue with saying my feedback was useful, but that’s neither in or there.
50:14 MH: Again, try to embrace the uncertainty, which is fairly subjective, intuitionally.
50:23 S?: No, it’s not bothering me now, the feedback that you gave I’m like, “Son of a bitch. There needs to be… ” I need a stool.
50:29 MH: Alright, moving right along.
50:31 MK: Anytime, anytime.
50:33 MH: Hey, Moe, what’s your last call?
50:35 TW: Sorry, GDPR comes into effect in three days’ time, so I hope everyone actually knows what it means. If it doesn’t, I’m just doing call out…
50:43 MK: Dull, darn data protection.
50:45 MH: You’re in deep trouble, or hopefully someone else at your business is managing it. But [50:50] ____ shared a really good resource on the UX guide to getting consent a fair while ago. And, also, I’ll do a shout out to Episode 77 with Jodi Daniels. If you need a refresher on GDPR, 25th of May.
51:04 TW: Wohoo! Yehey!
51:06 S?: What’s that again?
51:08 MH: A day that will live in infamy.
51:11 TW: Michael?
51:11 MH: I’m excited to see the fallout of that. Okay, my last call, a little while back, every year, Jeff Bezos writes a letter to shareholders, and every year I enjoy reading it. And so I always make it my last call, at least, I think maybe two years running now. But in this particular letter, he spent quite a bit of time talking about how Amazon has tried to embrace high standards. And a couple of things really stood out to me, and that one is high standards require by domain. So it’s not someone has universal high standards across everything. They have high standards in some areas, but not in others. And I was like… For some reason that really flipped the light switch for me.
51:54 MH: And then the other thing is he walked through how they tried to then create space for people to adhere to those standards because to achieve a high standard in something may take a significant amount of time or effort, and if you don’t allow for that in your business, you won’t be able to hit those high standards. And I really liked that they thought about the before and after, and what was necessary to create the standard. And he told a really great story about a friend of his who was trying to learn to do a hand stand in the timeframe necessary to be able to learn to do a perfect hand stand. So well worth a read, especially if you’re like me and want to increase your standards over time, which I do. So that’s my last call, not necessarily for this podcast, however. That’s one of the area most standards that I will continue to hold, much to Tim’s chagrin.
52:50 MH: Well, Matt, it has been a pleasure. And actually, as I was listening and thinking about your name, Matt Policastro, I kind of came up with a play on your name, which is more like a Polly Matt or a Poly Math. Your just very knowledgeable and really great to talk to you. So thank you so much for coming on the show. And if you’ve been listening, what you may not be aware of is that half of you have heard version A and half of you have heard version B and through your comments on the Measure Slack, and our Facebook channel, we’ll be able to tell who’s had the best response rates. So get out there, talk to us about it, especially Tim, he definitely wants to talk to you about all of your statistical ideas and concepts. So we’d love to hear from you. And I don’t think there’s anything else to say. Just to know that for my two co-hosts, one of whom is Tim Wilson, my close personal friend, and Moe Kiss, also, my close friend…
53:54 MK: No close personal friend. [laughter]
53:56 MH: My growing personal… There was a tweet… Tim, okay, yeah… There’s…
54:00 MK: I know, I know.
54:00 MH: There’s context for this.
54:02 MH: It’s not that I don’t want to be your close personal friend, Moe, but we have a lot of work to do, apparently. A lot of hours which, let me just state, that one of the ways that we’re going to clock some of those hours is going to be in a couple weeks at the Marketing Evolution Experience. I don’t think it’s too late for you to come. Be there or be square. The Digital Analytics Power Hour will be in effect, doing a live recording of the show at that conference, and we would [54:32] ____.
54:33 TW: It will be a recording in front of a live audience, so that Moe doesn’t have to jump in. These are all live.
54:37 MH: It will be a recording in front of a live audience. So many details matter to Tim. And that’s what’s makes him great. That’s what makes him the quintessential analyst.
54:49 TW: Fuck you.
54:55 MH: But even if you’re not quite to Tim’s level of quintessentiality, you can get there if you just keep analyzing.
55:07 S1: Thanks for listening and don’t forget to join the conversation on Facebook, Twitter, or Measure Slack group. We welcome your comments and questions. Visit us on the web at analyticshour.io, facebook.com/analyticshour, or at Analytics Hour on Twitter.
55:27 S?: Smart guys want to fit in so they made up a term called analytics. Analytics don’t work.
55:36 MH: Statistics and coffee. Yeah, makes perfect sense.
55:40 TW: If you start saying something and…
55:42 MH: And Tim talks over you.
55:43 TW: Which I am very guilty of.
55:46 MK: Oh, shit. Damn it.
55:48 TW: And then Ready Player One, which I would have been oblivious to if Michael hadn’t said a year ago, “This is the book everybody’s reading now.” So, now people are like, “Have you read the book?” I’m like, “Yeah, of course. I read it a year and a half ago.”
56:03 MH: So good for you. There’s a couple of people you should be aware of in the pop music industry are really blowing up right now, Tim. One of them is named Cardi B, and you should probably check out her music.
56:14 TW: Yeah, the music’s not gonna…
56:16 TW: It’s not gonna happen.
56:19 MK: Okay, so everyone’s gonna stop talking and I am going to start. How do you get time to build shiny apps on the side? I’m just trying to do my job for Christ’s sake.
56:30 TW: Two out of three kids not living here and wife out of town.
56:35 MK: I don’t have kids.
56:36 MH: Moe?
56:36 MK: I watch Friends reruns.
56:38 MH: Nobody… Oh, Moe, let me give you the virtual list of high fives right now.
56:43 MK: I drink wine. I drink wine.
56:44 MH: No, because that’s healthy.
56:48 MH: That’s healthy. What Tim does, not healthy.
56:51 TW: I have watched the entire, to be fair, I’ve watched the entire Lost in Space.
56:56 MH: Oh, I’ve seen… I saw, I think, the first two episodes.
57:01 TW: It got slammed in the review I read and I thought it was actually… Really good…
57:04 MH: Oh, it’s really good so far, it’s crazy. Parker Posey.
57:06 TW: Parker Posey wow.
57:10 MH: Wow, yeah.
57:13 MK: What is going on, Tim?
57:15 MH: One thing Matt, that really, that Moe and I wanna express to you is that you don’t have to agree with Tim.
57:24 MP: Oh, I have no problem with that. Don’t worry.
57:28 MH: Yeah, okay. Great. Just making sure.
57:34 MP: Sorry. Go ahead, Michael.
57:38 MH: You were gonna say something very substantive and I was gonna say…
57:40 MK: Can I bring us back to the actual?
57:42 MH: I don’t know, Moe. Can you?
57:45 MP: Actually, as much as I wish I did, I do not know Tim terribly well. We had that one magical day at that conference and then never again, so…
57:54 TW: Rock flag and regression.