The company: “Hey, you. That’s a mighty nice test you’ve run. We should be doing that a lot more of those.” You: “Um…okay. But, I’m only one person.” In this episode, the gang chats with Pinterest‘s Andrea Burbank (Twitter | Pinterest) about how she (loosely) dealt with this scenario: from sheer force of will to get some early wins to strategic thinking combined with late nights, an obsession with checklists, and a willingness to be flexible as she slowly, but firmly, pushed the organization to steadily increasing test volume and test reliability. And sweet potato gnocchi.
Links To the Mentionables from the Show
- Ronny Kohavi
- (Book by Atul Gawande) The Checklist Manifesto: How to Get Things Right
- Bonferroni Correction
- Sweet potato gnocchi on Pinterest
- AirBnB’s north star metric
- Generative World Cup Posters
- #021: Analytics in Sports with Ben Gaines
- The Explain xkcd Wiki
- #standardSQL BigQuery query prefix
- YOW! Conferences
- Stop letting the data decide — Niels Hoven
00:04 Announcer: Welcome to the Digital Analytics Power Hour. Tim, Michael, Moe, and the occasional guest discussing digital analytics issues of the day. Find them on Facebook at facebook.com/analyticshour, and their website analyticshour.io. And now, The Digital Analytics Power Hour.
00:27 Michael Helbling: Hi, everyone. Welcome to The Digital Analytics Power Hour. This is episode 94. In marketing and digital groups all across this great land of ours, and yes, I mean the earth, all too often we get so excited about the great A/B testing and optimization tool that we just bought. And we get it all set up, we run a few tests, and then I don’t know, interest wanes; and two years later, when it’s time to renew, we’re just looking at each other like, “Well, what should we do?” And the reality is there’s a lot of work that’s required to scale your program. And so that’s what we’re gonna talk about on this episode. And in a new test format, let’s introduce you to our hosts.
01:14 Tim Wilson: Hey, I’m Tim Wilson. I’m the elderly Director of Analytics at Search Discovery.
01:20 Moe Kiss: And I’m Moe Kiss. I’m a youthful analyst over at THE ICONIC.
01:26 MH: And I’m the somewhat middle-aged Michael Helbling. Alright. Yeah, the jury’s out on that one, guys. Okay. But in addition to us, we needed a guest, somebody with a lot of experience with this topic, and it just so happened, Moe met the perfect person at the YOW! Data Conference over at Sydney, Australia. Andrea Burbank is a Data Scientist at Pinterest, where she’s had many different roles across that organization as it’s grown from 50 people to now over 1,000. Prior to that, she was part of the Powerset team, and then was acquired by Microsoft to spend some time there, but now, she’s our guest. Welcome to the show, Andrea.
02:07 Andrea Burbank, Guest: Thanks for having me.
02:10 MH: Awesome. Well, I think a great way to get started is just, how did you end up in the role that you’re in, and how did you end up in the experimentation and testing area, and maybe some of the things that you experienced along the way?
02:24 AG: Yeah. I think two different things happened that kicked that off. I joined Pinterest almost exactly six years ago in July, 2012; and we were working on changing the categories we showed to users. So should it be animals and tattoos and art, or history and vacations and travel, or whatever else? And we shipped that the first week of the London Olympics, and we didn’t really have an A/B testing platform, and then our metrics went down a lot, and we had no idea, categories not helping our users, or are people actually off watching the Olympics?
02:56 AG: And so we said, “Gee, it sure looks like A/B testing is something we should invest in.” And so we did, and we had an A/B testing framework that was limping along and semi-functional. And so the next thing that happened was that I went to actually use it for our next launch, which was changing how we onboarded new users; but our A/B testing framework at that point randomized all the users who showed up everyday, so you couldn’t test something specifically on new users, and that it only measured the outcomes when they actually showed up, so you couldn’t measure a lot of other things we wanted to measure. And so as I realized that I couldn’t measure the thing I was building, I started building out the A/B testing framework, and then once I built it, it turned out it was useful for a lot of other things. That was how it happened.
03:36 TW: Was that a homegrown A/B testing, or was that a third-party tool?
03:41 AG: It’s always been a homegrown one for us. Really, all you need for A/B testing is to randomize users, and then write down what you randomize them into, and then track some stuff. And since we already were logging things…
03:53 TW: And be able to crunch the numbers in a responsible way, although…
03:55 AG: Sure.
03:56 TW: Some would say maybe the third party tools don’t actually help you do that, but…
04:00 AG: Yeah. So we had a user hashing system, and we’d implemented logging on Pinterest something like a few months before that ’cause so we had grown really quickly in 2011, 2012. And so we had very simple randomized users, we wrote down which group every user was in on each request, and then we could count how many repins happened, how many clicks happened, how many visits happened. So very basic in-house.
04:23 MK: And as you were on this, I hate saying the word “journey,” it’s so corny, but as you were down this path of, “Okay. I need this experimentation because it’s ultimately gonna help me do my job,” how did you learn the right and the wrong way to do things? Was it just trial and error? Did you try and educate yourself as you went? I don’t know, ’cause I think sometimes it is tough with experimentation, where you learn so much by making a mistake, but then you also don’t want that mistake to be replicated by other teams.
04:55 AG: Yeah. I think that’s a really, really good point. I think in some ways, I was a little bit lucky in terms of where I started out because I was working on the new user onboarding experience, things like novelty effects were super important since they were very top-of-mind, even though maybe if I had been working on the experimentation framework with another application in mind, I wouldn’t have thought of that until later. But because with onboarding, there’s such a big drop-off after the first day, I knew, okay, I’m gonna expose users on the first day, but what I really care about is how they’re coming back to Pinterest more a week later.
05:22 MK: Yeah. I know.
05:23 AG: And then I think we did make a bunch of experiments upfront, which actually helped strengthened the culture of experimentation. I think A/B testing can have a bad name of like, “Why are you running experiments on your users? Isn’t that unethical?” And I think the counterpoint for that is, “Well, where I was building something for our users based on our best guess, in this way, I get to actually validate our best guess.” The alternative to experimentation isn’t that everything stays exactly the way everyone would like it, the alternative is we ship things, and have no idea what you like or hate. And so hopefully by running experiments, we can. But one of the early things we shipped, we hadn’t really thought about, “Well, what if it doesn’t work? How will we back out of that?” And so that led to pretty strong buy-in from all sides of the company of, “We need to be thoughtful about what we’re going to experiment on, make sure we’re never intentionally shipping anything that hurts users.” But also think upfront of like, if our hypothesis is wrong, that this will be helpful, how will we make sure that the users who were exposed can be sort of reintegrated into the normal experience without any harm?
06:22 MK: Particularly in those early days, where the culture kind of wasn’t at the same point that it’s at now, what was the business like when things did fail? There are often tests that are inconclusive, or there are no results. What was the attitude around that? Particularly, as you were kind of trying to get people to understand the importance of testing.
06:43 AG: And then I think I was really trying to drive testing as a tool for sort of figuring out which way to drive, so we tried out sort of five really different onboarding experiences. And even if one turns out better in some ways than others, at least we can eliminate three of these ideas. ‘Cause we were pretty new, we didn’t really… We had an orientation that someone had built at some point, but we had no idea if it was a reasonable choice. And so the goal wasn’t necessarily that every change we made would triple our sign-up conversion rate or whatever else, but that it would give us an idea of what sorts of directions were worth pursuing. And so I think by framing it as a tool for learning, people were less invested in, “Oh gosh, should we ship this today? Oh no, are the metrics… ” etcetera.
07:24 TW: Well in a kinda perverse sort of way, was it the fact that you had… Something got rolled out, not tested, and was a problem? Did that wind up kinda buying you future patience with tests? Like if people feel like everything’s going fine, and you try to introduce testing, they’re like, “Well everything’s working fine, we’re doing okay. We roll things out, things stay flat, or things get better, and we’re good.” Whereas it sounded like you kinda had the perverse gift of a little bit of a crisis, where things weren’t good, and they’re like, “Well shoot, why? Is this because of an external factor, or was it because of what we rolled out?” That seems like it could actually have bought, for some period of time, bought you some ability to say, “Yeah, we need to test this stuff, ’cause we wanna not have that happen again.” But that’s me totally projecting, I have no idea how it actually [chuckle] played out.
08:24 AG: I think it’s exactly right. I think it wasn’t as powerful as it sounds like it could be, and I think some experiences like that could be for some organizations sometimes. For us, it was one of 20 things we were doing in parallel, and we kind of decided to do it anyway, and we were just trying to make it more reasonable based on some other things. We weren’t super concerned about the metrics, we mostly wanted to make sure we weren’t breaking anything terribly. And then you’re watching two weeks later the Olympics, and then everyone comes back, and everyone’s all happy.
08:51 AG: But I do think the idea of sort of a catalyzing story that helps people understand why A/B testing is valuable is really important. I did some of that internally, just by showing, “Okay, which of these two designs do you think will be better from some of Ronny Kohavi’s examples that he’s done?” And everyone votes wrong, and nobody votes right on all three examples. And you say, “Okay, given that you’re always wrong in guessing which design will be better next time.”
09:18 TW: Actually, I don’t think you every really thought about that as kind of keeping in the back of your mind, trying to get testing rolling if you actually do have a minor or an unpleasant crisis. And that could be a landing page on a campaign, it could be a new feature rolled out, because everybody sort of mobilizes and runs after, “Let’s address that now,” there may be the opportunity for someone to say, “Okay, we gotta get through the immediate crisis, but now let me use that to feed into a story to say, ‘Hey.'” And part of our post-mortem of how we could have avoided that, could have been through testing.
09:54 AG: So I think even though the Olympics thing wasn’t a huge catalyst for us, we did have one of those lucky confluences that convinced everyone to do testing. I guess “lucky” might not be the right word, ’cause it was sort of a Andrea-forced confluence, where we decided to do testing. But we were redesigning our whole website, and it was the sort of thing where you committed a lot of resources to it. So I think we had devoted 20% of our engineers for six months to re-doing the whole thing from the bottom up. And so we were absolutely gonna ship it no matter what, but I kept saying we should run an A/B test anyway, because we wanna know when the metrics change, “Hey, remember the Olympics back then?” If that’s due to this redesign then.
10:29 AG: So we’d like to know what size effect to expect from the redesign. And because I kept pushing for that, we ended up running that test, and ended up finding some pretty severe regressions in a lot of the metrics we cared about, that we wouldn’t have found otherwise. And because this was such a huge sort of company investment we had made at such a long scale, and yet we ended up pushing it out further because of A/B testing, that sort of brought the A/B testing framework and its capabilities to the forefront of everyone’s mind. And people said, “Oh geez, it was valuable that we learned these things before we pushed it out.”
11:00 TW: So in that case, you actually delayed the full launch… I mean, it had been launched to some users, but so you delayed it and said, “Let’s figure out what’s going on there.”
11:12 AG: Right. The original plan was rolled out to 5% one day, and then 20% the next day, and then 100% the next day. So, “Why don’t we keep it at 5% for a few days, and look at the metrics?” The metrics are weighed down, that sort of thing.
11:21 TW: So did you take it back down to zero, and then address things, and then…
11:26 AG: So we didn’t. In part, because we wanted to see… Well we’d still have enough users that we could roll other people in once we thought it was fixed. But we also wanted to see, as our fixes went out, was people’s behavior changing? And also, it wasn’t that the experience was horribly broken. It basically worked in most of the ways, but some functionality was gone, and as we added that back in, do people use it?
11:46 TW: So where did… When things were not working, was it a matter of saying, “Well these metrics are being impacted, let’s sit around, sit with the UX, figure out why that might be going.” Or did you…
11:57 AG: Yeah, so…
11:58 TW: Like, how did you then take it to the next level of actually figuring out what the root cause in the change was?
12:02 AG: At the time it was honestly super stressful. ‘Cause here I was sort of opposing this moving ship of this very important product. But I think it did lead to a lot of really good discussions of, “Okay, this metric is down. Is it because we made this button red? Is it because we move this from the side bar to the bottom? Is it because… ” And so we actually… I still have the docs of eight different hypotheses we had for why this metric was down, and then the subsequent experiments we ran, or we tried re-introducing all those things, or moving buttons around, or changing button colors to see which of them was having the effect. And so we learned a lot about sort of how users were using Pinterest that we didn’t already know, and also managed to not ship this thing that was removing a feature that it turned out people cared about, that we haven’t realized.
12:46 MK: It does sound like you had a very good culture around you though because there are some companies, that once the ship starts in that direction, you can tell them anything in the world about why it’s wrong or right, and they’re like, “But the ship’s going that way, there ain’t no turning it around.” And it does sound like you were…
13:04 TW: Well they would say, “Users will get used to it.” So your A/B test is no good, because you’re testing, and we’re changed, and that’s what’s caused it. Customer satisfaction went down because we made a change, but it will recover. But it sounds like you didn’t… So you didn’t really have any of that, or…
13:20 AG: I have heard all those stories many times. I think our leadership at Pinterest has always been very thoughtful and really, really, really cares about our users. And so when we could say, “Hey, it seems like maybe users are trying to do something and failing, or maybe it’s really not working the way people expect it to,” that they really care about. We don’t wanna tell a story of, “Well, in six months, every one will be fine, and they’ll be used to it.” We were trying to re-design it to make it intuitive, and so to the extent that we were failing, we wanted to at least try to make it better. We weren’t gonna not ship, but we are willing to push the ship out a little bit to try to improve that experience.
13:54 MK: I do also really like that idea, the concept you were talking about, about do no harm to your users. This is not about, it’s right or it’s wrong, it’s like what’s in the best interest of our users? And I think when you can frame your A/B tests around like, “We’re looking after our customers,” I assume that would help with buy-in, and that’s a pretty hard argument to fight with, I guess.
14:17 AG: I think it’s a hard argument to fight with, but both sides get to use it.
14:21 MK: Yeah.
14:21 AG: So we saw by pushing this out, we’re also delaying the shipment of these other features which rely on the new framework. And so it’s a constant tradeoff. And so we did ship, I think, when our metrics was still a little bit down. But instead of 40% down, which is kind of ginormous, it was like 2% down. And we kind of understood what was driving it, that sort of thing. In that way, we weren’t gaining these features any more. So you try to do the thing that’s best for users. But only by measuring it, can you really have a guess of what that is.
14:46 MH: It makes me curious, just as part of the evolution of an organization and their ability to test, or having a culture around optimization, is there a turning point where the analyst isn’t gonna feel like they’re that person in opposition to some big moving thing, or does the organization start to embrace that more holistically? And I’m curious because you actually sit in a really great spot, Andrea. I think having seen a small organization become a big organization, and be seeing this all along the way, were there times where… Like that first time, you felt stressful. Your words being kind of in opposition to this big moving ship, if you will, with this big initiative. Does it still feel that way, or is it now kinda like, “No, we’re all on board the same thing. And now testing is… ” And where was that switch?
15:41 AG: I think that’s a really good question. I think first switch came sort of from the value we demonstrated from that first really stressful pressure cooker of this rewrite. The whole company’s attention was on it. We showed how running the A/B test had actually been really valuable. And then we had pretty high level executive buy-in for, “Okay, we should be running tests on sort of anything we think is a substantial change to the service.” I don’t know how to answer your question about if you can avoid that first part of being the only person in the way, ’cause that’s how it happened. And I don’t know if you can get buy-in first, but…
16:14 MH: It’s almost like you have to have somebody be that champion, if you will, wherever they sit in the organization. I think it’s great if that person is sitting way, way high up in the organization, saying, “This is how we’re gonna run things.” But a lot of times, I think a lot of analysts, and people sitting down doing the work are those first people that kinda raise their hands, and be like, “Hey, we’ve gotta find a better way.”
16:37 AG: Yeah. I was pretty junior and a pretty new hire at the time, and I sort of just decided to stick my neck out and say, “I think this is really important.” And of course, it could have been that the A/B test showed nothing, and everything was fine, but then it also would have been lower stakes. Because we would have just shipped it. That wouldn’t have been this stressful. So it kinda works out.
16:55 TW: Well, and I do think, Pinterest having all just… In the scale of, let’s just call it massive scale [chuckle] from users. And it’s very, very consumer-focused. And you sort of alluded to kind of multiple times already that there were also kind of key metrics identified that the organization rallied around. As I think through some of the experiences I’ve had with organizations that really kind of… And you also said that it’s homegrown. So you started with kind of the need, and you just figured out a way to kinda get to it. To me, what I see, have seen more often is stakes are lower. Somebody sees a vendor talk, so they kind of jump to buying a tool, and saying, “I get the idea of testing, we should be doing testing.”
17:46 TW: It may be the analyst saying, “We should be doing testing, we should be doing testing.” But then when you say, “Well what are we gonna test? What are we grappling with? What significant changes are we making to the user experience?” And really struggling. I’ve had multiple clients that have bought a testing tool first, and then literally just cannot come up with ideas that they’re willing and able or even interested in testing. So I sort of feel like it’s a totally different kind of challenge, ’cause it’s very tempting a lot of times for us to say, “Everybody should be testing.” And I’m like, well, there’s the culture of accepting testing, but then there’s kind of also the, “Do you have enough people with independent thought and hypothesizing and ideas and changes being made often enough in order to make that worthwhile?”
18:36 AG: Right.
18:36 MK: I feel like we have the opposite problem. Our company is like, “Let’s test everything!” And you’re like, “Whoa, hold back a minute. Only test the stuff if you’re gonna make a decision off it. If you’re gonna ship it regardless, there is no point in testing it.” Well, from my perspective, but feel free to disagree.
18:50 TW: Well, what’s similar with… What Pinterest and THE ICONIC have in common is that both are pretty much pure-play online. The transactional… Whatever the transaction is, is occurring online. So I think there is a commonality, even though one’s a community, one is a product company. I don’t know. I feel like… ‘Cause there’s the scaling, like what, Andrea, I think you had to go through was it sort of took off, and you championed it and pushed it and made it happen, and then you said, “Oh, now I gotta scale this up. And how do we do this where we don’t just kind of descend into chaos, where there’s no organization and plan?” Hopefully, we’ll get to that as well. Looks like you’ve got some great thoughts and checklists and sort of where the bottlenecks are. But I think it’s also kind of responsible to point out that you’ve gotta have the right factors that say, “Hey, we’re ready to move into testing,” or you’re gonna be… It’s not even really a culture, it’s… Well, maybe it’s cultural or structural or business model or something else.
19:56 AG: Well, and do you have metrics that you can measure relatively quickly, so that you can get quick results on your A/B test and iterate quickly, which is usually the goal of running a lot of these.
20:06 TW: Yeah, do you have a meaningful outcome that’s low latency, that you could actually get to something?
20:14 MK: As you started ramping up your A/B testing efforts, did you find that each test was using fairly consistent metrics, or was it completely different metrics depending on what you were testing?
20:24 AG: We had only implemented about, I don’t know, 40 or 50 metrics that were tracked automatically, so everyone was sort of constrained to using those ones. You could always run custom queries, but it was… Most people don’t wanna run SQL queries or Hive queries, when they could look at a dashboard. I think it was also pretty early days, so we could only detect changes of a certain size, and it actually was reasonable to expect that your change to onboarding would get 10% of all users to come back, or would increase repins by 2% or 3%. And so it was fairly coarse, but it met our needs.
21:00 TW: So that… Is your analytics also homegrown? I should know that…
21:04 AG: When you say our analytics, so…
21:06 TW: Your web analytics.
21:07 AG: I will use kind of log things. Yeah, yeah, we don’t have Google Analytics or anything. We have front-end client logging, or we’ll log to Kafka, and then we aggregate things.
21:16 TW: Which is… So because you’ve got users logging in, so it sounds like you’ve got that, and then it almost… Because there’s an upside to having that in-house built behavioral tracking and logging, is that it presumably forced some discipline around what actually matters, what do we really care about? They’ll make the Adobe and Google people cringe horribly, but you’re like, “Hey, these are the metrics we have. We’re tracking these because they matter, therefore when you’re testing, it’s integrated, tied into that… ” Again, I think…
21:50 AG: Right. And I think one of the challenges with some of the tools, like Optimizely six years ago, was that the outcomes you had to measure had to be fairly immediate, and fairly front-end-oriented. Like “10 people click the sign-up button” might be your success metric. And for us, it was much more about saving content they are interested in from a variety of surfaces on Pinterest. And so either we would have to hook these external tools up to our internal databases somehow, or honestly, since we are already tracking all these things, all you have to do is join the user ID and the experiment, the user ID who took the action, and you can do all the analytics yourself.
22:24 MK: So can you talk to us about the point you got to… You’ve gone on this incredible… Oh God, journey… [chuckle] Of A/B testing, but then you got to some point where you’re like, “Okay, I can’t be the only person in the company doing this.” What did that look like?
22:46 AG: It looked like me being really stressed out, and having to do a lot of code reviews, and not getting anything else done. And I think I was on a team of three, and the other people on the team sort of saw this happening, and so I said to them, “I just can’t do this. What can we do instead?” And so together, we came up with this idea of me training them to do what I did, which required me to write down all the things I was thinking about. And then once I had that written down, we said, “Okay, well now that three of us understand it, and we’ve written everything down and clarified your notes and so on, we can spread that beyond the three of us, and try to get people across the whole team to work on it.”
23:22 TW: And what were the buckets of what you were writing down? How much was, “This is the sequence of steps”? How much was identifying a good experiment? How much was, “These are the mechanics of not screwing something up”?
23:36 AG: Yeah, so basically, I had set up a email filter that would move to my inbox anything whenever anyone was changing an experiment, and I would just poke my nose in on the code review, and look at how they’d set it up, and try to get them to not screw things up. So what that looked like was me writing down basically all of the screw-ups I’d seen, and what had happened, and then we started thinking about how you would avoid them. So people thought they were allocating 5% groups, but they were actually allocating half a percent groups or 4% groups. Or people would ramp up, and then ramp down ’cause they realized they’d broken something, but then they’d put those same users back in the next day, without realizing that maybe having the site be broken for two hours yesterday while you were testing would affect the metrics today for those same people. That sort of thing.
24:21 MK: And so when people were reading, and particularly your colleagues were reading all of these, I guess, common mistakes that had been made, did they get it right away, or did they have to kind of, as they were learning, make some of those mistakes themselves for it to actually really sink in?
24:36 AG: So that was where we tried to sort of turn this list of mistakes I’d seen into a checklist of how would you anticipate those mistakes? And then we also turned our training program into a apprenticeship. So you were actually on the hook for looking at all of the experiments going out on Pinterest, and looking at the code reviews the same way I had, and trying to catch those mistakes. ‘Cause it’s one thing to theoretically know, “Okay, these things might happen.” It’s another to be told, “Hey Andrea, I know you looked at that code review for the French experiment, but you didn’t notice that it actually wasn’t restricted to users speaking French. See how that’s here on the checklist. How can we make that more clear, so it doesn’t happen?” This sort of thing. Not in a confrontational way, but in a, “We all wanna try to run experiments that aren’t either messing the user experience up, or that throw away all of our hard work. And so what can we do to keep improving these checklists, to make fewer things fall through the gap?”
25:28 TW: Oh, I just feel like I need to read Atul Gawande’s book.
25:32 AG: It’s so good.
25:33 TW: I’m such a fan of checklists that…
25:34 AG: Well and it’s fascinating ’cause he talks about…
25:36 TW: I feel like, I’m like, “What’s it gonna tell me?” It seems so obvious like, “Yes, of course, checklists.”
25:40 AG: Right… It’s like surgeons and pilots who have years and years of advanced training, and yet having these simple checklists of, “Did you count the scalpels on the table before, and did you count the scalpels on the table after?” turns out to improve outcomes dramatically. And so what we are doing is much lower stakes, thankfully, than flying an airplane or doing surgery, and much, much simpler. But even so, just reminding people of the things they need to check, and giving them a simple thing to work through as they’re doing a code review proved to be really valuable.
26:10 MK: And you’ve had a few different checklists, correct?
26:13 AG: We did. So basically, we made an analogy between running an experiment and flying a plane, which again, oversimplified; I would much rather run an experiment than fly a plane. But basically, for every experiment, there’s some point when you start it, there’s some point when it’s already running, and you might wanna change it, and then there’s some point when you think you wanna end it. And the pitfalls of each of those stages are different. And so at the start, a lot of the things we wanted people to check for were, “Did you write down what your hypotheses are? Are you triggering the experiment in the right place? Did you write down what your experiment does, and what each treatment group actually does?”
26:44 AG: And then partway through, it’s, “Does your experiment look like it’s doing what it’s supposed to? Are you ramping up the groups because you actually need more power, or do you already have enough statistical power, and you could just wait a week?” These sorts of things. And then at the end, “Do you have enough data to make a decision? Are you making the right decision? Were there pre-existing differences in your groups that negate your decision?” These sorts of things.
27:06 TW: Wow, that is… I don’t think I’ve… I think that’s… ‘Cause you’ve also inherently baked in a… Well, onboarding for additional resources, as well as some ability to govern… A process that’s being governed, which means that you don’t have to do some massive knowledge transfer. And I think the airplane analogy is awesome, ’cause I feel like there are… It does seem like in testing, there are… You have to fight against the people who say, “We’ll look at the tests every day, and just end it when we’ve hit significance.” And everybody at this point hopefully knows that’s not good. And then you have the, “Oh no. We decided we’re gonna run it for two weeks. We’re not gonna look at it at all. We did our calculations,” in which case they may say, “Oops, we actually screwed it up. We should have been able to identify that the test was not working, not working as intended two days in, and correct it, but now we’ve kind of gone nuts.”
28:05 TW: So shifting a little bit, and maybe this is… I’m not super clear on exactly what your role and the other two people on your team, that when it comes from an analyst side of the organization, I’ve seen people think, “Hey, I’m an analyst. I know how to crunch the data, and I just need to get a tool, and I’m set,” and kind of underestimating the need for UX and creative resources, as well as for development and IT resources. Did you run into that at all? That you can’t just scale necessarily just more of you, you need you and the other parties that are involved.
28:44 AG: Right. I think we never had a shortfall of ideas of things we wanted to try building. And so we had UX engineers who were already planning to build this new experience. It was just a matter of getting them to build experimentation into that. And so I would help with coding that part, or helping them set up that part, but I didn’t need to try to convince people that we should run a test on building this thing. We were already planning to build a different search relevance engine, or a different UI for Pin Close-up, or whatever else it was.
29:14 TW: But did it open up to where they were already building a new experience, but instead of building a new experience, they would build a couple variations of the new experience?
29:23 AG: Sometimes. I think one thing we learned from this huge redesign was that when you lump together 3,000 different changes, it’s really hard to tell which of them made the difference in your metric. And so sometimes we would work with them and say, “Hey, you’re changing a lot of different things here. Maybe you should add another treatment that isn’t one you’re planning to ship, but that’s sort of an intermediate in between them, so you know which of these two big things you’re changing was more important.”
29:46 TW: Is that the sort of thing that multivariate should… A multivariate test, in theory, you’d chunk up those different things, and be able to do your partial factorial, or whatever magic you use…
29:58 AG: Bonferroni correction, etcetera.
30:00 TW: Oh Bonferroni correction, ooh, yeah.
30:02 AG: Yeah. I think there’s a lot of complexities that people can get really hung up on with multivariate testing. To be honest, Pinterest is really lucky in that we have a lot of users, and as I said early on, like a lot of the changes we were making had a pretty big impact. So we didn’t have to worry too much about a lot of the details of that. It was much more valuable to just try a couple things out, and you would see 5%, 10% differences between them. And then with tens of thousands or hundreds of thousands or millions of users, and then the Bonferroni correction changes your p-value from 0.0000001 to 0.0000002, and you don’t care.
30:38 TW: So given you’ve got so much scale, do you still… Do you set kind of a minimum length on the test, to say at least you need to go to a one-week… I need to cycle through a weekend, even though that’s gonna give us more… I guess, do you control the… Say we’re willing to do 5%, we don’t need to do it 50-50 split ’cause we’re swimming in users?
31:01 AG: We definitely almost never do a 50-50 split. We do try to get people to run for at least a week, both because our usage and users vary a lot by day of week, but also because a lot of what we care about is sort of long-term effects. And especially for UI changes, as you said, sometimes it would take a while to settle in, but even for relevance changes. One analogy I make is like, let’s say we figured out the best 20 recipes for pie, and you search for pie, and we show those 20 results to you, like you’re pretty happy. But then if you come back the next day, and search for pie, and get exactly the same results, now you’re not very happy anymore ’cause you were hoping to get some variety, or at least be able to scroll down and see different things.
31:38 AG: And so I think things like novelty effects matter for a whole lot of different experiences, even if it’s not, “I change the button, and everyone clicks it ’cause they don’t know what it is,” even if it’s much more backend, understanding what the effects are over a longer period was really important to us. That’s always a struggle. Like Pinterest would do probably like everywhere else, always wants to move quickly, always wants to ship something and move on. Sometimes we’ll do that via longer term holdouts, or whatever else, but we do try really hard to get people to wait for anything, where we think that effect might be important.
32:09 TW: The challenge of novelty effects, that’s actually… ‘Cause I think a lot of times, we think, “No, we wanna find the right experience, and we want it repeated, but you’re kind of like a news media site in that sense, that I don’t wanna keep going back to a news site, and keep seeing the same headline four hours later. Wow, that’s gotta be another fun challenge, yeah.
32:30 AG: It’s a real challenge, like we’re trying to balance the things you’re most interested, and which you’ll repin the most.
32:34 TW: Or the flip side is, I saw that cool thing, I didn’t pin it, let me go back and find it again. If you have too much variety, then they’re like, “Ugh, I know it was here. I just didn’t have time… ”
32:43 AG: Right.
32:43 TW: To pin it.
32:45 AG: And if we only show you sweet potato gnocchi like, yes, I love sweet potato gnocchi, but there’s only so many ways to prepare that. So how do we sort of introduce more diversity into your feed? That’s like people who like sweet potato gnocchi might also like charred enchiladas, or whatever else.
33:00 TW: Wow, I love gnocchi, and I hate sweet potatoes, so I’m now conflicted.
33:02 AG: Oh you’re missing out.
33:04 TW: Oh my God. Sounds healthy, a lot of vitamin B. It does…
33:08 AG: Not when you put the brown butter sage sauce on it.
33:10 MK: Okay. [laughter]
33:13 MH: So I’ve been thinking as we’ve been talking, this is maybe even just a question for all of us to ponder. So obviously Pinterest as a platform, as a website, they’re… You guys aren’t necessarily trying to get people to do something beyond sort of they use the product, and use the various functionality, which could be construed as a little bit different say than… Compared to like what Moe’s doing at THE ICONIC, which is they want people to buy products, and they’re trying to merchandise specific things. Do you feel there’s a difference in the way that people perceive, or approach testing? If it’s how do we enhance our product, which is the website itself, versus how do we get people to buy products which are displayed on the website?
34:01 AG: I actually disagree a little bit with your premise.
34:02 TW: Ooh. Nice. I like it. [laughter]
34:05 MH: That’s good. No, that’s good.
34:07 AG: So Pinterest’s mission is to help people discover, and do things they love. And so we actually have a lot to read about things, like spend more time on the site, whether that’s a positive outcome or not. Because we do want people to find the right recipe, and go out and make it; or figure out how to build their coffee table, and go out and build it. So we do have metrics around click-throughs and purchases as well. You know, retailers who link back to us, and say we can attribute this purchase to Pinterest, or just we know this user bought something after having used Pinterest, those sorts of things. So it is maybe more action-oriented than you perhaps think.
34:41 MH: Okay. Well yeah, and that’s me not knowing a lot about how everybody uses Pinterest ’cause I use it very sparingly. [chuckle]
34:50 AG: I think Pinterest is also interesting in that way ’cause I think everyone uses it really different ways. ‘Cause I use it a lot for cooking, and I’m thinking a lot about how we could improve that. But other people use it just for fashion, or just for collecting beautiful photos and places they wanna travel. And so when we think about the metrics of our A/B tests, we have to think about… I talked a little about how we have millions of users, it’s not a problem. That’s a bit glib on my part ’cause we have millions of users, but we wanna be helpful to both people in Japan and the people in the US, and also people who use it to collect beautiful pictures, and don’t really care where they’re linked, versus people who are doing recipes, and really wanna be able to click through the thing to find good stuff. So there’s all these different use cases, and segments of people. We try to keep in mind, and understand what metrics we can develop that will help us know if we’re improving the product for them.
35:34 MK: That is super interesting though, the idea that you would have such different segments that could use the same product for a completely different purpose. Like to be honest, I’ve never thought about using it for recipes, and now that’s gonna be my thing that I do later today.
35:47 TW: Andrea has an entire board that is nothing but sweet potatoes gnocchi recipes. It’s pages and pages long.
35:54 MH: But actually, there was something you said in there which I really think it was a big takeaway, which is, you guys spend a lot of time thinking about what would be the best way for a user to use the product or platform. Is it best for them to spend a lot of time, or is spending a lot of time, does it hit its maximum, and now it becomes a detriment? And so I thought that was really good because I think sometimes it’s hard to sort of be like, “Oh yeah, we wanna make this metric go up, up, up, up, up,” without thinking about if there’s a cliff that you can go over that might not be optimal anymore.
36:31 AG: We think a lot also about not getting the number of repins, the number of clicks to go up because there are people who just are on Pinterest all the time, all the time. What we really care about is making it useful to you, for whatever you wanna use it for, making it useful for more people. So people who repin, people who click is often our key metric, as opposed to absolute number of repins or clicks ’cause it’s much easier to get someone from their 1,000th repin to their 1,001st, than from their zero to their first. But there’s a lot more sort of user value in that very first one of, “Oh, I finally understand what this weird Pinterest thing is, and I found something I like, and I wanna save it for later.”
37:09 MK: Do you guys have a North Star Metric? I’m really… I’ve become obsessed about this ever since I discovered that Airbnb have this North Star Metric and everyone, every test, everything in the company is kind of aligned to that. And I really liked the idea, but just like in a use case like Pinterest, where everyone’s using it for different purposes, I feel like that would be really tough.
37:33 AG: I think it is really tough. To be candid, probably each year or two or three, we switch what the North Star Metric is, but they’re all oriented around the same idea, which is, as I said, our mission is helping people discover and do things they love. So sometimes we think that’s more click-through, sometimes we think that’s more saving for later, but the basic idea is we at least want people to be doing one or both of those things. And so our North Star Metric is something like repins or click-throughers, or repins or click-throughs, depending on sort of the flavor of the year or two. But I think it’s also… We do track it for every experiment, and you absolutely can’t decrease it by much.
38:06 AG: But a lot of things you do won’t necessarily move that North Star Metric ’cause it is millions of users, and all you’re doing is changing how we treat five-word search queries. And so figuring out the smaller metrics that are directionally aligned with the North Star is often as valuable as knowing what that North Star is.
38:26 MK: Yeah, that makes sense.
38:28 TW: Jumping off of that, I assume that there is massive scale on mobile apps, as well as massive scale on the website. The actual balance is neither one of them is immaterial. So where does… Do you wind up in the mobile app testing world as well? Is that similarly… Are there unique challenges there? Did that come later?
38:51 AG: Yeah, that’s a good question. It did come a few months later. I guess Summer 2012 was when I implemented our A/B testing framework on the web. For the app, you need to think about sort of latency to the API, where you don’t want it to have to wait for the API to reply to get the experiment groups. So we implemented that it would fetch it at the beginning, and then have a sort of asynchronized callback to say when the user had triggered into one of the groups, and so on. It’s evolved now some, so I think the clients actually can do the bucketing locally with the same hashing logic as we have on the backend.
39:22 AG: Many of our tests run across all platforms simultaneously. That can be tricky ’cause the release schedules are obviously different. It used to be that with fast prototyping, we would often do things web-only, ’cause obviously you can ship the web in 30 seconds, as opposed to waiting for the app store to approve your new version. But because so much of our traffic has moved to mobile, we now do many more tests that are mobile-only. See how it looks on one of our platforms, either iOS or Android. And then…
39:47 MK: So wait, I’ve got to ask, so how… ‘Cause this is exactly what we do, we test on web because apps are just too hard, and it’s like how do you time our iOS developer’s changes to be in sync with the Android? And then iOS always takes longer to approve. How are you doing this? Are you timing them, running them for different periods, or are you… Basically, you wait until you get iOS approval, and then everything goes?
40:12 AG: So often we’ll test it on whichever platform is ready first, but build in sort of two experiments. So one will be iOS blue button, and one will be Android blue button, but then there will also be a blue button. And so if you’re in the either iOS blue button or blue button, the blue button we measure on the iOS one, make sure everything looks reasonable; and then when we’re ready, we can switch over to that parent experiment, to make sure that the consistency across platforms is enforced.
40:37 MK: Okay, so this is a really naff question, but I’ve got to ask. So we’ve done the same a few times, and then when it comes to analyzing your results… ‘Cause there’s always this question then, is the result different on Android to iOS, and should you be checking that, or should you be combining them all? ‘Cause ultimately, they’ve all been exposed to a blue button. How have you been managing that?
41:02 AG: We manage it two ways, the easiest way is that our dashboard lets you put the results by platform that the action happened on. So you can look at whether repins on Android are down or up, versus on iOS. That works pretty well, ’cause most people don’t switch devices all that much. But you can also break it down. We have our script that we run by the platform on which they were were first exposed, and do the same thing, as then you know that there’s no sort of cross-platform switching that’s messing things up. Like if everyone switches to their iPhone because their experience is so terrible on Android, you’ll see that in those metrics.
41:33 MK: Nice.
41:33 AG: There’s a lot of complicated questions like this, of how do you figure out what segments are important to look at? Once you’re looking at 20 segments, of course, some of them have statistically significant differences that are spurious. It gets more and more complicated, for sure, that I think at the start, you said, “Now that you’ve solved the experiment culture problem… ” No, we have not solved by any stretch of the imagination the experiment culture problem.
41:55 MH: Once you get your culture wrapped up, then you could start tackling the hard problems.
42:00 AG: Well that’s one of the things I say, is like, “If you build the right tools, I think the hard thing is getting people to be thoughtful.” “Why am I running this? What will it do? What do I expect it to change? Can I measure that thing?” And tools can’t really help with those things. They can help you move faster, and see metrics faster, and detect anomalies faster, and realize you bucketed people wrong faster. And so to the extent we can build tools that do all the things tools can do, then we can use our brains and our time to try to get people to be more thoughtful about why they’re doing these things, and how they’ll measures success.
42:29 MH: That’s great. All right, we have gotta wrap up ’cause we are running out of time, and this conversation is really, really good. Okay. So one of the things we love to do on the show is called “the last call.” It’s where we go around, and we just mention something we found recently that we think is interesting, or we think people might find interesting. So Andrea, you’re our guest, would you like to go first?
42:53 AG: Uh-oh. So this is not super analytical…
42:55 MH: That’s okay.
42:56 AG: But some guy on the internet is using canvas and generative statistics to make posters for the World Cup, and I am a big soccer fan. And so he makes these really cool graphics which shows how many goals were scored by which teams and when, and it comes out in an art form to represent the soccer game, which I think is pretty cool. And you can Google, “Generative World Cup posters canvasing,” and you’ll find them.
43:17 MK: That’s awesome.
43:18 AG: But they have… They’re very neat. I love soccer, I love statistics, so it’s my thing for the week.
43:23 MH: Nice. Well then… And we already know that France won the World Cup, so that’s pretty exciting. [laughter] Based on when this was recorded. Okay.
44:01 TW: Although I think we had Ben Gaines on a couple years ago, who’s an Adobe Analytics Product Manager there, just to talk about sports analytics. And he’s a big NBA fan, but I can see a glorious return of Andrea for a discussion of…
44:16 MH: Soccer analytics? All right. Tim, you wanna go next?
44:20 TW: Sure. Mine’s kind of also sort of fun. So a couple weeks ago, I pushed out my explanation of distributions and binomial metrics in video form, on our site. This is not meant to be a logroll for that, ’cause that’s already up. But I did the whole thing in a xkcd format, and the course of just checking some things on the xkcd front, I found the explainxkcd.com. There’s an entire Wiki that basically explains every single xkcd comment, or it’s 80% of them. Which is just kind of fascinating, I’m intrigued by it. It’s funny in its own right, the fact that Randall Munroe has the level of following, that there’s an entire site that just explains his comics. Which most of them I generally explain, but it’s just a fun little site to poke around. And it also has some very useful sort of categorization of the comics as well. I think that’s how I wound up at it, is that I was looking for a specific comic topic, and I found it on explainxkcd.com.
45:31 MH: Very nice. Alright. Moe, what do you got?
45:35 MK: Okay. I’m gonna just slowly change people’s lives, like one tiny trick at a time. So…
45:44 TW: What is the latest big query shortcut you got?
45:47 MK: Yes, it is. It changed my life. So I migrated from legacy SQL to standard SQL a while back. And despite all of my ranting and raving, it doesn’t seem like the Google Cloud team is gonna change the default from legacy SQL to standard.
46:06 MH: They know you have a podcast right, Moe? [chuckle]
46:10 MK: Anyway, what I did discover from them is a trick that apparently all of the Googlers know about, which is that if you do a hash, and then standard SQL at the start of all of your code, it will always automatically run standard SQL instead of legacy.
46:27 TW: So you do hash, and then just standard SQL, one string?
46:32 MK: So yeah, basically you just do hash, and then you write standard SQL, it doesn’t matter if it’s caps or lower case, and you put it at the first line of all of your code. Any comments have to be below it, it literally has to be the first line. And it will always run, even if you have the legacy SQL box ticked. And this blew my mind, and it’s been pretty amazing. But I also have a tiny twofer, because the other one that I wanna do is just a shout-out, because the YOW! Conference where Andrea and I met, have two conferences coming out, one in Perth and one in Singapore. And they’re basically conferences and workshops for developers, by developers. So there’s one in Perth, in Australia, September 5 to 6; and there’s also one in Singapore on September 8. And they’ve still got early bird tickets available.
47:18 MH: Outstanding. Okay. My last call is my new Pinterest board, called “Analytics Podcasts,” where I’m capturing… Oh no, I’m just kidding. I did create one while we were on the show today, but that’s not really it. So I recently read an article by a guy by the name of Niels Hoven, who’s a product manager. And he was talking about how as he ran product management for some mobile game companies over the last 10 years, and noticed how data and analytics could sometimes take over and create what he called sort of a misrepresentation of what you should actually do to capitalize on pleasing users, and deliver the right experience to your mobile games. And so things like paywalls and how often you put up advertising and all these things were all being optimized by the data because it drove revenue to other targets, but actually created negative outcomes in the lifecycle of the apps that they would be managing.
48:25 MH: So it was a really good article. It is called Stop Letting the Data Decide. I actually found it really interesting, and not too off-topic from what we were just talking about on the show. So definitely check that out. It’s very thoughtful writing. Okay. We obviously did not nearly cover enough on this journey, or should I say walkabout, Moe, through understanding what it takes to scale experimentation in a culture of testing. But it was a great discussion. We would love to hear from you. And the best way to reach us is through the Measure Slack, or you can reach us through our Facebook page, or our website, or on Twitter. And Andrea, thank you again for being on the show, and sharing your knowledge with us and our listeners. You have a wealth of information, so thank you for sharing it. Or wisdom on your long trek through experimentation. See, there’s lots of synonyms, Moe. Lots of them. You have voyage…
49:33 MK: Lots.
49:37 MH: Anyways… But yeah, thanks so much for coming on the show. And I know that I speak for both my co-hosts, Moe and Tim, when I tell all of you out there, keep analyzing.
49:50 S?: Thanks for listening, and don’t forget to join the conversation on Facebook, Twitter, or Measure Slack group. We welcome your comments and questions. Visit us on the web at analyticshour.io, facebook.com/analyticshour, or at Analytics Hour on Twitter.
50:09 S?: So smart guys want to fit in, so they made up a term called analytic. Analytics don’t work.
50:19 S?: We bring a lot of impostor syndrome to this podcast, [50:23] ____.
50:25 S?: I have so much of that, I should fit right in.
50:31 S?: And Michael will likely illustrate how to not do a hard stop before starting. He’ll see me, even have a little minor stroke. Correct him, and then you’ll realize that it totally makes sense. It’s just a hard stop. And then you’re good to go.
50:43 S?: I would just go with no editing whatsoever, but Tim is a stickler for… No, I’m just kidding. [chuckle]
50:50 S?: My connection got bad. I’ll see you guys later.
50:56 S?: No. I was trying to not record in my bed, because I think it’s unprofessional. But it’s the best Wi-Fi spot in the house.
51:04 S?: Is your pillow properly fluffed? Do you have the comforter draped across?
51:08 S?: Yeah. Oh, geez.
51:11 S?: I wanna kinda shift and ask another sort of question. I’ve got a phone ringing in the background, which apparently my phone is… This is like the fourth time it’s rung during this. I did not shut my office door.
51:24 S?: I moved location, so no judgment.
51:26 S?: That’s okay. I’m gonna take my shirt off. Is that a problem? Cool. [51:28] ____.
51:33 S?: Oh, my God, Tim, if you don’t put that in, I’ll kill you.
51:36 S?: Rock, flag, and sweet potato gnocchi.