#040: Google BigQuery with Michael Healy

Published: Jul 5, 2016

Subscribe: RSS

Subscribe: Apple Podcasts | Google Podcasts | RSS

0 Shares

In this episode, we dive deep on a 1988 classic: Tom Hanks, under the direction of Penny Marshall, was a 12-year-old in a 30-year-old’s body… Actually, that’s a different “Big” from what we actually cover in this episode. In this instant classic, the star is BigQuery, the director is Google, and Michael Healy, a data scientist from Search Discovery, delivers an Oscar-worthy performance as Zoltar. In under 48 minutes, Michael (Helbling) and Tim drastically increased their understanding of what Google BigQuery is and where it fits in the analytics landscape. If you’d like to do the same, give it a listen!

Technologies, books, and sites referenced in this episode were many, including:

Google BigQuery and the BigQuery API Libraries
Google Cloud Services
Google Dremel
Apache Drill
Amazon Redshift (AWS)
Rambo III (another 1988 movie!)
Hadoop
Cloudera
Observepoint Tag Debugger
Our Mathematical Universe by Max Tegmark
A Brief History of Time by Stephen Hawking
A video of math savant Scott Flansburg

Episode Transcript

The following is a straight-up machine translation. It has not been human-reviewed or human-corrected. We apologize on behalf of the machines for any text that winds up being incorrect, nonsensical, or offensive. We have asked the machine to do better, but it simply responds with, “I’m sorry, Dave. I’m afraid I can’t do that.”

[00:00:24] Hi everyone. Welcome to the digital analytics power hour.

[00:00:28] This is episode 40 big 40 big 40. TIM WILSON My cohost. Did you ever think we’d make it to 40. Big Tim. Here to talk about big Querrey big 48 episode. Yeah are our episode 40 and we’ve barely almost never almost never talked about big data specifically Google’s big Querrey and it’s been around for a long time.

[00:00:53] It’s very popular. We thought we would talk about a little bit that and maybe other cloud platforms cloud data platforms. If we feel like it. So we needed someone who knows a little something about it. Luckily we know a guy Michael Healy is a data scientist that search discovery prior to SDI. He was of course smart where he built analytics tools from the ground up and he is our friend and colleague welcome to the show. Michael Well hello to both of you. So here we are to talk about big Querrey. First off why is it named Big Querrey.

[00:01:28] That’s an excellent question. Why is it named Big Querrey is because you’re allowed to run very large queries against big Querrey is what they call analytics as a service internally at Google. And they described the acronym as unfortunate Lee as a badass what it is. And so their goal is to provide perfectly scalable and managed service which allows you to scale your analytic services to whatever you need to do.

[00:01:51] Somewhere at some point there was a possibility there would be big ass.

[00:01:55] I can’t speak to that because I never worked at Google but I’m sure I’m sure I my guess is it came up. You’d have to search like random config files servers for that. So yeah that’s what Bakery is a way to store and query data at basically any scale and a completely platform manager so you have very low overhead in terms of managing data.

[00:02:15] But it is a big data store for large brains of data even though the name is kind of referring to the query and that’s just because the nature of the way the store is structured it allows it to be.

[00:02:26] It may be a product differentiation. When we talk about Google Cloud Services internally at search discovery is more about the Google cloud platform which has in its platform a number of things something called to compute engine which is a way to do your own server sort of like Amazon cloud services they have Google file storage or Google Symbol’s storage forget what it’s called. It’s a way to do dumb storage. And if Google big Querrey which allows you to query data at scale. So you’re right it actually is both storage and query exactly why it’s called Querrey is that big Querrey is built on top of a tool which Google calls a dremel which if you know what a dremel is that fancy a little grinder which seemingly everybody is buying or has bought to do sort of cool make projects. So they have a tool called Dremel from that they’ve built big Querrey which allows them to effortlessly Seaquarium data but in fact it is actually able to store and Querrey data both at the same time.

[00:03:21] So I’ve got a picture of a Google engineer who’s got a bubble but with a dremel working away on silicone to make a data.

[00:03:29] So I think I’ve got this nailed this may be a very short episode of Gamze I think you know I’m not sure you’d want to use a dremel on silicone like it might you might want to use like hot knife or something but it is Dremel a Google platform or is that a dremel is an internal Google technology which I don’t believe has been exposed to the public in its entirety yet like pieces of it haven’t discussed in their Google academic papers. I don’t believe the whole thing is exposed as of yet or parts of it are. So if you’ve ever used Apache drill which is a way to sort of carry anything inquiry at very large scale is somewhat similar to that. Patrick Jal’s but on the same technology sort of that’s collapse inside of Trammel plus the storage it becomes big Querrey.

[00:04:11] So is there any and this is again this is all speculation. I don’t think any of us have inside knowledge but you’re more equipped to speculate than I am. So when it comes to queering a massive set of data and returning something potentially very small based on what you’re querying so is there a link to the actual Google search engine technology like if those murders that are. No no.

[00:04:34] So the Google search engine technology is built for a specific purpose which is to provide lightweight search. This is built to basically query to set up a data in a very conveniently scalable fashion. And so a little bit of behind the scenes for database nerds. It’s all stored in column based data which means that it’s easy to do scans and just make table scans very cheap as well as something I more recently found out is that each called Google big table that the Google query table structure each column is actually an individual file. And they do that for compression so that it’s easier to scan the whole thing. And so the data are stored in very specific measures which make it very easy just to query okay and I’m not sure I answer your question because I kind of lost my train of thought my train of thought left me at the station.

[00:05:22] So you did. I mean I remember back and it had to be 99 or 2000. We were we were reporting from a Lotus a bunch of Lotus Notes database has hacked together with some static pages. Static static files hacked together some Oracle stuff and we were trying to get like a better conta management platform for a Web site. And I remember the one architect was like lobbying for us to architect this whole thing in a column based database. So I think that the idea that column based storage can be faster more efficient. It’s still kind of blows my mind a little bit just about I can barely grasp a couple of tables.

[00:06:05] Another product which is not specifically the subject here which is Amazon Web Xerces redshift database so Amazon Redshift is likewise a calling based database. Now they do some other things over there that’s more of a database as a service or you know mewed there’s more legwork in getting that up and running.

[00:06:25] What’s really nice about big Querrey is that it’s completely a managed platform. So like you don’t need a database administrator necessarily to say hey is my server up time with my server uptime time that’s taken off your plate 100 percent. So you don’t have to sit down. I need reboot the server or these other I need a rebalance the tables like Google manages all that for you. So you’re basically coming down to a few discrete tasks. One is you kind of want it you would you want to have some sense about how your tables are set up so some sort of design too is how are you going to get your data in there.

[00:06:58] And three is how are you going to square your data. Basically three jobs which is not impossible. So the sort of technical overhead is completely taken care of for you on pay carry which is another nice feature of their platform. Now what you can see if you go to YouTube or if you try yourself like you can query terabytes of data in a second. So this happens not infrequently customers come to us like we have really big data. You don’t understand then we talk to them and they’re like yeah we have terabytes of data and you’re like that’s really not that goes on one hard drive. It’s cool.

[00:07:30] Don’t worry about it. That’s not a problem. We could do that a lot if you try to print it.

[00:07:37] Yeah. If you are print that out that’s a lot of paper but like a crafters Excel every time I try to load it yes if Excel is the upper reaches of your technology. Terabytes big data.

[00:07:48] If you want to think in current terms a terabyte is not very big at all. So I think tools advice is something like If you have over 10 petabytes they want you to reach out to them.

[00:07:59] So you said three things and I missed one of them was how you get the data out.

[00:08:03] One is how you get the data in and I guess the third is how you store it kind of how you just kind of like you want to have some concept of if you’re we speaks positively to the Google Analytics data in a second because they aren’t aren’t two of those taking.

[00:08:17] Like if you’re taking a premium. Right. Do they not. It’s kind of flipping a switch.

[00:08:22] Yes so let’s kind of dovetail into that. So the first is if you don’t have a structured data if it’s not coming out like if you’re going to data from an API or already from another source my initial recommendation would be to just shove it into a big quarry and the way that you get it. The less you handle the data then the less overhead you have on your end. So an email rather than you know forget the just extract and load that’s that’s typically what we do. Now I’m saying that like you want to think about for us when we’re doing data that’s not structured like Google X data. So we work with outwards data or other marketing channel data sort of search engine market channel data and for those restructured the API exports correctly so that they would be easy to acquire until later on. So that’s what I’m saying. You want to kind of think about this when you’re extracting the data is it if you have the option to have input there that’s a great time to step in and say here’s what we want to do in the future. So let’s structure it so that we’re just putting it into the square in the right format.

[00:09:24] And conceptually are you sort of potentially saying I’ve got an API one called the API builds or adds to one table another Kadomi as the router table. And I know that because I’ve got to the common key and those are exactly it’s both. It’s really so Joines are very simple.

[00:09:41] Any sort of table Querrey so very simple. If you’ve ever worked with a nattered database system what we call relational database system or RDBMS like my sequal or PostgreSQL Hewell or Microsoft sequel server.

[00:09:54] So one of those would be those you know there’s a lot of overhead and making sure your tables are normalized which means in a very specific it turned like an entry can only exist once in a table. And so you have to spend all this work kind of like normalizing this table so that the Querrey engine to be run efficiently. And you also have to prune your data sets your archiving your data sets one definition of big data. And it’s the core team. It’s when it’s cheaper to store the data than it is to try and figure out what data you need to get rid of. So with big Querrey definitely the mindset is like just dump it all in there and you can prune whatever you need to on the course string back out.

[00:10:30] OK. So we’ll talk about how with merkezi you’ve got da free which you can still use big Querrey but in that case you’d just be making API calls and pushing stuff from the original that’d be kind of weird.

[00:10:43] It would be less than ideal.

[00:10:45] You’re better off they’re already storing it.

[00:10:47] So just make your make your queries yeah sketchy premium or. Well I mean so.

[00:10:54] So when you make query from the API essentially you’re not getting what we call event level and this is not us. This is not Google Analytics event level so in database terminology event level is essentially the irreducible record of what happened. So we can’t simplify this any further. So it would be each server call essentially in a nutshell. So you want to really do data analysis at a more advanced analysis. You don’t want to have essentially those server calls at some level. And so you can’t get that out of the GA API you know very intentionally because even for a small company that would be pretty significant right for each server each visit you’d have an average of 10 events if you have 5 million. How am I going to get that out of the API and serve that up as a scalable fashion. That’s not its intention. Its intention is to provide summary numbers for Google X premium. Yes. If you sign up for big Querrey they will actually automatically pump did you get level data into an instance for you.

[00:12:00] And then what’s involved so. So say you’ve got that setup and then somebody says I want to see. I want to see visits by day. Something that’s definitely a rolled up aggregated thing is that one where you say yeah for that stuff still go back and use the standard API or whatever or would you actually say no big query I can set up a bunch of those common things that I can get you’re basically reengineering the the processing logic engine that is doing right to a degree.

[00:12:28] So let’s take a step back how you get to enable. First and foremost. So hopefully you’re working with a partner at Google and you have to contact your ear. You have to make get with your contact at Google. Tell them you want to turn it on and then they can enable it for you. The pricing structure for Google let me just mention that for Google Elix premium the way it works is that you get a credit for Google compute services per month and you have to pay for it actually using that as a Google and extremely awkward because when you pay for it but you get a credit. And in every case that I’ve seen the credit covers what it costs about Querrey. So you got to get this credit to use Linux big Querrey they charge you against that credit and then at the end of the month you sell your records.

[00:13:15] The one opponent you do need is you do need to set up billing just in case you do go over. I personally have not seen that yet because it’s just so darn cheap and every time there’s a price cut by Amazon or by Google they kind of go on a pricey war every once in a while and it just gets cheaper and cheaper so well and just so we can kind of rattle at our customer being Redmain being a little blown by that as well.

[00:13:37] It’s like you get like a 1 500 hour a month your credit. And we’re talking five dollars per terabyte in a query two cents per gigabyte per month for storage. So right. You look at that.

[00:13:51] I mean yeah it is kind of like it’s a it’s it’s really small it’s so ridiculously small now is it possible. I 100 percent believe that if you had massive data in there that you would be paying more than this or if you weren’t a Google and extremely customer you have to pay a nominal fee. However in every single case that we’ve implemented it I would argue that the benefit to the company from having access to the entire data stream is greatly greatly outweighs the small cost. So even if you have to pay you are extreme. Achieving so much more value from your data that is well worth it.

[00:14:27] So do you have any cases that are not Google. All its premium. Yeah. Yes. You know they’re GA and they are using big queering. Yes. Is that web analytics data going in or is that because there’s other large web analytics data.

[00:14:40] Well some of it’s search engine data some of his other web Elex data and I don’t want to go into specifics but it’s so marginal. It’s silly Yeah. Now if you utilize it a lot like everybody is doing queries then it’s possible your costs rise up. Now Google also helps you by caching query results and kind of like they’re not out to like ding you on pennies like when you have a Querrey five dollars per terabyte and the first terabyte per month is free. So like if you are a terabyte that you’re free. I mean so I had one that I ran for a while and I had tried to just like duplicate. I think it was like ten dollars a month or something. I mean it’s so inconsequential for most companies that it doesn’t really matter as much.

[00:15:22] So what about and how when you can jump in any time. Because I wanted to get to and get that out to you.

[00:15:27] Keep it going. This is one of the first conversations we’ve had on this podcast where I feel this urge to be taking notes.

[00:15:39] I want to point out Tim that I gave a very detailed presentation about big Querrey internally probably a year ago and I won’t say who missed it but they’re probably taking notes right now.

[00:15:49] They were they were you know we’re all a team. So we haven’t talked about which is another huge curiosity area is how you get data out right. Can you sort of talk through that third piece that you referenced earlier.

[00:16:05] So the four ways to really get data out is the web UI command line rest api and then ODBC which stands for open database connection.

[00:16:14] So you and I can literally take Microsoft Excel and hook into a query with an order.

[00:16:19] There is a degree. So let’s be careful though.

[00:16:21] Either their calendar. Yeah.

[00:16:24] So this literally happened to me before I literally seen this where somebody made a query against a very large scale database and like I’m having a problem my computer. I’m like wow what’s the problem. It just won’t. Whoa oh wow. Hey look your disk is for. Well what did you Querrey from the database.

[00:16:40] Well what did you think was going to happen if you did a huge Querrey on a big huge database of course like it’s going to kill your computer. There’s just common so. Could you connect with Excel. Yeah. Is it like a really good idea. I’m not sure.

[00:16:55] But the way that in that case when you’re making when you’re making a query I’m assuming regardless of which one of these mechanisms robots say that it’s one that I’m initiating locally through the API or ODBC it’s actually if that query is going to be on to Petta bytes of data.

[00:17:12] But it is going to require 10 rows supported the ODBC standard includes things like row numbers like you would actually get a concept of like how are you getting how much day you’re getting back. But so you mentioned the web UI. You’re able to actually like log in with your was great about the Google ecosystem as they enable your logon fantastically so if you’re already using Google Analytics or some other thing and you’ve connected your identity that you use for normal google purposes with this bakery project then you can just say take the meat to my console and it takes you rate the web UI and that’s where I would say almost everybody should say everybody should start there and you should basically live there for a couple of reasons. One it’s a great way to like build or debug queries as well as if you have a scale limit. They do some limiting of the data returns so you’re not going to run these data scalability issues so if you happen to get back 10 million rows like you have to worry about killing your computer it’s all based on the browser. So it’s a much better way if you’re doing data exploration or building or debugging queries.

[00:18:14] I totally used the web UI and then you can take it kind of like using the database queries where you will you can then take what you’ve got and say Okay now I want to run it through Python through the rest api is that exactly.

[00:18:26] And you can basically copy and paste that or if you’re actually building the query in a toolkit build a query. It also gives you some diagnostics like hey this query took this long. So if you actually want to be really super concerned about like pricing and you want to be concerned like how long these queries running you know is there anything we can do to make these faster. Then it gives you data diagnosed some diagnostics inside the tool a feature of inquiries that actually allows nested data set.

[00:18:53] So you mentioned Google Analytics API or the Google intellect premium data set rather. So it comes across as event level data which means each row is one event. Now embed in that event there are actually Jaison objects and what that means is that instead of like on my sequal row each row is the column and each column there’s a value. So think about big square there’s actually like in column browser type there may be browser type dot build date or browser type dot platform. Does some of these are nested data structures inside big Querrey. And so that’s something that if you’re trying to get used a bakery is actually a little bit harder to wrap your head around how queries work with these nested data structures. It’s not always intuitive especially if you’re coming from a relational database system. And so the the web UI really gives you a concept of like here’s how the data are structured.

[00:19:40] Here’s where the nest nested data elements are and how you want to start to unpack those. And so I would definitely go with a web UI for me 99 percent of your building debugging or data acceleration.

[00:19:53] So to that point I give you in the way of you I will you actually get results and you have a little 20 years of the next to it where you can.

[00:19:59] Yeah I mean some next man and I’ll give you the feedback hey you need to some buy or eat. Hey dummy forgot to explode this value and you can’t combine that so you can actually expose what’s going on between for nested values. They do have a command line. I always feel like command line is like for maybe batch operations. I don’t really use it at all. That’s just the way it is. Maybe if you’re doing backups or something like something simple like you want to shell script it that’s fine. We use the rest api our workflow is to build it and debug it in the web UI and then pull that out. Put that into rest api inside a deployed application.

[00:20:38] Whichever language we’re using at the time and so just as one of the things I’ll listen back on this and say Tim you silly. I’ve been hearing rest api for years and am I right in thinking that means that that can be python that can be archived that can be how I can use javascript.

[00:20:55] Believe will use javascript today so there are supporting libraries.

[00:20:59] I think it’s all linked to where they live.

[00:21:01] There is a big square connector for excel so hey good news for you.

[00:21:08] But no no I wouldn’t be doing it with this point. I mean this is just so there. So it’s possible but I would say RSA API.

[00:21:16] You know there’s sort of a there’s a Google indifferently. They’ve tried to build basically a Google services package which connects all these different things. So in Python there’s a Google service package and you basically authenticate and what’s cool is that authentication can then be used if you set it up correctly can be used to court Google Analytics inquiry big Querrey whatever the case may be as long as you understand how the API works and the rest api largely.

[00:21:40] I mean there’s syntax and but in a large sense here putting this kind of sequal like query.

[00:21:47] Like you said you can basically copy your not you’re not directly copy pasting but you’re close to copying and pasting from them.

[00:21:53] I know you’re basically copy pasting the query but it is kind of embedded embedded within the middle of the API. Yeah okay.

[00:22:01] Yeah it’s embedded. I mean you may want to like there’s no sequel.

[00:22:05] There’s no way in other tools Ikari was called a sequel engine tools like Tableau I think actually have some kind of sequel engine like they try to build it for you based on the table. But if you’re using the rest api then that’s kind of up to you to say I want to build this query and have this component. Obviously you can parameterized things like dates or ranges or whatever so to meet your needs be parameterized within the conc the confines of the language you’re working. But that’s how we do it. Yeah.

[00:22:32] Okay. So I want to turn our discussion a little bit and tuck a little bit about why companies would start to use this right. So what’s what does it look like for an organization to be like you know what we’ve got our analytics tool set out we’re using it to analyze what’s making us take this next step and think about using big Querrey what are some of the things we want to do or should be thinking of as a business to do.

[00:22:57] Well what yeah what do they have to day like. What is their environment today. Yeah and then what they’re both what they’re trying to do and then isn’t a migration or something new. Yeah.

[00:23:07] So typically it’s a couple of different things that companies are looking for when they get into big Querrey. So they may already be googolplex premium customer and they may want to enrich that data set at scale. So if you’re already Google Enochs premium customer and your data are being uploaded and then you want to upload a bunch of customer service data. You know what. It’s fast cheap and scalable to put it inside big Querrey so we’re just going to go all in on Pickworth. So that’s definitely one component of it as well as that you like if you’ve ever been involved in a data warehouse project data warehouse projects tend to be very long complex and not inexpensive at all as opposed to big Querrey which is basically you know very nominal fee.

[00:23:48] You spend six months debating and bring up for you.

[00:23:54] I’m not saying that’s wrong but that’s the old archetype which is to say you know set up to be the very scale you know very specific and it’s the way it’s done. Whereas big worry I see a lot of companies particularly marketing organizations but a lot of other you know more agile organizations they want to be able to get a new datasource whether it’s seasonal or whether they’re going to try and they want to look at the entire scan of the data. So maybe they’re doing some sort of seasonal marketing message with a new vendor which they want to try you know. Is that worth splitting up a data warehouse projects internally. No probably not. To be honest it’s going to take longer to do the product than it is to do this test. If you have your data stored in big Ware then you can just upload that data and analyze it as needed. And bada bing bada boom you’re done with the day. It’s much cheaper faster and flexible than an existing data warehouse product so that’s another component where people are seeing the value of hey it’s going to take us millions of dollars to build an in-house data warehouse. It makes more sense for us to just put it up in the cloud put it in big Querrey and we don’t we don’t have to worry about it. We have the same to the data.

[00:25:01] So is it ever to the marketing organization saying yes we have we have a big Nitties a whole infrastructure thing we’re never going to be prioritized around the operations group or whatever and they just decide they’re like fuck it we’re going to go and stand this big quavery thing up and start throwing stuff in it that we care about basically.

[00:25:19] I mean it might be it’s probably a little more organized like they would hopefully involve the I.T. team because they may have a questions about what’s going on. It’s not a whole like you know Rambo jump out of a helicopter or shooting missiles or you know the Russians Rambo 3 you’re not quite new in that was that Rambo.

[00:25:36] Or was that Rambo.

[00:25:39] Sorry I have no I don’t even know why but Novak you just ran off to the exact scenario that you had seen in your mind where that’s always like my mind my nightmare when people are like try to be like Rambo programers or Rambo you know Rambo program people are just like hey I’m going to jump out and do it all by myself I’m like go work here.

[00:25:57] Well I’m not I’m not implying kind of a going rogue but I definitely have worked up about organizations where the historical data. You know it’s an oracle data warehouse and they’ve got their Brax base stuff and they got a provision everything and it’s just kind of after a year or two somebody comes in and says that’s fine we’ll play nice with you. But you know we’ve got to get something done in the next month.

[00:26:22] Yeah. And so that’s a that’s a lead in for them to get started as well as a lot of companies are technical debt right they’re already waiting for requirement to be filled in the data center right in most places there you know would you say that is true that in most places there are requests for data warehouse upgrades that are already in the queue.

[00:26:41] And then you come along. Oh here’s another one that I need but I need it right away. What’s going to go the end of the queue. So you know what is a better way to do this or the for the fourth thing in the queue stays.

[00:26:51] The fourth thing in the queue for ever. Because there’s always something that rising above it.

[00:26:55] So there’s a little more to it than just like hey let’s turn it on. Obviously you know we really leverage the rest api a lot so you need to have some developer resource in there to kind of like get to operationalize. Now could you go in and kind of manual upload some subset of files.

[00:27:11] Great. You could totally do that. You know you could upload it and inquire into the storage and then put in a big Querrey and fantastic McQuery against it you could do ad hoc basically by hand but if you want to operationalize it you need to have a little bit of developer resources but not a lot. And after that like I said you don’t have a whole data warehouse you restructure team supporting you. You just need to make queries and get your data out.

[00:27:35] It seems like there is that people will say oh I’ve got a premium if I’m an analyst and I’m excited because hey we finally went to premium because we were maybe with something fairly obvious you know we’re hitting limits and so great I’m going to get an sample data that seems like one of those sort of misnomers out there. I’m getting and sample data that and sample data is kind of really not the same thing as the event level data of flipping the switch and loading it into a big query right. So if an analyst says no I really I have the capacity and the ability to work with basically the hit stream and I want to access that. I may have a premium but I’m not really going to get at that data until I flip the switch and turn on the query. In your book you made the point earlier that even with the a relatively small site the actual raw data can still be a decent sized dataset and I would think from doing true analytics on it the more Tomich you’ve got the more data points you have the more opportunity you have to actually find something if you just get down to the you know stack all your dimensions up with whatever metric that may be a bigger flat table but it still is stopping short of my actual atomic level data.

[00:28:55] Yes yes. Earlier if you’d have to rebuild reports in big Querrey basically rebuild legs reports you know via Querrey to some degree that’s probably true. Like you know how many page views are unique visitors. You know you’d have to have that query and run it can big query and do some sort of sequel aggregation function. But that’s super simple if you can’t figure that part out you know feel free to reach out to us or read a book because it’s so simple. It’s really easy to do that anybody basically should be able to do it. So that is true. Now with regards to a deeper level analysis yes it’s 100 percent true. You’re able to do further event level analysis of everything that’s going on on your site where we really see like the value proposition just explode almost exponentially is when you start looping and other data. So suppose you have callcenter data or some sort of custom returns data or some sort of online data which is not captured in your web and likes to. So linking up your Web analytics data to others types of data via the Google query tool that you become this basically magician that is able to do amazing things. Like what. So customer lifetime valuing you could do very simple customer lifetime value if you identify your own customers. So you end up doing cross device stitching. I mean if I would take it even further so let’s assume that you’re basically doing all your device or web analytics tracking inside Google Analytics. Uber landed on that.

[00:30:23] So we’re using Google Analytics for Web site for a device or whatever. But we also have voice call or we also have customer supply chain you know things like backend order tracking stuff. Basically what would be covered here. So like when was order placed when was the order shipped when was the order delivered. This sort of thing connecting that data altogether stitching a complete customer life together inside Google big Querrey is really where the value is. So the more you can flesh out your customers you know take a step back don’t just say we’re going to capture your device interactions you know those are table stakes.

[00:30:59] Well no no no. I mean to be fair he jumped in a little quick going on the path of that. And then there’s an online transaction. And so therefore you can get across yes cross device and then when I was kind of heading because you still have to have a key right. I mean right. No amount of technology is going to magically make your PC to do the right join. Right. You have to have. And sometimes it seems like that has to be thought through in how your operational processes are designed if they’re making an in-store purchases. Have you thought through how that in-store purchase would be linked back to call center to IVR system is linked to your digital data is it’s not. It’s not magic in that because you have the two data sets in one spot. You still get to have a queue to join in together. Right.

[00:31:49] Right. So that’s the preplanning that is required and sometimes your vendor is going to really have an opinion about how the data is going to be structured so you kind of to figure it out. And some of your implementation from the Lower Lakes and obviously we’re assuming that someone is sort of an identified customer in both instances. So you’re assuming that on the website you’re identified in the Storrier identified and we can also get a key and put those two together kind of like stitch that together you can also do if you don’t have that you can do some sort of meta analysis to say okay we had this in-store performance.

[00:32:19] Now we don’t we have this bucket of anonymous visitors you know how did our anonymous visitors as a whole behavior change and potentially affect our identified in-store purchases. So potentially you could work around that but it would get messy and be sort of complicated. Ideally you’d have your right some sort of warranty that would link all these tables together. And that way you could say Hey great you know here’s your customer I.D. number that is from once you logged in we know your customer ID level. So if you have a vendor that is maybe tracking online e-commerce purchases they can track you by email or if they actually have you log in they may have a customer ID.

[00:32:58] Now we’re headed down. I feel like a little of that sort of the vendors are saying oh you have to have. Yeah. They don’t like what they get yet you haven’t yet to yet you come here. Let’s just assume you have the key now just like ignore the reality of the other 85 percent of the business models that about.

[00:33:13] I’m not saying it’s always easy. What I’m saying is that I’ve seen it accomplish and we’ve we’ve accomplished it. So it’s not impossible but it’s definitely something that you have to think about and kind of understand from the beginning that we want to understand customer lifecycle and especially of a customer you know if it’s a situation where you’re going to have vendor where they have a CRM system in place already then they already really should have hopefully you know fingers crossed sought out a customer model. So if they’re using a sales force then they have customer IDs and sales force and they’ve kind of thought about all the different touch points. So if you don’t have that kind of environment then it’s more how do we make that sales force I.D. or how do we do this expose back out to big Querrey so it kind of depends on the industry.

[00:33:56] So say I’m an Adobe user and Adobe has got their data feed the you’re not going to customize it at all. And I think it’s included with most versions if you are willing to take that unfiltered unsanitized data feed from Adobe. Have you seen that. Or I guess the simple question is are there cases where somebody is using Adobe analytics where it makes sense for them to go into a big query is that while they’ve got other data that they’re already pumping in a big way or for other reasons than they should bring it in or what’s kind of right Dhobi to be wary. I

[00:34:27] think there are two data feeds. One is the data which everybody calls data feed from Adobe which is the flat file you get. Then there’s the adobe streaming data which is called Lifestream. I want to say one where they actually send the event out to you. So

[00:34:43] could you put either of those into big core. Yes absolutely you could. Would you ever do that. Typically that would be a decision made on a new install. So when we’re working with clients who are on the data feed if it’s an existing Adobe client they may already be getting the data and this may be a question that they have asked and answered of their I.T. organization you know five years ago and in that case you know if they already have the process to read the state into Nitties or Teret or some other on site device then what are we going to go in and say Hey you guys are morons you’ve got to put it all inside work. No of course not. I mean they already have a whole process to read and analyze the data so it doesn’t make any sense in the case where it’s a new install. Then it comes in and there is definitely a conversation about where do you want to store this data and how do you want to access it. And it really comes down to one of three options basically. One is Google Querrey some companies do have concerns about either putting their data with Google there. They want to stay with Adobe so they look at the Amazon Redshift product which I mention which is very scalable. It’s less of a platform as a service so you do have to actually understand how the data bases are spun up and managed to do little bit of management. You don’t have to have them on site.

[00:35:56] They’re in the cloud but you do need to do some load balancing or whatever the case may be. And the third option is some sort of they may want to buy and build an internal database for the final one. I would say the people who are using internal onsite databases will be the Nettie’s or Teradata or Nitties or terror data typically are the ilke if I’m forgetting one please excuse me that would be somebody that you know if you’re talking about a new install not a legacy install you know five or ten years ago if you’re starting new today there may be infosec information security requirements or some sort of legal or just old guard paranoia.

[00:36:34] I mean potentially you just some sort of like logistical requirement that they say you know we can’t put our data in the cloud. We just can’t. And I’ve seen this typically it’s like with financial firms like they really need to be on site. And so that’s it ends the conversation there are infosec requirements. So then the question the question is you know first can you put your I.D. in the cloud or not. If you can put it in the cloud then it becomes you know are you comfortable with the way Google held data. Or would you if there is an organization that’s an I.T. organization is already knee deep in the Amazon Web Services cloud. So they’re using Amazon Web Services computer engine or what have you. They may be really interested in redshift just for logistical concerns. Network Io is sort of the carbon monoxide in the room. So network I was just a cost of transport transport the data from point A to be your data. Right. So if you have a ton of data already be produced by your Amazon Web Services then it’s pretty smart to just send an end to redshift. That’s the way it works. Now if you’re coming into a freshly are nothing to worry about then I would say you know consider look at big where. It’s the way to go for a lot of companies. The lack of infrastructure and organizational overhead that you have to expend to get up and running is is quite nice and it makes it very very affordable.

[00:37:55] There’s one word we haven’t heard uttered at all during this and I don’t know where it fits. So where does Hadoop fit in to this entire universe.

[00:38:04] So Hadoop is actually an ecosystem of numerous things a few things. When Google published a paper which identified the Google file structure DFS some very smart people read that and they said hey we could basically implement something we could we could kind of figure that out. And the technology is exposed but conceptual we understand it so we’re going to use Java to build what’s called DFS or Hadoop file system. So we started with G.F. faster Google’s system. Then you go to a Hadoop file system they built that. Now Hadoop is actually the umbrella term for not just a Hadoop file system but also the mass produced engine which Google wrote about and their paper. So they built these two components and put them together and basically a Hadoop would be for someone who also wanted to be on site but not in a relational database server so it’d be a different way to go. Okay. That mix if you wanted to be onside but not necessarily in a relational database format you could go with a Hadoop vendor like a Cloudera.

[00:39:00] Okay that’s it. I’m totally I’m fully knowledgeable now.

[00:39:05] You’re ready to go. It’s the client I don’t get to talk about a Web analyst Wednesday. Let me tell you how I am.

[00:39:12] What is your what our discourse on the Google file.

[00:39:15] Well no it’s really good. And actually I’m just sitting here basking in the glow of proof that I’ve gone out and gotten people way smarter than me to work. Insert discovery for that iPhone. No. Anyway Michael thank you so much for coming on the show. This has been great and I think something you know people who listen to our show will get a lot out of maybe not as much out of it as Tim Wilson but now and I think it’s a great topic. And I think you know it’s timely. We’ve talked a lot on this show about how data science is becoming more and more central to what it means to be a digital analyst. And so I think there’s a lot of interest in these kinds of topics in these kinds of things so thank you very much and I certainly learned quite a bit that I could have learned a year ago apparently if I had attended the internal I’m sure have I wanted to I’m sure I wanted to anyway.

[00:40:13] Thanks for having me. It was awesome.

[00:40:15] Yeah well and so we do a thing on the show Michael we’re ready at the end of the show we do thing called last call or we started doing it where you know something cool you’ve seen the last few weeks or whatever you think is worth noting. So we just go around and do that and alow. Tim why don’t we start with you. What’s your last call.

[00:40:34] OK so get a couple I can do. I think what I’m going to go with. It’s going to be appropriate because now that my brain is full and I momentarily wake up tomorrow morning and realize that I don’t have any understanding of any of this I feel like I’ve got a better sense of the query I’m going to go with something sort of simple and basic but I realized how much I am using it now from where I wasn’t using it a year ago. So plain old web analyst. We always have to check the tags see what’s firing. For years I mean I’ve used fiddler I’ve used Charles proxy I’ve used the Google Analytics chrome debugger for GA I’ve used the digital poster bugger for Adobe but a while back I hit a spot where I needed to capture post data and that was not being captured anywhere and Josh West it demystified pointed I was like I’ll use the observer point tagged bugger’s free plugin for Chrome so it’s simple. I realize I use that 99 percent of the time now if I’m not looking if I’m not trying to debug a mobile site or mobile app where I need to run it through Charles. So I had a client a few weeks ago who said I sent a screen capture and they’re like What is that. That’s not the digital Polster bugger. I’m like No it’s way more awesome. So that’s a Pogge it’s free. It gets both your DA and your tag management and your Adobe tags all in one spot. It’s not perfect.

[00:42:00] I’d love to have the ability to do a little filtering in it but yeah to handle it all. A little tool little Chrome plugin and it’s right there where you might be inspecting elements anyway. So that’s my last call. Nice what about you Mr. Healy.

[00:42:16] So I have a book recommendation. Max Tegmark she wrote a book called our mathematical universe which is discussing his hypothesis that our physical reality is a mathematical structure and his theory of the ultimate multiverse. So this is kind of like several people have been discussing it lightly about why is math. Why does maths so appropriately define our universe. Can’t say I understand 100 percent or can’t speak eloquently about what is talking about but it certainly is very thought provoking.

[00:42:47] It’s like reading a brief history of time which I was awesome with for like the first two chapters and then and then I’m like How can this how can this little fucking book like make me sad.

[00:42:58] Yeah. It’s God. I read in the brief history of time.

[00:43:05] He was advised that for each equation do you guess how many readers what percentage readers fall off.

[00:43:11] There you go. And Lileks can solve this problem. 50 percent.

[00:43:15] So yes yes 50 percent of your books or you like he only has one equation in the hole and it equals MC squared. So our mathematical universe has quite a few more equations.

[00:43:28] So what’s interesting my last call is not actually is very dissimilar Michael Healy but totally coming from my point of my direction of not being good at math.

[00:43:41] As good as some people and that is recently I stumbled across this YouTube video by this guy by the name of Scott Flansburgh who is one of these guys. He’s like a human calculator so he can add things up really fast and do math really quickly and he had a YouTube video that I watched about how he learned math and some of the things he’d found and it was just fascinating. I guess maybe I’m getting old or something but I just found it really interesting so if you’re ever looking for a way to waste 45 minutes or so go check out Scott Landsbergis videos on YouTube and he’s got some pretty cool stuff. Sort of like the name the number nine mean something and it ties into the mathematical universe because like math it’s totally universal anyway if you’ve been listening to the show and you want to get in on the conversation we would love to hear from you on our Facebook page. On the measure slack Michael Healy also on the measure slack drop and some science here and there. So come check us out ask questions hang out with us. We’d love to hear from you.

[00:44:46] Can I throw in that the measure slack because we paid gazillion people asking so if you go to bitterly add measure slack. One word you can give me all lowercase you can capitalize and the M in the ass that that bothers they both work and there’s a there’s a google form and jump on in there we know we have contributed to that community through this fight.

[00:45:06] That’s a good good thing to add and for once again Michael Healey Thank you very much and for my cohost Tim Wilson keep rising.

[00:45:20] Thanks for listening. And don’t forget to join the conversation on Twitter on measures like great. We welcome your comments and questions.

[00:45:28] Facebook dot com slash analytics our analytics on Twitter. Our Shetler according to. Your guess work with some divas. We do. We absolutely do. Oh God. Mr. Wilson I may. Feel like maybe that’s one of the one of the entry kind of where Michael was starting. When is the first time we’re going to. Yeah I’m sure. What’s what’s your what’s what’s your guy’s internal convention doctor and doctor. I go by my nickname. We do West Coast Michael in East Coast Michael West. And now. You stopped recording right. What’s the old saying a profit is respected except in his hometown. So knowing we can we can fix this and most of it’s the world’s most dipshit question that there’s data. It’s so big and I’m like let’s pump the brakes here. You’re not like either one of them and neither of the two of them like each other. So what was your question again. Pretty pretty pretty pretty good. We don’t do this as a YouTube video because nobody could stand to look at me and tim for that long. So I’m having trouble right now. And bada bing about it boom you’re done with the day. West Coast there’s still not realize on talking I realized I just don’t care. OK. Just tell I mean this is obviously like a body. Yeah this is the one must submit to all Chaston work or show.

[00:47:31] Rock flag and big query.

Podcast: Download | Embed

Subscribe: RSS

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Have an Idea for an Upcoming Episode?

SUBMIT IT HERE

Recent Episodes

#243: Being Data-Driven: a Statistical Process Control Perspective with Cedric Chin

April 16, 2024

https://media.blubrry.com/the_digital_analytics_power/traffic.libsyn.com/analyticshour/APH_-_Episode_243_-_Being_Data-Driven__a_Statistical_Process_Control_Perspective_with_Cedric_Chin.mp3Podcast: Download | EmbedSubscribe: RSSTweetShareShareEmail0 Shares