Panel:
Table of Contents:
- Introduction
- Working Remote
- Black Friday
- The Memory Leak
- Large Scale Mobile Deployments
- Testing and Code Coverage
- Security Testing
- Sharing the Data
- Event Loop Monitoring
- Plugs
00:17 - Daniel Shaw: Hello and welcome to NodeUp. It is a very node Black Friday show. Today I'm joined by Eran Hammer, Ben Acker, Wyatt Preul, and Kevin Decker. They are the Walmart Labs team behind #nodebf, the hash tag on Twitter. If you haven't gone through and read all the amazing live tweets, be sure to check that out. So, there's an exciting thing that's happening over the last six months or so where Walmart has become the new Voxer. And what I mean by that is for a long time the service that was breaking node in ways that no one else was seeing was Voxer. It really helped evolve node core. And now Eran and the Walmart team have taken over that mantle and that responsibility and have been breaking node in new and interesting ways. It's really exciting to see the load and the traffic that's going through node at Walmart. It's really important, not only for Walmart, but for the node community in general. So I'm really excited to get this show on. Why don't I start by letting everybody introduce themselves. Eran, you want to kick it off with who you are, for those who don't know?
02:02 - Eran Hammer: I'm the chief node instigator at Walmart. I joined about two years ago with the sole goal of getting Walmart to switch over to node. I'm the architect for our hapi.js framework. And I'm the guy who says naughty things on stage.
02:30 - Daniel Shaw: And everyone enjoys it. Wyatt?
02:35 - Wyatt Preul: I'm Wyatt. I live in Kansas City and I work on Eran's team and help with hapi.js and any spumko-related projects.
02:44 - Daniel Shaw: Awesome. Kevin?
02:46 - Kevin Decker: I'm Kevin. I'm in charge of the mobile web front-end team. And I'm based out of Chicago, myself.
02:53 - Daniel Shaw: Awesome. And Ben Acker?
02:55 - Ben Acker: Hey, I'm Ben and I'm up and in around Portland, Oregon. I've been there about as long as Eran. I'm doing stuff on a couple of different teams over the past two years.
03:10 - Daniel Shaw: So before we dive into the discussion around what happened with node and Black Friday, there's a really interesting thing to note: you guys are insanely distributed! That's fantastic!
03:25 - Eran Hammer: Yeah, we really believe in remote. Not only do we have two main offices in Portland and San Bruno (right outside San Francisco), but the mobile web team is all over the place, our team is all over the place. We actually have one team member located in San Francisco and we say "you really shouldn't go into the office anymore." It's been great. It's been a really good way of finding talent and also being really productive; not wasting time on commute and other silliness.
04:07 - Daniel Shaw: I love not having to get in a vehicle and travel two or three hours a day. Those are two or three hours that I can use for much more productive things, including just spending time with my family.
04:24 - Eran Hammer: This is not even the full team. We have team members in New York, in Minnesota, I'm in Santa Cruz Mountain, so we really are all over the place.
04:36 - Daniel Shaw: That's fantastic. It's great to be able to pull in all the amazing talent that you have on your team. Before I get into what Black Friday is about, I want to give a quick shout out to our sponsors. Today, NodeUp is sponsored by &yet, Modulus, and Clock. Let me tell you real quick about &yet and all the amazing stuff they're doing. &yet builds node APIs and JavaScript SDKs for clients like AT&T, as well as building their own product, And Bang. And Bang is a service that offers real time chat and task management for teams. You can use that on the web, in IRC, or you can roll your own using their API. It's free during the private beta. You can check out And Bang at andbang.com. It has both RESTful and Socket.IO APIs. &yet has been a leader in WebRTC and you should check out their SimpleWebRTC library. Their Talky.io service -- @usetalky on Twitter -- an evolution of an experimentation they've been doing over the past year with WebRTC. They're really driving that forward. And you can't miss their security contributions with lift security and their leadership position in the Node Security Project. So, lots of love for &yet. I really appreciate all they do to support NodeUp and the node community.
06:22 - Daniel Shaw: So, Black Friday. Black Friday is important to Walmart, and, why don't we set the stage here and get a better understanding of why Black Friday is such a big deal.
06:37 - Ben Acker: I think for any retailer here Black Friday is kind of nuts and goes a little bit crazy. There's loads of traffic both on foot and online. There's gonna be loads of spikes. Loads of stuff going out at multiple times during the day. There's going to be multiple peaks throughout the day that you're going to have to prepare for. Overall, it's a harrowing experience for everybody involved. Both for all consumers and for those providing stuff for consumption.
07:13 - Eran Hammer: It's not just important to Walmart, it's really one of the pillars of the American economy. It's the busiest shopping day of the year. Most retailers do a big chunk of their annual revenue in the time between Thanksgiving and Christmas. So if you're down for even a little bit of time, and you're unable to serve customers, it has a major impact. First, because those people might not wait for you to fix it; they might just go to your competitor. But also, it's an opportunity loss and we can't afford any of that when traffic is soaring and shopping is focused on two and a half days, between half of Thanksgiving, Black Friday, and Cyber Monday.
08:11 - Daniel Shaw: And this year was particular in that Black Friday came particularly late, right?
08:18 - Eran Hammer: Yeah, so everybody gets a week less to shop. All the retailers have been freaking out about it for the last six months. And I think next year's going to be similar. People count the time between Thanksgiving and Christmas, and the one week off throws them off. That's why this year it's no longer "Black Friday," it's "Black November." Everything is painted "Black" because everyone is freaking out trying to get that lost week back.
08:59 - Daniel Shaw: How has that gone for Walmart ecommerce? Let's define the scope of what we're talking about. Your team is primarily interested in mobile and #nodebf -- what part of the Walmart infrastructure are we dealing with at this point?
09:21 - Eran Hammer: Our team, specifically, is all mobile. We serve all the mobile web experiences. That's what Kevin's primarily working on. So if you go to mobile.walmart.com you'll see the mobile web site. We also have native clients on Android and different versions for iOS. Our system is serving all those clients for the mobile customers.
09:50 - Daniel Shaw: Got it. How has that gone in the past? The past online experiences with Walmart?
09:58 - Eran Hammer: I think the key here is not necessarily comparison to past seasons, especially for mobile, but just the insane growth we're seeing. The report that came out from IBM said that 31 or 32 percent of all Black Friday traffic came from mobile devices. We have seen a 53 percent share for mobile across Walmart's Black Friday, which is pretty amazing when you think about the Walmart profile. We are on the low-cost end of retail, and even with that, our customers are way more sophisticated than the national average, which is awesome. We're basically looking at a doubling of traffic and a doubling of revenue year after year, in terms of mobile share. And that is the scary part. So the fact that we survived last year really means nothing for this year, because everything is doubling, everything is tripling. The mobile share of the business is getting bigger and bigger.
11:18 - Daniel Shaw: So how did the Walmart Labs team get ready for this year, especially with all the constraints that were in place. How did you get node ready for this launch? This is node's big public launch at Walmart, right?
11:43 - Wyatt Preul: We actually started pretty early on in the year and that's probably why we found that memory leak so early. We just started deploying out node servers and seeing how they handled traffic. And then we also did some stress tests as we got closer to Black Friday, to see if we could handle the anticipated load that we were expecting. On top of that, we also made sure that we had appropriate monitors in place on all of our applications and the servers, so that if there were any issues we caught them early on. Or even anticipated issues that we were able to predict were going to happen. So it's been a long process. We started early in the year, and it never really stops.
12:29 - Daniel Shaw: What did you go through to set up these tests? What were the phases of testing?
12:38 - Eran Hammer: So, there's our own testing throughout the year. For those of you that don't know, our node proxy servers are sitting between the mobile clients and our legacy Java services, and some third-party services. They're basically acting as proxies. The idea behind that is it allows us to migrate services from Java or other providers to node, based on their value and our priorities. We don't have to rewrite the entire system from scratch and do a big swap. It's a very convenient model, but it also means that -- in order for this model to work -- the proxy has to work really well. It has to be invisible and not add any latency. So there's a lot of testing involved just ramping up the volume and trying different server configurations -- trying it out with very few servers. When we launched it I believe we had something like six or ten servers in two data centers. And that was driving 100% of mobile traffic back in April and May. That was very few VMs driving all the mobile traffic. We learn a lot from trying different configurations, different scales of deployment. And the memory leak was really the main focus over the last few months. That drive most of our testing, most of our monitoring. That was the one thing we were worried wasn't going to work during Black Friday.
14:26 - Daniel Shaw: Let's hold off on the memory leak for a second. I want to fill in a little piece of context that I'm not sure people realize. Now, hapi.js is known as a general web framework. Is hapi also powering this proxy?
14:48 - Eran Hammer: yeah, hapi is our basic building blocks for anything we do in node. The proxy itself is really hapi plus twenty lines of code. I tweeted a gist over Black Friday that's a pretty accurate cut and paste of what's running in production.
15:13 - Daniel Shaw: So if other people wanted to build out a similar proxy layer, they could go to hapi and find an example for a good proxy?
15:27 - Eran Hammer: There's plenty of good examples there. I also point out PayPal and their new framework. One of the components they have is their npm proxy. And they use hapi to build that.
15:42 - Daniel Shaw: And there's this feature in hapi to basically define a component in json and it uses that to build it up. It's quite interesting to see. Lots of great things coming with hapi. And new and different ways to use it. And a lot of people coming to the node community are used to django and rails and these web-centric frameworks. They might not realize how network-focused node is. It's great to see hapi also being such a core node citizen, being a much broader network framework. I want to get into the memory leak. It was a fantastic story. But first I want to talk about Modulus is a PaaS service. They have integrated MongoDB hosting that ties in really nice performance statistics and leverages that tight integration to help visualize that data, so you can see how things are performing. In their PaaS, they offer free SSL, websockets support, and they support nearly every node version. You can sign up for the Modulus PaaS service modulus.io and use the promo code nodeup1
. So check them out. They are @onmodulus on Twitter. They had a very exciting announcement at NodeSummit. They now have an enterprise offering, where you can take the PaaS service and do your own hosting. Leverage the experience and service of the Modulus PaaS in your own data center. Curvature is their offering around hosting that offers rapid development, easy scaling, and real time analytics. And Inflection is their instrumentation around analytics. Check out enterprise.modulus.io for more information about that. Thank you, Modulus, for all the support.
18:25 - Daniel Shaw: So, there was a memory leak. And no on believed Eran. We all thought it was back pressure and that Eran was wrong. Eran, can you tell us about this leak?
18:40 - Eran Hammer: I'm just gonna let Wyatt talk about the leak. He's the guy who suffered mostly through it. I just pointed fingers and said "look at charts, look at the charts!"
18:40 - Wyatt Preul: It started pretty early on. Whenever we got the proxy deployed out. At first we had come crashes that we resolved and then everything seemed to sort itself out, except for some slight memory growth here and there. I was one of the people who did not believe Eran that it was a full-on leak. He spent a lot of time watching the graphs and charting them out. And it definitely ended up being a leak. We were able to share some core dumps with Joyent, and TJ Fontaine at Joyent was able to reproduce the issue. He created a really sweet DTrace script that was able to nail down where the leak was located. He got it deployed out and fixed in 0.10.22. So that's what version we had running on our servers. Eran, did I miss anything?
19:52 - Eran Hammer: I think the coolest part of this leak was that it was a 4-byte leak that was only happening when you created certain stressful situations on the HTTP client. So you had to make a lot of small HTTP client calls in a particular load to trigger it. But then, what's really interesting, is that after the leak was found, the number of people coming out and saying "that was my problem, too!" has been crazy. I just heard from the guys at YUI that one of their servers had been suffering from that. And now their RSS charts are flat. There's been a few of those. But everybody just assumed that there was something wrong with their stuff, because if it wasn't then everybody would have this problem. And it actually turned out that a lot of people did have this problem; they just didn't want to believe that it was somewhere else.
20:55 - Daniel Shaw: That shape of the graph is very familiar. So it's crazy to see that actually shift now. So, what were you guys doing to make sure that Black Friday was going to be successful. You did extensive testing to make sure to make sure node wasn't going to fuck up.
21:27 - Eran Hammer: Over six months we had been building a pretty intense monitoring system. We practiced rolling the servers, putting fail-overs in place, using load balancers. And for the longest time, we actually assumed we weren't actually going to run this during Black Friday. So the freak out period was really just the last three or four weeks where we said "we're going to run this on Black Friday, so now we have to make sure it's ready for prime time." It was really just a long journey of constantly stabilizing and improving what was --- without fully knowing how prepared we were for quite this much traffic all at once.
22:24 - Daniel Shaw: Because it's probably pretty hard to stage something like that, right?
22:33 - Eran Hammer: Yeah. I like saying that Walmart is too big to stage. And part of the challenge in reproducing the memory leak, for example, was for the longest time we could not reproduce it outside of production. We needed that particular environment to trigger it in terms of the usage patterns and the hostile services both upstream and downstream. Just that nature of real, massive mobile traffic that is very hard to simulate at scale outside of production.
23:08 - Ben Acker: That's what I was going to say. Even all the tests that we did ahead of time, none of them came close to the traffic we were actually able to see in production. And it also took a lot of time determining how we were going to get stuff set up. We've got monitoring systems across loads of different platforms. Eran's mentioned Splunk in the past. There's a whole bunch of stuff that feeds into Splunk. We've also got some Graphite stuff going on. And in one of our systems we've got a whole bunch of monitors that are providing information on all of the system resources for one entire system. Which was ridiculously useful in pointing out some problems we were having with certain elements of our analytics system. And we were able to react to and fix during the Black Friday weekend.
24:07 - Daniel Shaw: I've built large scale mobile deployments, but I don't know that everyone is familiar with the challenges that are involved in mobile clients. You described them as hostile clients; what does that mean?
24:27 - Eran Hammer: You're trying to get me to give the quote from the stage, right?
24:33 - Daniel Shaw: Not necessarily! But there were some choice moments.
24:40 - Eran Hammer: You have to look it up if you want to see more. But it was a one-time um... But mobile clients between latency, long connection, disconnects, incomplete socket transactions -- they're not very polite in terms of how TCP is being used, so you don't always get your heartbeat or your acknowledgment or the termination of the connection. So it creates quite a lot of stress on your servers because they end up with a lot of crap they have to maintain. And node does a really great job of doing it all for you. But it does take some time, on the software side in your application, to really fine-tune your app to deal with it. That's a lot of the smarts that we have in hapi. It's doing all those things, accounting for all those edge cases where you're proxying but then something goes wrong the in the middle; the connection terminates or closes or errors or multiple of those at the same time or you have the option close on you. All those different things. And you end up with this insane state machine that has to handle all those different events. If you look at most people's applications they have this one big error handler for everything. And they don't really care if the what happened was the socket closed or anything else. They're just like "Oh, it failed? Just ignore it." But you can't really do that in a more complex system, especially not a proxy because you're going to end up wasting a lot of resources if you don't terminate all the dependencies once something goes wrong.
26:30 - Daniel Shaw: Absolutely. When building the backend for a mobile service, there's a disproportionate amount of code around those paths. So what works well in handling that? What has been effective in cleaning that stuff up? Do you have to go through an extensive audit, making sure you know all the stuff that's there so you're closing everything? How have you gone about doing that?
27:15 - Eran Hammer: Most of it was actually done by Wyatt. I'll let him talk about that but really it was just an exercise of dealing with stuff as it's happening.
27:28 - Wyatt Preul: Some of it seemed to be a response to what we thought was the memory leak. We thought it was related to disconnects. So as we tried to solve that problem, I added code to track the upstream requests whenever the mobile client disconnects to make sure that all the upstream requests disconnect. We do have to track all of that, through pools for all those requests. That's about it.
27:56 - Daniel Shaw: So at this point you have all that instrumented. Have you experimented with reducing some of that code surface area? And loosening that up or do you feel like you've potentially over-engineered there, that you could reduce some of that?
28:15 - Wyatt Preul: That's the great thing with having Eran on your team: he won't let it be over-engineered before it gets committed. He does a good job of checking all the pull requests and he probably did trim out a lot of extra over-engineering that I tried to add in. So it's pretty slim right now.
28:33 - Eran Hammer: You can go and look at some of the hapi changelog and you'll see some unfortunate "fixes" by me, trying to simplify some of his handling. And only an hour later to have another version of hapi go into production where we put it right back in. So there's plenty of times where we said "Why are we doing this? Why are we listening on this event? It doesn't look like something we should be doing." And then as soon as we put it back into production we're like "oh, yeah, that's why that's there." The way we handle that is to immediately go in and add more unit testing that will try to recreate that particular thing. Then next time somebody tries to remove a piece of code that doesn't look like it's doing anything, some test will break. And then you can go see what that test is doing and what it's describing. It's been our approach that relying on our unit tests is a method of documenting why certain things are in place. Not just as a way of testing but also as a way of documenting the knowledge.
29:50 - Daniel Shaw: Awesome. I love it. I think that's essential and often where I'll go look in a module before even diving into the docs. What are the important things that the module developer was concerned with? What have they addressed in their tests? So in this process, you also went through a complete analytics rewrite. Wyatt, do you want to talk about what went into getting ready on the analytics side?
30:23 - Wyatt Preul: Last year we actually did have node running on Black Friday for our analytics server. This year, it needed a lot of updating. It was running on old versions of node and hapi, so I went it and took out some of the code that we could re-use, and then updated it to use the latest hapi version on node 0.10. Then we also re-architected it somewhat. We split it off into workers and collectors, which made it a lot easier to tell out where our bottlenecks were and to scale out appropriately. We had to end up doing that for workers. As we had a lot more analytics traffic to process we were able to scale that out and add more worker processes.
31:10 - Daniel Shaw: For those who aren't familiar with the hapi ecosystem, it's not just hapi core that's doing things. Hapi is designed to be plugin-based and very extensible, so what are we talking about when we're looking at the collecting and the analytics stuff?
31:41 - Wyatt Preul: The collector just became a plugin for hapi so it was then pretty easy to deploy out. The deploy itself just became a config file pointing to that new plugin. That was about it for the collector.
32:00 - Eran Hammer: I think Wyatt's being a little modest here. One of the things that the analytics system was trying to do was: we had a pretty solid solution before that was done by Bradley Miller from our team, who is really good at putting together really quick solutions. The challenge has been, since last year, in terms of growth, it looked like we'd have to go to a couple hundred servers just to keep up with the load using the architecture that we had in place. Which was fine for last year, but it wasn't really going to work this year. And we had one web server that was getting all the requests, parsing all the payloads, parsing the json in the payloads, doing some schema validation on this data, and then passing it along to mongo, Splunk, statd, Omniture, or whatever else the destination is.
That was creating quite a lot of tension on the servers because they're doing all of this as an in-memory processing, so it was heavy on CPU and heavy memory. And the only was to handle the overload of data was to keep bringing more and more of these servers up. So we had no way of handling peak other than putting more capacity into your system for peak, which is very expensive for analytics because we had about 10x growth over Thursday, Thanksgiving, and that went down to 3x the rest of the time. So there was one day when we had an insane amount of analytics coming in. Investing all this on all these VMs just for that one day would be really wasteful. So we were looking for a way to scale this thing up.
And then Wyatt started playing with different architectures and different solutions to create this queue architecture. That was really interesting. And some of the things we've learned about saving through a queue -- we ended up using RabbitMQ as our queue -- we tried using AWS directly, we tried to write to files, and the queue was the fastest. Which really shows just how crazy-fast node is when it's doing just network stuff. If it's network in, network out, those processes that are receiving all the data are doing absolutely nothing they're just sitting there with low CPU, low memory and they're just piping data in right back into the queue. And then we had the other process take the data from the queue and we were were having too much data we were just getting behind on the queue, but we're not losing data. It bought us more time to get over the hump of the increased traffic until there was a little more quiet to catch up.
34:49 - Daniel Shaw: So why RabbitMQ?
34:53 - Eran Hammer: It wasn't something that we looked at; it was another team that we worked with who's building a much more customer-facing system. They looked at multiple solutions, and of all the ones they looked at, RabbitMQ came up as the most likely queue that they'd use. So we said "okay, we'll use the same thing, then" to keep things more consistent. And it did the job really well.
35:23 - Daniel Shaw: There's a note about tracing, or a tracer for production?
35:31 - Ben Acker: It's just one of the monitoring systems that we built. Something that would make a bazillion requests of all our external analytics collectors and make sure that we could trace stuff going all the way through to identify problems. Ultimately, this was our big test of the analytics system; we had a little bit during the load tests, but during Black Friday was the main time. We wanted to have something more than the load tests, so this was something simple we could do super easily in node. It traces stuff going all the way through our analytics system and make sure stuff was alright. Kind of like a mining canary.
36:19 - Daniel Shaw: Is any of this open-sourced? The collector and those things. Will any of those plugins be open sourced, or is that inside plumbing?
36:33 - Eran Hammer: It's probably going to be a blog post, rather than open-sourced. Just because there's not that much code there and to abstract our internal logic -- validating our own analytics schema and the stuff we do to make it compatible with some of the third-party vendors we use -- I don't think it's useful. And for us to abstract that away is just adding complexity where it's not necessary. But I think there will be a blog post where we share a prototype example that's doing the same thing, only without all the Walmart-specific logic. We'll definitely do that.
37:10 - Daniel Shaw: Awesome. That would be fantastic. I'm sure many people are looking forward to that. In your talk at NodeSummit you had a really interesting quote that I loved. One of the last testing phases, there was a moment where you'd taken node out of the mix and you were seeing that those backend Java processes were performing less well than with node in the middle. Why is that?
37:46 - Eran Hammer: What node was doing very, very well was acting as a sort of back pressure for the legacy services. And was queuing things up, it was doing a better job at handling the heavy socket activity coming from the mobile clients and offering a little buffer. Load balancers are very good at spreading load, but they're not very good at being patient. And so, if there's a little hiccup somewhere in the system, the load balancer will just reject all those requests immediately until everybody is up and ready. Putting node in place was basically giving the Java servers a little more breathing room, throttling some traffic for them without actually losing that traffic. It did result in increased response times, but during Black Friday people already are immune to slow websites. Everybody knows some things will be slow to load. So it's a little more acceptable in terms of customer expectation in the industry to be slightly slower. It's not good but people are a little more patient on Black Friday than on a regular day. So I think that's what node was doing. We actually didn't get to fully investigate why everything was looking so much better with node in front. And we also noticed we can reach higher page view rates and order rates when this is fully deployed. So I think it's mostly the queue behavior, which is also why things were dying without some back pressure -- it would just become this infinite queue and at some point it would run out of memory.
39:46 - Daniel Shaw: Exactly. That can be the most disastrous thing. So, before we talk about how things went on Black Friday, let me talk about our final sponsor today. Clock is a digital development agency on the outskirts of London. They make beautiful websites and web-based applications. They're known for integrating legacy systems, devices, and all the things that really don't want to be integrated. They are experts in publishing, customer insight, and bringing customer loyalty into the development experience. They have developed in-store hardware devices for retailers. They make a 100% node.js solution called SwipeStation -- you can learn more about that at swipestation.co.uk. Clock's been around since 1997, they've been using node since 0.4, and since then they've fallen in love and try to use node with everything that they do. Their clients include BBC, Newscorp, Nielson, Joyent, Eddie Izzard, Shortlist, Hearst Media. They have a publishing platform that is used throughout Europe called Catfish, with a number of publishing companies using that. You can learn more about Clock at clock.co.uk. Be sure to follow @clock on Twitter and hit them up for a quote at [email protected].
41:28 - Daniel Shaw: So, #nodebf. Black Friday happened. And what went down?
41:35 - Ben Acker: Not much.
39:46 - Daniel Shaw: What? All this and nothing!?
41:44 - Ben Acker: Yeah. We had it set up. It's like we wanted to have this big fire drill -- there have been loads of people begging for all kinds of coverage schedules and all kinds of different stuff. So we had one set up. We had a coverage schedule so that if anything happened we would be ready. We were all excited to see how it would perform, so we were all on anyway, just on a hangout pretty much from Wednesday all the way through Saturday. Just taking a little time to sleep here and there. But for the most part it was listening to Eran DJ to us through the hangout.
42:35 - Eran Hammer: It was so boring. It was driving me nuts. But the thing is, it was edge-of-your-seat boring. It's like, you're bored but at the same time your body is completely tense. You're waiting for something to blow up and nothing blows up, so it's this constant conflict in your own brain.
42:58 - Ben Acker: And to give you some kind of idea: before everything went down we were talking about this new analytics engine that Wyatt put together and we have these Nagios alerts that'll hit us based on queue sizes. Sometimes we'd get heavy traffic and we'd get in the hundreds of thousands for stuff. And then we made some changes to it so it could handle more traffic and I got a call at some time during that, when I wasn't on the hangout. It was Eran and I answered and said "What's up, man?" and he's like "Ben, one of the queues is up to 16! You need to check this out!" So, yeah.
43:40 - Daniel Shaw: Ha! Could be dangerous!
43:43 - Eran Hammer: I think it was like a hundred.
43:47 - Ben Acker: It may have been. Granted, it probably was like a hundred when he called, but it was back down when I got it.
43:59 - Eran Hammer: And the thing was, throughout this whole thing, everybody was coming up with suggestions on how we could -- without getting in trouble -- how could we unstable things just a little bit to make it interesting. I think Wyatt wanted to move to node 0.11.
44:22 - Kevin Decker: I liked your suggestion of reverting the memory leak fix.
44:27 - Eran Hammer: Yeah, let's just go to 0.10.21. We had a few ideas. So we found a caching bug that we had to fix before Monday. Certain promotions would not have been visible otherwise. And we were like "let's just wait until Sunday when everything is quiet" and Kevin and I looked at each other and said "no, no -- now!" We knew it was basically zero risk for us -- so it wasn't like we were being brave -- but it was this psychological thing that we're going to do a release in the middle of Black Friday. Just because. And it took a few minutes, it was completely uneventful, it worked exactly as everything was rehearsed. It was really great. It just showed what you can do, even through your busiest day, when you have a good foundation and a solid team behind it. You're not even really taking risks, you're just doing what you've been preparing for.
45:47 - Daniel Shaw: Speaking of the team there: if you have a constantly-on situation, I'm sure that being remote is actually beneficial. Being able to loop in and out, but still having some contact with reality.
46:07 - Eran Hammer: I think the culture of remote basically away the culture of nine to five. Everybody is managing their own time and working the crazy hours that work for their life. If you look at my GitHub activity, you'll see I'm very active between 10:00 pm and 2:00 am. It really enhances this kind of behavior; you never have to apologize for asking someone to work at a crazy hour or put in more time, because everybody knows they'll balance it at some point. It will all even out. We don't need to say "oh, you're going to work four more shifts today and you'll take next week off to compensate." You don't have to play those accounting games that most other companies have to play. And the thing was, during Black Friday itself, most of us were up all night. We only had two people listed down as responsible but it was just so much fun. We were just having a good time, like a team building exercise -- we're on a hangout, we're all talking. Yeah, I think remote is the future. It just takes time for people to understand that and embrace it fully.
47:25 - Daniel Shaw: And build the trust. Having home runs like #nodebf is a great way to not only move node forward but also as a positive indicator of what a remote team can accomplish. And how it's beneficial. So I know you're dying to share some numbers, Eran. What can you share, in terms of details about what happened on Black Friday?
48:09 - Eran Hammer: Actually, that's one of the few things that we can't really share. We were able to push the boundaries on what's acceptable sharing for the company, especially live. But the numbers part, in terms of revenue and volume and number of users? Unfortunately those are all SEC-regulated pieces of information and we have to wait a little longer. We'll be able to share a lot of this stuff more as soon as the quarterly reports are out. But what I can share is that we've seen doubling and tripling of volumes. We have seen a very obvious migration away from desktop to mobile.
49:05 - Daniel Shaw: There's a number there, right?
49:07 - Eran Hammer: Fifty-three percent. That's the number that was put in the press release, so we can talk about that. And the variety of devices we're seeing is significantly greater. We're seeing that people are more serious about buying online and buying on mobile than before. The pattern in previous years has been more about looking up the store or looking up store inventory, or just bringing up the coupons or local ads for the stores. A lot of that has moved to making the actual transaction and purchase on the mobile device. Which is really great for us because that's what we do. And then we have this hybrid experience. Through most of the year, our focus is not so much on mobile commerce but on the hybrid in-store experience. Like self-checkout: you can scan items from your phone, pay on your phone, and leave the store without going through a cashier line. So that's more of the non-Black-Friday focus for the mobile group. But Black Friday is its own beast.
50:18 - Daniel Shaw: So what are you using node for, specifically?
50:22 - Eran Hammer: Oh, it's all over the place. I just today found a new place where node is being used a Walmart that I didn't know about. We're using it for the mobile API proxy, which is the main piece of machinery that we were talking about during Black Friday. We're using it for the entire analytics system. So: the servers that receive the analytics data, that process it and forward it on to its destination,
50:51 - Daniel Shaw: Which is Splunk.
50:53 - Eran Hammer: Splunk it one of them. We use Splunk, we use Omniture, we use statd, we use MongoDB, we have our own secondary baked reporting system that has been built in-house -- it's pretty awesome. And we're actually really excited about the Voxer analytics system that Matt has promised to open-source any minute now.
51:11 - Daniel Shaw: Everyone needs to send Matt a message on Twitter expressing how excited they are to see that. Having used that, it's really great.
51:25 - Eran Hammer: It's magic. I went there and I was drooling when he was showing me. We spent two hours at the Voxer office playing with that system and digging into stuff. And of course, Matt being Matt, he repl'd into one of the servers where we didn't understand what's going on and he started playing around in live code right there with live traffic. It's great fun, and I think when Matt open-sources that it will give some heartburn to startups that are thinking about doing node operations stuff. They'll look at it and go "Oh. Crap." I think it will give some Splunk use cases a run for their money. It's awesome. It's really a fantastic home-brewed system that's going to make a huge difference in how node ops are done. So I'm really excited about that.
The other things we do with node... we have a whole bunch of things. Today I learned about another team from Walmart in the labs area. They do all the social data mining activities. So their focus, for example, is recommending other products for you to look at. You know, relevant products based on what page you're on, and it'll say "hey, look at these pages." That engine is running on node and hapi. It's been live through Black Friday. I didn't know that. They also said it was really boring, their servers never went above two percent CPU. And we tweeted it today from the hapi.js account -- there's a little picture. So if you go to the Walmart.com website and you go on the left side you'll see "other related products" and that little panel is coming from node. It's really cool. People are doing it in other teams and it's growing quite naturally within the company.
But the other thing that's really cool for us is we've recently moved our mobile web experience -- which is still our biggest driver of traffic and transactions. We moved that from being hosted on Java to node, and we also built a pretty nifty A/B test framework into that. Kevin can talk more about that.
53:49 - Daniel Shaw: Yeah, I'd love to hear about that.
53:51 - Kevin Decker: Over the past couple weeks, Eran and I were tackling how we can do A/B work. Our previous solutions weren't going to scale for Black Friday; in years past, we actually just had to turn them off completely. But through a variety of -- I don't want to say "interesting hacks" -- smart use of data we were able to do all this without any I/O, have it be completely self-contained, and you can actually run it on client-side if you want. It performed flawlessly during the Black Friday time period. And we've open-sourced all of it under the Spumko project. "Confidence" is the name of it. We'll also be posting about this on the Walmart Labs blog in the not-too-distant future.
54:56 - Eran Hammer: We're just waiting for the machinery of PR to put it up there. It was really one of those fun moments where Kevin and I were talking over IM and saying "well, we can't put it there and we can't put it there." Because everything will break something that's already being used. And someone came up with "well, we already have this random UUID that we use; why can't we use that as our random data set for the submission." So we started looking at that and found a pretty clever solution. The idea came from Adam Barth who's one of the leads on the blink engine at Google. He's my go-to crypto guy whenever I have anything related to crypto. So that's a really cool hack. The way it worked out with generating A/B evenly distributed numbers for tests without actually changing the footprint of our --- [possible drop out from 56:01-56:06].
Our team is actually moving into a no-QA process. It's still gonna get QA by virtue of us providing an API for mobile clients that are getting tested and they test our stuff as well. But, as for the services layer itself, we feel pretty good that we actually don't really need much more than a bunch of automated smoke tests and integration tests that are run automatically whenever we do a build or push stuff. Between Travis and the blanket.js node coverage module, at this point we have thousands of tests. Our test-to-code ratio is crazy. No one can commit a change without passing the tests, and one of the tests is one hundred percent code coverage. So if you're making a change and you're not fully testing it, that'll break and it won't get merged. By keeping it at one hundred percent we can keep that going, and you'll know really quickly if something new has been thoroughly tested. What we've also done, is any time there's a problem or a bug, the first thing that we do is add another test that will fail. Over the last six months, we have skipped that test a couple times and we've paid for it. Every time we've said "oh, it's just a small bug we'll just fix it and go on," we've had to fix it like three times because we skipped writing the tests first.
The other thing worth mentioning is we work really hard, within hapi, to implement an injection functionality that allows you to fully simulate a working server without actually starting a working server. We use that extensively when we write all these tests and we have a larger collection of integration tests as part of every module we write along with all the unit tests. We're learning to do more on the integration side and on the unit side because they take less effort to maintain. Every time you change something on the [call dropout] side something breaks on the unit side, which is annoying. But also you get more thorough testing by doing that.
And to enable that we used to use mocha for a long time, but when node 0.10 came out with domain support, mocha wasn't really doing well with that -- it just was not handling that particular error-handling scenario. So we forked off and trimmed it down significantly, we removed all the neat, crazy functionality that mocha provides and focused on our very narrow case and came out with a module called lab that we use now. It has all the built-in coverage and tooling we need for testing and it's mostly compatible with mocha. If you have mocha tests you can use them largely unchanged with lab. Our aim was not really to compete with mocha or replace it but just to say "hey, this is something that's very specific to our needs." So that's the general approach.
We've had really great success with putting more and more of the testing burden on the engineers who are building it. And also as a team, we require that any change will get reviewed by one other team member. There's no requirement that a manager has to review it or any of that crap, we just need to make sure that at least one more person looked at it. And we have this constant challenge, started by me by being really mean to people when I do code review. Now it's like a game: if you find a comment to make about someone's pull request, it's showing off a little bit but it's a healthy showing off. Wyatt's favorite past-time is to do code reviews on my changes and he always finds something. Doesn't matter how small it is he always finds something, just to stick it back.
1:00:34 - Daniel Shaw: Nice! You're trying to encourage people to get beyond "looks good to me."
1:00:40 - Eran Hammer: It's become a game within the team, you know; showing how smart you are by finding little things to nit-pick. Which is great, it keeps everyone on their toes but because we're all doing it to each other, it doesn't make anyone feel upset or stupid. We're all stupid, you know? If you're writing code you're gonna write bugs. You're not always going to see the abstraction or simplification you can do. A fresh pair of eyes is really, really important. That has been a really good process for us and the quality has been great.
The other thing that has been really fantastic is the recent up-tick in hapi adoption. The community has been really awesome. And we don't even care if people don't do pull requests. The quality of the bug reporting we're getting has been so great. People will put code to reproduce it, they're engaging, they don't just report a bug and then leave. So that has been a significant factor, especially since our own use of our own framework is behind some other people. We're not using all the features that other people are using. And when we have to use them it's already been battle tested for a year. It makes our life much better because then we can go to production so much faster with a brand new feature that hasn't been tested or had as much traffic. But somebody else has had it in their production. Which is really the beauty of open source.
1:02:19 - Daniel Shaw: Excellent! I have a couple questions from the #nodeup IRC channel and want to get them in. They're riffs on testing. Oseti asks what did you do in terms of security testing your node stuff?
1:02:36 - Eran Hammer: Wyatt has some security background. I have some questionable security background.
1:02:47 - Daniel Shaw: Ha! You know better than that.
1:02:50 - Eran Hammer: In terms of security, our approach has been to follow common sense best practices of security. In coding everything, we echo back. Whenever there's an optional configuration piece we pick the stricter version as much as we can. I think we have a good mindset for it in the team. We haven't done anything specific just thinking about security. It's something more embedded in the process. When we look at a piece of code we ask ourselves "Is this good? Is this not?" We also do a lot of research. Like when we implemented basic auth support we went and looked at the recommendations. Our first version of basic auth support allowed you to do plain text passwords or store the passwords in the clear. And we decided we're not going to support that and we're going to make it really hard to do poor implementation of your password storage. So we're just keeping an eye on stuff.
For example, the recent report about the express file exploit where you can basically fill someone's disk. And we had the same thing. We were using a different module than express, but the way we were using it basically created the same situation. So we keep an eye on everyone else's security issues in the community and as soon as anybody reports anything we immediately check if we're vulnerable as well and patch it. Some of our users have found in testing some ways of bringing down a hapi server that were unintentional. And we fixed those. Up until very recently this wasn't a big deal because it wasn't really running in any significant production capacity. We had about a year and a half of being able to solve problems without the pressure of "oh, we have to keep this secret until everybody's operating" and all that kind of stuff. But now we're moving into a little more restrained environment where we have some customers. At NodeSummit, MasterCard mentioned that they're using hapi. Conde Nast is using it for an upcoming application that they're building for their magazines. Walmart is using it in the big time. It's not longer this toy that we can play with, so we are constantly reviewing how we're going to do it. And also we have the internal Walmart security team that is constantly trying to break into our servers inside the VPN. So whenever they find anything they report to us. Then we look into it and see if it's a framework issue or not. So that's the focus on security. it's part of the process, not something we do as a step.
1:06:00 - Daniel Shaw: And you have a team that has a lot of experience in that as well. Another question: how did you decide to do the live tweet stream? The #nodebf, basically? Who did that?
1:06:22 - Eran Hammer: So that came out of a promise I made two years ago at Node SummerCamp, where I stood up and tried to shame everyone in the community for not sharing any real data. And it was really funny if you think about it. The only points of data we had until very very recently is that two years ago LinkedIn did something and said they went from this many servers to this many servers. It was a really cool piece of data because it got us really excited about it. "Here! somebody's doing something real!" But at the same time it was really meaningless in terms of decision making. And then, everybody knew about Voxer but nobody knew what Voxer was doing, but they knew it was something really big and great. So that was kind of useful. But at some point that stopped being useful in terms of selling it to your boss, in terms of having faith in it, in terms of making sure your shareholders and investors don't think you've lost your mind.
So I kept saying "I will share as soon as I can share." I felt that Black Friday was the first opportunity for us to really put a spotlight on something we felt good about. I felt like, either it's going to be boring as hell and that's going to be impressive by itself. Or it's going to start to break but we have such a great team that we'll be able to pick it up and fix it on the fly and kick some ass. We didn't get to kick some ass, but we got to broadcast the lack of news. Which was great by itself.
And since then, I've been asked by a lot of people "Why would you even do it? It sounds like a lot of work that doesn't give you any benefit." But I completely disagree. I think that there is great value to the community -- and then giving it back -- in highlighting what you've done, what worked well for you, getting feedback from other people about what they've done. But also, let's face it, from a purely PR point of view I think that just by showcasing what our team has done we're accomplishing quite a lot.
Giving people the reward which is not something that you get to do very often. Hiring for a large company is hard. A lot of people are looking for the sexier, smaller companies. Evens startups today can compete on pay and benefits. It's no longer just big companies doing that. We're all engineers. We all love to write code and create stuff, and as engineers we want to talk about what we're doing. As engineers, we want to share with our friends and our community. And we want to show off. What's wrong with showing off? When you work at a little startup and you get to go and do a talk about a hack you just did this weekend it's really cool! And it gets more complicated in large companies. So, just setting that stage and saying to people "Hey, you know what? You can do it! And it's not risky and it's not that complicated." And maybe other companies will now say "if Walmart can do it we can do it, too."
1:09:46 - Daniel Shaw: That's fantastic. There's another great question that I want to get into the podcast. Eric Toth over at PayPal asked about the monitoring that you're doing on the event loop. Can you explain a little bit more? You mentioned that at NodeSummit.
1:10:05 - Eran Hammer: So the Achilles heel of any central event loop system is that if the event loop starts taking a long time then everything gets starved. In node, if you're not careful about how you're using nextTick
versus setImmediate
, you can very easily starve your application from ever getting to the next point.
For example, let's say you're setting a timeout for five seconds. And then after you set a timeout you are calling a function that takes six seconds to run. So it takes six seconds before that function returns control back to the event loop. What will happen is that your timeout will now take six seconds; that wasn't what you wanted. If you set multiple of these six-second timeouts and each one of them takes ten seconds, then at some point your event loop is going to get really, really behind. And you might get minutes behind.
So one thing we've done is monitor how far behind we're getting. There's a module that Lloyd from Mozilla put out called toobusy. It does it using native C++ code and it's a very neat little module. We've decided to limit how much native C++ stuff we include. We just try to avoid it whenever we can. So instead, what we're doing is very simple. You can look in the hapi source code and you can see it -- it's not proprietary, it's right there in the framework.
What we do is we just set a timeout and we say "that timeout is for one second" and when the timeout comes back we say "how long did that actually take?" And usually it'll take one second and maybe two milliseconds. So know we know that we're two milliseconds behind. And be careful: don't use nextTick
because it does not go to the bottom of the queue. You want something all the way at the bottom of the event loop so you can really begin to measure it. And if you're doing any kind of back pressure you absolutely have to monitor your delay. And again, all you have to do is just set a timeout somewhere somewhere in your application. If you're not using hapi you should. But it you don't want to, just set a timeout somewhere. Every time it comes back just do a delta from when you set the timeout to when it fired. Then you of course deduct the timeout itself, and then you get the delay.
For most systems, if you're building a brand new system and you're getting above 70-100 milliseconds then something is wrong. Your software is not designed correctly; you're taking too long with some of your functions before you return; you're doing too much CPU stuff. For us, we're working with a really large number because we have all these really slow upstream services. So 70 milliseconds? Yeah, I wish! So our timeout is measured in seconds but even that is good enough to provide us with the back pressure we need.
1:13:18 - Daniel Shaw: Is that line something that you do across the board on your services or something that you tune per service?
1:13:28 - Eran Hammer: Now it is. It's actually a brand new thing that we added. We always thought about doing it but we always had over capacity. So we'd say "I don't need to do back pressure, we'll just add more VMs." But Black Friday kind of changed our tune on that because during the stresses we saw how we were losing servers. They just got too much load and just ran out of memory. So we said "okay, we don't want to do that." And that's when we implemented the back pressure controls. The event loop was the last thing that we added because sampling memory without paying attention to your event loop was clearly not enough. At some point your sample rate gets so skewed that by the time you get to the next sample it's been a minute and you've already died.
1:14:16 - Daniel Shaw: Okay, well that was fantastic! It's extraordinarily inspiring to see that evolution and have the story of two years ago really tie into what's happening now. It's really great that you've taken the responsibility of that primary breaker of node and also included that you're responsible to the node community to share that information. As an individual and I know everybody else in the node community deeply appreciates it. And we appreciate the effort that's going into hapi.js as well. It's a great framework and it really gets us ready for the growing enterprise business -- whatever term you want to use. And node becoming a reality and a first choice in building applications for the network.
1:15:40 - Daniel Shaw: So we have plugs for our upcoming events. So please, do you want to lead off plugs, Eran?
1:15:54 - Eran Hammer: Sure. Of course I want to plug our hapi.js framework. If you haven't, I challenge you to try it and tell us why you don't like it. And if you convince me I'll buy you something. I'll buy you coffee. If you have a solid argument why you don't like it versus express or restify or other frameworks, then I'll send you a Walmart gift card. But really, please try it out. We've put two years of work into it. We're really proud of it, it's awesome, and we really want more people to use it. Because the more people who use it, the better it gets for everybody.
I also want to plug the scalenpm.org fund raiser. We reached our goal, but at the same time it's never bad to have some cushion. The campaign is still going on and I do want to issue a challenge. Walmart has donated $40,000 -- no one has come close. And I want to challenge all the multinational companies that have been on stage at NodeSummit -- you know who you are -- and I have not seen your names showing up on that list. For all of you, $40,000 is the number to beat. But I also want to mention that the node developers on my team, on Kevin's team, on other teams within Walmart Labs have really stepped up and we donated about $2,000 of our own money to this campaign, on top of the $40,000 that the company donated. So it's not just that we got our employer to pay, we all love npm and node so much that we felt really strongly about it. If you haven't made a donation yet, go make anything. You really don't want to be that guy who's name is not on that list when the campaign is done.
1:18:18 - Daniel Shaw: Fantastic. So, Wyatt?
1:18:20 - Wyatt Preul: Sure. I'm in Kasnas City so I was just going to plug our local meetup, NodeKC, so if you're in Missouri or Kansas or Iowa or Arkansas just come on up. We meet every month. That's it.
1:18:35 - Daniel Shaw: Awesome. Kevin?
1:18:37 - Kevin Decker: I wanted to plug the Thorax framework that we're using to actually power all of the frontend experiences here. It's at thoraxjs.org.
1:18:47 - Daniel Shaw: Awesome. Ben?
1:18:49 - Ben Acker: I wanted to throw a plug for knode, which is a thing we put together up at PDX Node with a whole bunch of CascadiaJS folks. It's a place to store information about the local meetup groups or different types of organization that you're doing for meetups for nodebots days or any type of hack days. It's also putting together a whole bunch of information about people who are willing to speak in different areas or are willing to travel and speak at these meetup groups. Check it out at https://github.com/knode.
1:19:34 - Daniel Shaw: That's fantastic. That's a really great resource. My plug real quick. The Node Firm just completed our first beta of performance analysis training with Trevor Norris. It's really great. And we've scheduled the next iteration for January 8th. So head over to http://thenodefirm.com and you can get signed up for that. It should sell out pretty soon so don't linger too long. So, real quick, very high level there aren't too many. JSFest in March, be sure to get ready for that. And DHTMLConf (part of JSFest). Tickets are selling out for that, so you want to get in there soon. In Melbourne, CampJS is back in March or April, 2014 at. Check campjs.com. JSConf is coming. Tickets will go on sale the 16th of December. They will sell out in seconds. Get your bots or whatever ready for that. We'll be back in Amelia Island, Florida again. It's a great family experience. I went there this year with my entire clan and everybody loved it. NodeConf is going to be July 4th this year, it should be back at Walker Creek Ranch. Looking forward on. Thank you so much, everybody. Please leave a review for us on iTunes even if you aren't using iTunes it helps us get in front of people and they can find us more easily on Apple's rankings. So leave a review and let us know what you think. I really appreciate it. Be sure to follow @NodeUp on Twitter.
Right now we're booked out on our sponsorship slots but there should be some opening up in the future. So if you want to support NodeUp and the node community, we'd love to have you around. Eran, Ben, Wyatt, Kevin: guys, I really appreciate your sharing node Black Friday Walmart being a really great steward in the node community and doing a lot of great work with hapi.js. Thank you so much and it was great talking to you guys.