Sessions is temporarily moving to YouTube, check out all our new videos here.

Don't Let Just Node.js Take the Blame

Daniel Kahn speaking at London Node User Group in June, 2017
583Views
 
Great talks, fired to your inbox 👌
No junk, no spam, just great talks. Unsubscribe any time.

About this talk

Daniel will cover: how to trace down problems inside Node, the challenges operations is facing in today's highly heterogeneous applications, how to protect the boundaries of your service to pinpoint problems in other tiers, new tracing features in upcoming versions of Node.


Transcript


- So when I started with Dynatrace two and a half years ago and I did Node before around three-ish years, so Node was like this big next thing ahead and everyone thought it was more like for these guys, so very, very startup-ish and actually my company didn't take me too seriously anyway so they hired me for the Node stuff but it was more like a marketing check mark so that we covered that as well, and the CTO when he kind of talked with me when signing the contract he said, "Yeah, we will give you another topic as well because that is not really important what you do." But we found out and very quickly found out also amongst our companies is that Node.js is what the company outlaws use to finally introduce change, so Node.js was always kind of a tool of digital transformation and that's a really big thing currently. And you can think of some brave folks kind of going on a quest, in reality they look more like that, like Alex Balesh of Intuit, he founded something that he called a pirate ship and he created TurboTax, they worked after hours, used Node for it and just wanted to move quickly, a pure like legacy shop Intuit before a huge company and to really around his team to really try to do something differently. Or PayPal, this is Trevor Livingston, I don't know if, everyone knows the PayPal story meanwhile right? Yeah so it took them around six months, so they were a Java monolith and it took them around six months to deploy a simple UI change to production because monolith and hey, just for changing some text copy it won't deploy everything anew and so they used Node.js at first as pure templating engine, and prototyping engine, but soon they really moved to Node big time. Which meant for our company, because PayPal is a customer for us and this was the first time that Node.js got really important because Node was, PayPal was a Java shop, and they were good customers and everyone was happy and then they started to introduce Node.js and our agent was not there at this point so you have to, to do monitoring, you have to have a proper agent that monitors everything and they were like, "Yeah we are sorry but what you have is not what we need." And then we found out yeah Node.js is getting big. Another example, Aaron Hammer, at this time at Walmart, not really a startup. He used Node.js really as transformation tool, so he put Node.js in front of this whole Walmart legacy stuff to transform this whole company. Over the time it came, amongst our customers, it came over PayPal, I worked with Intuit, I worked with Autodesk, they were, are all node jobs and we all know like all those different, this is just a short list of companies that are using Node.js but who should I tell that, you know that already. When I look at our customers and they are really enterprising and look at how they are using Node.js, it's not like, and also people didn't do that, they don't throw out Java at all or they don't dump the mainframe into everything with Node.js because seriously that's not what JavaScript is made for, but what they do is they have these new offerings like some mobile app and or some HTML5 single page application, and then they use Node.js as glueing tier to collect data from all those different back ends and then to transform it and to present it to an Android client or some single page application. The problem is that anytime something gets wrong here on all those back end tiers, the first point where the problem occurs is Node.js. And it may even been in, I will cover that later, that it looks like it was a Node.js problem because Node.js starts to error out and soon you're at the error so people say, "Yeah Node.js is what company hipsters threw in to break everything." And then you're this guy. Like trying to fix things in production, and that's not so funny. But let's not be funny, let's talk about what Node.js is really. Everyone has seen this presentation already hopefully, so this was when Node.js was introduced. And this was, is maybe one of the worst presentations I've ever seen with the biggest impact ever, it was at JSConf U, and so he introduced, yeah what Node.js should be event non blocking, et cetera, et cetera. Now from a real technique and perspective, Node.js is nothing else than a C++ programme controlled by V8 JavaScript, so and it's really like when you look into the V8 code, it's like you have some JavaScript code, you throw it at the V8 engine, and on the other hand something like that comes out because it's really machinery code running on the target host machine, it's not byte code as with Java, so easy and performant and really great. Every time, and when we talk about performance or like processes that run somewhere, we always have to deal with two things I would say. So every running programme has something going on on the CPU and some state stored in memory and so when we talk about performance problems, we may have to look at these things. Why is this important? Because the people I talk with in my job are not Node developers, they are like more sitting on such things devices, they are sitting really in front of dashboards and they just have to make sure that this whole Node.js thing runs and they have no idea what's really going on in there and even more important it is really to enable them to find problems in Node.js quickly, and let's start maybe with memory problems here. So Node.js, the memory is very much like for every, almost every programming language, so we have this whole area, Node.js consumes is the resident set. Then we have the code segment, that's your code, then we have the stack where we store local variables, points, et cetera. And there we have this heap. So this is separated into a heap that is allocated and some used heap that is actually used within this allocate heap, and then we start everything that's dynamic like your object enclosures and everything you create here. And the nice thing is that there is some function you can call on process, it's called memory usage, and it will spit out something like that, it gives you the resident set size, it gives you the heap total and it gives you the heap used and you can call this while your programme runs frequently, so you don't need a monitoring tool for that and you can write this out into some CSV file or something else and what you get then when you employ XL maybe against it is something like that. So that's really easy to do. So here we have this resident set size, resident set. Here we have the heap, and here is the heap used, and as we see here we have here this little sawtooth pattern, so memory gets constantly within this used heap, it's constantly consumed but also freed, yeah, constantly, everyone knows which process is to blame for that? Why this looks like that? Exactly, that's a Austrian garbage collector, because I'm from Austria. And so everything should be fine actually so we have a garbage collector so what could be wrong so when we allocate something we don't have to really deal with this stuff as JavaScript developers. Node.js makes it really easy, you don't have to, I don't know, care about really memory management, not even as you have to care about memory management in Java. But there are problems, and to understand that we have to visit my hometown, it's Lint in Austria. This part of the, works way better when I tell this in the US because this is the house I'm living and it was built 1600 something, that is nothing special for you, for the US guys that's really like oh my god. And here I'm living and there is the backyard and there is this door, and when I open that there is this garbage room and there is some stuff standing around and there is this old TV set, and here is this, and it's already old as well, this sheet that says, "Bitte nicht hier abstellen und gleich sels entsorgen." Which means please throw it away, we won't do that for you, we don't deal with that. And if this happens in your code, it looks like that. So this is... This is how Node dies when there is a memory exhaustion. And we are currently really, from a monitoring perspective, this is an interesting case of error because you can currently not hook yourself in with that, with an exception handler to really figure out that there was an exception but because we would like to show that our customers so we are about to open a pre-request to really also see all those garbage collection problems here. And then the garbage collectors broke. And then you are here for your weekend, so but let's see how we could build really a memory leak in JavaScript and not one of those memory leaks like I don't know store the IP address of every visitor in the global array, it's clear that this will become a memory problem at some point, after few million visitors. Let's build something that is a little bit more complicated, and that's obvious. I found something in the Material Project, I have to say that meanwhile it may even work or not really trigger a memory problem but you can find a similar example I'm sure. So what we are doing is we have an express route and it's called /leak and it calls a function called replace thing. And replace thing is there, we have some global, not really global, it's like model global or like in some model that is exported and we set it now here and then we have this function replace thing that is called here and we do a reference from the thing, so we reference this into original thing, then we have a unused function that is never, ever used. And it references original thing and just locks out the high, no one cares because we don't call it anyways. And then we have this object that allocates a long string like I don't know it's one million or so, I don't know I cannot count series but a lot of asterisks. And then you have some method that is not called as well, and it would lock out something as well, but if when we look at that we see that replace thing actually wouldn't do anything so it just falls through and that's it and what you would think is that if you do that nothing will happen, the memory will be allocated, de-allocated because we don't do anything with it. But the reality is different, maybe we have to look at a little bit more, a schematic thing here. So we have the thing here and it talks to the root context of our Node application, then we have this long string in here that we have this some method. Original thing as it is in this unused function references the thing here, and the thing is that some method references this unused function back again because if you have some method, a closure method, the method actually has to know its context, so its enclosing context, so it has to keep reference to the unused function and at the time when I created this slide it was still so that the V8 engine simply could not resolve this number of indirections, so there was nothing wrong about your code but V8 was just like, "I'm not sure if I can really remove that because I have so much references going on." And really that's, I really created that then without a tool, you really see when you create that you have to use memory and so that committed memory kind of rises and then hits some boundary and then within this boundary Node allocates and allocates and allocates memory until it hits here the bend and then you have this exception and your process dies, so that is what happens then. And how can we find such memory leaks? So how can we maybe introspect what's going on in the memory? And the nice thing is is that the V8, Node.js runs on the V8 engine and the V8 engine exports some APIs that allow you to create some kind of or collect some telemetry data. And for that you can for instance use this package that's called V8-profiler which does nothing else than exporting this C++ JavaScript API to Node.js, so the C++ API to JavaScript to Node.js and you can call it from there. So you make something like a snap profile that takes snapshot and then you serialise it. And the cool thing about that is that the format it spits out is the same format you can use in Chrome developer tools so when you use it on your browser it's really the same JSON format that comes out of that and then you can feed it into Chrome developer tools and get something like that. To really find a memory leak, you have to do one thing, you always need a delta. Because memory leaks are something about progression, about something that rises, so you have to find out so what was the state before and what was the state after so you can actually in your application build something that looks at memory thresholds every few, I don't know, seconds, half minutes, whatever, and collect a memory dump when it finds, it kind of, it's constantly rising, and then you see what's actually growing within the memory. And if we do that with the example I showed you before, we can select delta here, so I selected delta between one dump and the other dump and then I can sort it with this delta and then I see very quickly okay this is the structure that is really growing all the time and there you have this asterisk again. So this is an example that it's easy to find because it's asterisk but this is how you do it actually and down here you see the references this object has in memory actually because the memory in Node.js and also in Java for instance is constructed as a graph, so as long as an object has inbound connections, it will stay in memory and here you see how deeply nested this whole structure here in memory is and that's the reason why the V8 engine couldn't collect it, so that's how you find memory leaks and not only in Node.js, most probably in most languages. So memory leaks are covered, let's talk about CPU problems. And if we talk about CPU problems, you always have to talk, in Node you always have to talk about the Node.js event loop. Because you just have to and there was this nice talk by Bert Belder and Bert Belder wrote a lot of Libuv, at least the Windows part and Libuv is the elaborated kind of key API that stands behind this whole Node event loop idea and it was so funny because he was like yeah when he prepared to talk he was looking for some easy way to show how the event loop looks like, so he did but every presenter at every conference, he looks at Google image and finds if there is something already there. We can just steal and reuse and he showed up what comes there and he said, "Wrong, wrong, wrong, wrong." So this is all from blog posts, from people trying to explain their event loop and everything was plain wrong. It's just interesting so Node.js is most probably that event loop is very misunderstood. So he came up with something like that. So this is with unicorns and really nice and really look up Bert Belder Node interactive Amsterdam, this talk is really worth looking at. I take a real simple approach to understand the event loop of Node, what the Node event loop is not organised like a stack as it is often represented, so it's often represented as a stack of functions and callbacks that are constantly called, so this is the wrong idea of the event loop. Actually the event loop is more or less a process that runs through stages. This means first we work through all those timers, and that can be a queue, yes, and there can be some FIFO running but that's not the thing. We have some timers, we have some IO callbacks that we deal with, these are JavaScript callbacks for instance, then we have some IO polling where we poll from threads and see if there's something new, and then we work through all those set immediate that were set by our application and then we have those close events, and this is then the loop that runs constantly. And there we also put some I would say research into that and looked at the event loops at how we can monitor this better and this is a healthy event loop. You see here you have some, the interesting thing about the Node event loop is that the latency, and that's why the Node event loop latency as many tools show it is a little bit misleading because the event loop looks the same under highest load and when it's idling. Because when the event loop has high load, it's like you speed up with the car, it tries to make more cycles, and if it's under no load, it will basically idle along and have a long latency as well so you actually don't really know from just looking at the event loop latency if there is really a problem with your process. So one metric does not really help you. So we have here the loop count, the number of loops and here we have the total time, how long one run of the event loop for instance takes, so and this is a totally sane process and then we have here the Node.js loop latency and the worker latency. The latency is, so we created that like that, so we put something on a timer and let's say okay set timer at 200 milliseconds, and if then the function that we want to call comes back with, after 250 milliseconds we know that we have 50 milliseconds timer so this means, or latency, this means that the event loop took 50 milliseconds longer than we wanted to finish this task which is an indicator that it was somehow busy already. And then we have the work process latency, and this is the time that passes from adding some function to the asynchronous queue and until it is really processed, this also tells you how busy this asynchronous pool actually is and this also gives you a clear metric about how busy the event loop is in this moment. Good example, event loop problem, you call something asynchronously and this runs main thread event loop, sometimes thread pool, it's also something that is misunderstood, Node doesn't always use the thread pool, Node uses heuristics to find out if it's possible to run maybe something asynchronously through the system kernel because many functions or many functionalities on the system are already asynchronous so you don't have to actually run it through the thread pool of Node so it figures that out and it's either thread pool or system or anything and then it comes back and everything is fine and then someone comes along and says, "Yeah that's great and as Node is so lovely asynchronous I can simply calculate, you don't see that, calculate Fibonacci in here." Because no problem and the problem with Node.js is as we all know CPU heavy tasks are not really what Node.js is made for because this really blocks the main thread. This means that while this Fibonacci is calculated, Node cannot really take any more requests and that's really different from, let's take, for instance, PHP, because in PHP every request spawns some thread in Apache, and if you, if this one thread is blocked, no problem until some degree because when a new request comes in I can simply spawn a new thread and keep going until the whole Apache thread pool, I think it's 250 threads, or was, for some time ago is exhausted, so there we really have a problem. And if we look at yeah this box and if we look at that, this is how the event looks then, the total time rises, the number of loop runs through really slowers and here it's the same, so the latency gets really big and that's, then we get a problem. So what can we do to find out about CPU problems that are going on? And one really interesting showcase I created also with V8-profiler, in this case so I use start profiling and stop profiling here and with that you can also dump out some CPU profile and what I did is I wanted to know what's really going on when you run Express in development or in production mode. And I got some data out of it and fed it to D3 that's the JavaScript library for charting, and created some burst charts because I like some burst charts, and it looked then like that and I can try, do I have, if that's even possible here, I don't know, now I don't try that because it fails most of the time so when you, what you actually can do here is you can drill into, over D3 you can drill into every segment in here and then it will go deeper and if you hover over that, you will see that everything here, I used Jade as template engine here, everything what's going on here is done by Jade so this means, makes sense if you run Node in Express application in development mode, your template gets recompiled for every request and that's a lot of work, so we're talking about the application about will be about one half slower than running in production mode and this is the production mode so you don't have to know much to see that there is really, really less going on on the CPU in a given time segment. The funny thing about that, and this is, I didn't know that was that Express actually defaults to development mode when you don't explicitly set production mode. This means when you are running Express and you just let it run even in production at some system and you don't explicitly set Node_env to production, you are really wasting a lot of performance without even knowing it, so that was a learning for me in this case. So let's talk about a few other real use cases of Node performance problems. There was this Node.js in flames blog post, who knows that? Good, then I can talk about this a little bit, this was Netflix, and Netflix used or uses Express and they had the problem that the latency of the requests really went up after, I don't know, a few hours every day. So they had to restart the process to get everything to normal and then it was rising again so they were kind of trying to figure out what it was and in the end they found out that Express.js was actually to blame for that. What they actually, and they wrote this blog post which caused some flame war because the name fit so well, because what they did was they had a script that was constantly creating new routes, make sense? Because they have this huge kind of dynamic infrastructure and this script was creating new routes and what they were assuming is that the same route coming in like created by the script for Express will override the old route. Which isn't so, this would be a typical hash map case because one hash map key overrides the other one. What was really the case is, so here we have this, we have all these routes coming in and here we see there is route A is coming twice here. This is because it's not a hash map, it's just a simple array and if you'll put one thousand times route A in there, it will be an array with one thousand items in there. And the reason why they don't do a hash map is also quite clear because they also support route access here, so how can you create a hash map that matches with route access, So it's not possible and then you get typically O of N problem so what Netflix was creating was huge routing tables that were, in the end, O of N and it took one request to really run through thousand and thousand of array items until it found the right route to call. And I tried to reproduce that, I created simply a application that kind of was building up a routing table over time, make it larger and larger, and I did some CPU profiling in there, and what I got was that. So here you see again that everything here, when you hover it you see it does route something, everything here what's going on, all these segments is actually finding the route, so running through this array in Node.js that's a typical CPU problem here. And the typical problem actually caused by your application code, so I think that's a quite interesting case because Node, Netflix was then using flame charts and everything to figure out this problem but yeah easy when you know the problem you can easily find a solution, I did that, so but still using regular CPU profiles it's really possible to find this problem. Another problem I want to cover here in really shortly here is security. The nice thing is with Node we have all those NPM packages. This is already old, we are now at 500,000 NPM packages available, not all for Node, meanwhile also for React or whatever. And the thing is Node.js developer are so easy with installing an NPM package. I don't have to mention Left Pad here in this audience, right? So I think everyone knows that so we are, we are solving very often we are installing stuff because we found it oh stack overflow stack NPM instal and then we have it and then it works somehow magically code from someone else on the internet is always better than mine so I use it. Actually you have to be aware that anyone can publish an NPM package and you can do bad things with that, most of the time people don't, aren't intentionally bad, they just screw up when they code and they create security problems. And for that I really recommend you to really look into Node security, that's meanwhile even a charted working group within the Node project, this is Adam Baldwin, a really nice guy and you see him somewhere doing a talk then listen to him because he founded this whole security project and this is a way and it's free actually and Snyk, this is another way that does security, another company, where you really can throw your package JSON at this, at this API they have and it will then tell you if there were, are known security problems to your module base and I think that's an important thing and we should always consider that, yeah that is Snyk by the way. So both ways to do it and Snyk consumes also parts of the Node security project, so these are all important things you really have to consider when using Node and when you are in a larger enterprise you should also have some kind of auditing in place that really looks at what are we really installing here and which third party code are we putting into our own code base? But what if Node, because I was covering that before, what if Node isn't the source of all evil? To really understand that we really have to look at like a little bit of history. So in the 80s, things were really easy for people, when you worked in IT, there was the main frame and when something went wrong you went to the main frame guy because yeah it was most probably the main frame. Then Java came and Java talked to the main frame so it was already started to get hard to find out what's really going on, was it the Java client or was it something here in the main frame, and here we had some websites around 2000, no one cared about that but here we have all those SharePoint applications, Java, all this intranet where people are really doing business processes, business transactions through some interface, through some browsing interface that again then talks to some main frame maybe. Yeah here SharePoint, now we are here and this is the perspective that a non operations person has because you have Node here in the back end, then you have the single page applications from iOS client or Android, then you pull in some Jquery, some React, whatever, maybe from some CDNs and maybe from Amazon S3 likely Amazon S3 is never done, right? So we don't have to fear that. So the thing is that the number of stakeholders constantly grew over the years and it's now really hard to find where the root cause really is and this really hits Node.js even more because you are most of the time Node.js as I showed is used to consume data from different sources, and one problem we came across very often is called backpressure. I hope there are no car lovers, I don't know here, because this is, the next slide will be a little bit hard. It's that one, so this is backpressure actually, so this is something really slow pulling something, really fast pulling something really slow and it was not really made for that. Transferred to code it looks like that, you have your fast Node application that can do thousand requests per second, and then you have your legacy back end that can just do 10 requests per second. What happens then is, and you will think okay then something, things go slow but yeah you really easily see that the back end is to blame, the problem is that at this point at some point the IO polling queue of Node.js gets overloaded because it's waiting for all those many requests coming back from the back end and that's really a thing and then Node actually starts to get non deterministic and starts to error out so the first person that really sees an error is you and then it's actually really you that is to blame because something is going wrong. This is some yeah another graph of backpressure here on the event loop. A lot of event loop run through is going on, and here we have the work process latency, you see that it's really growing and the Node event loop latency is slow because it has not anything to do anyhow because the event loop is all the time busy doing those asynchronous works here. So that's really a problem and that really causes a lot of trouble in companies. And what I would say is that we need some, as Node developers, in larger applications we need some holistic view, so we have to see the whole picture, we need maybe some border protection, I have to, yeah, I had Trump in there sometime but I think for political reasons, when he got President I was not brave enough to keep him in. So because I'm bringing this talk in the US I don't know if I can come back anytime when I do that. So but what you can actually do in your Node application is you can simply use HR time and collect the timing before you call some asynchronous function then this asynchronous function runs, then you have this callback, then you collect this HR time again and then with the delta between that you know how long this call took or you have to, of course, also make sure if there was an error you have to monitor somewhere maybe elastic or something, you have to write out this error to some external device, or if everything went fine you log a success and you know the method duration. This really would, can, and will help you to at least find out if your back end cause some slow or what's really going on, that's no big deal, the problem is you still don't know where this request really came from and you don't have much, I would say, metadata about the request at all and of course everyone can do that. So you can do this for browser, Webserver, Node, and Java, and Oracle, whatever, so everyone collects this data, but I would say we need some kind of more holistic view and this is where application performance monitoring comes into play. As I said everything I covered before is totally doable within your own code. When it gets larger you may want, and Node.js is really also a problem here, you may want to have a bigger picture, so this means you want to follow through transactions through your whole stack to really know how this passes through your tiers and when we talk, so this is this full stack idea. And then you can create something like that and you know how really your whole infrastructure looks like, how those different tiers are communicating with each other, or you are able to make something like that and find out about architecture problems, Node.js here is really a problem because of all those asynchronous callbacks for us for monitoring perspective, Node.js is really hard to monitor. I mean I think we wasted, it took us I think three years, and I think we destroyed two or three development teams to get it really right because really like passing the some transactional context from between into a chain of callbacks is really hard so we have to do a lot of wrapping, a lot of monkey patching in your code. Currently there are a lot of initiatives to get this better, but still, Node.js is maybe one of the hardest to monitor code bases there are and we have a lot of customers that are like, "You do it so well for Java, be serious, you don't take Node.js really serious so that's the reason why you don't do it properly." But it's really hard to do. Yeah. That's just a picture from a larger application that we call Deathstar. So because it looks a little bit like that, so this is how today's infrastructure looks like and how complex things can go, and this is our mushroom cloud that's the progression of a problem when really the shit hits the wind so that's really complicated, I'll skip that, I want to show you some awesome stuff in the end. Because one thing is that really the Node project and V8 they work really closely together which is not, yeah which isn't a given because per se for Google, Node is not the most important platform, the most important platform is the like Chrome, so really working closely with Node and putting new functionality in there, that's really a thing, they have to put effort in and there are few people at Google that are also closely tied to the Node.js community actually and they give us really all the time new functionality we can utilise like those APIs I showed you before with the V8 profile for instance and now we have this Node inspect, which is really a great thing, so you can instal it actually with NPM instal, download Node inspect, and then, it's now even, I think with Node eight it's even part of Node officially, so you can call Node inspect so main.js and you will get something like that and it gives you some URI and this URI you can throw into Chrome dev tools and it will show you really your code and you have a debugger, a stepping debugger here, you can even in place change your code here in Chrome developer tools and it will change your code, like it will change the code that runs and you can really play around with your JavaScript like in your browser and I would say that's really a great thing and we are kind of yeah working constantly on new things. Next thing that will come is some event API so that module vendors will be able to send events to the new some tracing events to the new V8 tracing engine so this means we can, on a separate thread really trace out information about Node.js, about telemetry data which will be also very nice thing to do. If you're more interested in this whole topic, just look at Node.js diagnostics, that's a working group and the nice thing about Node project is that we are a really welcoming project so everyone can actually contribute, every contribute is very welcomed and really there are also all those node interactive events and all events around the world where new contributors are on boarded so it's really possible to work with all those groups within Node.js and I'm working with this diagnostics working group where we meet every few weeks and really are discussing how we can make monitoring of Node.js better and I invite every one of you just to have a look at that, we have documented everything that's going on around that. That's it from me, thank you.