About this talk
Pizza Hut website is powered by a system of 20+ microservices, all docker-ised and running in AWS. Learn about the challenges the team have faced, like: splitting logic into microservices; changes made into several microservices at once & deployment; testing a huge products API; running over 20000 unit tests on one microservice; a system of logging, monitoring, and support.
Hi everyone. I'm Anna. I work as a software engineer at Pizza Hut at the moment. That's my Twitter handle. Feel free to tweet at me, follow me, and I'll follow you back, all that kind of stuff. I'm here today to talk a little bit about what we have done at Pizza Hut. It's one of those talks where I will show you what we have done with Node, rather than go into details of any specific technology, but I will also go into some of the challenges of microservice and Node infrastructure and architecture and hopefully make some sort of conclusions that are applicable to everyone who is using Node. So some year ago, Pizza Hut had this very out of date ugly website, which is a little bit weird because they were the first e-commerce website ever created, some time in the 90s. So you would think that they are up ahead with digital, but a year ago it wasn't so and that's why Pizza Hut created Pizza Hut Digital Ventures where I work, to create a new, flashy, awesome website to begin with for the UK and later on to go global and they invited MacKenzie Agency to help them out with the kickstart and they came in and said that they can start taking first orders in six weeks, which is a little bit crazy, but they did do it. They did do it in six weeks. The way they did it is they created a super-simple lean targeted infrastructure full of microservices. To begin with, there weren't that many. Over time it was growing. Now we have around 20 microservices and we cover two regions in the UK, so we have fully covered the UK market at the moment and we are moving to France and global over the next three years. Obviously the system doesn't only have to be lean and easy to develop, it also has to be super resilient and scalable. To do that, we created microservice infrastructure, which is really suitable for this kind of task and we decided to use Node fully, 100%, for several reasons and all of them are super pragmatic. That's not necessarily because Node is the best language in the universe but it is because there is a huge community around it. It's easy to get good libraries. It's easy to get tested, well-documented code that you can just pick up and use. It's easy to do front and back end in the same language. We, all of us at Pizza Hut, do full stock because there is no need to switch context. We all work in the same language. And, last but not least, it's easy to hire. It's easier than hiring, for example, for Closure or something like that. So we stayed very much on target. So there will be two parts to this talk. The first one will be shorter and it will be about the general infrastructure and how we organised the system and the next one will be about the challenges which we have faced while developing the original product and the current one. So at Pizza Hut as I mentioned, we do have microservice infrastructure. It pretty much looks like this. We've got the UI on the front, then a set of public-facing microservices which you can directly all talk by requests and there is the VPC, or the private sector for microservices that are talking to our internal network, taking payments, all the sort of secure stuff that you should not be able queue in directly and that all sits behind queues. So it's a fairly standard microservice infrastructure. There is nothing weird about it. You can kind of imagine it as a normal infrastructure where the UI is separate and the back end is all one chunk, it just happens to be slightly separated out. It might live in different containers but at the end of the day, most of it is kind of the same. We are fully on AWS and that was a conscious decision to stick to one vendor, which might be weird because a lot of people when we talk about infrastructure, we talk about vendor lockin, is that a bad thing? How do we make sure we can move from AWS to Google Platform easily or to Azure. That kind of makes sense, I guess, in certain ways, but if you want to be lean and you want to really make use of everything a certain platform can give you, then you really need to commit to it and so far that proves to be the right decision for us. Everything is written in Terraform. There's actually the link at the bottom. I'll also share this slide later on. It's a Terraform repository open source written by our solutions architect who created the whole system and you can literally just take that and run your own microservice infrastructure if you want to give it a go. So, theoretically, we could deploy it to Azure because we use Terraform, but obviously we are linked in to DynamoDB and S3 and all these similar things, so you can never really take any kind of architecture and lift it off and put it somewhere else, even if you code in something like Node because you always have the external dependency even if they are minor. So that's about the main infrastructure. Yep, it really allowed us to move fast, to iterate really quickly. We do continuous delivery in the sense that we do not even having a staging environment because if you want to replicate 20 microservices in exactly the same state in another environment, that is really tricky. But actually, we don't need another one, another staging environment. I can see really puzzled faces now. And when I first came to Pizza Hut, I was like, "Well, we don't have a staging environment." I was really afraid to push live, because I felt like I would break everything immediately. But actually, having separate microservices and separate repositories really de-risks everything because if you screw up, you screw up on like one small part and quite often we use feature flags and all kinds of systems around it that allow us to swap back and forth really quickly so even if we manage to screw up something in production, A, it usually doesn't have a big impact and B, it's really easy to roll back, to roll forwards, change things around. So it takes little bit of inner discipline to make sure that you have tested everything well. We have 100% disc coverage and we make sure that our discs cover the right things as well so that we have relative certainty in moving forwards and iterating really quickly. And that's how we got to a stage where within half a year we had a completely new website because it allowed us to move very very fast. Cool, now for the challenges. It all sounds really good, because you can, you know, create something resilient and secure and scalable very fast, so what's the catch? I believe that the catch is mainly in the changing the way you think about developing your code and the way you do operations, do the real-life stuff. So there are three main challenges that I have personally found at Pizza Hut. One is optimization, how do I make sure that things are fast? Another one is splitting code. How do I make sure that I draw the line right? How do I split the microservices into the right bits? The last one was monitoring, because obviously when we don't have staging environment, we need to know very, very fast if something has gone wrong in production. So, the first one is optimization. So as I said, we want to ship fast. Most of the time I maybe, I don't know, ship to production five times a day. I want that to be very fast. If I want to roll back, that also needs to be very fast. So we want to be able to do all these things quickly, but also we want to make them safe. We don't want to break production because we have, I don't know, 1000 sales every 10 minutes in peak hours. We have a lot of sales going through and if we screw up, we lose a lot of money, basically. So we want to stay safe and fast. So how do you draw the balance? In our main microservice, which serves products, we have about 20,000 tests, which is insane. That's like an incredible amount and most people when they see it, even new starters at Pizza Hut are really puzzled by that. They're like, "Why would you do that?" Well, we have around 700 Huts and each Hut has about 100 products and we need to make sure that every single Hut has the right products. Just the amount of combinations that can go wrong in such a system. There are a lot. So we need to test it really, really well. But how long are we willing to wait for something to build? Right now our build just to run these tests takes about 5 minutes, which is quite a long time in the continuous delivery world. So, when I'm talking about optimization here, I'm not talking about optimising the microservers. That's super fast. That serves, I don't know, under 100 milliseconds. That's fine. The problem is the build and the tooling around it. When we had the 20,000 tests, I started wondering how to make it faster and I went, sorry, into Node profiling and that's really cool. With Node you can really easily determine what is your code spending time on. So as an example, we can say that 100 times we will read an json file and pass the json. That is all we will do with it and I originally thought that the read will take a lot of time and then the passing maybe a little bit, but not that much. Now our json files in the tests are obviously a little bit big. They are mock files of stuff, of pizzas that each pizza has like 30 different toppings and five different cheeses, and you know that's a lot of data for every single thing. The files are quite chunky. There is an example of what you run when you want to do the profiling. You just do --prof whenever you run the script. That will create really weird, super long log file that is not very readable. Then you need to run the profiling process on it, which is also part of Node Stamp package and then you get some kind of a file, which looks something like this. That's not much more readable, I guess. That's about one tenth of the file, by the way. That's not even everything. But if you sort of focus on the main bit in the middle. I'm not sure if it's super readable, but I'll walk you through it. So you see that there are ticks on CPU. You see how much CPU each process in your Node application takes and on the third line we can see json passer, which for me was quite a surprise. I thought that the reading of the files would take much more energy for the computer than it would to pass the json. But since the jsons are relatively big, we actually spent a lot of time passing them. You can see that only 4% of the actual CPU usage and you could say that's not very much, but if you have 20,000 tests, then 4% is actually quite a lot. So for us, shaving off 4% was really crucial. So the lessons learned from optimization is how to do profiling. In Node it's super easy. It's inbuilt. Data passing is expensive, surprisingly. We memoise until we run out of memory and I think Node has by default a 1.7 gigabyte memory to its use. We actually overpass it, especially now when we are running different campaigns and we have like three different versions of everything for different dates. That kind of, like, happens. But we also figured out that we need to shave off a lot of data that we actually use in our tests to specifically target tests to the data we actually need and not really waste time with mocks that are super long and contain maybe 70% of data that are not very useful for us. Cool, so that was optimization. The next one is splitting code. Be lean, as a spoiler. I think this one is a little bit more than just for microservices. It applies to Node as well, quite a bit, because Node, by default, is quite modular and I think every single Node developer faces the question in daily life, "Where do I split it? "How do I split my files? "How do I split my modules? "Does it make sense to extract one thing "into a completely separate module "so that I can reuse it in two different places?" I feel like this is something that we need to deal with all the time, not only in microservices, but in Node as well. As an example, I've got four microservices that we use at Pizza Hut. Really, we have only three of those, but I'll get to that. So there is the products microservice which is our core microservice. It contains all the data about which products we have, which Hut has which products, including availability in terms of time. Some products are only available during lunch time. Some are only available during Sundays. So there is a lot of different logic around that. Then we have as a separate microservice, content, which serves images and descriptions and titles, etc. Another one is POS, which is point of sale, which is the system that sits in the actual Hut where we send the orders to. Now the reason why we split it this way is because, well, products is the main one. That's kind of the core that gels everything together and then there is the content, which makes sense to have separate because it does not have any business logic. It doesn't really know anything. It's quite dumb. It just gives you descriptions for products. So that kind of makes sense to begin with to have separately. Then, the POS microservice is separate because it talks to the outer world and now we know that the POS system can be changed. We already have a new system coming up. We know that when we go global, we'll have like three new POS systems in France, etc. So we know that this is quite a moving target. That doesn't really necessarily have to do anything with the user, with the user experience, or with our business logic. That only tells us how to map our products to their SKUs or their IDs. I also have pricing microservice there, because that's our hub. That's the one we're splitting off at the moment because products started doing way too much. We figured out that there was pricing in there as well for all the products and that doesn't actually necessarily need to live in the same place, because when we are working on it, we usually work on it separately. It has separate business logic. Having it together in one place just doesn't make sense and makes things harder for us. So generally speaking, the approach is to be lean, not to over-split to begin with. We didn't come and say, "Oh, we'll have 25 microservices "and each one will do exactly this." We created it step by step. Some of the things were there to begin with because they made logical sense, but most of them are created over time. And the other hint on splitting is you don't really want to make two monoliths because that's not what microservices are about and that's not what Node modules are about. You don't really want to have everything living in just one ginormous thing that does ten different things. You want to separate content as much as possible. That takes on to a controversial topic of to dry or not to dry, or code reuse, because, you know, Node is modular and quite often we get to a point where maybe something could be extracted into its own module and could be reused by three other files or modules and it's the same with microservices. Sometimes it would be easier to just, or seeming easier, to take something off and put it into its own microservice because then we would not need to change one thing in three different places but as we figured it out over time, it's actually much easier to copy paste and even, a few times in a year, go and change it in three different places than figure out a whole system that we would need to maintain around for use. So most of the time-- Well, we really have a microservice template that has all these reusable utils and things that we generally speaking do everywhere. Based on that we create new microservices, but most of the time, we just don't want to make things dry. We are happy to iterate over the same thing over and over, if that makes things easier in the long run. Yeah, as I already said, lessons learned. Don't over-split it. Don't under-split it and just do it as you go. I feel like a lot of people have the approach that you want to engineer everything to begin with and then just go with it, but that never works in real life, especially in an agile-- When you know that every single week somebody will come and say, "Can we change that? "Can we change this?" So be lean and be agile. Last but not least, there is monitoring, which is maybe a little bit more into operations than infrastructure, but maybe because at Pizza Hut all of us do everything, it feels like a natural part of Node microservices for me. The question is, "If we are shipping all the time, "how do we make sure that everything is okay?" And how do we make sure that if something is not okay, that we can make it okay again very fast? For that we have quite a complex system of monitoring and stuff. We are hosted on AWS as I mentioned, but using Cloud Watch itself, the monitoring system that AWS provides, would not be enough because Cloud Watch can be down. It happened like a month ago, where Cloud Watch suddenly went down and we were blind in that respect. So for that reason, we have New Relic which is also monitoring our systems and checking that they are up when we would expect them to be up, that there are no error alerts, etc. We also have an extensive system of logging throughout all of our microservices and UI and we collate those logs in Loggly and then we can easily sort through them and see if something is going wrong. Obviously, doing all of this manually would be a lot of hassle and so we have for all of these things integration to Slack. It's quite a badass robot. We call it Hardbot. It's basically like a slackbot that sits in its own microservice. It's also written in Node and it just monitors everything. It shows us alerts if there is an alert in Cloud Watch. It shows us alerts whenever we erase an alert in Loggly. It immediately gives us feedback of what is happening but also, because it's a microservice and it's quite a clever thing, you can teach it to do stuff, so we implemented auto-triage to, when an alert comes through and we have already seen it 100 times and we know that it's a minor bug that we don't really want to address, the auto-triage does that for us and just says, "Oh yeah, we've already seen that one, "you can ignore it," which really helps us to get to the actually important fails that we have not seen before. It does much more. It's not only about reporting things, it's also about changing things. It allows us to close and open Huts whenever we are told to do so by the area manager and it allows us to create reports for the estimated time of arrival for pizzas and stuff like that. So it does a lot and we are still building it to be able to do more and more and the hope is kind of that in the end we don't actually have to go to AWS to check things. We just ask Hardbot, "Hey, is everything okay? "Can you give me a report for something? "Can you check if this instance is up, etc." So that's kind of where it's going. And the lessons learned here are don't rely on one service for monitoring and logging. I feel like that's maybe like a one-off of infrastructure and DevOps. You don't really want to rely that AWS will always tell you when it's down. Integration with Slack is awesome. It saves us so much time and it makes it just much more enjoyable to roll out things and roll back things, and-- Not roll back, but roll out and make sure that everything is fine. Another one is log a lot and log often. We are figuring out that although we have a log on every other function in our code base, it's still not enough. There are still things that are escaping us and the more logs, the better. There is nothing like logging too much, I don't think. Last but not least, designated alerts overseer. We figured out to begin with that if there is an alert on Slack, 10 different people jump on it immediately even if it's 10pm on Friday night. So we decided to always have an alerts rota and always one person was responsible for checking that the system is up so that not everyone constantly needs to check all of the microservices. So in other words, Node microservices are awesome. They are not a silver bullet, obviously. Nothing is. But the challenges that have arose during our working with them have all been definitely manageable and there are good approaches out there that help you manage a relatively complex system like that. Not only microservices, but Node, give us a great flexibility to scale and to make sure that we deliver the best user experience. Thank you very much.