Sessions is temporarily moving to YouTube, check out all our new videos here.

Knock Knock. Who's There. File Compression

Irina Shestak speaking at London Node User Group in June, 2017
Great talks, fired to your inbox 👌
No junk, no spam, just great talks. Unsubscribe any time.

About this talk

In this talk we will walk through file compression algorithms in Node as well compression standards. We will cover working with streams, audio buffers, and typed arrays to get us to compress and decompress files, and yourFavPhilCollinsSong.wav.


My name is Irina. I am a developer over at Scripto. We write scripting software and we do a lot of work, well, all of our work is in Javascript, so Node and client-side Javascript, and we do a lot of work with operational transforms. So I write Node, and I do operational transforms. Cool. So Scripto. Today I would like to talk about compression, and when I start off kind of this talk, I go into kind of a bit of research phase, where I try to figure out what it is I wanna do, and I was reading this paper on what it is like to minify and compress your code and kind of this is what, one of the first lines of the paper was this, which I thought was kind of really cute, but that brought me to think about the fact that, you know, most of our work has changed in the way we handle JavaScript, and we create a lot of JavaScript traffic, and I kind of wanted to figure out, because every time we go through creating software nowadays, we go through minifying and compressing our code before we serve it up to the client, and so I wanted to look into that portion of it because by being able to kind of work with a particular part of track of traffic, and being able to compress and minify the code, we can save up to 50%, kind of from the same paper, same takeaway. So what does compression look like? This is taken straight off from my bundle code that I'm working on right now. You've got some gzip going on, you're working with a minified version of code, you wanna log some errors, you wanna write it to a file, you wanna serve it up afterwards, maybe you're doing that with NginX, maybe you're doing straight up with your code, either or, you're compressing. But let's actually, what I want to focus on during this part of the code is HTTP compression, and in fact, I wanna talk about zlib, and I wanna talk about gzip. So HTTP compression. So what it looks like is you've got an HTML file, you gzip that HTML file, and what it looks like in the end is something like this. So if you've gzipped the file previously, you look at it, it's a bunch of mumbo jumbo, but what you're essentially ending up with is a much smaller file. Pretty cool, pretty basic, but let's actually dig into it. In regards to specifically HTTP compression, what we have is we have a client and a server set up, what client does is it says, "I want to work with some coding, and these are the encodings I can work with." Nowadays, they're either gzip, DEFLATE, or Brotli. These are the ones that are accepted by all your major standard browsers. Brotli was just accepted into Edge, I think two months ago or something like that. So you're able to work with those, and what the server then does is says, "Okay, the content in the coding I'm gonna send to you what's most convenient to me based on what you're saying you can accept, is Brotli, so I'm gonna choose and send you over some Brotli." And this is what it looks like if you wanted to look at it inside your browser, right? So you're looking into the accept encoding, these are your request parameters, and what you get back is your response parameters, which say, "The content coding you're sending back is gzip." Cool. But kind of before we get into gzip, Brotli, or anything of that sort, let's talk about DEFLATE, because DEFLATE is what is the stem of these compression algorithms. And what DEFLATE is made up of is two different algorithms. One's called the Huffman, and the other one's LZ77. So let's kind of dig into what Huffman algorithm does, because that's kind of what all those will come down to. So essentially what you do is you're dealing with streams, you got a data stream that comes through, it's got a bunch of values, we want to parse through the data stream, and what Huffman encoding does is it first generates a frequency table. So given that particular data stream, we're then able to see how many values and what of values get repeated, and how many times. So we can see that C gets repeated once, given that very simplified data stream, A gets repeated three times, so on, so forth. And then what you end up doing is creating a probability table. And that probability table is based on the total values within that stream. As they go on, increase. And then what you end up doing is building up a Huffman tree, and they're called trees cause you kind of build up this tree as you go up and go through your values in the stream. So what you start off with is the fact that you've got all these values and they have probabilities, and these are the number of times they appear, and then you build your tree up. So A and C all together come up, and then you get that top portion off as you build up, and you get the values, and you've got the frequency from previous frequency tables, and you've got the probability, and it's a tree because each individual stem is a leaf. So the right hand side starts off always being a zero, and the left hand side starts out as being a one, and you keep going down the tree. Our right hand branch can't go any further so it stays zero, but we prepend a zero to the right hand side of the bottom branch, and a one to the left hand side of the branch. And your bottom value is the code that comes up at the end of the tree. So that's Huffman enconding. And what LZ77 is is this other algorithm that creates a length and difference vars, and this is how it works. So you also have a data stream, so every thing as you guessed, is strings, but you create like a window. You go through a window through the stream, and the window is made up of two parts. One is a search buffer, and the other one's a look-ahead buffer. You're currently in the look-ahead portion of the buffer, and you go back and look through the search buffer as you move on. And that windows moves as you go through the stream. And the distance portion of that is given a particular value, you want to find out where was it before that you've seen it, and your search pointer stays in your search buffer, and then you're able to see whether or not, what's the distance between the previous time you've experienced this particular portion of your code. The length is the number of characters that gets repeated. So again you have your search buffer, and you look back in your search buffer and you see how many characters get repeated, and that would be your length. And the entire thing is made up of length and distance vars, basically those two values we just talked about. So that's DEFLATE. And DEFLATE's like the basics that come back into working with PNGs, working with the basic ZIP, working with gzip, everything comes back down to DEFLATE. And they all take down, it's kind of a similar thing. And what DEFLATE does is that there's other parts of it, is that you have made up of three different bits. And the first bit is letting you know whether or not there is anymore things to come in the stream, so back in the data stream we can find out whether or not it's ended or not. And then you get your raw data and you get your static Huffman blocks in your second bit. And the third bit gives the dynamic Huffman blocks and the table, the frequency table that we generated in the first place. Cool. And so the compression is basically made up of the same pointers that are LZ77, and the weighted symbols that Huffman blocks come back. Cool. And so gzip, what gzip does is that it's a combination of what DEFLATE is, but the Node interpolation, because we're talking about JavaScript and we're talking about Node, the Node's interpolation is zlib, so you require zlib and you're able to kind of get that in. And what zlib does is that it allows you to have more processing power and more memory use, more control over those two items than gzip or just using DEFLATE from then on. And then you can create better tradeoffs and you're able to kind of, I guess make different presets as to how more compressed you want something. So if you go with a compression level nine, it's gonna take a lot longer for it to get processed, but it'll be a lot more compressed than a compression level one. And like I said before, because if we're dealing with Node, we're able to require zlib, and be able to use that. So given all those, I kind of wanted to show you an example as to what these things actually look like. So normally I would kind of live code this entire thing, but we have 20 minutes, so I'm gonna show you what it actually looks like, and we'll see kind of the size of something like that, and what's the difference between some of these things. Essentially what we're doing is we're gonna create a server, and what that server's gonna do, it's gonna listen for some of the things that our client is requiring, and then based on the things that we're passing into our server, we're gonna compress that data. So I'm using a couple of things, I'll get into Brotli in a bit, but I'm using Pump, which is an awesome library to work with streams. It handles your errors unlike... The streams are a little bit broken so it doesn't do the error handling that Pump provides. We're working on it. And we're gonna use zlib, which is just Node's zlib implementation. But let's go through this. So what we want is based on the server we create, we want to go through request and response, and we want to grab the headers that we pass on. What we do then is we create a read stream from the source, and that read stream is basically just a plain HTML file, and that HTML file looks like this. Just a plain old, nothing interesting. And then what we want to do is look at those accept headers and see whether or not something on DEFLATE comes in. And once we get through this, we will respond back and we'll Pump down the source, and zlib's got an implementation for create DEFLATE, and it's got an implementation for create gzip. So we'll work with those two. Awesome. So let's look at kind of what this returns to us. So what we want is make sure that this server's running. And then what we want to look at is just a simple kind of, what it looks like on ADAD. So simpleindex.html looks nice and long and nice and long. But we can see that we've got no encoding is passed, and it's just a connection. But what we want is then kind of see what it looks like with DEFLATE, for example, without the word count. So word count, this is a compressed version, kind of cool, gzip's gonna look exactly the same, just compressed content. But let's actually look at the difference in word count, or kind of an easy way to measure what looks like what with word count. So I'm just gonna type it into WC. So let's look at the most basic of these without header, just curl. So that comes back with 8,772. Let's look at what it looks like with DEFLATE. DEFLATE comes back with 2,047. And if we look at gzip, comes back with 2,059. So a little bit smaller than DEFLATE. - Sure. So regular, DEFLATE, gzip. So gzip's pretty much the same. Kay. So that's what I wanted to show you. So Brotli's been accepted into most major browsers as an encoding format, that's why I wanted to take a look at it. SDCH doesn't get used anymore, in fact Chrome discontinued it, so that's done, Brotli's a much interpretation of that. So what's Brotli? Brotli's been specifically put together to work with HTTP compression, so it comes with a bunch of cool things that you can work with within HTTP compression. Started off, again, still the basics of DEFLATE, start off with a stream. You have some literals that you can't compress, and you still go on with length and distance vars. And that comes back to being a single block within DEFLATE. And what Brotli does is that you actually work with commands rather than static blocks. So your actual entered literals are just things that come back from dictionaries that are a lot more dynamic than they are in DEFLATE, because DEFLATE comes back with more static dictionaries. But a dictionary that comes with Brotli is specifically made for HTML terms, so you will have things like doc type and head and div that are already kind of come up for you, you don't have to recompress them, so it's easier, but the only thing with that is just because you have such a larger static dictionary, you actually go a lot slower in terms of the way you put that together. But there is actually a really neat implementation in Node, I can't pronounce that, but it's Brotli backwards, which is cool, and it's pretty well put together and I really quite like it. And it's interesting to kind of play around, in comparison to what gzip and DEFLATE gives you, because it's definitely a lot smaller. So with Brotli what you have is also the same thing, and I've used Pump even though, so there is streams and perl notation for it. So you just require the library and compress stream will returns a stream that you can pump into Pump, or just your regular stream. And then what you end up getting is a much smaller compressed size, so it's like 1,669 in comparison to 2,059. The only thing is this is like a pretty small HTML file, it just takes like a lot longer. There's a bunch of other implementations for kind of specifically HTTP stuff. So Snappy is a pretty neat one. And you can find it also by Google and they specifically were made for HTTP compression. This one's like a lot faster but doesn't do very well in terms of compression, it's just like really fast but not as compressed as gzip is. I didn't write up an implementation but you can look at it there. Yeah, they're way faster and there's a node implementation you can look at, node-snappy, that does that for you. Cool. That was it for me for today. I wanted to show you a little bit, tell you a little bit about algorithms, and I hope you enjoyed this. I'll put the slides online because there's a bunch of libs in there and a bunch of articles on Brotli and accept encoding, and things like that you can look at. Awesome, thanks.