Sessions is temporarily moving to YouTube, check out all our new videos here.

How Uses Node.JS Streams to Handle Genomic Big Data

Bruno Vieira speaking at London Node User Group in April, 2017
Great talks, fired to your inbox 👌
No junk, no spam, just great talks. Unsubscribe any time.

About this talk is an open source community, that aims to provide highly reusable code and tools for genomics by leveraging the Node.JS ecosystem. are big fans of Node.JS Streams: find out what they are and how you can use them to process large data in chunks.


- I'm gonna talk a bit about bionodes. But before I'm going to give a bit of context. I'm a PhD student at Queen Mary University WurmLab and I'm also a Mozilla Science Lab Fellow. We're trying to make, promote open science and open source in academia. And the problem is this, so this is the volume of genomic data that's being generated and if our predictions are correct, it's going to surpass everything. In this paper, you can check the links. The slides are all online and you can look at the paper in detail if you want. But they're predicting it's going to surpass astronomy, Twitter, Youtube. We're going to be generating zettabytes, exabytes of data a year. This is not fiction because for my PhD thesis, my raw data compressed, so the things that I took at the start, was seven terabytes compressed and I'm currently using 40 terabytes on the cluster at the university. The machines I typically need to run this stuff have 500 gigs of RAM and we're gonna get now new machines with two terabytes of RAM. So how did this happen? Well, there's two reasons. One is the cost of sequencing that's decreasing. It's decreasing a lot due to the new improvements in sequencing technology and with each generation, it keeps decreasing. We soon will have sequencers, we already have sequencers that size you can just plug in your laptop and they're working on sequencers you can plug on your phone. So sequencing is going to be accessible to anyone. This means we're gonna be flooded with genomic data. The other problem is usually bioinformatics software sucks. The reason for that the incentive in science is to publish papers, not to publish beautiful code, or good code, or to make your tool reusable a lot. So, science is kind of broken right now but, at the same time, there's a lot of people trying to fix it. And that's what we call open science. We're trying to promote open source and best practise in science. What bionode is trying to do is build tools that try to suck less. And for that we're trying to make modular and universal bioinformatics. We're trying to create tools that try to do one thing well, provide highly reusable code and tools, so that we don't keep reinventing the same functions in every project, and be able to scale with big genomic data, and also try to run our code everywhere because it's really hard, We need to rerun our experiment across several machines and it's really hard to get all the things you need to run on those machines, so we're using things like Docker a lot and virtual machines. But the cool thing is, if we use Node.JS, we can have highly modular codes. There's a lot of modules on MPM. The community is very open and everything is on GitHub, so it's really easy to contribute and change things. And Node.JS provides the native implementation of Streams, which we want to use to scale for big genomic data. And also, since it's JavaScript, you can run it on the browser on the common line. Not that you would want to run 500 gigs of RAM analysis on your browser but there are small things you can do with web apps or you can look at a sequence and do a small analysis. And right now bionode is a locally sourced community, so some people are, We basically rely on contributions. We have a few people who are more or less active on the project. And we also, we are being used in a bunch of other web apps, and tools, and projects on some Universities but also there's a startup in Cambridge using bionodes. Some of them are here. You should talk to them later. And there's also other open source projects using bionode. We currently have around 13 tools but a lot of them are still, kind of, work in progress. Also trying to get students for Google Summer of Code to work on bionodes for three months. We got one last year. We're gonna get another one this year. Every year we try to apply and until now we've been successful but the message is we need more contributions. If you want to have a look at this, if you're interested, just check our project board and pick an issue or come talk to us. That was the introduction to the project. How do we use Streams in this project? I'm gonna give an example of one of the modules. The bionode-ncbi module is to access a database from the National Centre for Biotechnology Information. What we do is, we require this module and then usually in JavaScript use a callback pattern. That means you get all the data. And then once you have all the data, you do something with it. But if you use event pattern, you can, on a data event, process the data. So you don't have to wait for everything to finish. You can start doing something with that data right away. And the cool thing with streams is that then you can just use the pipe pattern. So you can actually pipe things together, in the same way you would do on the unique summon line. For example, here we are looking for human genome and we pipe into JSON parser, and then we just pipe into the common line, the standard out. And so this makes it very easy to build pipelines to analyse genomic data and to combine all these pieces together. Streams are a first-class construct in Node.JS. So how do we implement them? Well, the easiest way is use a module for example, like Mississippi. Or you can subclass the appropriate Stream Class in Node.JS, which is the hard way to do and then you also get bind to the specific version of Node.JS. So you should try to avoid that. You should use another module to write your own streams. That way you're not bound to a specific version of Node. For example, if you use the Mississippi module, this provides you functions to write readable streams, writable streams, transform, duplex, and create pipeline. And I'm gonna explain that in a second. You can also check this repository, awesome-nodejs-streams has a bunch of cool modules that are based around streams. It was created by our Google Summer Code students. With streams you can process data in chunks. So you can have a readable stream, let's say reading from a file or a request, or just reading the standarding. And then you can pipe that to a transform stream, a parser, a filter, you can do some kind of multi-thread analysis, query to a database, anything. And then you can pipe that to a file or to a standard out. So that's usually the types of streams you have. What streams give you, is that if somewhere in your pipeline you have a bottleneck, so you have, for example, network latency or you have an issue accessing the database, you don't have to worry about timeouts or the pipeline crashing because streams handle back pressure, which means that if one of the steps of the pipeline is slowing down, the whole pipeline, the whole streams adjust to that. So for example, if you have a big chunk of data that's blocking one of the steps or you're waiting for some response of a database or request, the initial implementation of streams, you would lose data because when there were push streams, so as you were getting data you were not ready to read it and so the data would get lost. So they changed to pull-based streams, which means that when you, if your pipeline is blocked, when you try to do this.push, it returns false so the data goes to a buffer. And if the buffer is full, the whole thing stops until downstream it gets processed. And then when that happens and downstream returns through, then you use a callback to get more data. The cool thing is you can do this.pushseveral times. So you can split something into several chunks to send it downstream and then when all those chunks are processed, you can use the callback. What this would look like in code, is if you use a module like Mississippi, and Mississippi has through as a dependency so instead you can just use through, the module through. The way to write the stream is you provide the transform function. And in this case we're using through object because we're going to deal with JSON but you could be dealing with buffers. So the transform function is gonna get an object, an encoding. So what you do is you can modify that object, you can do whatever you want with it. Here we're adding a property to the object. And then once you're done, you push it downstream. And if you want to ask for more data, then you use the callback. So this is how you write a stream in nine lines. Now, if you want to, for example, write a pipeline, let's say you want it to filter some genomic data, For example here we are creating a filter function, so we're just going to look for the property sequence and see if it matches these three letters. It's kind of a dumb example but it's just to illustrate how you could do that. If the object is a match, you push it. Otherwise, you just ignore it. Once you've done everything you wanted, you call next to ask for more data. You can then use another module, called pumpify, to combine all these streams in one. What pumpify is doing is using the ncbi with the bionode-ncbi module to fetch from the nucleotide database, which is called NUC core. Then it's piping whatever it gets to the filter stream that we just wrote above. And then you can stringify that and send it to standard outs. This is a very basic pipeline. Then you can write data through that pipeline. So here we write inquiries to bionode-ncbi but that's basically how streams works and you could writing anything to it. Another kind of streams you can have, are duplex streams. What they are is when you combine a writable stream and a readable stream together, so it looks like a transform stream but they're two independent streams combined. It could be, for example, you're writing and reading from a database or writing and doing requests. So it's two separate streams but you can combine them together in one duplex stream. And Mississippi provides a method to do that. Another kind of stream that you can do are passthrough streams. What they do is they usually just look at the data, so for example, you could implement a counter with that or you could do something. So they don't modify the data but they do something else each time the data goes through. For example, again with bionode-ncbi and the through module, and let's say you have your own database module, you can implement a simple counter. And so you create an empty stream. To implement a passthrough stream you don't provide the transform function. Every time you have on data events, you can implement the counter. That way you could pipe, for when you're doing inquiry to look for all the genomes available on ncbi, all the spider genomes and we're just counting them. All the data we get, we're writing to a database. So the counter doesn't touch the data. Once your script ends, you can also log the count if you want. The cool thing with passthrough streams, is that then you can use them to create forks in your pipeline. You can observe the same data and do something else with it. This is how you could, for example, implement some kind of multi-threading or sending your, your request or your function, your whatever, to another machine or another cluster. This is what it looks like. It's the same thing, you just create a passthrough stream. And then here we are doing inquiry, and we're stringifying it, and we then we pipe into fork. But then you can read fork for two different things. So one is piping to the common line standard out and the other is writing it to a file. That way you can create more complicated pipelines. That was what I had to tell you about streams. If you're interested, check our project and contribute. And that's it. Thank you.