176 JSJ RethinkDB with Slava Akhmechet - JavaScript Jabber -

[This episode is sponsored by Frontend Masters. They have a terrific lineup of live courses you can attend either online or in person. They also have a terrific backlog of courses you can watch including JavaScript the Good Parts, Build Web Applications with Node.js, AngularJS In-Depth, and Advanced JavaScript. You can go check them out at FrontEndMasters.com.]

[This episode is sponsored by Hired.com. Every week on Hired, they run an auction where over a thousand tech companies in San Francisco, New York, and L.A. bid on JavaScript developers, providing them with salary and equity upfront. The average JavaScript developer gets an average of 5 to 15 introductory offers and an average salary offer of $130,000 a year. Users can either accept an offer and go right into interviewing with the company or deny them without any continuing obligations. It’s totally free for users. And when you’re hired, they also give you a $2,000 bonus as a thank you for using them. But if you use the JavaScript Jabber link, you’ll get a $4,000 bonus instead. Finally, if you’re not looking for a job and know someone who is, you can refer them to Hired and get a $1,337 bonus if they accept a job. Go sign up at Hired.com/JavaScriptJabber.]

[This episode is sponsored by Wijmo 5, a brand new generation of JavaScript controls. A pretty amazing line of HTML5 and JavaScript products for enterprise application development in that Wijmo 5 leverages ECMAScript 5 and each control ships with AngularJS directives. Check out the faster, lighter, and more mobile Wijmo 5.]

[This episode is sponsored by DigitalOcean. DigitalOcean is the provider I use to host all of my creations. All the shows are hosted there along with any other projects I come up with. Their user interface is simple and easy to use. Their support is excellent and their VPS’s are backed on Solid State Drives and are fast and responsive. Check them out at DigitalOcean.com. If you use the code JavaScriptJabber you’ll get a $10 credit.]

CHUCK:

Hey everybody and welcome to episode 176 of the JavaScript Jabber Show. This week on our panel we have Aimee Knight.

AIMEE:

Hello.

CHUCK:

Dave Smith.

DAVE:

Hey-oh.

CHUCK:

I'm Charles Max Wood from DevChat.TV. This week, we have a special guest and that is Slava. I don't have your last name, so I'll just…

SLAVA:

It's Slava Akhmechet. Hi, guys.

CHUCK:

I totally was going to guess that.

SLAVA:

[Chuckles]

CHUCK:

Do you want to introduce yourself?

SLAVA:

Hey everybody. Thanks for tuning in. My name is Slava. I'm one of the founders or a company called RethinkDB. I've been programming pretty much since I was born, or since I could type. So, I'm very excited to be here to tell you about Rethink and to tell you about all kinds of other programming things that hopefully we're all excited about.

CHUCK:

Alright. Do you want to give us an overview of what RethinkDB is?

SLAVA:

Yes. So, RethinkDB is an open source distributed NoSQL database. It's designed for building realtime applications. And when we started RethinkDB we noticed that the world is moving in the direction of real-time apps. So, for example, if you even use an app like Google Docs, when you're looking at a single document and someone else is modifying it, another user, you could see the changes in real-time. So, when we saw that we were blown away and we basically realized that that's where the world is going. And sooner or later every single application will be built in this way. But all the tools weren't really designed for that stuff.

So, we set out to build a database that makes building these kinds of apps really, really easy. And building them right now, it's clearly possible but it's very, very hard. It requires a lot of effort, engineering effort, a lot of know-how. It's just fundamentally a challenging project because all the tools weren't designed for that. And as people are moving to building these kinds of applications all the development stacks are being retooled from traditional pull architectures to push architectures. And there's a lot of innovation in the browser like Socket.IO, React, Angular, that support that kind of stuff. There's not a whole lot in the data layer.

So, we decided to build RethinkDB. RethinkDB is the first database that's built around the push architecture. So, instead of sending a query and getting a response, what you do is you subscribe to data and then [Rethink] [inaudible] computation. And then RethinkDB pushes changes to you, to the developer, anytime that something changes in the database. And we can get into a lot more detail on that. But what that does is that makes building real-time apps dramatically easier. And that's why RethinkDB exists.

CHUCK:

Yeah, I saw the streaming example on RethinkDB's website. And I thought that was interesting. But then it occurred to me. Is this then a Backend as a Service? Or is this something that you're going to put Node.js or Ruby or something else in front of and then that's going to proxy the streaming and maybe munge the data a little bit to send it back out to the web client?

SLAVA:

Yeah, well RethinkDB is an open source project. So, it's not a Backend as a Service in the way a cloud service is. We do have partners that host RethinkDB as a service. So, Compose.io is a notable one. And they host all kinds of NoSQL and even SQL databases now, I think. So, they do hosting. But fundamentally it's open source. Anybody can download it, install it on their laptop or on [very] beefy servers. And generally, the way our users build with RethinkDB is they spin up RethinkDB. It's very easy to set up. Then they have a middleware layer. And it could be Node.js which is extremely common. It could be Ruby. It could be Python. It really could be any programming language.

The web server then communicates with the database the way you traditionally would, except using the streaming architecture. And then you can push the data back up to the browser using Socket.IO. Right now at least, we don't allow connecting the browser directly to the database because it's just a security nightmare.

CHUCK:

Right.

SLAVA:

And this is something users really, really want. We get requests for this every single day. And we're actually working on this but we're just very careful about it because security is extremely important and we want to make sure it's bulletproof before we release that stuff.

DAVE:

So, one of the things that attracted me to RethinkDB probably maybe a year ago was that it seems like Rethink was one of the first NoSQL databases that actually allows joins between objects. Can you talk about that a little bit?

SLAVA:

Yeah, definitely. So, when we were designing this, we started thinking about these push streaming architectures because it's really important. And it's the first database that does, but that's the thing that differentiates RethinkDB from every other product or project. And that's why most of our users pick it. But there's a lot of other… you know, databases are very horizontal pieces of software. There's a lot going on and there's a lot we don't necessarily talk about all that much. So, joins is one of those things that we really cared about because most of RethinkDB developers came from the traditional relational database background. And joins are extremely, extremely useful. Like, anytime you want to develop anything really, once you get out of a five-minute blog stage or like an example project, joins becomes extremely important.

And there are a lot of other things you want the database to do. You want it to be able to do subqueries. You want to query across tables. Just generally once the app gets complicated you want to be able to do complicated queries in the database. So, we wanted to bake that in. It was really, really important. So, we built the whole thing in a way where the user basically writes a query in their native language. So for example, in Node.js it looks kind of similar to jQuery. You start with a table and then you can chain commands that transform data. Also, if you use bash it's pretty similar to pipes. So, the data flows left to right. You transform it. And then what happens is that query gets packaged up and gets sent to the database server. So, everything runs on the server.

And furthermore because you could have a cluster of servers because it's distributed, you could have 10 servers with data split up across them. And all of this will get run on the server side, on the database side. So, the database will take this query, analyze it, compile it into this distributed mapreduce program, send it out to all the shards, collect all the data, and then send the data back to the user. And the idea was that you should be able to write arbitrarily complex queries. It should be intuitive and easy to do. And the database should handle all of the complexity and all of the performance considerations so you don't have to do that in the client.

So yeah, we wanted to do joins. It was super important to us because that's the kind of database we would want to code against. And we added that pretty quickly. And internally, they actually compile to these map-reduce queries, which is really, really cool. So, you could write map-reduce directly or you could use this higher level language for the joins and sub-queries and things like that. Does that make sense? I'm not sure if I'm communicating this well.

DAVE:

Yeah. No, totally good, yup.

CHUCK:

Mmhmm.

AIMEE:

I actually have a little bit of maybe a naive question about that, because I also have those questions about joins. So, I've used traditional relational databases and then I've used Mongo and Cassandra. But I've always been taught that when you're using NoSQL you should create your data in a way that you wouldn't need to do joins. So, what are some of these complex cases where you would need to do a join?

SLAVA:

Well, so I think joins aren't even that… you don't even need to get into cases that are that complex to do joins. If you have something very simple… it starts out when you even have really simple apps. Like for example, if you have a blog post and you have comments in that blog post, you might want to structure it in a way where the comments are attached to the blog post in the same document or you might want to structure it where the comments are in a different table. And with blog post examples, it's really, really easy to structure it where everything's in the same document.

But then the moment you need to scale, if you run into cases where your post may have tens of thousands of comments, which isn't actually uncommon on social media, that happens with a lot of forums, you can't really stick all that into the same document. And with polls, you could probably get away with it. But the moment you have a more complex app where you have a lot of users, a lot of things going on, companies and employees, that's a common example, you really can't put everything into the same document because that's limited. You can only put so many things in one document. So, joins become really, really important.

So, many people made the argument that you shouldn't use joins. They're not scalable, right? Like if you're using NoSQL you could design your apps not to use joins. But I completely disagree with that. And I think that happens for a couple of reasons. I think people say that because when NoSQL first go started, it was obviously a huge deal, distributed databases. But it was quite immature. And building in the architecture to do joins is actually really, really hard. So, at the beginning all these products hit the market and people really wanted them. So, they started rationalizing. Okay, let's not use joins because they're really bad. But I think the reality is that it was just very early and it takes an enormous amount of engineering effort to add that to the database.

And the databases just didn't have that, so people started rationalizing why joins are bad.

DAVE:

[Chuckles] Yeah. That's what I thought.

SLAVA:

But I think in practice…

CHUCK:

Yeah.

SLAVA:

In practice, what ended up happening is you still need them. So, what would happen is you pull a bunch of data into the client. Then in the client you would look through that data. And then for each document, you'd send a request to the server. So, basically what happens is you, the application developer, you end up implementing joins yourself in the client kind of ad hoc. And it's inefficient because you go back and forth over the network and you have to pay latency costs.

DAVE:

Yeah.

SLAVA:

And you can't optimize it and stuff like that. So, I think joins are really important. I think saying, “Don't use them,” is kind of a rationalization and it's pretty dead. And I think people end up doing it anyway in a less efficient way. Does that make sense?

DAVE:

Yeah, totally. In fact in my app, I'd say there are three kinds of joins. There's inner joins, outer joins, and Python joins, which are the ones [chuckles] you just described.

SLAVA:

[Laughs]

CHUCK:

Where you do it in code?

AIMEE:

[Laughs]

DAVE:

Yeah, you write your own joins.

CHUCK:

Yeah.

AIMEE:

[Laughs] That's pretty good.

CHUCK:

Yeah. And the thing where I've seen people doing these gymnastics with their data to avoid joins, what winds up happening in a lot of cases is they wind up storing it in two places or three places, and then keeping it all in sync turns into a major nightmare. Because you have the comments under the post and you also have the comments under the users who posted them and you have the comments… and so, if somebody updates something or needs to change something or delete something, you have to go find all of the instances and make sure that they're all in sync. Or you wind up doing like you said, the end code join, because you just don't have another way to handle it.

SLAVA:

Yeah, I think that's extremely common. Actually, so back when we started Rethink, some of us, we wrote that kind of code because we used some NoSQL products and it just never really made sense. So we thought, “Hey, this is an enormous engineering effort to bake that in. But once you have that, it just makes the users' lives dramatically easier.” And that's why we added them. And I think it took a long time to get all the architecture right, but I think it was the right decision.

CHUCK:

Now, does it return the data out of it in JSON? And does it store it in JSON?

SLAVA:

Yes, it does. So, everything in Rethink is JSON. It's actually an extended version of JSON because JSON proper doesn't support certain things. Like it's not that great with dates for example. It doesn't have a native date format. It doesn't have a native format for binary files, a couple of other things. So, we use an extended version of JSON. But everything can fall back to JSON. And JSON is what's stored in the database. JSON is what you get back. And also the driver protocol is itself JSON. So, people who write the drivers communicate with RethinkDB using JSON. And by the way, we tried a lot of different protocols. We tried Google's protocol buffers. And it turned out to be pretty bad for large documents in many languages. We tried a couple of other things. JSON actually surprisingly, it turns out to be the most compact and efficient and performant for all these things.

CHUCK:

Well, and every language just about has a JSON [inaudible] somewhere.

SLAVA:

Right, right. It's so common that all the JSON libraries in pretty much every language are super optimized.

CHUCK:

Yeah.

DAVE:

Yeah. So, going back to the real-time aspect of RethinkDB. This is maybe a little bit of an adversarial question, but you mentioned that browser connectivity directly to RethinkDB today is not a thing, which makes sense to me. But that means you're going to have to put a backend in front of RethinkDB that communicates with your frontend, typically a web browser. So, to me the hardest part of real-time data is getting data consistently back all the way to the client. But it sounds like with Rethink, the architecture that it would push you to is basically building your own pub/sub for the browser or using one off the shelves that's not Rethink. What is the advantage of having Rethink push real-time changes to the server if they can't go all the way to the client?

SLAVA:

Yeah, that's actually a great question. By the way, that doesn't seem adversarial at all. [Laughs] So, I don't know. Maybe I have a higher bar.

DAVE:

I'll try harder next time.

SLAVA:

Yeah, yeah, yeah. No, so I think that it's a good question. So, I think the hard part of building… So, getting the data to the browser consistently is definitely hard. And there are libraries that handle a lot of that. So, Socket.IO is notoriously good at this. And then the frontend frameworks also do quite a bit to help you out with that. But it is hard. But I think a much harder part is getting the changes in the first place. So, imagine you're writing a real-time app. The way people generally do it naively is they say, “Okay, we have to get data in real- time so we're going to query the database in a loop. Like let's say we're going to query it every couple of seconds.” Now, if you start querying a database every couple of seconds, then what happens is first of all the user experience is pretty bad because the latency is high. It's a couple of seconds.

CHUCK:

Mmhmm.

SLAVA:

And second of all is, the moment you start getting lots of concurrent users, all of these requests just keep hammering the database and bring it down. So, it's really, really hard to scale. And people very quickly realize, “Okay, so I can't do it this way. I have to do something smarter.” So then, typically the second stage is they start writing custom code in the web server to deal with changes. So, generally when a change happens, they have two code paths. One to write it to the database and another to send it back, route it back to the client.

So, they do that for a while and then realize, “Okay, that doesn't work so well because my code gets messy and complicated. And also if I started scaling up web servers, if I have multiple web servers, they have to communicate because my users need that need to see each other's data may be connected to different web servers.” So then, that stage, people start adding pub/sub mechanisms. Like, they'll add some kind of a pub/sub system and then the web servers will write to the database, also communicate with pub/sub, figure out the routing where all the clients are. And at that point, your code gets pretty complicated. You have to do a lot. You have to maintain these extra pieces of software. And the infrastructure, you have to write everything twice. You have to deal with sync, with routing. So, it's certainly doable.

And by the way, so Quora, the question and answer site, they have a really good technical presentation. And I'm going to look this up and send you guys a link. They have a really good technical presentation on how they did it. And they built the app roughly in this way. So, it's certainly doable. But I think for most teams it's just really complicated and it's a bad abstraction. And generally, you could use an abstraction that wasn't designed for something to do something else. But it just gets harder and harder over time.

So, the reason why you want to use Rethink is because we eliminate all of that complexity. What happens is if a browser connects to the web server, the web server just opens what we call a change feed. You can think of it as a stream from the database. And it says, “Give me all the data I care about and keep pinging me about it when it changes.” So for example, if you're writing a multiplayer game you could say, “Open a stream. My player's in this location. Give me everything within five miles, like game miles, of my player.” And then any time something happens and gets written to the database, the database pushes an event up to the web server saying, “Hey, this data has changed.” For example, the players around you have moved or they picked up some items or things like that. And then you just push that back up to the browser via Socket.IO or WebSockets or something.

And what that does is it takes away all of the complexity. So, we deal with the routing. Because the right client only gets the information it needs. We deal with all of the computation, because if you have some complex queries, for example if you say, “Give me the top ten players in my game,” the database can deal with all that. And it can tell you when that leader board changes. You don't need to add pub/sub. You don't need to write code that splits things and write them to the database and somewhere else. So, everything just becomes much simpler.

And it's a little bit hard to explain verbally, but if you look at the examples of what the real-time map, a good scalable real-time map looks like, if you use a pull database and if you use a push database, the results are just… the outcome is kind of staggering. It's so much easier if you use a push architecture.

CHUCK:

Yeah, but don't you then have to have a push architecture up to your frontend?

SLAVA:

Yes, you do. But that's much easier because you can just use Socket.IO, right?

CHUCK:

Yeah.

SLAVA:

It's not that hard to communicate via Socket.IO from the web server to the web browser.

CHUCK:

Yeah, I tend to write my apps in Rails. And so, Rails itself doesn't lend itself that way. Rails 5 is going to have Action Cable in there which does WebSockets. But for the meantime, it seems like it may be helpful in certain circumstances but not in others because of the limitations of the system that I'm in. But if you're in Node.js and you want to use Socket.IO, this makes a ton of sense.

SLAVA:

Yeah, I definitely agree. I think if you're building real-time applications like that, Rails as it is right now is probably not the best tool for that. So, if you're building a Rails app, RethinkDB can be really useful as just a general purpose scalable database. But the push architecture is definitely a little bit hard. We do support that, so we support EventMachine in Ruby.

CHUCK:

Right.

SLAVA:

And you can do quite a bit. But it's definitely harder. With Node.js, it's just staggering how easy it is.

In Rails yeah, it's a little bit harder.

CHUCK:

One of the things that appeals to me about a lot of the other databases, NoSQL databases I've used, is the clustering. But it seems like the clustering in RethinkDB is ridiculously simple. Can you explain a little bit about that for us?

SLAVA:

Yeah, definitely. So, the idea was [chuckles] this was a long evolution. And what we wanted to do is we wanted clustering to be simple to use. So, simple to set up and use and extremely robust in real environments. And this took about four years to do and it took a bunch of iterations. So, what we found when we started Rethink is we looked at other NoSQL products and we found a couple of things. So, the first thing was that setting up clustering in most of these products is actually really, really hard. It's relatively hard to set up even two machines. But if you scale it out, to really scale it out and have it work, you need to really know what you're doing. So, it was pretty challenging. And the second thing is, when things went wrong in production environments, it was very hard to figure out what was going on. Sometimes you got catastrophic failures, data diversions, stuff like that.

So, we wanted to build Rethink in a way where none of that would happen. And we definitely failed a couple of times. It was an evolution, an evolutionary process. So, the first thing we did is we made it really, really easy to use. And everything was transparent, [opaque] to the user. So, you just say, “I want three shards, five replicas, in these data centers.” You set it up and you go. And it was as simple as that. And underneath we'd do all the moving of the data and all the management and all those things.

And then when we shipped that, we started getting pretty big production users. We very quickly discovered that when things go wrong, you want to really understand what's going on. You want to know what's happening in the cluster and in the database. And we didn't allow for that early on. So, the second thing we did is we exposed all of the internals to the user. And we did it in a way where we really cared about the user experience. And I just don't mean like visual user experience [necessarily], although that too. But the kind of APIs that the database exposed.

So, what we did the second time around is we took all of the internals, we designed a really nice API, and we exposed it as JSON documents in system tables in the database. So, you could query the database through the query language and you could control the cluster. So for example, you could say, “I want to take this shard, the data on this shard, and I want to pin it to a different machine.” and literally, all you have to do is just write a ReQL query that does that and the cluster will respond.

CHUCK:

Oh, wow.

SLAVA:

So, that was the second iteration and it was really cool. It was way better because everything was easy to set up but if you wanted to script it, you could script it. If you wanted to find out what's going on, all of that was available. And then after that, once that was solid, we [inaudible] failover on top of that. And that was probably the most challenging part, because automatic failover has to account for all kinds of network failures, hardware failures, split-brain scenarios, all sorts of things like that. And it took us about a year to implement and test the Raft Consensus Algorithm which came out of Stanford and was designed for this kind of stuff.

So RethinkDB 2.1, it's extremely easy to set up clustering. If something happens or you need to control it in fine detail, you can. And everything handles failover, network failover, and hardware failover really well. But it took about four years to get here.

CHUCK:

So, I've noticed that with other database systems, and they've all evolved since then. I've used Mongo and Cassandra and a few others. But the clustering with Mongo in particular I found had some serious reliability issues, especially as you got in to larger and larger clusters.

SLAVA:

Yeah.

CHUCK:

And I've also seen other systems where they had issues with performance as the cluster got larger because they're trying to get a quorum on data and things like that. How does RethinkDB handle in those situations where you have extremely large clusters? Do you get any kind of performance or reliability hit?

SLAVA:

Okay. So, we did a couple of things to make this work. The first thing is, are you familiar with Jepsen tests and the Call Me Maybe series of blog posts?

CHUCK:

I think I heard something about it.

SLAVA:

Okay. So, I can give a quick intro. So, this was done by a developer. He's with Stripe now. His name is Kyle but he goes under a nickname aphyr on Twitter and online. So, what he did was he built something called Jepsen. And Jepsen is a series of distributed system tests. They're general purpose and they basically test all kinds of edge case scenarios on distributed systems. So, this was just a personal project for him. And eventually I think he got hired at Stripe and Stripe now funds this research.

But at the beginning he just built this and he started taking existing distributed systems and running them through this testing framework. And he basically discovered that when you get into edge cases, and this isn't just theoretical, these are very real edge cases that could happen in real networks and happen all the time, a lot of these distributed systems really didn't' do so well. They got data divergence, all kinds of problems, things like that. So, he kept writing these posts about different distributed systems.

And when we started designing automatic failover and we were redesigning the clustering architecture, we I think were the first database product where Jepsen tests already existed and were established. So, we designed the system with Jepsen tests in mind. And we made sure that the design and the implementation passed them. And it was part of the explicit goals of ours, that it was very important to pass the Jepsen tests because it gives a certain guarantee. Or not a guarantee but at least a very strong probability that the system will perform well on different edge cases. So the design, RethinkDB's design accounts for Jepsen tests. And I think that's part of what gives it a lot of the reliability.

As far as performance, so what we do is interesting. We use Consensus Algorithms to handle metadata but not data. What that means is every time you reshard, let's say you add a shard or you add a replica or you add a table, you need consensus. So everything, all kinds of administrative operations require consensus. And that can definitely be, that can take a performance hit because consensus is fundamentally expensive. But then we don't use consensus algorithms for data. And we designed the architecture in a way where everything works out. And we have a lot of information on how that works at RethinkDB.com in the docs.

So, what ends up happening is as you get into larger and larger clusters, it still performs quite well on resharding and rebalancing and things like that, although we do have to obey the laws of physics. But with data, everything [inaudible] really, really fast. And as you scale up the cluster, you get near linear scalability. And you know, we have pretty big production deployments right now. And there's a lot we're testing. And as people push the boundaries, obviously we fix things and everything gets better and better and better. But I think it's a never-ending process. You just get the architecture right and then after that it's just a matter of engineering effort for a very long time.

AIMEE:

Okay. So, I guess next if we want to maybe talk about the query language, ReQL. I keep hearing everyone talk about it feeling very functional. And then in your docs as well…

CHUCK:

Did you say ReQL?

AIMEE:

Is that how you pronounce it? I'm not sure. [Chuckles]

SLAVA:

Yeah, yeah. Yeah, yeah, that's exactly how we pronounce it.

AIMEE:

Okay. So, it is ReQL. And then everyone says it feels very functional. So, do you want to talk about that and why you decided to implement it that way?

SLAVA:

Yeah. So, when we started designing Rethink, before that we mainly used SQL, which is the query language pretty much everyone uses. And if you look at SQL, if you ever try to write complex SQL queries and if you browse Stack Overflow questions about SQL, they have a very interesting quality that you don't get with any other programming language. For example with Python, people say, “How does this command work?” Or like, “I wrote this program to do something and it doesn't do what I want. What's wrong?” something like that, but something you'd gain with a programming language.

With SQL, the questions on Stack Overflow are different. And they're phrased in a way where people very often say, “I want to get this kind of data, but I really just don't know how to do that at all. What kind of a SQL query? How do I even write this in the first place?” And I think the reason for this was that when SQL was invented, the idea was you need that for analysts and they tried to get this as close to a natural language as possible. But natural language turns out to be really bad for programming. So, SQL is this very weird strange mix of relational algebra which is very theoretical and bulletproof and also this natural language that doesn't really work all that well. People ask all these strange questions. It's hard to learn. It's hard to figure out. If I want to write a query, how do I even do that?

What, we thought about with Rethink is we looked at a lot of different models and how we can build a query language. And we noticed that basically jQuery and to some degree the command line Bash and pipes are a really, really wonderful abstraction for manipulating data. And it's been around… Bash has been around for decades, or at least all the concepts. And it works really, really well. It's kind of bulletproof. And then jQuery did it in modern technology in JavaScript and proved again that this works really well.

So, what happens is you start, so data flows from left to right. When you start on the left, you specify your data source. For example, you say 'table users'. So, that's your data. And if you just run that query, you basically could get all the users in the table. But then what you can do is you can chain other commands on that. So for example, you could say 'dot join table companies' and then it will join these two tables. And then you get a new stream which is a combination of users and where they work. And then after that, let's say you wanted to drop some fields. So, you say 'pluck name, company name, address' or something. So, you've got the fields you want. And then after that, let's say you wanted to get distinct users. So, you say 'dot distinct'. And you can just keep chaining like that.

And what happens is it results in an extremely convenient and intuitive way to build queries because you start from left to right. And at any given point you have some data and you can just think about it in terms of transformations. Like, “Okay, I've got this. What's the next small logical step I can take to get to where I need to be?” And it's a very easy and convenient way to build up queries incrementally. So, we tried that and it worked really, really well. And we thought, “Hey, okay. This works really well. It's easy to analyze internally so we can make it performant. I'd worked in Bash for years. People seem to really love jQuery. So, let's use this model.” And that's how ReQL came to be.

So, if you're using Node.js it basically looks like JavaScript. [Inaudible] it kind of looks like jQuery. If you're using Python it's also very similar. But you can look at examples on RethinkDB.com. And it turns out that it's just a really convenient way to build up queries over time. So, that's why we picked the functional approach. And we didn't even think about it in terms of functional versus nonfunctional at the beginning. We just thought about it as, “Hey, what is the easiest way for us to get users a convenient intuitive query language?”

CHUCK:

One thing that comes to mind, we were talking about the query language. Before we were talking about joins. In traditional SQL if you want to speed up your joins you'll set up indices or indexes, whatever you're going to call them. Is there… it looks like there's a concept of that in RethinkDB.

And does that make a difference in your performance?

SLAVA:

Yes, absolutely. So, what's interesting about RethinkDB and I think a lot of NoSQL projects in general, so I look at it from the point I know everything that's going on internally in the system. And to a user RethinkDB feels totally different from every other relational database.

CHUCK:

Mmhmm.

SLAVA:

But you know, it's JSON. It's scalable. It's push architecture. There are all these differences. But what happens internally is a lot of the traditional database concepts, they all still apply. Because we use B-trees to implement everything so it's the same as every other database. And all these concepts you learn in relational databases, they're still applicable. So, with indexes, that's super important for performance. RethinkDB supports indexes the same way MySQL or Postgres would for example. So, you have a primary index and you can set up secondary indexes. It's super helpful for performance because nstead of doing full table scans, the database can optimize the queries.

So yeah, indexes are super important. They're easy to use in Rethink. There are some differences between how we do it and other databases do it. And I can get into that if you guys are interested.

But the general principles are very similar.

CHUCK:

One other thing I'm seeing here is map-reduce. And that's something I'm much more familiar with being in Hadoop. To some degree, I've seen some implementations of map-reduce either through plugins or directly in PostgreSQL. I think MongoDB has some map-reduce capabilities. I believe that some of the other ones also do. I seem to remember Cassandra having one but I don't remember if I had to pull in some extra stuff for that or not. I guess how core is map-reduce to RethinkDB?

SLAVA:

So, from an implementation point of view, map-reduce is extremely important because it's at the core of the distributed computation engine. The reason why it's important is because once you get to a database cluster and your data is distributed across multiple machines, map-reduce is just a phenomenally simple and convenient way to implement really any kind of data processing. So, it's at the core of the computation engine.

Now, from a user's point of view, map-reduce is kind of complicated, right? If you have to write map-reduce programs, it's not quite like writing Assembly but it's certainly not a high-level language like in the same way Python or Ruby is. So, what we did with Rethink is at the core we built a mapreduce engine and that's what runs all the computations. And then on top of that we have what we call [inaudible] commands to make to make everything easier.

So for example, if you want to get distinct users in RethinkDB or distinct documents in RethinkDB there's a command called distinct and you can just call it on a stream and get a set of distinct documents. But you could also write a map-reduce query to do that yourself. Similarly, if you want to do a join or a sub-query, you could do that using a map-reduce program in RethinkDB or you could just call 'dot join' or 'dot equijoin' in RethinkDB. But what happens is a lot of these commands, higher-level commands that the user would write, internally they just compile to really efficient map-reduce queries.

So, map-reduce is at the core of the engine and you could write map-reduce programs in RethinkDB directly. And a lot of users do. But basically every time we see people writing mapreduce programs, to us it's an indication that there needs to be a higher-level command to make it easier. And the idea is that eventually no one will ever have to write a map-reduce query in RethinkDB because everything will be accessible with these higher-level more convenient commands. And I think we're almost there or probably already there, because ReQL evolved for a long time. And we kept adding these higher-level commands. And now everything a user, almost everything I think a user would want to do, they could do it with these convenient commands. And internally it gets compiled down to map-reduce. But no one has to write that unless they really want to.

CHUCK:

So, can I then run a map-reduce and then run a query against that? In other words, tack on more ReQL afterward to further narrow down my result set of something?

SLAVA:

Yeah, totally. So, all of that… so, one thing about ReQL I didn't mention and I really cared about is we wanted it to be composable in the same way Bash is composable or traditional programming languages are composable. So, you could keep chaining things and you can compose all the components. Like you could say, start with this table and then do 'dot map dot reduce'. And then you could say 'dot join' and you can compose all of the higher-level ReQL commands and lowerlevel ReQL commands. And it's really, really nice.

CHUCK:

Oh, cool.

SLAVA:

So, nothing ever breaks down. It's not like you do map-reduce and it's done. You could pipe it into another table. You could pipe it into another query. You could use it as a source for joins. So, you could do all these things.

AIMEE:

Okay. So, I think we were going to maybe go on and ask a little bit about the community and such. So, I was just curious how much the community affects the features that you build and if you are big on trying to get contributors.

SLAVA:

Okay. So yeah, the RethinkDB community is hugely important. When we launched it, we launched it as an open source project. And it very quickly picked up steam. It became the first most popular document-oriented database on GitHub very quickly. And the second I think just NoSQL database in general. The first one is Redis. So, it became popular. And because all the developers hang out on GitHub, they started using the Issue Tracker. So, it was super important to us even very early on to develop RethinkDB in public. So, all of the discussions that go on, all of the technical discussions, they go on in GitHub on the Issue Tracker. There's really very little that happens internally that we don't publicize.

And the reason why that's super helpful is because users come in and they can request features. As we design, as we go through technical discussions users chime in and say, “Hey, this works for me. This doesn't work for me. I have this use case. This design doesn't fit.” And when we ship products, people then chime in with bug reports and stuff like that. So, what ends up happening is throughout the entire development process, and we try to release new versions of RethinkDB about every six to eight weeks, we get continuous feedback from the community. And it's extremely important because it just fundamentally results in a much, much, much better product. It helps with feature planning. It helps with technical planning and design. It helps with all these things. And we're very grateful to the open source community for contributing in this way.

Now, as far as contributors to the core database, we actually have hundreds of contributors from around the world. And we're very grateful for everything they do and all the work they do, from the core database server to porting to docs. We unfortunately haven't done a very good job of documenting the C++ core database source code. And we're working on that. And hopefully the core engine will become more and more friendly to external contributors over time. So, we haven't been as good as we'd like. But we're definitely getting better.

DAVE:

So, I have a question about hype. [Chuckles] So, obviously Rethink right now is kind of on the very beginning of its hype cycle. I don't know if you'd agree with that. But it seems like it is when you compare it to some of the more established data stores in the industry like Postgres, which has been around for almost 30 years and super reliable and battle-hardened and whatnot. What do you have to say about Rethink in its current state right now? Would you consider it production worthy and ready to go? Or is there stuff that it needs still?

SLAVA:

Yeah. So, RethinkDB is five years old, which is pretty old for a software project, but very, very young by database standards, compared to Postgres which has been around for decades. So RethinkDB, it took about five years for us to declare everything to be production ready. It's been in beta for about two years, or two and a half years. And we just announced that RethinkDB 2.0 is production ready in April. So now, we're up to RethinkDB 2.1. We definitely have a lot of production customers. Some of these are Fortune 500 companies. Tons of these are small startups. Some of my favorite ones is Jive Software built a product called Jive Chime on top of RethinkDB to compete with Slack. There are some other really exciting ones we're going to announce really soon.

So, RethinkDB is definitely production ready. People can use it. People do use it in the wild all the time very successfully. But there's definitely an enormous amount to be done. And we've got, our user community has been growing pretty tremendously. It's actually been doubling every three months for the past I think two years. So, it's been a wild ride. And as we get more users, people want more and more and more things. So, the development pipeline, it keeps growing and growing and growing. And I think it's probably going to keep doing that [chuckles] even over the next 10 years, just because there's a lot left to do on pretty much all fronts. And we're very excited to do it.

CHUCK:

Didn't you just release a new-ish version of RethinkDB?

SLAVA:

Yes. We just released RethinkDB 2.1 a couple of weeks ago.

CHUCK:

So, if somebody's been using Rethink 2.0 or whatever the previous version was, what are they going to notice that's different?

SLAVA:

Okay. So, the biggest thing in RethinkDB 2.1 is automatic failover. And this is something we've been working on for over a year. It was a massive, massive engineering project. What happened was in RethinkDB 2.0 if you have a server failure, let's say you have a cluster of five database machines and one of the primary servers fails, then that required human intervention. So, an administrator would have to log in and either fix their hardware or they have to take it out of the cluster. So, with RethinkDB 2.1 we introduced automatic failover and [high durability]. And the idea is if a server fails, the cluster will elect a new server. So, everything will run without interruptions, without any human intervention.

If there is a net split scenario, like let's say you have two data centers, one in the United States and one in Europe, and there is an interrupt, a network availability between these two data centers, everything will continue to run. And when the network rejoins the database will know what to do and will do the right thing.

The third thing that happens in 2.1 is if you want to add a shard or add a replica or remove them, you could do all of that live. So, there won't be any interruptions in RethinkDB itself. Before with RethinkDB 2.0, sometimes adding a shard would interrupt the database for potentially hours. And in 2.1 all of that is instantaneous. So, all of the features are basically around automatic failover, all of the consensus stuff and elastic adding and removing shards and replicas automatically.

So, it's really, really wonderful because people can upgrade from 2.0 and they don't have to learn anything new. They get these features out of the box. They don't have to change their applications. They don't really have to change their practices all that much, other than the fact that they don't have to wake up at 3am if something goes wrong most of the time. So, that's really, really nice. And all our users really love that release. So 2.1 yeah, it came out in April. It's very, very exciting. It doesn't add very many new features that are user-visible. But for all the ops people, it's really, really wonderful. It's a dream come true because if something goes wrong, everything just keeps running. And if you have to scale it up you can just add nodes and the application won't experience any downtime. So, that's 2.1. And if anyone's using RethinkDB 2.0 I kindly encourage people to upgrade. It's a very exciting release.

CHUCK:

And it'll run on Linux or Mac OS. Does it run on Windows?

SLAVA:

Yes. So, it runs on all kinds of flavors of Linux and Mac OS. We actually have a Windows port in progress right now. And we're very excited about it. I actually grew up programming in Windows. So, it was my first operating system and it's still kind of… I know it's not a popular opinion necessarily these days, but it's still near and dear to my heart. So yeah, we're shipping a Windows port pretty soon. But right now, usually if people use Windows, they run RethinkDB in VirtualBox.

CHUCK:

Very cool. Well, if people want to know more about RethinkDB or get involved in the community or whatever, where should they go?

SLAVA:

If you're interested in Rethink, please go to RethinkDB.com. All the resources are there. There are tons of docs, there's community links. We're on Twitter @rethinkdb. We're also on GitHub, same, rethinkdb. So, there are plenty of places to get involved with the community, ask questions, stuff like that. There's an IRC channel. There's a Gitter chat that's basically on top of GitHub. So just yeah, go to RethinkDB.com, RethinkDB on GitHub or Twitter, and there are tons of channels and resources for you to learn and ask questions.

CHUCK:

Alright. Well, let's go ahead and get to some picks. Before we get to picks, I want to take some time to thank our silver sponsors.

[This episode is sponsored by TrackJS. Let's face it, errors cost you money. You lose customers, server resources and time to them. Wouldn't it be nice if someone told you how and when they happen so you could fix them before they cost you big time? You may have this on your Back End Application Code but what about your Front End JavaScript? It's time to check out TrackJS. It tracks errors and usage and helps you find bugs before your customers even report them. Go check them out at TrackJS.com/JSJabber.]

[This episode is sponsored by Code School. Code School is an online learning destination for existing and aspiring developer that teaches through entertaining content. They provide immersive video lessons with inbrowser challenges, which means that each course has a unique theme and storyline and feels much more like a game. Whether you've been programming for a long time or have only just begun, Code School has something for everyone. You can master Ruby on Rails or JavaScript as well as Git, HTML, CSS, and iOS. And more than a million people around the world use Code School to improve their development skills by learning or doing. You can sign up at CodeSchool.com/JavaScriptJabber.]

CHUCK:

Dave, do you have some picks for us?

DAVE:

Yes, I do. Just one for you today. I wanted to pick a Netflix series that was pretty cool. I'm a casual history buff and I enjoy casual history. And there's a pretty cool historical fiction/based on real events series on Netflix called Our World War. And it was quite fun. So, I enjoyed it a lot. And I'd like to pick it. That's all I've got for you.

CHUCK:

Alright Aimee, do you have some picks for us?

AIMEE:

Yup. I have two. It's been a while since I did a health one. So, I need to get back on that. So, my non-related pick is going to be these protein bars that are called Quest Bars. So, a lot of different protein bars are not very good for you even though they sound like they are. They have tons of sugar. You might as well eat a candy bar. And these have one gram of sugar. So, if you're into that kind of thing and you want to eat healthy, these taste really good and they're good for you. They're called Quest bars. You can usually get them at Whole Foods or GMC or something like that.

And my programming pick, I have been working with another girl on Saturday mornings. And we've been doing a lot with Kyle Simpson's course that he has up on GitHub, the 'You Don't Know JS' series. So, I know you can pay for this, but then it's also… there's a bunch of free stuff up on GitHub as well. And it's really, really good and very thorough. So, that is my programming pick. And that's it for me.

CHUCK:

Awesome. I've got a couple of things. First off, a quick reminder about Angular Remote Conf. It's going to be on September 24th through 26th. So, if you're interested in that, go check it out. You can use the code JABBER to get 20% off of the ticket price. So, go sign up.

Also, I've gotten into a TV series. It's a guilty pleasure of mine. It's called Orphan Black. It's on BBC America. And it's awesome. It's basically, you figure this out within the first few minutes, but this girl's coming back from rehab and she sees a girl that looks just like her jump in front of a train. So, she takes on her identity. And that girl turns out to be a cop. [Chuckles] And it turns out that they're clones. So anyway, you start to unravel all of this mystery behind it. So anyway, really enjoying that.

So, those are my picks. Slava, do you have some picks for us?

SLAVA:

Yes. So, I actually prepared a bunch of programming picks but now that I know that it's okay to mention Netflix series and TV series in general [chuckles] I'm going to change some of mine and do that. So, I recently got into a TV series called Mr. Robot. And it's really, really cool. It's about a basically security engineer who lives in New York City. He works for a security company. He has paranoid delusions and he goes through the world perceiving it in his own very particular way. And he connects with other people by hacking them and learning things about them. And it's a

surprisingly good show. It's really, really good. It's got very realistic portrayal of what software engineering and hacking is like. And it's got really good narrative around it. So, I like that a lot.

My second pick is also a TV show called Rick and Morty. It's an adult cartoon about a crazy time traveler, or I guess a crazy inventor that does all kinds of time traveling and multiple dimensions and stuff, and his grandson Morty. And they get in all sorts of really cool funny adventures. It's a lot of fun to watch.

And my third pick is Rust programming language. It's been around for a while but I only recently got into it. It's been on my radar for quite some time so I wrote a couple of programs last weekend. And it was tons of fun. So, I'm very excited about it, because I think it'll be probably a big deal in systems programming. I think right now it's the only viable alternative to C++ and it's going to become more viable over time. So, these are my picks.

CHUCK:

Yeah, we did an episode a couple of months ago with Dave Herman about Rust. So, definitely go check that out. We also had Steve Klabnik talk about Rust at Ruby Remote Conf. But I'll get links to all that stuff and put it in the show notes. Lots of cool stuff, looking at Rust. Alright. Well, are you on Twitter? Is there a way that people can follow you directly?

SLAVA:

Yeah. So, I am @spakhm on Twitter. If you follow @rethinkdb I also follow that all the time. So, I'm going to just respond from my personal Twitter account and you could do it either way.

CHUCK:

Alright. Sounds terrific. Well, thanks for coming. And hopefully some people go and check it out and find some great uses for RethinkDB. I'd also be interested if you're using it to know how you're using it and where it's paying off for you. So, leave a comment in the show notes.

SLAVA:

Thank you guys for having me. This was tons of fun.

JOE:

Real quick. [Inaudible] announcement. Tickets for ng-conf are now available through a lottery. So, if you head over to ng-conf.org you can register for the lottery to pick up tickets to ng-conf and hopefully win a ticket to the lottery. The lottery tickets are free. And then you win the opportunity to buy a ticket. Because we just had so many people the last couple of years trying to get tickets that it's crashing servers and stuff. So, we're going to a lottery this year to make it simpler. So, head over to ng-conf.org to register to win the opportunity to get a ticket for ng-conf and get in the lottery.

And again, it's free to register.

CHUCK:

Yeah, I entered the lottery and I wasn't going to announce it because I wanted to win it.

JOE:

[Laughs] Awesome.

[Hosting and bandwidth provided by the Blue Box Group. Check them out at BlueBox.net.]

[Bandwidth for this segment is provided by CacheFly, the world’s fastest CDN. Deliver your content fast with CacheFly. Visit CacheFly.com to learn more.]

[Do you wish you could be part of the discussion on JavaScript Jabber? Do you have a burning question for one of our guests? Now you can join the action at our membership forum. You can sign up at

JavaScriptJabber.com/Jabber and there, you can join discussions with the regular panelists and our guests.]

176 JSJ RethinkDB with Slava Akhmechet

0:00

51:00

Playback Speed: