JSJ 447: Using Javascript for Data Analysis and Data Science with Daniel Lathrop - JavaScript Jabber -

JSJ 447: Using Javascript for Data Analysis and Data Science with Daniel Lathrop

Our guest is Daniel Lathrop, a freelance investigative data journalist and educator, and formerly a newspaper reporter and Professor of Journalism and Media informatics at the University of Iowa. On this show, Daniel describes how JavaScript is a great choice for doing data analysis and data science, potentially even more so than other languages which are often used for this purpose, such as Python and R. Daniel also provides information about lots of useful tools and techniques to use in this context.

Hosted by:

Special Guests:

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Transcript

AIMEE_KNIGHT: Hey everybody, welcome to another episode of JavaScript JavaScript. Looks like today we have myself, Nashville, we have Dan.

DAN_SHAPPIR: Hey from Tel Aviv where it's nice and warm.

AIMEE_KNIGHT: AJ.

AJ_O’NEAL: Yo, yo, yo from the Arctic office in Pleasant Grove.

AIMEE_KNIGHT: And we have Daniel, I don't know where you're coming from. So if you wouldn't mind introduce yourself, tell us a little bit about who you are and what we're talking about today.

DANIEL_LATHROP: So I'm Daniel Atherp. I'm coming to you from Iowa City, Iowa, home of the Iowa Hawkeyes, Go Hawks. And I am a former professor of journalism and media informatics at the University of Iowa, and I now am an independent consultant and freelance journalist. So I have been giving talks for the last several years about JavaScript for data analysis and data science. And I proponent of it and trying to help build the community.

Are you building applications with Vue.js? Then you need to check out the views on Vue podcast. Every week we bring in a guest panelist from the Vue community and talk about the interesting things being built with Vue or the changes coming in its ecosystem. You can find it all at views on Vue.com.

AIMEE_KNIGHT: Very cool. I don't think we've had anybody from Iowa on before. I'm actually familiar with that area. I had spent a lot of time there growing up, so small world. Yeah, does anyone have questions to get us going? It sounded like we were talking about some pretty interesting stuff when I jumped on a few minutes ago. I don't know if you want to start there or somewhere else.

AJ_O’NEAL: Was it even JavaScript related though?

AIMEE_KNIGHT: It wasn't. Still cool though.

DAN_SHAPPIR: No, I think we'll jump into the topic first though. So you're saying that we're going to be talking about using JavaScript for data science, that's JavaScript as opposed to what do people usually use when they do data science?

DANIEL_LATHROP: Well. The dominant are R and Python. So R tends to get used at the far end of the pipeline by statisticians and analysts. And then the pipelines are mostly built in Python. The heavy lifting is mostly done in Python, because R is super slow, except that actually Python, compared to JavaScript, is also super slow. So that's one of the super inappropriate uh, from its programming paradigm for the problems that are being solved.

AIMEE_KNIGHT: Just to make sure I understand what you just said. So are you saying that JavaScript is faster than Python or Python is faster than JavaScript?

DANIEL_LATHROP: Uh, JavaScript is substantially faster than Python at everything except regular expressions.

DAN_SHAPPIR: That's not actually that surprising. I mean, when you've got like Apple, Google, Microsoft and the open-source community trying to see who can develop the fastest JavaScript engine, then you'll get some pretty amazing results. But again, before we delve into that, can you perhaps talk a little bit about what you actually mean when you say data science? Can you kind of provide the definition for that?

DANIEL_LATHROP: Sure. So that's a great question. Data science is a huge bucket because it's everything gets lumped in there from you know, making a dashboard in Tableau to, you know, dealing with petabyte data at scale in real time. So if it involves to me, uh, gathering, analyzing, and delivering data for decision-making and understanding.

AIMEE_KNIGHT: Now, did you choose JavaScript yourself? Or were you kind of taking over a project where JavaScript was already the technology used?

DANIEL_LATHROP: So, I chose JavaScript myself for probably an unusual reason. I joined the university, I left this summer, but I joined about five years ago and I began teaching a course, elective course on what's called data journalism, which is just data science for journalists which is, you know, kind of like rocks for jocks in a certain way because journalists mostly hate math. But it turns out I could do it without a lot of math being involved.

AIMEE_KNIGHT: As someone with a minor in journalism, I will attest to that. But I've come to like math in my later years.

DANIEL_LATHROP: Yes. Well, I always loved math and became a newspaper reporter after college, worked on my college paper and I've used math since my first day on the job. Uh, but I began teaching this class and I realized quickly that the way it was being taught at other universities, the way I learned it was super confusing because it's like, okay, we're going to do two weeks of Excel and two weeks of SQL or SQL, uh, two weeks of Python and then two weeks of data visualization in something and then a big project. And that is a lot to absorb for people who have never written a hello world before, who've never done anything more in Excel than maybe sort stuff. Although my students were coming in able to make pivot tables because our intro class required that. But that was it. That's the baseline. And I wanted to get them making interactive graphics in JavaScript. And to me, that meant I needed to start in JavaScript. And I was working alongside another new professor who joined the same time as me, who was a front-end UX designer. And when he wrote code, he wrote it in JavaScript and he wrote a lot of data visualization code. We were both data viz folks. And so he did all his scraping and analysis in JavaScript. And I said, that's madness. How could you possibly do that? I came from like, I use all of these different tools in these different ways. And he said, well, I just don't have to change topics all the time. I can just use JavaScript for everything. And Node is awesome, and you should use it. So I quickly, while I was getting ready to teach that class tried to figure out how much I could actually do in JavaScript. And I'd done a lot of D3 work because if you do customized data visualization, the D3.js is the tool, the tool to do cool stuff. And so I realized actually I had already been doing a lot of data analysis because my way of doing it tended to be. I wanted to automate taking relatively raw data and displaying it without, so that I could drop new data in pretty easily. So what I had typically done was figure out all my data analysis and then recreate just the piece that led to the visualization using D3 and JavaScript. So once I actually looked at it, I realized, wow, I already know how to do all of this. I just haven't been doing it. Uh, and it's super fast cause I was running fairly complicated analysis in the browser and people, you know, uh, resizing by population. That's called a, I was a pseudo-Dorling cartogram of counties. So you resize them based on their population and arrange them so that they look nice into a, like a jigsaw puzzle. Some of that I had is manual work. The prettiness of the locations actually. I worked with an artist on. But the resizing had to be done dynamically because we could have different numbers, we could have different election results, it could change over time. So having learned to do all these things, I realized, hey, I could just do it this way. And I could teach my students to do it this way. And instead of using SQL, I can just use an array of objects. And instead of writing a select query, I can write a, you know, a filter. And so I had a bunch of journalists who, you know, had never taken an advanced math class, certainly never written a hello world and was able to get them writing their own, you know, interactive visualizations in D3 by the end of the semester. I felt pretty successful after that. And I said, well, if I can, if I can do this with journalism students. What could we do if we actually put our minds to it?

DAN_SHAPPIR: Based on what you're saying, if I understand correctly, so the way that you previously worked is that you would have some backend services doing the numerical computation, and for that you would use whatever other programming language, and then the front-end visualization was done with JavaScript, and I changed that kind of you made was to start doing the numerical computation using JavaScript as well. And initially perhaps to avoid jumping between different systems and then just because you enjoy doing it in JavaScript and it was really fast. Am I kind of presenting this correctly?

DANIEL_LATHROP: That's about right. I think it's really helpful. Node obviously made a huge difference. Certain things don't work well in the browser. You don't want to deal with terabytes or gigabytes of data in the browser. But you certainly can. Once in a while, some of the projects I work on with a node backend, I have to go into the command line rams and increase the amount of memory available up from one gigabyte. That can be a limiting factor, depending on what you're doing. But yeah, it's just I came to love JavaScript. I did not like JavaScript before it became my full-time language. To me, it was an obligation. I wanted to make things in the browser. I had to use JavaScript. I avoided it at all other costs. But once I really started using JavaScript, I realized that it's actually better for pretty much all of these things. The programming paradigm, the sort of async possibilities really lend themselves to doing complicated analysis. And there's other people who think so. Google has a TensorFlow implementation in JavaScript. So you can take any TensorFlow model, and you can develop a new model in JavaScript and do pretty complicated machine learning stuff.

AIMEE_KNIGHT: This is actually interesting to me, because we're taking over our myself and someone else or we have an intern right now like an AI ML intern and when she leaves, I'll be taking over her project and the other guy will be helping out as well. Now we're using Python. One quick question I wanted to ask is, like Dan was saying about all the computational stuff, and you had mentioned node, but is this in comparison to just like Python out of the box? Because I don't know much about Python, but I know that they have you can leverage like async code and stuff like that. So is that taking that into account? Because, and then I ask this question just because I think it's going to take a lot to shift any sort of mindset in this space that JavaScript is a viable option.

DANIEL_LATHROP: So yeah, there are async libraries that have been grafted on top of Python. But Python is, they're not natural or native to Python. Most of the high end statistical stuff is actually grafted onto Python and the paradigms are a bit off for the fundamental structure of Python. Python wins on data science because of its community and because of its popularity in academic settings. And so it's popular in academic settings. People who learn this stuff, you know, came from academia or trained in academia, right? Computer science courses are taught using Python and...

DAN_SHAPPIR: But is the heavy duty computation actually performed? Let's say I'm using Python. Let's say I'm more old school of not using JavaScript and using Python. Is the heavy duty computation actually done in Python? Or is Python calling out to some external service or library or whatever that's actually implemented in C or Fortran or whatever? And that's actually doing the numerical heavy lifting.

DANIEL_LATHROP: There's definitely the possibilities to call into the C++ library, and there's implementations on top of that. But the problem is that ultimately a lot of the logic that you're building is custom, and you're actually writing it so that the number crunching, the actual number crunching engine may be in C++, but all of the other steps around it are what you're writing, and that code is in Python a lot slower than JavaScript, which is not always the limiting factor, but it's an important factor. And you're really doing something that in a lot of cases, you're writing async code, you're trying to create callbacks in a language that isn't designed for callbacks. You know?

DAN_SHAPPIR: Yeah, but sorry to interrupt, but that's kind of the thing that I'm trying to understand. So it's callbacks to what? It's callbacks to like back to another system that's performing computation for you. What is it calling back to? Or what is it calling to that you're waiting to receive a call back?

DANIEL_LATHROP: Well, I see. So a lot of times, uh, you are, you have a data service and you're calling to APIs, right? You're calling to get data from somewhere or there's some, uh, there's some bigger service, uh, something like Amazon Redshift where you have a bunch of data and it's streaming in streaming data and their async and callbacks are really useful. There's also a whole world of what's called high performance computing that's done with parallel processing and so you actually need to break code up into distinct pieces that can be map reduced across multiple processing cores. And that implementation is, again, there's a lot of people trying to do it in R. IBM has spent a lot of time and effort on this. There's some implementation being done over Python, but you end up with this problem of a language that just isn't, it's designed for, in the case of R, procedural programming. It's only quasi-object-oriented, not really object oriented at all. And then Python, which is uh, you know, sort of standard object-oriented, but synchronous model. And then any parallel or async is being done through a library. Uh, and then those libraries are themselves sort of, they have, they have a degree of awkwardness. You know, imagine we have enough trouble learning callbacks and promises and async programming in a language where it's part of the language spec, uh, Python it's not. And so you have a translation problem that these libraries have to solve. And a lot of them have to solve it in Python, which is, again, slow. Not super slow. I mean, Python's great. I love Python. I've written a lot of Python code in my life. I still sometimes write Python code because the libraries are a lot more mature. There's a lot bigger community. But any time you have to either do things in parallel or you have to call back from another, from an external service, JavaScript is really better designed for that. And again, and the speed, the speed is better. Let me, let me give you an example. Let's say you have, this is a problem that I'm, it's a simple trivial problem that I'm currently solving, but it's illustrative. I have, my wife is a professional fundraiser. So she has a service that is doing her data collection payment processing. And then it's streaming out the contribution objects, which I, in JSON, which I am then planning, working on taking into DynamoDB. Now, it's just going to exit. I'm going to write a Lambda. It's going to execute that and it's going to run and I'm going to be able to do some asynchronous things with that data because that data then needs to get processed. It needs some analysis done of it. It separately needs to be stored in a different data set. It separately needs to be pushed to her newsletter software. Again, this is a fairly trivial task, and frankly, I could probably write it by like just download a spreadsheet every day and upload it every day. And it would take me not a huge amount of time. But by having this, right, when hopefully she signs up a giant client who's doing a hundred thousand transactions a day, you know, all of a sudden we can have that in real-time and it can do all of those things and it's got a flexibility. I can drop all of them into one lambda so that all of the different, instead of chaining a whole bunch of other things together, because I can do the parts that need to be done synchronously using async-await, and then all of the delivery at the other end, and all of the sending things out that maybe only have to go to one place can be done using Promises, or can be done just fired off and let it go.

DAN_SHAPPIR: So if I'm trying to summarize what you're saying, it's this. So you're getting some sort of, let's say, a request from a front end. You're doing a bit of custom crunching on the input values. Then you're calling asynchronously to either get some data or to do some heavy lifting but standardized computation. You get the data back and you run it again through the custom pipeline to get the numbers that you want in order to display those numbers. And so would that be a more or more or less accurate description of the flow?

DANIEL_LATHROP: Uh, that's about right. Except that then I have to take, uh, those numbers at the end and also send them to several different places.

DAN_SHAPPIR: Oh, cool. But yeah. So it might be like, like you said, it, it might be a sequence of, it's a sequence of operations, but each one of these operations is in and of itself is asynchronous.

DANIEL_LATHROP: Exactly. Oh, and in some cases, right, I'll have one operation that can be done and then I'll branch, right? So async await to do this calculation. And then I start, I can start a separate chain of things either in a separate async function that I can call or just with promises. Uh, so that all of the pipeline that has to happen to go to this service or to create this visualization happens asynchronously from each other.

DAN_SHAPPIR: So two, like technical questions that, uh, kind of, you know, jump to my, my mind. Uh, one, what kind of difference if any has a big in the introduction of big int done for this, is there any like implication out of that?

DANIEL_LATHROP: I think the heavy-duty and some of the libraries like that are more numerical like MLJS and then obviously TensorFlow.js are going to have or are going to make use of it. And there's been some libraries to sort of hacks around to create fake big ints before. So my assumption is that that's going to speed things up a lot. But from the userland perspective, mostly that's already been abstracted away into the libraries.

DAN_SHAPPIR: Oh cool. So what you're saying is that effectively, even though the bigint wasn't a built-in feature of the sorry, of the language, where it was needed, it was shimmed and now we can effectively get rid of the shims, so it will just make life easier to people doing the implementation and maybe make the stuff faster, but from somebody who's consuming the data, it's not going to make that much of a difference.

DANIEL_LATHROP: That's about right.

DAN_SHAPPIR: And another question, have you ever played with trying to move some of that back-end computation into the front end, into a worker or something like that?

DANIEL_LATHROP: I haven't ever tried to do it in a worker. That would be an interesting problem to solve actually. Most of the things that I am doing that end up in the front end have been sort of, it just makes a lot of, it's always made sense to ship, create an API and just have, or to a point, I will say, I do have it do a fair amount of the transformations can happen in the front end. And I have done that fairly often. Uh, because if you have a visualization, you want it to be flexible. And also I fall in, in a category of, you know, so much of the internet is broken if it's more than five years old. I have gone back and over my career, most of the interesting things that I've built no longer exist. So I have moved to trying to have a lot of things where I can make it work off of static JSON or can make it work off of CSV. And so I can, there's a certain amount of, I would say the light duty computation. It's very easily done in the browser. It's shockingly fast actually. Uh, up until you get to, you know, maybe 10 megabytes of data workers would, would be an interesting approach. It's, uh, it's a web API as opposed to, uh, as opposed to a language feature. And so I, I'm all, I, I've never really dug into it and I haven't seen a lot of library support.

DAN_SHAPPIR: Look within the worker, it's effectively just JavaScript. The, the main advantage of offloading like heavy-duty computation to a worker is that from the front-end perspective is that if you run it within the main thread it will just be blocking to the user interface. So if somebody tries to interact with the graph for example while you're running a computation it just would not respond because it's too busy doing the computation. With a worker it's kind of like an API call into it, you give it whatever you wanted to compute. And when it's done, it posts back the results. So it becomes totally asynchronous from your perspective. So just not blocking, that's the main advantage. Compared to putting the stuff on the backend, it's just, I would say it's more or less a question of potentially of cost. Like if you're doing your computation in the backend, you know, your you're paying for the backend services. Now cloud computing is significantly lowering the costs, but there's still a cost. If you're putting it in a worker, you're shifting the cost over to the consumer because it's eating their battery on their device, but it's offloading the computation from the server. Now it might be slower, especially if they have a mobile device, where compared to a fast server. And like you said, and they probably have less memory than a high-end server. So if, like you said, if for data science you need a lot of memory, that can certainly be a limiting factor. Still, it might be an interesting architecture to try. Can you give an, so you mentioned D3 as what you're using for the visualization and TensorFlow is what you're using for machine learning. Can you give examples of some of the additional libraries you tend to use in this context?

DANIEL_LATHROP: Uh, yeah. So there's a, there's a library called simple statistics. Uh, that's really got a nice, uh, feature set for doing, you know, it's sort of basic statistical tests, basic means, medians, modes, those, you know, sort of, uh, run-of-the-mill margins of error, uh, things that you, you have to do a lot in data science, uh, or in data analysis in general. Uh, there's a fair amount of that a fairly amazing library in having done a really good job of implementing a fairly wide variety of data analysis and data dealing with steps. There's a great library for dealing with larger CSV flat file data sets called Papa parse and it's really nice. And I use that whenever I have in ingesting large text files, which happens a lot on the server side, on the ingestion side a lot of times you're getting things in. One example would be I'm working on a project involving census data and there's, has to ingest about 200 tables for each state and US territory. And each of those is a CSV. And some of them are very large. They're very wide CSVs actually. They're not mostly not super long. Wide CSVs are a particular problem. Papa parses is really great for that. See there's MLJS, which is for more of your numeric computations, and less ML is machine learning. But it's more on the side of complicated statistics and implementation of numerical, underlying numerical things. There's things like logistic regression that fall into sort of the machine learning category, but aren't supervised or unsupervised learning the way we think about it. TensorFlow, obviously Google puts a lot of energy and emphasis on that. It's hard to beat Google for engineering quality and talent, particularly when they then go and open-source it.

Are you stuck trying to figure out how to get to the next stage of your developer career? Maybe you're just not advancing fast enough in the job you're in, or you're trying to break in in the first place, or for whatever reason, you keep going to interviews and it's just not working. You wanna land that dream coding job, but it just doesn't seem to be working out. Well, John Sonmez has written a book for you called the Complete Software Developer's Career Guide. He walks through each stage, of the development career and all of the things that you need to do in order to move up, keep learning, keep growing and find that next job that's going to get you where you wanna go. So if you're stuck and trying to figure this stuff out, go pick up the Complete Software Developer's Career Guide. It's the number one software development book on Amazon. It's sold over 100,000 copies so far. I actually have friends of mine that reach out to me and go, hey, do you know this John Sonmez guy? Cause his book is awesome. So go get the book you can get it at devchat.tv slash complete guide. That's devchat.tv slash complete guide.

AIMEE_KNIGHT: I do have a question. I have two questions. I'll start with one that's probably slightly more related to what we're talking about at this very moment. One is kind of back to what we were talking about earlier. So back to like Python versus JavaScript though, like I know at work we are kind of deciding between a couple of different algorithms for some time series data and have settled on something called the Facebook profit algorithm. And like I just went to the documentation on GitHub for it now. And I see that they only have implementations in R and Python. So I'm kind of curious, like, is this a problem that you face right now? Or like a lot of these different algorithms aren't necessarily, they don't have JavaScript implementations right now that are probably like well tested. And, um, and I guess like the other question is it just, do you think like JavaScript will take time to catch up? Cause we've had, I don't know, I can think back to like two years ago and we had like the view versus reactivate and view like caught up over time. But yeah, like I know that this Facebook profit algorithm is pretty heavily used, and the fact that I'm not seeing any JavaScript support for it.

DANIEL_LATHROP: That's a huge problem. One of the reasons I'm out there trying to evangelize this. I actually had a short-lived JavaScript for data science weekly newsletter, and there really wasn't enough content, enough things going on, new things happening to have news every week. So it's sort of petered out because the community in JavaScript isn't there yet. Might not, you know, I think JavaScript is the right language for data science. Right now. You know, the center of gravity statisticians use R, computer scientists use Python. Those are really the places where things are implemented and developed. And then those things then tend to be consumed at the last stage, you know, in the browser or through a web interface, but all of the actual work is mostly being done. And I would say the high-end stuff mostly in Python. Yeah. It's, it's a frustration for me.

AIMEE_KNIGHT: I can see I'm, I'm excited that you're talking about it.

DANIEL_LATHROP: It's a frustration for me. I am not a, you know, AI researcher. I am not a algorithms guy. I am a consumer of those things. So, uh, I actually am, you know, working on, uh, a lot of projects where I do have to drop into R or drop into Python for pieces of what I'm doing. So it's kind of hard to get there. There's newly now, I will say there is a really good analysts tool book that's finally been created in JavaScript. And I think that's going to make a huge difference just because of tooling. That's the Data Forge Notebook. Uh, and that's fine. I'm going to forget his name when I'm supposed to be saying it.

AIMEE_KNIGHT: Oh, it's okay. You could drop it in later if you need to. The other question I had was, you know, because you're like touching on it now, and I guess this was my way of circling back to it. Interestingly enough, because like I would love to be able to use more JavaScript, obviously with like the stuff we're doing at work, because that's kind of my bread and butter. But there's just like the Python ecosystem is massively larger for it at this point. But I thought it was interesting. So I asked Google Nest the other day, what's better, Python or JavaScript? And, you know, that's a very open-ended question, but I was surprised that Google Nest was like, Python is better because it's easier to learn. And so the people that you're teaching JavaScript to, I think sometimes especially like, D3 or just like the async nature of JavaScript can make it a little bit more challenging for beginners than Python. Have you faced any of that in teaching students to use JavaScript?

DANIEL_LATHROP: So I would say that JavaScript is harder to learn for people who already have made the jump to writing code and the code they're writing is in something different than JavaScript or another sort of semicolons and curly brackets language. But the, the for teaching, uh, my students in any case, the gap, uh, between nothing and something is the large, really the large part. So it could be five or seven or 10% harder for them to learn JavaScript. Uh, then it would be for them to learn Python, but I will tell you, I have students who, because of our media informatics program, are simultaneously taking my course and, you know, intro-level computer science courses. And I will say that my students had a lot easier time with JavaScript than with beginning to understand list comprehensions in Python. So it's arguably easier, but I think the difference, I think the difference is smaller than people think. And the other thing I will say about teaching students, teaching new people JavaScript is, uh, and the ability to do things in it is at the end of the day, I can teach a student in their browser. It goes, it gets data, processes. It does, it does what we want to do with it. And it displays it in a graph and a chart that they have created that they can interact with, that they can see and touch. They shouldn't touch it unless it's on a touch device, but it's on a touch device, they can actually touch it. And they've only had to learn one language. It all works relatively speaking the same. And then I can take that student and an example, I have a student who was into sports. And so for his final project, he scraped the basketball play-by-plays for every, every team in the big 10. Uh, and, uh, the goal eventually was, was to do some analysis on that for a undergrad, undergrad project. Just, just the scraping of, of 10 different, fairly complicated websites was, or 12 or 12 teams in the big 10. Sorry. Websites was a, uh, was it, was a fairly large undertaking. And that, uh, but he was able to do that right without in the same language that he could make a visualization in. Uh, and that's, uh, and that's a huge, that's a huge difference because the best moment in terms of getting people to do this is how you get them committed. And the moment that gets teaching undergrad journalism students, the only one that gets them committed is when they make a bar chart and then I show them how to make a, to create a CSS hover selector and it will change the color of the bar they're hovering over. And they'll just sit there and play with that for 15 minutes after they do it and go in and change the CSS over and over again. That's not strictly JavaScript, but that's what you can do when you have JavaScript. Try that with Python. Uh, I, I dare you.

DAN_SHAPPIR: Um, so suppose, um, I'm convinced and, uh, I want to do my data science in, in JavaScript. And I actually even have a JavaScript experience, which makes it that much easier. But what I'm lacking is data science experience. Where should I go to learn the math in general and data science in particular?

DANIEL_LATHROP: There's a great new book that came out earlier this year that is, so JS for DS, and the book is JavaScript for Data Science. And it's a textbook that's published by Pearson. Uh, and then they, they have a whole, it's the whole thing is, is on the web as available and so, uh, it's a great, it's a great text for giving you, sort of getting you started on all of these things and they walk through, they walk through everything, you know, from soup to nuts. Uh, I wish I had had this for my, my students last time I taught this class.

DAN_SHAPPIR: Uh, can you give the name of the book again?

DANIEL_LATHROP: Because the book is JavaScript for data science. Uh, the website is. Yes. For ds.org. Uh, and you can also buy it as a dead trees book that was published in January.

DAN_SHAPPIR: Awesome. I just put the link. We'll have it in as part of the show notes. Uh, that's awesome.

AIMEE_KNIGHT: One thing I was going to ask too, um, do you think that? So I kind of, I don't know, my eyes weren't really open to this until being in the role that I am now. I always thought like, you know, AIML was more of something for like a traditional, you know, somebody who has some sort of like education in this and they're doing very specific things, but I kind of feel like just like just the economics of the world and trying to do forecasting and stuff like that. Do you think that this is going to fall into the laps of like more of like the everyday developer because they want to do things, you know, economically, efficiently does that's how it's kind of fallen, fallen into, you know, the, the day to day stuff that I'm doing.

DANIEL_LATHROP: Yeah, I think there's right. That's a great question. So there's certain things that are on, on the edge that researchers do. Right. And as researchers do those and normalize those and build tools for doing those, then normal people, I consider myself, I'm a dumb guy. I was a newspaper journalist. It doesn't get dumber than newspaper journalists. Let's, I say that with some self-deprecation, but it's, you know, I'm not, I'm not a theoretician. I don't develop, you know, new tools. I use tools. And I think that...You know, as those tools are there, then of course we're going to take advantage of them and we're going to learn the best practices for taking advantage of them and that's what's happening, right? Uh, TensorFlow is, you know, which has a JavaScript version is an example of this, right? You know, Google put a lot of time, smart brains, effort, engineering hours, statistician hours into creating a framework for doing these kinds of workflows. And now, you know, with some understanding of what you're doing, you can use that tool. You don't, you don't have to understand how to make the tool. You only have to understand how to use it. Uh, so I think that's over time going to be more and more of what developers end up doing because we can. Right.

DAN_SHAPPIR: I have one last question from my, from my part. Based on the way that you described it before, a lot of the computation that's taking place that you're doing is, let's say, running in a Lambda on the Amazon Cloud. And you're writing, I guess, potentially fairly complex JavaScript code, not necessarily complex as doing really complicated structures, but because you're doing various numerical computations, which can be non-trivial. How do you debug it?

DANIEL_LATHROP: The correct answer is with unit tests. That's the correct answer. The other answer is by testing it, doing integration tests. So just running it on things, seeing what's wrong, going back and finding it in the code. Unit testing solves most of that. But you do have to, I think, in anything involving data analysis. You do have to run tests, you know, full L. You have to run test data through it where you know what the results should be and see if you get the result that you want it to get. And it should be gnarly edge case data so that you're not missing something that you're likely to see.

DAN_SHAPPIR: Kind of reminds me of that old joke about QA for a bar where a QA person walks into a bar and orders one beer, a million beers, minus one beer, and a lizard. And then an actual customer walks into the bar and asks where the bathroom is and the whole bar explodes and kills everybody. So, so yeah, I guess it's, it's, if you, if you can get really good test coverage in advance and if you can indeed break the the flow into units that can be fairly straightforward in a straightforward manner unit tested. That's definitely the best way to go. Hopefully you don't run into situations where it's like some sort of convoluted data flow that you somehow need to step through or whatever.

DANIEL_LATHROP: Yeah, I think the the answer is always breaking things down into small pieces, breaking things into separate functions, writing unit tests that are testable. That's always necessary. And then you still have to do the integration tests. You still have to test it with live data to make sure that all of the parts that each individual is working, you haven't missed the odd edge case where you, you know, where two of them break in ways that are, lead you to still get data and have that data be wrong.

DAN_SHAPPIR: Cool. Uh, well, unless, uh, anybody has any additional questions, I think we can head over into pics now and, uh, right.

AIMEE_KNIGHT: I'm good. That's a, I'm good. Yeah. I'm, um, I don't know, AJ, did you have anything that you wanted to chat about?

AJ_O’NEAL: Uh, the time has passed. There was a couple of things earlier on, but the conversation went in different directions, but it's fine.

AIMEE_KNIGHT: Okay. That's a great episode though. Thank you for coming on.

When I first started taking computer science classes in college, I thought programming was just a joke. In fact, I changed my major over to engineering and started doing computer engineering and chip design. Then I found Ruby and I fell in love. I love Ruby. It was my first real programming language where I dove deep and really learned how to make software that makes a difference for other people. Since then, and the way that we got started with DevChat.TV, we started a show called Ruby Rogues. It's currently in the 400s of episodes. We've talked to hundreds of people in the Ruby community about the Ruby community, about the Ruby programming language, about Rails, and about what makes good programming. So if you're interested in RubyRogues or you just wanna hear a long series of experienced programmers talking about real problems, then go check out rubyrogues.com.

AIMEE_KNIGHT: But yeah, we do do some picks. So we'll go around and then since you are our guest, we'll probably have you save the guests for last. Dan, you wanna go first?

DAN_SHAPPIR: Okay, will do. So my pick for today is content by a favorite of the podcast, Kyle Simpson, Getify. I needed to get up to speed on service workers. It's a core technology of the web, but somehow I've managed not to do anything significant with it. So I decided the best way for me to get up to speed quickly is watch the front-end master's course that Kyle has titled exploring service workers. And so far, it's definitely not disappointing. I'm enjoying it a lot. And Kyle is always, is an amazing teacher. So that will be my pick for today.

AIMEE_KNIGHT: Awesome. AJ?

AJ_O’NEAL: Well, skip me for a second. Come back.

AIMEE_KNIGHT: Okay, cool. I'll go with mine. So mine is actually, it's like a white paper. I'm not sure if that's technically what you would call it, but it is basically like an abstract for the different details along with what they're talking about, but it's written by some people at Facebook about forecasting at scale. And this was what the intern that we have at work passed to me since...She knows way more about this stuff than I do, but like I said earlier, I'm starting to level up on it since I will be taking over the stuff that she does when she leaves and try to automate it and stuff like that. So yeah, I will drop a link into our show notes for that.

DAN_SHAPPIR: You should have the system forecast that everything will be wonderful if all the devs are given a huge raise.

AIMEE_KNIGHT: I mean. I'm, I don't know. I'm actually very, very, very excited about learning this kind of stuff. It's very intimidating to me, but also really, really, really fun and interesting. So AJ, are you ready?

AJ_O’NEAL: Yeah, I'm ready. I got, I got some for you. So first of all, I've picked this before. I'm going to pick it again. Rip prep or RG. It's a drop-in replacement for grep that is get aware respects dot get ignore dot ignore, and has some other points. The readme is atrocious because it's six pages of how this is faster than every other thing, which no one cares about, except for people that care whether their reps are 10% faster than other tools. So I've got a cheat sheet up on webinstall.dev for it and an easy way to install it on Windows, Mac and Linux. And then I'm also going to pick, well, so there's I have a concern about the way things are going, and I've been trying to think about how to put it in a good way, and a conversation the other day kind of brought this up to the top. When you consider history, the human history, first of all, I don't think that we're likely to change any more in the next 10,000 years than we have in the last 10,000 years. I think human nature is gonna stay pretty much the same, and we're gonna repeat history over and over and over again with all of its atrocities. That's what we've done for the past 10,000 years, and I just can't see that changing. And I wanna point out that in history, the people who were the most adamant that we follow a morally superior path are generally regarded as history's villains. So when you think about the, and I'll put this in air quotes, Christian Crusades or in air quotes again, the Islamic beheadings, or in air quotes again, the German cleansing. All of these things were based on like moral superiority of making the world a better place. And when we look back on these things, we generally don't think of them of having accomplished that goal. And in line with that, Wikipedia has some interesting pages, you know, on history and whatnot. And this is that despite what you may think, because no matter which of the six different political movements you are a part of or not a part of, you're gonna feel like I'm targeting you and I'm not. This is so scarily broad that applies to just about every one of just about every party or position. And I think it's in part due to social media and the way that kind of the worst of human nature is brought out. But if you look at the Wikipedia article, on propaganda in Nazi Germany, I think it will be an eye-opening experience as to the kind of strong opinions and divisiveness and moral arguments that were made to create something that was so horrific. And I just, I hope that people will be open-minded and aware that statistically you're more likely to be on the side of the oppressor than the side of the victim, no matter which political stance you're taking. Historically speaking, more people were the silent people or the people who were the oppressors than were people who were the peaceful people. And if you have that in mind, then perhaps you don't become one of the people that you didn't think you'd ever become.

AIMEE_KNIGHT: Daniel, do you want to go next?

DANIEL_LATHROP: Wow, that's hard to follow. So I already mentioned JS for DS.org and the Data Forge Notebook, which is something people should take a look at to start doing their data workflows in JavaScript. And I also want to give a shout out to Claudia.js or Claudia.js, which is a serverless tool for that's in the for JavaScript devs that is something I've started working with that is way easier than something like serverless framework. Makes my head hurt even more than directly working with AWS does. Now, no offense to the serverless framework folks there. They're amazing. They're doing the Lord's work. But Claudia.js is something that JavaScript devs should take a look at.

AIMEE_KNIGHT: Awesome.

DAN_SHAPPIR: Now, Daniel, if people want to contact you, follow you, read stuff you write, maybe reach out to you, what would be the best way to do that?

DANIEL_LATHROP: So all my information can be found at daniel.buzz. It has links to all of my other presences on the internet.

AIMEE_KNIGHT: Awesome. Thank you so much for coming. This was really good. Is there anything else?

DANIEL_LATHROP: Thank you so much for having me here. I really appreciate it. I really want to urge people in the JavaScript community to start, start seeing JavaScript as a first-class place to do your, your work with data.

AJ_O’NEAL: Cool. I second that motion having used Python and knowing how, although the libraries like pandas and FB profit, et cetera, exist in Python, it's a real pain in the to use if you need it to work with more than one thing at a time.

AIMEE_KNIGHT: I guess with that, we will wrap up and say bye and we'll see everybody next week.

DAN_SHAPPIR: Bye bye,

AJ_O’NEAL: adios.

Bandwidth for this segment is provided by Cashfly, the world's fastest CDN. Deliver your content fast with Cashfly. Visit c-a-c-h-e-f-l-y.com to learn more.

JSJ 447: Using Javascript for Data Analysis and Data Science with Daniel Lathrop

0:00

53:44

Playback Speed: