Full-Text Search in Ruby - RUBY 600

Dave, Chuck, and Valentino join this week's panelist episode to talk about "Full Text Search in Ruby". Dave takes the lead as he explains full-text search, how it works and its purpose. They also dive into meili search and elastic search.

Hosted by:

Charles Max Wood •

Dave Kimura •

Valentino Stoll

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Picks

Charles - Home - Rembrandthuis
Charles - Home | Kapitein Ann
Charles - Star Wars: The Clone Wars
Valentino- BestDoc.ai
Dave- WLED
Dave- Jupyter

Transcript

Charles Max Wood:

Hey folks, welcome back to another episode of the Ruby Rogues Podcast. This week on our panel we have Dave Kimura.

Dave Kimura:

Hey everyone!

Charles Max Wood:

Valentino stole.

Valentino:

Hey now.

Charles Max Wood:

I feel like I've been gone forever. I'm Charles Max Wood from Top End Devs. And yeah, this week, it's just us. We're gonna talk about full-text search. And Dave, you've been playing with that a bit. I've played with probably just one angle of this lately. But yeah, since you recommended it, do you kind of want to set the stage for us?

Dave Kimura:

Yeah, so I think the first thought when I hear full text searching is don't do it and And I say that jokingly but i'm also very serious about that it's one of those premature optimizations that we often run into so just as a preface Only listen to this if you know That you have reached that precipice point where now you do need it But if you're starting out a new application, I would not dive into full text search right away because it does add a certain level of complexity to your infrastructure and to your application. So there are, but assuming that you've been doing just the normal querying, you've done as much optimization as you can. And now you are reaching a point where you have enough data where it actually makes sense to do full text searching or. If you have the very specific needs of needing to have weighted results and stuff like that, and you've kind of outgrown just the normal database querying side of things, then it does make sense to introduce some kind of full text searching. And as far as I know, there's three or four different major paths that you can go with, the easiest being PG search, which is

Charles Max Wood:

Right.

Dave Kimura:

the full text search within the database. Or if you're on Postgres SQL. And my sequel also has a full text search built into it now. Actually, it's been in there, I think, since version five, seven or something, which is long deprecated now. So version eight does have it in. But then you can switch over to a third party service, which is adding more into your infrastructure, but it also has more capability. So the old tried and true one has been elastic search. And for anyone who has hosted an elastic search model or application for some period of time will have ultimately been utterly burned by it with it just dying or running out of space or something horrible. But then after enough pain points that people have experienced with elastic search, there's been some new companies come up. And one that I've used a lot lately is melee search. And it boasts to be very powerful like elastic search without all the complexity.

Charles Max Wood:

Yeah, so I've been using SearchKick and it works with OpenSearch and Elasticsearch. And to be perfectly honest, I don't even know what the difference is if one's a fork of the other or whatever, but.

Dave Kimura:

Open search due to some licensing, I think, with AWS, they had to shift and do something different and call it something different. I mean, correct me if I'm wrong there.

Charles Max Wood:

I don't know.

Dave Kimura:

But I think it's an AWS thing.

Charles Max Wood:

It looks like it's an Apache licensed, oh it's Apache licensed, I thought it was Apache Foundation. Yeah, I don't know. But yeah, I've used it. I used it in my last client. And yeah, it seems to work pretty well. But yeah, I mean, I implemented it. So I hadn't gotten to the crashing and dying and burning up my app. So yeah.

Dave Kimura:

Yeah, and it's one of those things where if you do use elastic surge, melee surge, whatever kind of full text search, do not use it as a data store. And I think that's one thing that I've seen in multiple situations where people have just been utterly burned by is that index is no longer recreatable because records have been deleted on. the database side or something else has happened, or they were just using that full text search engine as its own data store, which I think is a huge mistake. I think that you should always be able to recreate your index, which is basically like the, um, the storage of all the data that is now searchable within these

Charles Max Wood:

Mm-hmm.

Dave Kimura:

full text search mechanisms. And if it's not recreatable, then you have to be like tiptoeing around very careful in how you interact with that full text search, which is just a horrible experience. And it makes lifting and shifting that environment very complicated because now you had to extract all that data to reload it.

Charles Max Wood:

Right. Yeah. I've only used it in the sense of having my primary store in Postgres and then running a re-index periodically. Right. So when I save something or, you know, on a cron job or, you know, every so often on Sidekick or something like that.

Dave Kimura:

Yeah, and I think search kick is one of like the most pleasant experiences when

Charles Max Wood:

Uh huh.

Dave Kimura:

using a full text search, specifically with elastic search. And that's the one that I used for a long time. And it's still running on some of the maps in some regard. But in it's no fault of Andrew Kane, who created search kick. It's more my issue with elastic search in general,

Charles Max Wood:

Mm-hmm.

Dave Kimura:

because I have had situations where the managed service that I was using. So this was hosted with AWS and they were managing the elastic search on their own. I was just consuming it. So you pay a nice, pretty penny for that, but it just died. It would not come back up. I could not connect to it. I had not made

Charles Max Wood:

Dave Kimura:

any

Charles Max Wood:

man.

Dave Kimura:

changes It just happened randomly and I'm like well crap now you cannot search for any episodes on Drift for Ruby so I Yanked it out and just did normal querying the database for a long time and that actually worked really well But it's gone to a point now where I have enough content Where I'm kind of outgrowing that phase where you can just paginate through

Charles Max Wood:

Yeah.

Dave Kimura:

a bunch of results and I need more targeted results so or weighted results,

Charles Max Wood:

Mm-hmm.

Dave Kimura:

so i've reintroduced full text search in there and for a year or so I was using my sequel and I did some fancy my sequel queries with weighted results using their full text search and That worked OK, but it wasn't ever really great and it wasn't really getting accurate results. So a few months ago actually switched it out for melee search. And I'm self hosting my own melee search instance, which is very lightweight on resources. And because I'm not populating it with too much data, it's actually very manageable to work with. One drawback is that it is a single node instance, so you cannot cluster it. You don't have high availability

Charles Max Wood:

Okay.

Dave Kimura:

but um, it's also something where it's you know, It's a great solution so Uh, that's why i've been using lately and it's actually been um Much more accurate and better and we can dive into it a bit, but i've also basically, you know, let's Throw in all those buzzwords. It is AI powered now. So.

Charles Max Wood:

Ooh. So I have a couple of questions, right? Because I've been looking at OpenSearch and SearchKick for top end devs, right? Because I'm in the same boat as you are as far as the amount of content. I think at this point we've published more than 4,500 podcast episodes over the last, I don't know how many years. And so, yeah, I mean, I want people to be able to search it. I'd like to be able to prioritize it with the more recent stuff, right? At a higher priority, you know, but if you're, if you're searching for something in the more relevant, you know, keyword matches, you know, come up with, with earlier stuff, right? I still want those to show up because they are relevant. Um, and so, yeah, I've been trying to figure out exactly how to put that in. I mean, lately I've been working on some other angles on top end of, so I haven't even put it in yet, but. Um. is definitely on my list of things to get done this summer. And so that's what I've been trying to figure out. One thing that I have a question about is, we've been getting transcripts done on all the past episodes that didn't have them before.

Dave Kimura:

Mm-hmm.

Charles Max Wood:

We did transcripts for a long time and then we quit doing them for a while just because it was too expensive. But yeah, now we're just throwing it at, I think it's Whisper AI powered something or other that we're using and... Yeah. So now that we have all those transcripts and you know, most of them are pretty decent. Not always. If it's a like the name of a gem that's not a standard word or if it's a person's name, it doesn't always get that right. But yeah, you said that your Mellisearch model wasn't too large and so it wasn't as fragile or something like that. If I'm feeding it 4,500 transcripts of Am I gonna overwhelm the thing?

Dave Kimura:

No, I because

Charles Max Wood:

Ha ha ha!

Dave Kimura:

you're still talking about 4500 records

Charles Max Wood:

Right.

Dave Kimura:

You know and within there you're talking about one attribute of one record having one hour worth

Charles Max Wood:

Yeah.

Dave Kimura:

of Transcribed data and

Charles Max Wood:

Well, I'll

Dave Kimura:

I don't

Charles Max Wood:

probably

Dave Kimura:

think

Charles Max Wood:

pull

Dave Kimura:

that's

Charles Max Wood:

the show

Dave Kimura:

too

Charles Max Wood:

notes

Dave Kimura:

big

Charles Max Wood:

too, but yeah.

Dave Kimura:

Yeah, that's all relatively small and there is an actual uh, there is a limitation so, uh with elastic search you do not have any kind of limitation per record size but with Mela search there is a two gigabyte limitation so if your

Charles Max Wood:

Oh, I'm not gonna hit that.

Dave Kimura:

transcription was like a year's worth of someone reading the Bible or something yeah you're not gonna be able to store that that's gonna be over two gigs probably but since you're talking about just a one hour segment of transcribed audio that's gonna still be in the kilobytes

Charles Max Wood:

Yeah,

Dave Kimura:

of data

Charles Max Wood:

right. Right, so I can just shove all that stuff in there and then make API, or yeah, basically, it's gonna make API calls into the Mela search server and get the data back. How fast is it? Because the open search stuff seemed to be reasonably fast and it got me good results most of the time.

Dave Kimura:

I would say you're probably going to have about 50 millisecond search time.

Charles Max Wood:

I can live with that.

Dave Kimura:

50 to 100. Yeah.

Charles Max Wood:

Yeah.

Dave Kimura:

I mean, it is really quick

Charles Max Wood:

Yeah.

Dave Kimura:

and the more memory you throw at it, the faster it's going to be. So a few years ago, I purchased a reserved instance. That's a two core four gig, uh, Ram, uh, AWS instance. And that's where I'm hosting it on. And it's not even using 400 megabytes. So I mean, it has so much room to grow.

Charles Max Wood:

Oh nice.

Valentino:

So it's been a while since I've had to like prop up a full text server

Charles Max Wood:

Mwahaha

Valentino:

from scratch. But the last I remember doing it the freeway was with solar, right? Apache solar.

Dave Kimura:

Ooh, I forgot about solar.

Charles Max Wood:

Solar and Lucene.

Valentino:

So, I mean, yeah, they all use the same engine as the last search, right? Lucene.

Charles Max Wood:

Yeah.

Valentino:

I mean, most of them are probably going to use that because why not? But yeah, I mean, I'm curious why you didn't reach for like something like solar, you know, which has sunspot as an example.

Dave Kimura:

Java.

Valentino:

Java. I mean... Yeah, I'm just curious.

Dave Kimura:

Java

Valentino:

Java, because

Charles Max Wood:

Ha ha

Valentino:

Java?

Charles Max Wood:

ha.

Dave Kimura:

No, Sunspot It's an interesting one and I like it but for me if you've ever set up a Sunspot solar incense with the rails application it works and it works really good or really well, sorry, but It's one of those things where much like elastic search It's a matter of when something goes wrong and then you're trying to figure out what the heck is going on. So while it is really powerful and really good I think it just has way too many move in parts especially if you were using it. With multiple web services and you've set up your own sunspot solar instance because I think the default path if you followed like the Ryan Bates episodes from way back when or.

Charles Max Wood:

Mm-hmm.

Dave Kimura:

the documentation, it's all just kind of assuming that you're running this locally on your own machine because you're essentially running a rate task to start up the Sunspot solar instance instead of having it hosted or through a manage host. So it's just one of those things where I've actually had to go deeper into the Sunspot solar running instance to troubleshoot and debug and set up some things. And it was just really annoying because now you're having to deal with Java and the Tomcat Tom files and just the nastiness that's all behind there. Unless this has changed recently.

Valentino:

Looking at the repo, they now have some Docker containers and Chef scripts and stuff. Even Ansible support Kubernetes. So maybe this is easier to do with a lot of the more modern infrastructure tooling,

Dave Kimura:

Yeah,

Valentino:

right?

Dave Kimura:

sounds like it.

Valentino:

But I haven't used them, so you may still have those problems. I do remember... from doing solar upgrades being a thing.

Dave Kimura:

Mm-hmm.

Valentino:

Which is why maybe using something like Melly Search or Elasticsearch, now that you don't have no problems, because you still have to upgrade Elasticsearch as an example, but having that higher level support definitely changes things.

Charles Max Wood:

I will

Dave Kimura:

and

Charles Max Wood:

admit, I never really did anything with solar, so.

Dave Kimura:

MeloSearch isn't all, you know, rainbows and sunshine either. Actually just did a upgrade from MeloSearch 1.1 to 1.2 of my servers. And it was kind of a pain because they do not have a direct upgrade path for your database.

Charles Max Wood:

Oh really?

Dave Kimura:

And the database is locked to the version that you're running. And you know, it's just the same story that you find with Postgres SQL. So what I ended up having to do was do a dump, start up the new instance, import in the data. And in my case, the way I have Drifter Ruby set up is that I can take down the Mellor Search running server, and it's going to add about 500 millisecond latency to your request. But then it'll fall back to normal active record querying. So I have a redundancy plan in place. for when Mellasearch goes down. But what I did was I just deleted all the data, just changed the version, and then re-indexed everything, and everything just came back up. So that was my approach to upgrading. But if you're dealing with tens of thousands, millions of records, that's not feasible, especially if you were seeding in a lot of data. That's just not realistic. So you would have to go through an upgrade path. I just wasn't interested in for the few thousand records that I'm actually hosting on there to go down that.

Charles Max Wood:

Right. That makes sense to me. That's probably what I'd wind up doing too.

Valentino:

What has been your like, biggest win with using Mellis Search?

Dave Kimura:

More accurate search results. I would say because the way they're doing the waiting is really interesting. So You set it up and configure it very similar to how you would with search kick but the weighted results is basically just in the The appearance that they show up in your configuration So if you want the title description to be the highest uh ranks thing you just put those at the top of your searchable attributes and then they're going to be the heaviest weighted results. And so after I implemented Mellie search, I started adding in other things. So that's when I started using some of the hosted AI things that I've been doing with open AI whispered that I'm self hosting. Do I started doing all the transcriptions and closed captions. So that then gets uploaded into Mellie search as well. So similar to like what you were doing, Chuck. This is now. My Mellis search instance is hosting one attribute that's very large because it's the full transcription. So if you did a search or anything that I verbally said within a video, then that's going to show up in the search results. If there's something in

Charles Max Wood:

Mm-hmm.

Dave Kimura:

the title description or in the code show notes, those are going to have higher weights to something I verbally said. But if I was mentioning one little thing, and hopefully the transcription was accurate because it's all done by AI, then it's going to show up in the search results.

Charles Max Wood:

that makes sense. One thing that I'm wondering about, because it sounds like we've kind of gone from doing the database searching to, you know, with something like PGSearch or whatever, to using an external system like MellieSearch or OpenSearch or ElasticSearch to now seeding it with data that we get from an AI system that's doing some transcriptions and things. It seems like the next logical chat GPT for your website. Is that even a realistic thing at this point?

Valentino:

I think Algolia just released something around that. Yeah. So Algolia is like a

Charles Max Wood:

Yeah.

Valentino:

elastic search competitor, right? They just have, they have this AI search and discovery platform now, which looks pretty interesting. It seems like the step you're talking about.

Charles Max Wood:

Yeah. Just to go back though, Dave, with your setup, so when you post a new video, does it just automatically generate the transcript? You don't have to go and submit it to anything or do anything.

Dave Kimura:

Correct. Yep.

Charles Max Wood:

Oh, I would love to figure out how to do that.

Dave Kimura:

Yeah, it doesn't in my basement. So because I mean, hosted services are expensive when you talk about

Charles Max Wood:

Yeah.

Dave Kimura:

machine learning. It is not cheap, especially depending on what kind of model you're wanting to run. If you're talking any kind of large language model, if you're able to do it on consumer GPUs, that is going to be exponentially cheaper than having a cloud instance running with that amount of GPU VRAM needed. So.

Charles Max Wood:

Mm-hmm.

Dave Kimura:

I am using open AI whisper, which I'm using their large model, which takes about. Six to eight gigabytes of Vram to host on a RTX 4070 GPU, and that's I have the model running. I have not fine tuned it. I've been playing around with fine tuning it because it is just my voice I'm transcribing and doing the close captions for. So that's going to be like. the best case scenario for fine tuning a model, but I'm just using their default large model. And I'm taking that in, doing the transcriptions, but to interface with it with my Rails application, I am using API calls. So I have a Flask application wrapping that model. So my Ruby on Rails application just makes an API request to the Flask app. that then does all of the transcribing, and then gives the response back, store that in an action text storage or blob, and then that gets automatically updated and uploaded to MelaSearch.

Charles Max Wood:

That's cool. So just to kind of open up some of this a little bit. So your Rails app's hosted in the cloud, so it just hits your home IP address and it gets tunneled through to your server in your basement and then, why did you put a Flask app around it instead of like a Sinatra app or something?

Dave Kimura:

Well, because just interfacing with the because it's all Python code.

Charles Max Wood:

Oh, it's all

Dave Kimura:

So,

Charles Max Wood:

Python to begin

Dave Kimura:

yeah,

Charles Max Wood:

with.

Dave Kimura:

yeah. So having the flask gap, it's just easier to set it up that way. And just the code for the flask gap is like 20 lines of code. It's very manageable and stuff, so it wasn't that big of a deal.

Charles Max Wood:

Right, because essentially what you're doing is you're just capturing whatever it sends in, handing it off

Dave Kimura:

Mm-hmm.

Charles Max Wood:

to Whisper, and then responding with whatever Whisper gives you, right? Makes sense to me.

Dave Kimura:

And so I was joking around and you know I do feel this way. I think microservices are an anti pattern for architecting an application. But in the sense of machine learning when you have something like a whisper model and that kind of stuff turning that into a microservice so you can interact with it with any kind of API. It's. That's, I think, the real use case for microservices, where you have a bunch of these small microservices, or in this case, large language model microservices, and you're able to interact with them from multiple applications. So I actually have multiple

Charles Max Wood:

Mm-hmm.

Dave Kimura:

apps using these large language models that I'm hosting. So not only is that cheaper, because I don't have to spend the same amount of VRAM to do that, but then I don't have to... You know change anything up. It's all they all use the same API. They're all using the same interface So it's very easy to work with and manage

Charles Max Wood:

Makes sense. I still, I have so many questions. I'm just gonna keep going

Dave Kimura:

Ha ha.

Charles Max Wood:

unless Valentino says, I have one, but I guess the other question I have, because I definitely wanna set something like this up, and I know we're kind of deviating from the full text search a bit, but so if I wanted to set this up here and I have a computer that could run it, I'd probably have to put a little bit nicer video card in it.

Dave Kimura:

Yeah.

Charles Max Wood:

Does it just... Is it smart enough to say, Oh, you've got a video card that has enough VRAM to do this stuff? Or. Yeah.

Dave Kimura:

So I'll let you in some of my dirty secrets that you can do to make this so much easier. One you need a way to play around with this. And let's say you know first you need a computer so you know check you got your computer you need a graphics

Charles Max Wood:

Yeah, I have

Dave Kimura:

card

Charles Max Wood:

one of those.

Dave Kimura:

check.

Charles Max Wood:

I have a couple of those.

Dave Kimura:

But

Charles Max Wood:

Yeah.

Dave Kimura:

now let's say an Nvidia graphics card. You can use AMD but you were just going against the grain. So an Nvidia graphics card. And let's say this computer, what are you using it for? Is it now dedicated to AI? Or do you want to play Diablo 4 or some other game on there? Is it going to be a multipurpose? So here it comes, as Valentino mentioned earlier, Docker. Dockerize this stuff. You have a little bit of overhead, but it is so much easier to work with. as far as your large language models and stuff. So as I'm developing these, here's what I'm doing. I have my Windows machine, and my Windows machine has a stupid, ridiculous graphics card in it. It's a 13900K CPU with 64 gigs of RAM and a RTX 4090. I mean, it is a balling machine. It's awesome. But... It's also my gaming machine. So I bought Diablo four and I've been playing that, but also want to continue doing my machine learning without having to do boot into Ubuntu. So what I've done was I have Docker installed on there that I'll spin up when I'm actually wanting to do some machine learning stuff and run Portainer on there to just easily remotely manage the images because I don't want to be on my Windows machine doing the development. I want to be on my Mac doing the development. So I have a Jupyter instance set up on there. And Jupyter is a notebook system. So it's a markdown compatible notebook that is compatible with Python in the sense that if you have some Python script, you can just hit Shift-Enter, and then it'll execute that script in its own little isolated environment. And that includes the ability to run machine learning models. So you can do all your testing out in your browser of choice on your computer of choice. Well, it's all running in Docker on this other machine. And so that's how I kind of just played around with the playground with this stuff. And once I kind of got it to a point where, OK, yes, this is working the way I want, I then just copy that notebook. or the relevant bits of code of that notebook into its own project. I then create a Git repository for it, push it up, and then I pull it down on my other machine that I have behind me, which if you're watching the video, it's a little matrix thing going down, but that runs a RTX 4080. So it's a bit lighter GPU. Still crazy stupid expensive, but that GPU is then hosting the actual model, running the Flask app. And then that's also, it has the ports open to then be able to communicate with over an API. So that's where I'm actually hosting it. And if I need to do any kind of tweaks or anything on there, then I can just launch VS Code with Remote SSH and then shell right into that machine. It sits headless on my shelf, so it doesn't have a keyboard, monitor or mouse plugged into it, but it doesn't really need it. Because it's just my little headless server that I play around with.

Charles Max Wood:

Cool. I might have to pick your brain a little bit more about it. But yeah. That's cool stuff.

Valentino:

Yeah, we did an episode where Dave broke down a lot of this. I forget which one that was now. We'll have to find it.

Dave Kimura:

If Chuck ever implemented Melasearch in transcriptions on DevChat, then we would be able to find it.

Charles Max Wood:

Well, give me a couple of

Dave Kimura:

Hahaha!

Charles Max Wood:

weeks.

Valentino:

I'm curious how you're handling the ingestion process of MelaSearch. I'm seeing in the documentation here they have this internal queuing system for ingesting documents with tasks, which seem like a sidekick,

Dave Kimura:

Mm-hmm.

Valentino:

but for MelaSearch specifically, what's been your experience using that?

Dave Kimura:

So in the models that I'm using MelaSearch, I'm using that helper MelaSearch and I pass it in Qtru. You know, that's like probably the first and foremost thing because if there is ever a problem with your running instance, you don't want that happening in line in the request that is updating a record because then you're going to get yourself into a situation where the whole website appears to be down. So I do push it into a background job. And then I have a couple of different helper methods within there. I have attribute, which is saying, what do I want to be indexed? And that's where I have the list of all the different attributes, including the ones that have the transcription body. And then I have the searchable attributes, which is basically like the weighted results of what are the attributes I want to have searched on and which ones are the most important. And that's where I have like the episode name is the highest then the summary, then the content or the code body of the show notes and then the transcription body last. So those are like the two most important ones. And then you can get into some of the ancillary things like filterable attributes or sortable attributes, depending on what you want there. But I mean, it's give when I first saw it. really gave me the search kick vibes to it. It's almost like the Mellor Search team saw what Andrew Kane was doing and loved it and then just re-implemented it with their own version of a much more simple elastic search. You're muted, Valentino.

Valentino:

I was saying that for those not familiar with elastic search, like the processes, you define all of your mappings up front, right? You say, okay, these are all the fields that I want to have for this particular document, which is in the index. And then only then can you insert data into those documents in the specific fields that you want to index, right? And you can do partial indexing on those fields. I really like how you Mela search has it so it seems like you define those as you insert them, right?

Dave Kimura:

Yeah, and let's say if, for example, with DevChat TV, Chuck wants to go ahead and implement mailless search because that's the easiest bang for the buck to get going

Charles Max Wood:

It's

Dave Kimura:

with.

Charles Max Wood:

top end devs now, but yeah.

Dave Kimura:

Oh, yeah, top end devs, DevChat, sorry. So you implement it and now you're like, you know what, I want to make this better. I want to add in transcriptions in here as well. So now you have basically a schema change within your mailless search data. which if you're doing elastic search, it's not going to like that too much. But with Mela search, it's really nice because each record will basically have its own schema. That's not really accurate, but let's go with it. So as you are updating a certain model within Mela search and you do a full reindex, it's going to start reindexing new records. Well, if you've added in the transcription attribute to something that you want to be indexed, then some records will have it. Others will not. Elasticsearch just freaks out with that kind of thing. But with MillaSearch, it'll just start adding it in. So you have records with different schemas, essentially, or different attributes. So that's why I say it's not really accurate, but it kind of is. So doing changes like that. Especially on an application that you're maintaining and evolving over time is really important. And that's one thing that really drew me to Mellasert.

Charles Max Wood:

So can you re-index old records with the new attributes?

Dave Kimura:

Yes, absolutely. And there is a model.reindex bang that will do all of that for you, just like

Charles Max Wood:

Right.

Dave Kimura:

search kicked it.

Charles Max Wood:

Search kick does,

Dave Kimura:

Charles Max Wood:

yeah.

Dave Kimura:

it does that in batches of 1,000, I think. But you can do it in batches of much smaller if you want, which, Chuck, if you are going to be on top end devs doing the transcriptions, then you are going to have to do it in batches of something like 10, because you're talking about doing the full transcription of episodes, which there are post limits that you can do to mail a search, which

Charles Max Wood:

Uh-huh.

Dave Kimura:

I don't think is a no search issue, but more just a HDP issue.

Charles Max Wood:

Right. Which is fine. Yeah, I'm really starting to get into the AI end of things as far as like what the capabilities are, but it's interesting to just see, Oh, okay. You know, this search problem can mostly be solved with, you know, something like Mela search. Now, um, search kick is kind of an interface library between Ruby or rails and elastic search or open search. So is, does Mela search work with search kick or do they have their own jam or how does that work?

Dave Kimura:

They have their own gem. It's MelaSearch-rails.

Charles Max Wood:

Uh huh.

Dave Kimura:

And then that gives you, that adds in the MelaSearch library, which is kind of like the Elasticsearch library that SearchKick adds in to interface with the actual running instance of Elasticsearch or MelaSearch, whatever.

Charles Max Wood:

Cool. Well, folks, if you see search show up on top end devs anytime soon, it's Dave's fault. I did the work, but it's still his fault.

Dave Kimura:

You're welcome.

Charles Max Wood:

It's something I've wanted to add in for a while, but I've mostly been just focused on the UI and making it easy for people to play and use the thing. But, um, one, one thing that I'm finding is that, um, Yeah, for my own stuff, right? If I want to link out to something or I know we talked about this in an episode, just for me to find it and reference it so P other people can use it is where I wanted anymore. So.

Dave Kimura:

Oh, and this may be the deal sealer for you, Chuck. So if you ever do a Google search and you're searching for a phrase or something,

Charles Max Wood:

Mm-hmm.

Dave Kimura:

it's not Google itself is not very good at searching for a phrase because

Charles Max Wood:

No.

Dave Kimura:

it'll try to group the words together, but it'll just find does this page contain those words?

Charles Max Wood:

Right.

Dave Kimura:

But you can actually put double quotes around that phrase and then it does a search for that exact phrase. Mellow search has that built in with the double

Charles Max Wood:

How nice.

Dave Kimura:

quote search. So you can do that double quote search for that must include. And then you could also add in or chain in any other word. And those are also then going to be a part of your search. No extra work for that.

Charles Max Wood:

Double quote fracking Cylon. Double quote. And there I just, it'll get transcribed and it's going to mess up my results. I'll get this and wherever that other quote is. Very cool. Anything else that we should dive into with this?

Valentino:

I'm curious if you've hit any of their known limitations yet, Dave.

Dave Kimura:

On Mellas search? No. No, because I mean, there is a really good article and I'll post it on the chat here to include in the show notes. But it's basically a table from a company called Typesense, which is another type of search, but I

Charles Max Wood:

Mm-hmm.

Dave Kimura:

haven't used them. So I didn't want to say much about them. The last I checked, they did not have a. Ruby plugin so you could you can't really easily use it with Ruby but what they do have on here is a really good chart comparing typesins with Algolia, Elasticsearch and Mellasearch and what some of the limitations are and so if you look at some of the limitations around them I mean there's not too much that Mela search can't do that elastic search does there are differences So you can't do negative keyword searches with Mela search. So if you want to exclude something Explicitly from the search you can't do that. But I mean overall it's pretty darn Comparable for a lot of the things that I care about So there is a maximum index size so Index can be up to 500 gigabytes on Mellasearch, which is insanely huge. I mean, you're talking like super large companies. And at that point, you're you've outgrown Mellasearch, probably. But the maximum number of indexes, which is basically the number of models or things that you're indexing. So not the records, but the actual indexes. So, for example, like episodes or users or companies

Charles Max Wood:

Uh-huh.

Dave Kimura:

that you're wanting to search. is 200 on Linux or Mac and 20 for Windows. So, I mean, 200 is still a pretty large amount that you're able to do. I'm only using a couple of indexes. And then the maximum record size is two gigabytes. So each record within one index can be a maximum of two gigs, which is insanely huge. If you're hitting those limitations, I think you have bigger problems with the record size.

Valentino:

Yeah, I mean, I bring it up. The biggest thing that jumps out to me is their concurrent request limitation of 1,024 requests at the same time. Which, I guess if you have that problem, maybe you should be looking at something else anyway.

Dave Kimura:

Yeah, it's kind

Charles Max Wood:

Still

Dave Kimura:

of crazy.

Charles Max Wood:

cool.

Dave Kimura:

But I mean, even then, yeah, 1,024 requests per second is huge.

Valentino:

I mean,

Dave Kimura:

I mean,

Valentino:

that's

Dave Kimura:

that

Valentino:

just at

Dave Kimura:

is,

Valentino:

the same time.

Dave Kimura:

yeah.

Valentino:

They could probably process more than that.

Dave Kimura:

Yeah, I mean that is a huge shrunk and honestly if I had that problem number one I would be very happy but to

Charles Max Wood:

Hehehehehehehe

Dave Kimura:

if Realistically if I wanted to still use Mellasearch for this application and that was my only limitation I was hitting then I would just set up a load balancer Which balances the load between multiple of them and whenever a record changed? I would just explicitly have it upload to different servers. So it would just run that one request four times on one on each of those Mellor search servers that was hosting. So, I mean,

Valentino:

Yeah,

Dave Kimura:

that seems

Valentino:

I did

Dave Kimura:

Valentino:

notice

Dave Kimura:

the.

Valentino:

that they tie it directly to your memory footprint. So

Dave Kimura:

Mm-hmm.

Valentino:

yeah, it makes sense to just distribute it that way. Just balance it. This is pretty

Dave Kimura:

But

Valentino:

cool.

Dave Kimura:

yeah.

Charles Max Wood:

Yeah. Yeah, really

Dave Kimura:

end.

Charles Max Wood:

cool.

Dave Kimura:

I just like how simple the hosting of it is. There's not many moving parts. It's a self-contained binary. And then it has its own data store. So I mean, that's really it. You don't have to have this whole slew of junk like you do with Elasticsearch.

Valentino:

Yeah, I mean, the queuing system too, like built into it is definitely very appealing, right? Like where you don't have to manage that yourself as far as indexing, you know, you just throw it, throw it on the queue and it let it deal with it itself. And you even get like telemetry data on it, right? Like, so you could see how fast it's taking the process. What? It's really cool.

Charles Max Wood:

One other question I have on this is, so can you run MelaSearch on a server and then have multiple apps create their own separate indexes or whatever on it?

Dave Kimura:

I don't know. I'm just using the one, but.

Charles Max Wood:

Yeah, I've got a

Dave Kimura:

Uh, my

Charles Max Wood:

couple

Dave Kimura:

concern.

Charles Max Wood:

of websites I'd like to set up. And so it'd be nice if I could run one instance and then just have everything talk to the same.

Dave Kimura:

Yeah.

Charles Max Wood:

setup.

Dave Kimura:

My concern would be cross talk

Charles Max Wood:

Right.

Dave Kimura:

and

Charles Max Wood:

I agree.

Dave Kimura:

that's what you would want to avoid. I mean I would have a similar issue with elastic search. But there is a sense of API keys. So I'm wondering if you use a different API key if that would then...

Charles Max Wood:

if it silos it.

Dave Kimura:

Yeah. but it would still probably count to your maximum number of indexes on

Charles Max Wood:

Okay.

Dave Kimura:

that server.

Charles Max Wood:

Oh, so you might have to set up more than one server anyway.

Dave Kimura:

Yeah, if one application has a hundred indexes that you're doing and then you have three others that are 50 each Then you're probably going to exceed that capacity

Charles Max Wood:

Makes

Dave Kimura:

But

Charles Max Wood:

sense.

Dave Kimura:

I mean, in reality, not everything in your database needs to be indexed. So if you're just talking like episodes, then you have your episodes

Charles Max Wood:

Mm-hmm.

Dave Kimura:

or show notes or something like that. And that's one index. You don't have to have an index for each individual one or each table.

Valentino:

Do you use their paid version or are you just on the open source?

Dave Kimura:

I'm on open source, so I'm self-hosting it. I don't know if I can self-host a paid version or what the differences would be. But oh, um, originally I was on their, okay. So their paid version is their hosted environment. So I originally went to that. That was my first approach and, uh, you get, uh, 10,000 searches included per month, um,

Charles Max Wood:

Mm-hmm.

Dave Kimura:

for free. So the problem was I hit that in one day. So, uh,

Charles Max Wood:

Oh really?

Dave Kimura:

yeah. I then started doing the math. I'm like, holy crap, this is going to get expensive. So

Charles Max Wood:

Yeah.

Dave Kimura:

I switched off after less than a week of being on their hosted environment into my own environment. So I'm like, I can't do this because that's going to be very expensive.

Charles Max Wood:

Yeah, I ran into a similar problem with Algolia. I hit their free threshold pretty fast.

Dave Kimura:

Which I don't mind paying, but

Charles Max Wood:

Yeah.

Dave Kimura:

when a service becomes more expensive than the application hosting itself, then I got a problem.

Charles Max Wood:

Right, yeah. Yep, and that was kind of the limit that I hit too. Yeah, I was paying more for the search, or I would have ended up paying more for the search than I would have paid to just keep the app running. And it wasn't by any means the most valuable part of the app. Good deal. Well, should we scoot along to picks? Alright, Valentino, what are your picks?

Valentino:

Ah, let's see. Well, we just released at Docsimity, a project that I've been working on for the past month, kind of impromptu, called Best Doc AI, DOC. And it's a new AI product that lets you find the best doctor or specialist for a diagnosis that you get. And so you enter a diagnosis that you've gotten from your primary care physician, and it'll go and it'll... figure out what specialist exactly you should see for that, and then match up with the physicians. And since we have over 80% of US physicians on our network, we're able to identify those and service them. And it's so, so fricking cool. I recommend you check that out. Best Doc AI.

Charles Max Wood:

Cool. How about you Dave, what are your picks?

Dave Kimura:

I think lately my biggest things that I've been playing around with are WLED, which is a open source software project that allows you to control individually addressable LED lights. And so I've been playing around with that. I wrote a stream deck plug-in so I can control them from my stream deck, which is just like a macro keyboard. And then also wrote a VS code plug-in. that will do syntax highlight checking. So as I type a syntax error, then I have a little lamp sitting next to my desk that'll light up red. When I fix that syntax error, then it lights up green. So just a fun thing. And then Jupiter is my second pick. I have that self-hosted running. And it's that Python notebook that allows you to just execute stuff. It is so much fun to just tinker around with. I even gone to the kids Taekwondo lessons with my iPad and we're just playing around with it and stuff. So it's a lot of fun.

Charles Max Wood:

Cool. I've got a few picks. So last week I was at JS Nation and then React Summit in Amsterdam. So fun, just good stuff. Great connecting with people. Did some interviews for JavaScript Jabber. And anyway, so I'm gonna pick those. Board games. I'm just gonna go to one that I played last night with my buddies. I think I picked it on here before, but I'm not sure. And it is, here let me find it. uh gonna take me a minute.

Dave Kimura:

Why are you looking for that, Chuck? I have to say, you got me on to Dice Forge and my kids love that game.

Charles Max Wood:

It is a fun one, definitely

Dave Kimura:

Charles Max Wood:

a fun

Dave Kimura:

got

Charles Max Wood:

one.

Dave Kimura:

the expansion, the Rebellion expansion, we've been playing that lately, they love it.

Charles Max Wood:

I don't think we have that one. We have a different expansion. The game that I'm going to pick is Star Wars Clone Wars. If you've played Pandemic, it's a Pandemic style board game. And it's pretty fun. It's got a little bit different mechanic because you actually have to fight like the villain at the end. You're just completing missions. It's a relatively simple game. It has a board game weight of... 2.09. And so anyway, it's a lot of fun. So we were playing that last night. It was nice. As far as Dice Forge goes, I'm trying to remember. Maybe it is the rebellion one that we have. Yeah, I think it is. I think that's the expansion that we have as well. So it's definitely fun, definitely a fun game. And then a few other things I'm going to pick. So one of the things that was kind of fun about the trip to Amsterdam was the folks from the conference put me up in a hotel that was actually a boat. And so whenever I went back to my hotel, I take the ferry or the boat, because they chartered a boat from where we were staying up to the venue, because there really wasn't a convenient way to get there. otherwise, but then you'd have to walk down the pier to this boat. And so I'll put a link to the boat. If you ever wind up staying in Amsterdam, it was really fun. The rooms were rather small. Um, just, you know, you kind of expect it on a boat. Um, but the boat was called captain Anna and captain was spelled K A P T E I N. Anna and, uh, Anyway, it was awesome. So let's see, let me see if I can find a. proper link for it. But yeah, I really enjoyed that. While I was there, I also went to the Rembrandt house. And basically it was the house that Rembrandt lived in. And they restored it. So it's not none of its original. And that was pretty cool. And then I also went to the Dutch, I can't remember, Versets Museum, which is the Dutch resistance during World War II. where it was about all the people who resisted the Nazi occupation. That was really, really cool, really fancy. Not fancy, but really interesting just to see all the people in different ways that they stood up to the Nazi regime and things like that and some of the history as far as how and what they did to resist when the Nazis first came in. Then as time went on, right? How they... They continue to fight them. I mean, obviously the allies had to come in and liberate them. But anyway, it was really, really fascinating just to kind of see all the history there. So I'm going to pick both of those. And then, yeah, we were, today we did the book club because I was out of town on Tuesday. We usually do the book club on Tuesday morning and we're doing seven languages in seven weeks. So we were talking about Prolog and I actually found Ruby Prolog library. And I thought that was pretty interesting as far as, you know, kind of having an engine where you do the more declarative programming instead of the procedural programming that we're used to in Ruby. And so I'm gonna pick that as well. So lots of picks, but anyway, those are my picks. And yeah. I guess we'll wrap it up here. Until next time, folks, Max out.

Dave Kimura:

Talk to you later.

Full-Text Search in Ruby - RUBY 600

0:00

53:00

Playback Speed:

Show Notes

Sponsors

Links

Picks

Transcript