Navigating SQL Data Changes: Tools and Techniques for Data Recovery - RUBY 645

In today's episode, Valentino is joined by Evgeny Li, a seasoned software developer and founder of Bemi, to delve into the complex world of data recovery and auditing within SQL databases. They explore advanced solutions like paper trail and refi for querying historical changes, point-in-time recovery features offered by cloud-managed databases, and the use of AI to humanize data structures and enhance monitoring.

Hosted by:

Valentino Stoll

Special Guests:

Evgeny Li

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

They highlight the critical role of Change Data Capture (CDC) in tracking and troubleshooting data changes, with real-world applications across logistics, healthcare, and more. They also discuss the pros and cons of different CDC implementations—query-based, trigger-based, and log-based approaches.

Valentino and Evgeny shed light on the challenges of managing large data volumes, emphasizing proactive design and the importance of strong data management systems. They touch upon essential tools like GitHub Copilot, OpenTelemetry, and various auditing gems, offering valuable insights into maintaining data integrity and performance.

Tune in as we navigate the intricacies of data management, the impact of AI, and best practices to ensure reliable and efficient database operations.

Socials

Transcript

Valentino Stoll [00:00:04]:
Hey, everybody. Welcome to another episode of the Ruby Rogues podcast. I am your host today, Valentino Sol, and I'm joined by a very special guest today.

Evgeny Li [00:00:15]:
Hey, I'm Yevgeny. Yep. Yep. I'm so sorry. Yeah. No worries. That's Yevgeny. Yeah.

Valentino Stoll [00:00:21]:
Even yep. Welcome. I appreciate you taking the time to come on today. I'm gonna just introduce yourself and, tell everybody why, you've gotten gotten so famous you're on the show today.

Evgeny Li [00:00:35]:
Not sure about famous, but I'm very happy to be here. And, yes, I'm Evgeny, a long time Ruby Rogues podcast listener. Thank you for inviting me. I'm a software developer, worked for many years, using Ruby as my primary programming language, and created some open source libraries, and gems, like batch loader, for example. Recently, I started my own company called Demi, and this company helps other companies track data changes in their databases. And I recently wrote a blog post about different audit trail approaches, that can be used in Ruby, and, yeah, today I'll be happy to discuss it and share my knowledge.

Valentino Stoll [00:01:24]:
Awesome. Thank you so much. And, yeah, I appreciate Dash Loader. I definitely use that every day. So very cool. Yeah. So, I mean, audit audit logging, such a popular topic. To be honest, like, one of those, like, kind of things that you, over time, will just end up using.

Valentino Stoll [00:01:45]:
Right? It's something that you wanna just also drop in and kind of hope that it works. Right? Like, and just, like, don't have to think about it again, I think for the most part. What kind of got you started down this path of, like, you know, really, really diving into this?

Evgeny Li [00:02:07]:
Yeah. It's funny that, yeah, you mentioned that everyone is familiar with this problem or maybe at some point heard about gems like paper trail and things like that. But I noticed that it's mostly in the Ruby community. When you step aside and, for example, go into, like, JavaScript world or Golang world, like, people have not heard about tools like that. They would use completely different approaches. And because Ruby has Rails and active record as a, like, main ORM library, there are some lots of unified tools. And how I first encountered like, first of all, I used all these gems maybe for many years. For example, paper trail is the one of the most famous ones.

Evgeny Li [00:02:52]:
What it does, it basically, at a high level, just connects, to, like, your model. In your model, you you have some callbacks, and these callbacks record changes in a separate table. And recently, I worked at a company called AngelList, AngelList Venture. And AngelList used Ruby, and one of the main problems that I tried to solve there was related to tracking. And there are many different reasons why people would want to track their data changes. At this company, specifically, it was related to compliance and regulation. So the company operated in the financial space. It was important.

Evgeny Li [00:03:29]:
There are some certain compliance requirements. For example, SOC 1, SOCs, and many different weird acronyms, similar to SOC 2, which is mostly about security, but those are related to, like, financial reporting and things like that. And it was super critical to be able to record what happens, with data in the company as well as be able to audit it and trace back, like, why this change was made. But I'm noticing that, yeah, full disclaimer. I'm working at a company called Demi that we try to build a productized solution, something like that. And there are many different other reasons, Like, we are noticing why people care about audit trail in general. The most maybe straightforward one is rollbacks. Like, maybe you, like, made some changes by accident or maybe by design, and you want to allow people to revert the changes they made recently.

Evgeny Li [00:04:27]:
That's one of the use cases. Troubleshooting is another one. So there are lots of companies that want to understand what happened. Maybe some companies implement on calls for their engineers, and they do something manually, go in and do something for their customers, and that needs to be tracked as well. Regulation, as I mentioned, in certain, like, financial sectors, banking, and things like that, it's kind of, like, required. Some companies build, activity feed as a security feature or table stakes features. For example, if you use, like, Notion, maybe enterprise plan would have, like, activity feed for admins to see what happens within the organization. And people can also use this data for fraud detection, analytics, and things like that.

Valentino Stoll [00:05:19]:
Yeah. You you make a couple great kind of crucial points here. Primarily, like, there's activity. Right? Like, that people like the trap, which I feel like often is, like, separated. And it's funny that you mentioned that this is, like, mostly, like, Rails and Ruby, specific. And I think may maybe the other realms, like, go toward, like, the log capture approach, where just, like, logs are getting blasted and you try and, like, resolve all of the the time stamps as as they come, which I think is crazy. Yeah.

Evgeny Li [00:05:54]:
There there are many different, like, views on how to use, for example, data and interact with the database in the Ruby community, like, conventional reconfiguration. We have active record and many other ORMs, data mappers. In for example, if you take a look at in the Go community, it's comes from simplicity of the language. That is why many people try to stick with, like, low level adapters and write SQL queries themselves. And that is why there it's even, like, hard to build a unified solution like that. In the GS community, for example, there are lots of many ORMs that, I don't know, work on edge and, like, work in serverless environments and things like that. So there are lots of lots of crazy technologies, which is also hard to build a unified solution. But, yeah, communities like Ruby, Python, I'm noticing that, yeah, they're more mature, and they have already lots of great tools around that.

Valentino Stoll [00:06:53]:
Yeah. That's that's really interesting. I mean, definitely, we could take some learnings there. But I would like to just think about, like, you know, why people even like, you mentioned a lot of, like, you know, mostly, it seems like reporting and, like, trying to, you know, align with, you know, requirements regulatory requirements or or things like that. Even internal, you know, regulation where you want to be able to, confirm that things are happening, why they changed. Right? And I I feel, circling back to this, like, notion of, like, activity is happening on your app, and you wanna, like, be able to isolate how the activity affected your data. Is that kind of like the approach that you're taking with Demi of trying to visualize how those two things match up? Or, you know, where where do you focus on?

Evgeny Li [00:07:52]:
Yeah. We take a slightly different approach. So there are many different approaches. So for example, I wrote this blog post article that describes some of the approaches. At a high level, there are I call them callback approaches. So paper trail audit audited mongoid history and gems like that. What they do, they basically add callbacks. And when you use ORM, they would write history of these changes in a separate table.

Evgeny Li [00:08:18]:
There are some trigger based approaches. So in a database, you can write code, which is dangerous, could be dangerous, especially if you're trying to implement complex business logic. But for simple triggers, it works. So we can try to implement triggers when, for example, a record is being updated, inserted, or deleted. Please do something. And there is one gem called, It's very hard to pronounce when you see it how it's spelled. But this gem, yeah, implements this approach, with triggers. And I I think that for majority of the companies, especially small ones, these approaches are great.

Evgeny Li [00:08:55]:
There is, like, low overhead. You just add them. Maybe add them even, like, just in case. Maybe you would do something dangerous and delete something by accident or your logic was not tested well and it performed something crazy, and you would want to roll back. That's kind of like good insurance, I would say. At Demi, we take slightly different approach. I can go in more into details, but at a high level, we use replication lock locks, as a solution. There is a pattern called change data capture.

Evgeny Li [00:09:29]:
Very weird, like, wording, but it's a data pattern, used by data engineers very often. What it does, it basically yeah. Capturing data changes from the database and moving these data somewhere. The most popular use case is probably ETL or solutions like that when the you would use some kind of, like, solution that connects to your database, for example, MySQL, Postgres, or whatever, and it tracks all changes and moves these, changes to warehouse or something like that. So we use this similar approach and change data capture. CDC, in short, can be implemented in many different ways. There is, like, query based when you basically query, by timestamp, for example. Give me all records that were updated from the last, like, hour when we checked.

Evgeny Li [00:10:20]:
It could be implemented by using triggers again. And with Demi, we use lock based solution. So Postgres, for example, has Vowel right ahead log. MySQL has bin log. MongoDB, for example, it has Uplog. Like, SQLite has Vowel as well, right, the headlock. So all these logs, they, again, briefly describe, like, why they exist in the world. So, these right ahead logs, they're designed to be able to quickly, record changes in the database in append only log before writing all the changes to the disk with data indexes and things like that that are heavier operations.

Evgeny Li [00:11:06]:
And these locks, they are crucial for data recovery. For example, if your instance crashes, it would try to recover from the last state, replace certain, let's say, changes from the wall and get to the latest state. These replication logs can also be used for, like, cluster replication. For example, if you have, like, primary and secondary, secondary would connect to the write ahead log or replication log in the primary and try to read it and, get and sync the state. We use basically the same solution at Demi. We are sort of connecting like a read replica, but we also decode it. It's called logical decoding. We decode all all these changes so it's you can understand, like, it's not on a binary, like, file blob format.

Evgeny Li [00:11:54]:
It comes from, like, decoded on a a role level, and you receive all the changes, and we simply stream and store them as a kind of like, yeah, in a separate table, and people can use it and read it. And people, sometimes yeah. As I described, there are many different approaches, like or use cases why you would use it. But at a high level, what we do, we simply write a log of all changes that people want to track. They can specify, like, what table to subscribe to, and there are certain technologies in on the Postgres level that were recently released that may help with performance as well. Plus, we have some nice UIs that people can use to see, like, what happened to this record and things like that.

Valentino Stoll [00:12:42]:
This is really interesting. I love the article that you have that kind of breaks down all of this on I've never heard of change data capture before. It's it's kind of wild. There's a lot of lot of pieces to the, process here, which I think just comes honestly with, a lot of the event driven kind of, architectures. So I just wanna take a step back. Like, let let's let's walk through kind of, like, you know, why would somebody, you know, aside from like, let's say, like, what what are the primary use cases, I guess, for having your for for having a a need for that level of, like, you know, orchestration and, like, granular rollbacks. Right? Like, what you're talking about is having, like, I guess, a massive amount of data that you to granularize what you're auditing, and be able to recover from those in case anything should happen. Is that your primary focus, like, focusing on that recovery when you're trying to do the audit trail? Or, like, where do you see a lot of the, you know, the most common use cases of having the audit trail and where where, like, just having paper trail, like, starts to fail?

Evgeny Li [00:13:55]:
Yeah. Great question. We are very, like we are noticing that it's a very horizontal solution and people use it for different purposes. That is why, like, it's great when you give a solution for developers and they build something on top of it. We have maybe 3 main use cases. One use case is just for troubleshooting purposes. We have, for example, there there is one logistics company, that has 30 or 40 customer support reps. And they're in the logistics business, so there's lots of real time data from analog world, lots of unstructured data, and they leverage AI and things like that to help with it.

Evgeny Li [00:14:39]:
But they need to see what's happening, maybe help customers, their customers, and, like, cargo companies and things like that. Help them understand what happened to the shipment, for example. And they have lots of different inputs. It could be, like, unstructured, just an email. It could be a doc signed or something that changed the status of the shipment, and they would need to troubleshoot and understand what's happening. So, yeah, I previously like, as developers, one solution is just going to a rails console or something like that. And if you use any of these solutions, you would just try to see, like, query a record and get all recent changes that happened to it when you need to understand, like, why is it in this weird state, why it's not in, like, different state, what happened there, like, what API endpoint, for example, changed it and things like that, some additional metadata. The second use case is related to just, yeah, rollbacks.

Evgeny Li [00:15:44]:
For example, we have, there is, like, one company related to carbon removal tech, for example, and they use Postgres, which is like an extension for geolocation data on top of Postgres. So they do lots of changes, in batches. And in certain cases, when there is an error or something, they want to roll back to the previous state, and it's possible to just track certain selected tables, and work with any, like, extensions and things like that. And the third one is for many, it's like table stakes, whether it's because of, yeah, regulation or, for example, there are in health care, it's very popular. There are, like, clinical trial management systems, for example. They need to track what happened to patient's data. And many of these solutions, like, one fun story, like, at one company, we use paper trail, and we had so many locks accumulated in the versions table that we ran out of integers. And we had to like, the solution was simple.

Evgeny Li [00:16:50]:
We basically, okay, said start, recording all changes in, versions v two table. So we would keep the old one. It will be like cold archival, but all new changes was recorded in a separate table because we don't have any integers there and for IDs. So that is why it's good to use UIDs when possible. But one issue with, for example, paper trail and solutions like that is reliability. Maybe, like, people maybe you heard that, it's possible to even with rails to execute certain queries that bypass the callbacks. For example, if you execute, like, delete all or update all or update column or something, it won't trigger the callback. Thus it won't record the history, so you will not have the complete history.

Evgeny Li [00:17:42]:
So that is where, like, log based solutions are critical and could be very useful. Because even if someone goes directly to a database, executes an SQL query, it will be recorded because it happens on their, like, database level. And that's one of the kind of, like, main reasons in terms of plus, there are some additional benefits like no performance penalty. So you're in run time. You don't insert any additional records. It happens asynchronously. But that's at a high level. Yeah.

Evgeny Li [00:18:14]:
You made me

Valentino Stoll [00:18:15]:
think about, because a lot of a lot of what we've been talking about has been surrounding, like, just auditing very specific processes around the data. Right? Like, we get a, you know, going back to the logistics approach where I, like, somebody submitted an order and they're tracking the process of that, and the order, you know, I imagine has a table of orders and various aspects of the order that are being tracked with it, and that gets updated in the database kind of the same way and it gets tracked, you know, as it progresses through whatever business logic that we have. What I feel like is kind of missing from what we're talking about here is, like, that, that overarching activity. Like, almost I it makes me think of, OpenTelemetry, right, where we have, like, distributed systems, right, where they don't have to be necessarily, you know, technologically. Right? But they're distributed events that happen throughout the business, that fire off and some activity happens and that has effect on the database. What do we do for, like, cases like that? Like, are there, like, tools available to audit that process and how that's affecting the the individual audit trails that are happening? And and, like, where how do you approach that aspect of

Evgeny Li [00:19:34]:
it? Yeah. You're definitely right. There are some, like, tools, for example, Datadog and things like that that help with some visibility, observability, and tracing. Many of these tools, they either implement a like, just login. So they would log what happens. And for example, if you use, like, Postgres, there is an extension popular one called PG audit. What it does, it basically logs all queries that were executed. So if for security reasons, someone, for example, gained access, you'll be able to, like, during, like, cyber forensics operations or something, you'll be able to detect what happened and who executed what.

Evgeny Li [00:20:15]:
So there are some solutions that do just, logging of activity that happens, and you can do it manually as well. There are some gems like public activity or Ahoy gems that, maybe use, like, Google Analytics on the front end or something like that or in the browser. It's similar, but on the back end when you can manually trigger certain events, that you want to track. And they will be stored in structured format, not in logs, but in a database, for example. There are also solutions that help with just performance, observability, open tracing, and things like that. But there are not a lot of solutions that do something on that data level. So and it's a little bit weird, and funny because, yeah, most of the products, they use databases, and that's the source of old data, and yet we don't track it. Like, developers use Git for tracking and versioning their code, but there is nothing for, like, databases.

Evgeny Li [00:21:23]:
And to be fair, there are some solutions that people sometimes use. For example, there are maybe 2 technologies that may help you solve this problem. 1 is temporal tables, and another one is event sourcing, utterance that some some companies and people use. So temporal tables, I'll try to summarize it in the simple words. Basically, it's within the SQL standard since 2011. What this feature does, it basically what I described, it kind of, like, has a version of all records. So you have your record, plus you have a range and a time stamp, basically, a range of time stamps during which period this record was there. So for example, if you're updating a record 2 times, there will be 2, records in the temporal table.

Evgeny Li [00:22:20]:
And the first record will have, like, an end time range when the second update happened. And second update will have, like, infinite time range starting with the time when this update happened. So with this functionality, and that's very powerful, it's usually helpful when you do, for example, reporting. I don't know. You need to generate some files or what happened last year, tax or something like that. That's very, like, powerful feature. Although many databases, including Postgres, unfortunately, don't have it implemented. So you would need to install some, extensions or build it yourself.

Evgeny Li [00:23:01]:
So that's one option. And another one for just auditing changes is rethinking how you make data changes is event sourcing. Event sourcing in simple words, it's kind of like, maybe the best analogy is Git. So when we use Git, we basically use an event source system. So git records all events, commits, and all the changes that happened. And then and that's the source of truth. So so commits become the main kind of, like, unit of data. And from these commits, the powerful feature is that you can travel in time.

Evgeny Li [00:23:43]:
So for example, if you want to roll back or see what was the state of, like, 3 commits back, you can do that by traveling and basically undoing the changes of each commit. I would say in from my perspective, like, in software engineering, event sourcing is like a unicorn idea. Everyone is talking how cool that is, but very rarely, people implement it or do it successfully because it's a very challenging problem. Because you deal with mostly with immutability of these events. So every change basically needs to be mutable. But what if you need to change the structure? Then you need to implement some kind of, like, versioning or replay the events and append some additional data or, like, change the things. Plus, then you would need to, somehow get to the actual state. So you would record all events, for example, shopping cart.

Evgeny Li [00:24:39]:
You would record an event that somebody added an item to a shopping cart. Somebody clicked on purchase, somebody purchased, and it was paid or something like that. All these events would be recorded. But at at the end of the day, you would need to understand what was purchased, for example, when showing the whole, like, order history. To do that, people use, projections, CQRS, like, for, like, separating reason rights and lots of lots of other patterns from the book called domain driven design, DDD, and that becomes very complex. I would say that event sourcing makes sense in certain regulated industries. And there are lots of great patterns or, like, how people do it. For example, double entry bookkeeping system in banks or, like, yeah, accounting and things like that.

Evgeny Li [00:25:33]:
This is standard where you would probably try to use this approach from the get go.

Valentino Stoll [00:25:38]:
Yeah. You make a good point. I mean, all of my, event driven design architecture related stuff has always you know, the complexity raises as you start to spread things out and you try and, like, capture what you know, collect all of the things and and merge, you know, the state of things. You know, which pieces do you pick? Right? Like, anybody that's had a merge conflict and get, you know, can understand the issue here. Right? Like, well, which which changes do you choose? Right? And that definitely becomes more of a problem as you start to spread things out with, you know, your event driven architecture.

Evgeny Li [00:26:16]:
Yeah. Exactly. Yeah. It's much simpler to use. For example, like, CRUD operations. For example, we just update the record, and that's it. It's transactional. Right?

Valentino Stoll [00:26:24]:
Right. Right. Do it or not. But it does make me kinda think, like, okay. Well, you know, as you know, it makes me wonder, like I guess if you have a need for, right, like, the things happening at different times, but, really, they're all the result of, like, one thing happening. It having a simple audit trail, like, it it makes sense to just throw a gem in and then just have it handle all of the things. But when is, like, when is something like, you know, a time series database or something like are are we just solving the wrong problem? Right? Like, why not just have something, like, in a time series database that that keeps track of the data as it happens, versus spreading it out? Like, what's the where do you see the benefit of having it be distributed like that over having it be more transactional and having things like, okay. Well, it happened in this series.

Valentino Stoll [00:27:33]:
Like, you know, is it worth it, what I'm saying, you know, to Yeah. To have that distributed nature versus just, like, taking the time and doing things in series?

Evgeny Li [00:27:43]:
Yeah. That's a good question. And I think the best answer is favorite, developer's answer is it all depends. But, I think, if you need strong guarantees and you have scale, then it totally makes sense to use this system. The question is then whether you buy or do it yourself. If you do it yourself, then you would need to start thinking about all these, like, technologies, cool technologies like Kafka and maybe storing it in, like, yeah, time series databases and columnar databases and things like that or warehouses and use it somehow. If you need to use it from the app, then you need to somehow connect to, I don't know, Soulflake to read it if you need it for, like, transactional or, like, operations, basically. I think, like, back to these, like, anecdote with large number of, paper trail records, Like, anecdotally, like, from talking to other companies, I saw that for many of them, just managing the database becomes much more expensive when you record everything in the same database.

Evgeny Li [00:29:02]:
You would notice that, like, this versions table becomes your the most biggest that, table in your entire database, and that affects everything. It affects, like, backups, the storage, how fast your database would run, how difficult it is to upgrade it at scale, how fast it will start, how, like, much traffic is there, plus activity, and things like that. And at some like, that's probably, like, totally a valid approach, at certain scale. But at some point, you would probably notice, like, why am I storing it in my database that is being killed, by this performance penalty? Because you're essentially doubling the throughput, on every for example, insert, you would insert an additional record, or it could be even worse. If you are inserting 100 records in one query, it will be 1 to 100 plus extra 100, queries for each record that was inserted. It will be 100 additional scale statements. And, yeah, at some point, it makes sense to somehow extract this data, and it could be still in, I don't know, the same type of database that you use. And there are some benefits to it.

Evgeny Li [00:30:21]:
For example, if your main transactional database is Postgres, for example, what we do, we store these changes and we call it destination database where the changes are stored from the source database. We store it still in Postgres. The reason why is because one is easier to consume. You just use the same adapters, same ORM. For example, in Rails, you can specify couple of connections. For example, for these models, go to this database. But for this model, go to the other database. So you on the application layer, if you want to consume those, you don't need to change anything.

Evgeny Li [00:30:55]:
And 2, especially in Postgres, there are lots of cool tech technologies that help you with scalability. One is maybe, yeah, time series we mentioned, there's, like, time scale extension or superextension because it dramatically changes the way, Postgres works, and it allows you to scale, compress data, and, record events in chronological order that it is designed for that with all these charting, partitioning, and things like that.

Valentino Stoll [00:31:25]:
Yeah. That's really interesting. You know, that's definitely something I've seen, you know, scaling. As things go, you you, you know, you wanna set it and forget it, but then all of a sudden, you're like, well, why is the database crashing when when we wanna, like, you know, audit the audit trail and and

Evgeny Li [00:31:43]:
find do any

Valentino Stoll [00:31:44]:
kind of reporting.

Evgeny Li [00:31:45]:
Yeah. And it affects not just, yeah, database, even, like, your runtime. So your application is slower as the result because it waits for database, but also tests. For example, if you don't disable, for example, versioning for your models in tests, imagine, like, seeding a large, like, database or lots of things, testing things. It will just, yeah, kill your test performance as well.

Valentino Stoll [00:32:08]:
That's a great point. So, I mean, what what can we do to, like, avoid these pitfalls? Right? Like, aside from, like like you said, use like, using some of these, other extensions, like, are there, like, best practices just, like, out of the gate that you can do to, like, maybe avoid some of, you know, the performance issues upfront without too much complexity?

Evgeny Li [00:32:32]:
Yeah. I think there is no, like, silver bullet, unfortunately. At smaller scale, for example, if you're a startup or something and you don't have lots of people or, like, customers or users, then it's probably not a problem, And you can just buy yourself time in terms of velocity by building features and not thinking about optimizing things. But, yeah, as you scale, unfortunately, companies hit this kind of, like, moment when they realize they need something, and they already build a usually, like, code base becomes giant. There are already lots of, like, team members and engineers on your team. And now you have to migrate everything. And that becomes very difficult to do if you want to, like, change things dramatically. And these migrations yeah.

Evgeny Li [00:33:25]:
I talked to some, like, people who build similar systems at, I don't know, Airbnb, Gusto, and other companies. It's all, like, very difficult when you need reliability performance and things like that. Yeah. I would say, yeah, I I think it depends on your business. If you're starting and you think you would need all these kind of, like, tracking functionality and it's business critical, It makes your business better. You can help, it can help you win customers and things like that. Then I would think from day 1, like, how to potentially design this system. Either use, like, event sourcing or use these CDC change data capture or, like, some products that out there, something like that.

Evgeny Li [00:34:10]:
But if it's just, like, secondary, then you can either buy at later stages if you're successful some solutions or just it's not like a main problem and it's probably will be deprioritized for many, many years.

Valentino Stoll [00:34:27]:
You know, I part of me always wish that there was, like, a log rotate version for database tables or or or things like that. Right? Where, okay. After so many so much time has passed, you know, just take whatever's there and, like, stash it somewhere else. Yeah. Yeah. Almost like an auto database charter.

Evgeny Li [00:34:48]:
Yeah. Right? Yeah.

Valentino Stoll [00:34:49]:
It unfortunately, doesn't exist. But

Evgeny Li [00:34:52]:
Yeah. You would need to build it yourself. Yeah. And we we we plan to do that. We we call it, like, retention period. So how long people need to retain their data and changes and audit. Because storing everything in the hot, like, database that can be queried at any time is expensive. That is why people sometimes move it to, like, some other cold storages and archival or something like that.

Evgeny Li [00:35:15]:
Yeah. It's 3 buckets. And

Valentino Stoll [00:35:16]:
That's awesome. I mean, it's definitely needed. I wish so many times I had just, like, an option to to buy that. Right?

Evgeny Li [00:35:24]:
Yeah.

Valentino Stoll [00:35:24]:
Yeah. Rather than have to build it and manage it and worry about performance of everything. Yeah. For sure. So so where are you taking this, from here? Like, what what are the next problems for you to solve? I mean, we've talked about quite a lot of different so, I mean, we we haven't dove in quite, to maybe some of the specifics, like, you know, doing various layers of logging. Right? Like, we've you've talked about, like, HTTP logging versus database specific, auditing. Right? And console logging, which we haven't talked about. But really interesting stuff from, like, 37 signals, right, with console 84 and audits, for doing that in, the rails console.

Evgeny Li [00:36:12]:
Yeah. Yeah. There are many layers. Yeah. Where There's so many layers. Yeah. Many layers are covered, and some are just they there are huge gaps. So for example, you mentioned, yeah, HTTP login, it's all done.

Evgeny Li [00:36:25]:
For example, you can application execution, it's done. There are services. Data, maybe it's solvable, and now we're trying to solve in the dot on the database level. Console access, it's also, I would say, a little bit specific to the Ruby community. For example, in the JavaScript community, there is there are not usually good, like, repl consoles or that come with frameworks and things like that. And it's good and bad. I'm noticing that people it's good that people don't go to this console to do crazy things there. But sometimes they go to SQL console directly or connect to an SQL and perform changes on the database level.

Evgeny Li [00:37:08]:
I don't I don't know which one is better. Yeah. With Ruby, yeah, it's luckily, we have great, like, console experience. We have, like, gems, like, pride that allow you to, I don't know, go and jump into, like, classes and expect objects and things like that. And, yeah, there there are couple of gems. One is console 1984 and another one, Audits 1984. These are interesting gems. I personally haven't used them, but they seem to solve, like, very popular problem, especially among b to b companies probably.

Evgeny Li [00:37:43]:
When you have customer data or customer request, you need to do something manually, but you want to audit, what changes an engineer did, for example. What these gems do, they simply yeah. You install them in the rails, project. For example, when you load your rails console, they'll prompt you and ask, like, why, are you here? Like, what do you want to do? And engineers can specify, like, a ticket number and specify some context. I'm solving this customer problem blah blah blah. And you can type all the commands, and the commands may be logged in a separate database table, and the outage gem 1984 allows you to inspect all these things. You can implement some additional approvals or, like, retroactive approvals. But these solutions are, like, and I don't know whether there is any better solution, but these solutions are not bulletproof in terms of reliability accuracy because because it's Ruby.

Evgeny Li [00:38:45]:
You can do whatever you want in a console. I can override everything, and then after that, do crazy things that no one would see and notice. Of course, you you're hoping that your colleagues or you won't do that, breaking all the rules, but it requires discipline. And, yeah, it's funny. I tried to implement something similar once at another company when we had a little bit more complex set up with SSOs to AWS accounts, these accounts would run on Kubernetes certain process with the rails console with some predefined end variables, and these kind of, like, sets, certain context for, for example, paper trail or tools like that so you can see who is making the changes. And, yeah, I haven't shipped this feature or it was not, like, released, but it's still I I would think that it's a problem for many. So if you're listening to this podcast, yeah, that's one opportunity to improve how to improve login and auditing of their consult sessions because many companies just leave there. Like, developers leave there for troubleshooting purposes.

Evgeny Li [00:39:57]:
Yeah. And things like that.

Valentino Stoll [00:39:59]:
Yep. Yeah. I remember first seeing console 1984, and the audits 1984. Kind of funny naming. But, yeah. Really cool. I I've used them, in a couple projects and, yeah, it does have, like you have to use, like what is it? The the active record, encryption stuff natively. So if you use maybe lockbox or something else, you're kind of like SOL.

Valentino Stoll [00:40:28]:
Yeah. But it's it's really interesting. Like, the interface is definitely pretty smooth, and I do like it. But you're right. Like, it doesn't really prevent anything. It's more just like, alright. Well, as long as you, you know and, you know, the goal is to hire people that you trust anyway. So as long as you have, like, at at least a process in place to audit it, it definitely makes things much easier to track that.

Evgeny Li [00:40:51]:
And from your experience, what did you do in these consoles usually? Like, from my experience, it's usually either checking the data, what's there, because you don't have, like, full UI for everything or something like that, or sometimes editing some stuff there as well. So I'm curious whether it's a common Yeah. I physical

Valentino Stoll [00:41:12]:
I hate I hate going in production console. So Yeah. For me, it's mostly querying, if if anything. Right, like, you have a bunch of related objects, and it's, like, almost not impossible, but, like, much worse to, like, go into the, you know, a SQL dump or something like that and sift through the the queries for it. But but sometimes, like, you know, you you need to, you know, just update some time stamp somewhere. And it's just like you know, rather than having to generate a rig test that goes and runs, right, I if you just have, like, one record that you're trying to get to, that's problematic. That's like, you know, failing a ton of, repetitive, workers or something like that, and you just wanna stop it from doing that. You know? But you wanna track that is that that you that you you were the one that, like, updated it.

Valentino Stoll [00:42:08]:
Right? And it's, like, very rare that you need to do that, but there are those cases.

Evgeny Li [00:42:15]:
Yeah. Oh, in these cases, when people need to update data yeah. I think the there are many also approaches how you do that. Like, one one off. Yeah. You can write a break task. Some people run it in migrations. Some people execute stuff in a rails console or things like that.

Evgeny Li [00:42:34]:
They think there is also, like, opportunity to improve things because writing like, it's good when it goes through the code, but the cycle can be and feedback cycle can be still, like, long, and you don't know, like, whether you wrote everything perfectly or not. Yeah. You would need to test it and things like that.

Valentino Stoll [00:42:52]:
And Yeah. I mean

Evgeny Li [00:42:52]:
the changes

Valentino Stoll [00:42:53]:
I would say, in general, like, probably don't update things Yeah. In a production box in general. Like, no writing. I feel like that's a good rule of thumb.

Evgeny Li [00:43:04]:
Yeah. Mainly just ban, yeah, access. Like, only, for example, 1 person or, like, certain, like, DevOps team can access or something for troubleshooting or, like, certain emergencies.

Valentino Stoll [00:43:14]:
I know I know some companies will do, like, you know, only, you know, the console access only gets read replicas. So then it just prevents it in general. Right? So then you just you can only read from the database in general anyway if you're connected to the console, which is really interesting and definitely, like, prevents it. Right?

Evgeny Li [00:43:34]:
Yeah. Yeah.

Valentino Stoll [00:43:35]:
But definitely not as flexible.

Evgeny Li [00:43:37]:
Yeah. Ideally, you would need to tightly isolate everything. Like, there are certain other distributed systems like Rediscash, memcache. People can also use and drop things by accident. Ideally, we would also want to isolate those, maybe certain HTTP requests. And Yeah. We also, like, log in HTTP requests that are going from your application because I can use all environment variables and send, I don't know, Stripe API request to send myself, like, 1,000,000 or something. Ideally, it should be also traced, like, who made this HTTP request and or maybe ideally prevented from doing it in the console.

Valentino Stoll [00:44:16]:
So, I mean, you bring up a good point. Like, I feel like we haven't touched much on, like, the recovery aspect of all of this. Right? Like, the whole point of auto logging is in case you need to recover something. Right? Like, or you need like, next to reporting. Right? Like, so reporting, you can gather. Okay. Well, we're doing a, you know, audit specifically of everything that changed this particular thing. But like, most of the times, I feel like the reason for that is there's a recovery aspect following that up.

Valentino Stoll [00:44:45]:
Right? Like, so Yep. Where where is that in this whole process? I know, like, some of these gems we've talked about have some of this, but and I feel like it's still kind of missing. Like, what is what is your, like, approach there, and, like, how how are you working to, like, resolve this kind of, like, stuff?

Evgeny Li [00:45:05]:
Yeah. Recovery is very tricky problem, in general. One issue, for example, is what if you somehow backed up some data or certain rows, you have the record history historical version. But now your schema has changed if you're using SQL, like, databases. Like, how do you resolve this conflict? What if, you added the column, you prepopulated it, but in your history, you don't have it? Now you have to prepopulate it because the column may not, like, contain, like, shouldn't contain null values or something like that. So that becomes a problem. In general, we have gems like paper trail. They allow you to query the version, find the right one.

Evgeny Li [00:45:47]:
There is, like, helper method called refi. You call it, and it basically instantiates a active record record with the version. And then you can call safe again, and it will override and store the historical, changes. As I mentioned, yeah, there are some, schema changes. That's the kind of, like, up to people to decide how to resolve these conflicts. There are some other different approaches that on the database layer people solve. So the one is very straightforward. You snapshot or take backups daily, nightly.

Evgeny Li [00:46:26]:
Ideally, you maybe test your disaster recovery procedures and test whether you indeed can recover. Because very often, no one tests those and just thinks that they're safe. And there there are also new popular techniques. Maybe you heard about point in time recovery. So that's popular feature that many database hosting providers advertise. What they do at a like, in terms of implementation, how it works. Okay. How would you use it? You would use it by simply selecting any time you want to recover data, and it will either create a new database instance with the state as of this time stamp or you to it will replace your database state.

Evgeny Li [00:47:14]:
The way it works, it uses these write ahead logs that I mentioned previously, because write ahead logs are, like, append only logs. These solutions, they continuously archive them. So some companies just take these balls and wild records, and they move them to s 3. And then if you need to roll back, they give you, like, either the timestamp when, for example, a certain blob was archived, or they may also do some binary search within certain file to find the exact timestamp you want to and try to recover to this data point. And there are some other, like, cool tag built on top of this idea. If you have, like, file and it's kind of, like, stored somewhere reliably, then you can do branching. So there are some serverless lots of serverless hosting providers. What they allow you to do is to use, basically copy on right copies.

Evgeny Li [00:48:16]:
For for example, your branch when you want to test something on staging. What they allow you to do is basically create a database that will simply point to the state from a shared well kind of like log. So instead of copying all data again and paying for all extra resources, it will copy from the main, let's say, history. But any additional changes that you'll make, it will branch off, and it will just incrementally record some additional stuff. So some super cool technologies that out there that help with, yeah, branching as well as, yeah, data recovery.

Valentino Stoll [00:48:54]:
Yeah. I've definitely seen the branching. It's pretty interesting, like because then it, you know, it can just either wipe it off or, like, integrate it as it as it merges down if you want it to, which is pretty interesting. Like but in reality, though, like, what what have you seen kind of, like, things evolve to, like, you know, what are people actually using? Are are they using copyright? Like, it seems like a very, like, complicated process, to integrate with. Like, what what are the where are the trends going?

Evgeny Li [00:49:24]:
Yeah. I think, in general, people prefer cloud managed databases usually for these reasons because they can manage backups reliably. So if you run something on RDS or, like, I don't know, render, stopabase, or whatever, you can trust that, the backups they take, they're probably fine, and you'll be able to roll back if necessary. That's one first line of defense. It's just relying on your hosting provider. And any additional ones, yeah, related to using additional tools, like the way I think about what we do, for example, at BEMI, we provide selective point in time recovery. What it means instead of, like it's very hard and dangerous to roll back the whole database state rather than, like, rolling back certain data records that were affected. And with tools like that, when you can filter out changes and select the ones that you want and just apply those, that's very useful because usually, yeah, I rarely saw when people would recover from, like, full database, backup because it means you would lose your history, what happened after that if certain other clients used it and things like that.

Evgeny Li [00:50:48]:
So it becomes very problematic and not very practical.

Valentino Stoll [00:50:52]:
So I have to ask this, mostly because it's my interest. But, like, do you see, like, AI fitting anywhere in this process, or is it kind of just like, you know, there's this is too sensitive of a topic to integrate with?

Evgeny Li [00:51:07]:
Yeah. I think there are some I and I would share one idea that, we the problem we try to solve with AI, maybe we'll solve it, but it's okay. If you have, desire to do it, go for it, because it will be great if you can solve it. But one maybe obvious solution is, like, AI for SQL. There are many nontechnical people who want to use, like, inquiry data in a, like, business intelligent tool, database, or something else, but they don't have, like, the knowledge how to stitch things together for complex scenarios. That's one idea. Another idea is we mentioned the audit trail and log. And for example, we have some customers, even nontechnical people use our UI for troubleshooting.

Evgeny Li [00:51:59]:
But what could additionally help is data mapping. So, usually, database, structure, it doesn't very well translate to real world language. You'd have some different IDs. Maybe you especially in Mongo, I remember, like, all days in Mongo, it's like document based, like, database, and it's, doesn't have a structure. So the column names, you may have different columns, sort of columns. It was recommended for performance reasons to shorten your column names to, like, 1 or 2 letters or something like that because it would take less space. And imagine seeing, like, an auto trail with weird, like, data changes and with m is equal this time stamp, b is equals weird string. People would not be able to understand it.

Evgeny Li [00:52:48]:
So where AI could feed it here is basically taking the structure and humanize it, understand the data structure, understand maybe the some business logic from the code base, and map it so anyone, even nontechnical, like, customer support, ops, like, the CEO could understand what's happening if if they don't have, like, knowledge about how data is structured. And that's in general, yeah, related to these ubiquitous language concepts. Ideally, everything should be the same, but that rarely happens. That's one idea. Another one is, like, some companies, they they maybe heard about techniques like RAC, for providing some additional context, with custom data to your, like, AI system. So it can use internal data. Maybe you you use perplexity. Complexity uses other resources that may be even, like, updated in real time just an hour ago.

Evgeny Li [00:53:51]:
Plus they have all these, citation links, pointing to the source. So reg techniques, can help you to do that. But companies in general, when they deal with internal data, audit can also help. So you want to, like, have clean data, especially if you operate in this space when there are lots of documents and structured things, maybe files, maybe emails. So I'm not sure, like, it's for me. I don't know, like, how to productize it or what's the exact, like, solution. But the problem is that, yeah, people want to take this data, but they also want to understand, like, what happened, for example, why it was like that, why this AI systems said it like that. Maybe we can fine tune it, and you will see that with, like, yeah, certain config changes, your response rate and accuracy has improved and things like that.

Evgeny Li [00:54:50]:
So it's important to audit the results of the AI queries as well.

Valentino Stoll [00:54:56]:
Yeah. Super interesting, takeaways there. Personally, I'm looking forward to it. Like, you know, just just getting some observation aspects of it out. So, you know, like, kind of just automate automatically getting, like, the, you know, bullet jam or something like that. That just tells you, hey. Like, these things are not performing well, but without having to actually capture them. You're just looking at it, sideways.

Valentino Stoll [00:55:22]:
Right?

Evgeny Li [00:55:24]:
Yeah. I hope, yeah, things will improve and, things will be implemented. Yeah. I use GitHub Copilot, and, yeah, it's some people hate it. Some people love it. I like it. When, it changes your mind, you have to know and understand how to use it as well to better leverage it. Sometimes I would write a comment explaining what I'm trying to do, and it would then suggest me some code that more or less matches, but then I would still need to check sometimes and things like that.

Evgeny Li [00:55:53]:
But, yeah, there are lots of opportunities for AI.

Valentino Stoll [00:55:57]:
Very cool. Well, we've talked about so much here. Is there any other, aspects you wanted to cover, before we move into PIX here?

Evgeny Li [00:56:07]:
No. I think we covered a lot of ground. I think in general yeah. That's, like, tracking changes, auditing is a big problem. There are tons of tons of ways how to solve this problem. Yeah. And I would recommend people try to evaluate them all instead of trying to just going with 1 by default because it's the hottest one or something like that. See what works best for you.

Valentino Stoll [00:56:35]:
Yeah. Well, for sure people should check out Vemmy, and your blog has a lot of great, content around audit logging and tracking various activities and database, things you can add. I I'll definitely recommend people check that out for sure because, I'm already gonna type in quite a few articles on there. You got a lot of great stuff there. Alright. Well, let's dive into pics. Do you have anything you, you wanna share? Or I can go first if you want some time to think about it.

Evgeny Li [00:57:08]:
I have 2 picks. One is nontechnical, and one is technical. So the nontechnical one is maybe if you're old enough, you remember there was a technology called RSS. And now I'm trying to use it on a day to day basis. The way I'm using it so there are, for example, lots of great technical, like, channels or content out there. And instead of, these systems being pushed to me or me checking, I don't know, subreddits or, like, hacker news and things like that, I try to aggregate all my content as a kind of, like, funnel into an RSS feed. Some of them, I use feeder, but there are many others, and they're they're similar. Some of them, they have help you, get email address so you can also subscribe to Substack without polluting your inbox.

Evgeny Li [00:58:06]:
So I'm just organizing it in different subfolders. I don't know. Text related. I don't know. Some engineering related and other topics. And I aggregate everything, like stop stacks, top subreddits within a day or something with some URL arguments. You can get to these feeds as well. Tech blogs from different companies, top Hacker News Within A Day or something, Digest.

Evgeny Li [00:58:28]:
So lots of lots of resources that can be accumulated there. And I have just a queue that I'm going through and reading when I have time on mobile or on the desktop. And my second technical peak, is, mix. I'm not sure if people heard about this system, but it's basically package manager and system configuration tool. It's sort of like R band, RVM, SDF, and all these things, but it's more powerful on the almost, like, operating system level. So it's not, like, just limited to programming languages, for example. But the way we use it, for example, we have lots of projects in different languages. There is a full disclaimer.

Evgeny Li [00:59:15]:
Nix has, from my experience, has horrible. It's very hard to use, very hard to understand what it does. So there are some wrappers around it. For example, DevBox, but there are many others. We use Dev Box. So, essentially, what it allows you to do is to bring any tech stack without Docker to your local machine, and it's isolated to your project. Meaning, it will create a separate folder for your Postgres, for your Redis, whatever, for your, like, JavaScript engine or not. Js or Ruby and things like that.

Evgeny Li [00:59:49]:
You can specify all these dependencies. So when you go to your project, you can keep using your, I don't know, favorites at the stage or fish or whatever shell, and it just automatically enables all these additional dependencies within this project. Yeah. That these are my picks for today.

Valentino Stoll [01:00:13]:
That's super cool. I'm gonna have to play around with that. Yeah. Nyx is definitely a terrible user experience. So thank you for sharing Dev Box because that's pretty cool. So, I only have one pick today. I my awesome coworker, Anna, she, wrote this incredible article on the Doximity blog, recently on, evaluating GPT outputs, and how we do it for medical professionals, and using golden datasets and, programmatic evals and some human evals as well, to just make sure that, you know, the output you get from large language models is accurate, at least following, you know, certain set of standards. So I recommend checking that out.

Valentino Stoll [01:01:11]:
It's really awesome and just, like, goes through a ton of stuff that, we've been working out, which is really fun.

Evgeny Li [01:01:18]:
Is that that's great. Yeah. I I hope that all health care providers don't just blindly use AI without the ability to

Valentino Stoll [01:01:25]:
I really hope so too. Yeah. It's been great talking to you. I love to see all the the work that you're doing, to to help make auditing easier, for a lot of businesses. And, you know, thank you for all the documentation and coming on the show today. It was, really great, you know, chat chatting with you today. If people wanna get in contact with you, how how can they reach out to you or find you on the web?

Evgeny Li [01:01:55]:
Yes. I mostly and rarely use Twitter or x, how it's called nowadays. I have, like, medium, GitHub everywhere. My handle is Exaspark. You can find me there. And, yeah, thanks for having having me here, Valentina, today. It was nice chatting with you.

Valentino Stoll [01:02:16]:
Yeah. Totally. Yeah. I mean, until next time, everybody, I hope you, learned a lot and can make use of a lot of this stuff, because I know I will. Alright. Bye now.

Evgeny Li [01:02:27]:
Bye. Bye.

Navigating SQL Data Changes: Tools and Techniques for Data Recovery - RUBY 645

0:00

01:02:32

Playback Speed: