Checking safety in Exactly-once, TLA+ and Stronger message processing - .NET 159
Tomek Masternak is an Engineer at Particular Software. He joins the show to talk about his video, "Checking safety in Exactly-once ". He begins by explaining what TLA+ is, the advantages and benefits of using this, the problems it solves, and many more. Additionally, he dives into "Ensuring exactly-once message processing with model checking" and talks about Stronger message processing.
Show Notes
Tomek Masternak is an Engineer at Particular Software. He joins the show to talk about his video, "Checking safety in Exactly-once ". He begins by explaining what TLA+ is, the advantages and benefits of using this, the problems it solves, and many more. Additionally, he dives into "Ensuring exactly-once message processing with model checking" and talks about Stronger message processing.
Sponsors
- Chuck's Resume Template
- Raygun - Application Monitoring For Web & Mobile Apps
- Become a Top 1% Dev with a Top End Devs Membership
Links
- Checking safety in Exactly-once - Tomek Masternak, Szymon Pobiega - YouTube
- The TLA+ Home Page
- Learn TLA+
Picks
- Adam - Virtual Box
- Mark - The Recruit
- Tomek - Asianometry
- Shawn - Ahsoka
Transcript
Shawn Clabough (00:01.651)
Hello and welcome to another episode of Adventures in.NET. I'm Sean Claibault, your host and with me today, co-host Mark Miller. Hey Mark, how are you doing? I'm pretty good. You know, the weather is changing, but getting a little colder, but I actually, I sleep better at night when it's cooler. So that's...
Mark Miller (00:08.438)
Hey there, Sean!
I'm doing great. How are you doing?
Mark Miller (00:21.482)
Yeah, I think that's true. I've heard not that about you, but I've heard that about humans in general. That lower temperature, cooler temperatures. Adam's like shaking his head like maybe I have heard about you, Sean, and what temperature you sleep at.
Shawn Clabough (00:25.921)
Yeah
Shawn Clabough (00:35.683)
Yeah. Yeah, that's our other.
Adam Furmanek (00:38.703)
I just thought about full moon being like these days and some people can't sleep at night, they are werewolves or whatever, but maybe that doesn't apply to Sean.
Mark Miller (00:47.467)
We don't know.
Okay.
Adam Furmanek (00:52.287)
That was a good one.
Shawn Clabough (00:55.482)
Yeah. And it's that other cohost you're hearing there, that's Adam for Monic. Yeah. Yeah, I think I can say his last name, but he can't say mine. So that's why I do the intros.
Adam Furmanek (01:01.067)
Hello folks, very nice to meet you again.
Adam Furmanek (01:06.187)
That goes both ways. That was Sean Claybo.
Shawn Clabough (01:11.915)
See you can, you got it, you got it. All right, why don't you introduce our guest today, Adam.
Adam Furmanek (01:19.995)
Yeah, sure thing. So today with us is Tomasz Masternak, a.NET engineer, which I presumably heard started with in 2006. Since he mentioned our favorite question is, what.NET framework was the first one you had a chance to use? So Tomasz, if you could introduce yourself and also share that fun fact.
Tomek Masternak (01:43.834)
Exactly, I can't prepare it. So listening to previous podcasts, I knew that does the intro question. So I actually had to check and it turns out that it was.NET 2.0. So that was the first version that I started with. I won't mention how long ago that was, but long enough, I guess. And since the...
Shawn Clabough (02:04.071)
Yeah, long enough. Yeah, at least you didn't have to go through the pain of not having master pages.
Tomek Masternak (02:09.726)
Yes, exactly. That was the good part. But I stayed with the platform since then and the technology, I have to say that now and then I look into different parts of the IT ecosystem, I guess but.NET has always been the major interest for me. And...
Since the beginning, I was pretty interested with distributed systems. So everything that was connected with messaging, sending information between the processes and figuring out how to build reliable systems that are built of different parts talking together was pretty interesting. I remember when WCF came out and when I was looking at it, I was pretty amazed by how well it was.
thought through all the channels, the ways in which we could configure that. That was pretty impressive. And I think that was the main interest for me for quite a while. And more or less eight years ago, I had the opportunity to start working on an open source library, which is called nServiceBus, which actually has a company and a group of people behind that.
But since then, I stayed with distributed systems for longer and specifically with distributed systems that are based on messaging technologies. So things like Azure Service Bus, RabbitMQ, MSMQ in the old days, but still being strong. So that's the area of my interest and my daily work.
Adam Furmanek (04:01.463)
Cool, that sounds really cool. So today I believe our topic for this episode is something called TLA+. And Tomasz shared his YouTube video presentation of his workshops and his blog called Exactly Once. So Tomasz, if you could like tell us a little bit more what this TLA+, it's a little bit cryptic when we first hear about that.
Tomek Masternak (04:31.374)
I agree. And that was my impression when I heard it the first time as well. Sounds pretty exotic. So, so let me try convincing you that being in the.NET space, it actually makes sense to, to look into that direction and at least consider that. So I was, that's one of the questions, which is why should I even look at TLA+, what it is. And before me going into the technical bits, let me give you a, a history from my personal
professional work. I remembered that on a couple of occasions when I was writing my first multi-threaded code. The way it would go was that I would start writing a solution for whatever problem. I would spend the whole day. And usually the way in which I would build it would be, I would identify a couple of scenarios and go through that scenarios in my head, looking at the code and figuring out whether it will work correctly or not.
And by the end of the day, I would be exhausted and I would say, oh yes, but finally, I actually solved all the problems. It's good, I'm done. I would go home, get night's sleep, wake up in the morning and I would be, oh, I think that there is this one more scenario in which it's not going to work. And then, and then the cycle would repeat over and over again for a couple of days until I would be like, I don't know, like, I guess it works, it's good enough.
let's think about something else. And I have to say that was the approach that I've seen around me. So at least I wasn't the only person there. I think that that's a shared experience for a lot of us, especially when it comes to multi-threaded code, to distributed systems, when you have messages flying around and processes talking to each other. So some time ago, I think it was...
maybe five years ago, I came by this TLA+. And I saw it first because that's a language and other tooling that I will mention in a second that was introduced by Leslie Lamport. So Leslie Lamport is a person who invented, discovered whatever you believe, Paxos algorithm, the two general problems, that's his invention.
Tomek Masternak (06:52.738)
the logical clocks, consistent snapshot. He stands behind a lot of things that were done in distributed systems to make them reliable. And one of the things that he also discovered or created was TLA+. And the acronym state stands for Temporal Logic of Actions. And it doesn't ring a bell or it didn't ring a bell for me when I first heard it. But basically, I guess,
In a nutshell, the way in which you can think about it is that you can describe your system using TLA plus as a language. So it's a language which looks comparable to C sharp. It has a syntactic sugar on top of it so that it can look like Pascal or C language. But you describe your system in that language.
And when you describe your system in that language, you specify two things. You specify which parts of your system are done in an atomic way, in an atomic steps, so that nothing else can interfere in that step. And secondly, you can model parts of your system failing. And then what do you say is, this is how my system starts off. This is the description of my system in time
slot zero, and those are the rules in which my system can evolve. So this is what can happen in possible futures. You give that to something which is called a model checker, and that model checker based on your description is generating all possible histories that can happen in your system, all executions, and it can also check for you whether sum of
guarantees or conditions are always met. And I know that it's pretty cryptic, but so let me give you an example. So for instance, what you can say is you can describe a trivial system, which is let's say a counter which can be incremented by two threads. And you can write a code for those two threads in TLA+, saying the one thread is going to update the counter, the second one is going to update the counter.
Tomek Masternak (09:10.134)
There are some rules that you put in and then you can say the counter can never go above 10 or above 20. You give it to the model checker and the model checker is going to say, okay, for whatever I can say, what you're claiming is actually true or otherwise it can say, no, actually there is this history, this scenario that I can show you how the things can happen in your system that lead
the situation which actually breaks your condition. So it can basically prove by an example that some of the things that you think your system cannot do or cannot break can actually happen. So this is the rough description or at least my mental model of how I think about TLA+.
Adam Furmanek (10:00.075)
So this is kinda like, I believe it seems to me like the mental exercise we do when you mentioned concurrent or parallel programming, then we just start waving our hands and saying, if this thread does that and then it's stopped in the middle, then some other thread kicks in and does that. Nah, that's not gonna happen, so let's rewind. And the story goes on and on.
Tomek Masternak (10:26.778)
That's exactly what is happening. And from my experience, I can say that I can do, I don't know, three, four executions at one time. This is my brain capacity. And what you can do with the model checker is that you can check all of them, assuming that the system that you're describing is not that big. And one thing just to note here is that to be realistic,
What you can do with TLA+, is you cannot take your whole code base and try to translate it into TLA+, and then to prove something or to check some conditions in that code. You can actually do that for a fairly small piece of description of the TLA+. So one part of the exercise is to actually identify parts of your system that are...
good candidates to be model checked. So basically figure out what is the ROI. And you are not going to model check the UI probably. You are not going to model check your SQL. You are going to model check the parts of your system that you cannot mentally think about in a way in which model checker can check them. And the way that I actually started with TLA plus was a very concrete problem.
And that problem was a problem in message-based processing systems. So an example of a problem that is there is more or less this. When you have some endpoints or some processes that are communicating via sending messages, let's say Azure Service Bus. Let's imagine that you have a system which consists of four or five microservices or services,
star services. And what they are doing is they are sitting on top of Azure Service Bus or RabbitMQ or SQS, whatever, but persistent messaging. And what happens is that they are publishing events, they are sending commands to each other, but the routine in which they operate is more or less the following. Like there is some input queue, there is a thread that picks up a message from the input queue, looks into the message and says, oh, I need to update something in the database, generates a
Tomek Masternak (12:54.642)
and then potentially publishes additional messages to some other microservices. And the problem that's there, especially with the modern messaging infrastructure is that there is no transactional guarantees between operations that you do with messages and database. So we can have a very interesting failure scenarios. And one of the quotes that I really like is that, like when systems are correct, they all...
all correct systems are pretty much the same, but those that fail always fail in their unique ways. So it's like, there are very interesting ways in which you can get into trouble with that scenario. So for instance, what can happen is that you can pick up a message or the thread can pick up a message. And if you acknowledge that message, so basically you tell to the broker or messaging infrastructure, I'm done at the beginning, your thread can die.
for whatever reason and that messages disappears. So probably a good idea is to acknowledge the incoming message only after everything else happens. So then you have a message and you want to make some operations in the database. So you do the operations in the database and you can commit the database transaction. What if you fail after that? The message that you picked up gets back into the queue and with Azure service bars after some lease time elapses,
it will reappear and you pick it up one more time. What happens with that duplicated message delivery? Probably you want to figure out how to de-duplicate that message to turn processing that second attempt to be a no-op effectively. What can also happen is that after committing a message, you can die before you send out the outgoing messages if there are some. So...
getting a duplicate might be a no-op in the database space, but you actually want to push the messages to make sure that they get delivered. So that problem of consistent messaging is, or what I call consistent messaging or exactly once processing, is something which basically pushed me into that direction of TLA+. Because that's one of those examples in which you go through various scenarios and it's actually hard to figure out.
Tomek Masternak (15:16.746)
whether you solve the problem, because there are so many things when your system can fail. And just one historical comment here that might be interesting is that in the past, that wasn't that big of a problem because we had distributed transactions. We had DTC or enterprise transactions in Java space. And what happened was that
the messaging infrastructure and database could actually be combined into a single transaction. And for various reasons, especially in the cloud environment, this is no longer the case. Like there is no way that your PostgreSQL or MSSQL is going to get into a single transaction with your messaging infrastructure.
Adam Furmanek (16:03.471)
Okay, so getting your problem, what I'm hearing is I need to pick up a message from the queue, do some processing in some completely independent system. So I cannot have transactions spanning both of those. I mean the queue and this other system. And I can die in any time, any time, right? So now comes the question, what should I do first? Commit here or commit there? But...
seems like we kind of have solutions to that. I mean we have transactional outbox pattern, we know there is something we call item potency, you mentioned the duplication. Like why would I even consider TAL plus instead of just you know taking those well-known and
Tomek Masternak (16:51.446)
Mm-hmm, that's a great question. So let me ask would be a simpler part of that. So let me start with item potency. Item potency is very often given as a solution to the problem of duplicate deliveries. So whenever you have a problem that you can get a request more than once, let's just make the operation item potent and that's it. And...
That's my hypothesis, but I think that solution came from a very specific context that I think matters in this discussion. So I think that the context in which idempotency solution came and is useful is in HTTP RCP based communication. So what happens in that situation is that you have a one process and a second process, and the first process is sending HTTP operation. If it fails,
is going to repeat that. If it fails, it's going to repeat it one more time until it is successful. And usually what happens is there are two additional behaviors here. So the sender is not sending, let's say, 10 concurrent requests to the same microservice, but it's actually sequencing the operations as it sends it. And I think that that's a very important behavior. And if that behavior is there,
then idempotency can be a solution because the receiver knows that it can get more than one request, but those duplications will come one after the other. When we are talking about messaging space like Azure service bus, unfortunately, this is not the case because we can get reordering. So I can get duplicates, but in between the duplicates I can get another messages. And then that idempotency is not.
potential solution. So let me give you a trivial example that's in one of the blog posts that we wrote with Szymon Powiega. So an example would be, let's say that one of the processes stores a state for a shooting range. And basically what you have is that there is someone who sets the position of the shooting range and there is another service that can guess the position. So let's say that I set the position of the shooting range to be 10, whatever like on the x-axis. And
Tomek Masternak (19:15.282)
I'm sending commands and those commands could be hit at position whatever. And the second command is move the shooting range, like move it somewhere to a random place. If I do hit at position 10, what I will get is I will get the answer. The hit is, is there, you get a point. But if the scenario
the failure scenario or problematic scenario happens. That is, I'm hitting at position 10, then the range moves, and then I have a duplicate of that hit at position 10. What happens is that the receiving endpoint gets a duplicate which says, I am aiming at position 10, but the shooting range is not there anymore, and it gets inconsistent response. So I think that usually the iPhone
idempotency assumes that I can trivially detect the duplicates and return the same response, but with the messaging system, it's not that easy because the state can change between the duplicates as well. So that's why I would say idempotency usually does not happen or idempotency is not a bulletproof solution for...
for messaging systems. When it comes to the outbox, I think that one problem that we have in the software industry, or one of the problems is naming, it's just hard. And sometimes outbox or a given term means different things in different contexts. So the way in which I think about the outbox
the outbox only is a very concrete implementation. And I think that in one of the places that I've seen it is microservices.io, which tries to catalog the patterns for building microservices. And the way in which the outbox pattern is described there is that when you get a request, prepare your business transaction, and then...
Tomek Masternak (21:35.038)
grab any outgoing messages and store them in the same transaction. And then have a background thread that looks into the outbox, which is basically a table that has an outgoing messages and push them asynchronously. And that's all fine. But once again, the problem is that when the duplicate comes in, you still need to be able to figure out whether it's a duplicate or not. So...
Usually you need to tweak the Outbox implementation a bit to make sure that when you get duplicates, you are safe as well. Apart from that, exactly once is basically, I would say trying to figure out what is essential part, what are the essential parts of the Outbox pattern and what are some of the extensions in which you can extend that. And the TLA plus that I was using
to figure out whether it's correct or not, is based on a flavor of the Outbox pattern. So I think that getting back to your question, Adam, I would say using Outbox pattern is a good idea, but writing the Outbox or implementing Outbox in a correct way is way harder than it sounds. And I think that by analogy, sometimes we can think that
we have object oriented programming patterns, which you can implement after reasonably, after you reasonably well understood the pattern, which is probably after a week, maybe two weeks. But with Outbooks, I think that there are some details that you need to get right, because otherwise it's just not going to work in the way that you expected.
Adam Furmanek (23:25.495)
I think you tapped on a very right topic, like idempotency, and especially the duplication. People often just yell, do the duplicate your messages without necessarily specifying how to do that. And sometimes it can be very hard. Like I was actually recently solving that and blogging even about that, that I had to do the transactional outbox pattern, but the database was...
distributed across data centers and there was no synchronous commit between these data centers. So how do I do that and implement this seemingly simple pattern in that scenario?
Shawn Clabough (24:08.783)
I want to ask one thing though. We mentioned idempotency quite a bit and a lot of our listeners are probably new developers and things like that and never heard that term before. So can you kind of define what that is for those listeners? Yeah.
Tomek Masternak (24:22.174)
Okay, I will do my best. So maybe I will answer how I think about the item potency and how I understand it. So the term is sounds way too complicated than it actually is, I think. And one way to think about it is if you have an API, if you have a REST service, if you have whatever, which gets messages and sends back responses,
The idea is that when you provide a message twice, you will get exactly the same response. So idempotency means is that your operation is not going, if you duplicate your operation, that operation is not going to change the state or have additional effect.
for the second, third and fourth time when you execute it. So one of the examples that are often given is from the HTTP space. So sometimes it is said that get operation or get verb should be always idempotently handled on the server, which means that if I go to the browser and I type in some address, no matter how many times I click F5,
it should not make any changes on the server. So it's like, it's only that one time which actually matters and not the other executions.
Adam Furmanek (26:04.051)
I think... Go ahead Mark.
Mark Miller (26:06.98)
Is it fair to say, like in the example of the range, that item potency would be an example of that would be move the range to a particular absolute position?
and something that would not be item potent would be move the range of relative value. Is that is that a simple example? Would that count or would you still get into problems with moving to an exact position if I had duplicates and something in between?
Tomek Masternak (26:38.486)
So I think that you touched on the cracks of the problem. So I think that if you are talking about HTTP communication and when the operations are sent one after the other and duplicates can be only like in batches, then go to a relative position versus go to the absolute position solves the, like it's item potent. Because what it means is that I can have a situation in which I have an operation which says, move to position five.
and then move to position five, move to position five. And those are my duplicated messages. And then move to position 10 and we are good. Now, the problem with the persistent messaging is that you can have duplicates, duplicates which are interleaved with some other messages. And what it means is that I can have a sender saying, move to position five.
and then sending a message move to position 10, but the receiver can get five, 10, five. And from the receiver perspective, the content of message itself, it's not enough to figure out whether it's a duplicate or not and whether we should apply the idempotency behavior on top of that. So what usually happens with messages and that's like, I guess the de facto way to do it is that your message
needs to have an additional identifier, which is a logical identifier on the message. And then you can see that those two fives are actually the same logical fives. And like if the identifier is different, those are actually not duplicates. And based on that, you can provide that idempotency behavior.
Adam Furmanek (28:32.671)
One question I might have to that when you say idempotency and first time I learned about that I was thinking that it kind of means stateless operation or stateful which results in the same way. Like you mentioned this get request that it shouldn't kind of change that resource behind the scenes but it's not necessarily the case isn't it right? I mean for instance caches can change.
And it's not that it must be stateless, that the GET operation and cannot change state at all. It may change state behind the scenes, but from the point of view of the observer, the state is equivalent, not necessarily identical, but equivalent.
Tomek Masternak (29:18.478)
Yes, that's true. So I think that there is a piece that I would call a magic trick in all of that. So the library that me and Szymon Pobiega created is called Exactly Once Processing. But, and just bear with me, Adam, I will get back to what you said, and maybe it's a long derail, but just bear with me.
but you actually cannot have an exactly once processing or exactly once execution. So you cannot write a distributed system in such a way that you can guarantee that only a single thread only once will execute this path in your program. What you can guarantee is that observable state that is accessible to the outside observer is going to be as if...
only one execution happened. So getting back to your example, Adam, definitely the situation might be so that I get a get and I don't have something in my cache. So I need to pull that cache and get the result. But as long as no matter what happens inside, the result that the client sees is always the same, then it's idempotent. Like because I cannot tell the difference from the scenario in which it did nothing.
versus it actually grabbed it and computed and whatnot. And that's actually an interesting question as well, because there is another way to implement item potency as well. So the way in which it can be implemented as well is that you can store the state of your system as a historical snapshot, basically, historical snapshots.
And for instance, what you can say is that when I get a command that I need to execute, I get the current state, I compute the next state, I store that. And when I get a duplicate of that message, I get to the state of my whatever, as it was when I first time executed that message, I run the business logic and get the results. And as long as my business logic is deterministic and depends only on that snapshot,
Tomek Masternak (31:42.794)
the outside results will be the same as well. So it's definitely the case that there are some magic tricks that you can put inside, as long as you can make it so that from the outside, it looks as if it's the same, then it's idempotent. So I guess the idempotency is the behavior that you see from the outside. It can be implemented in various ways. And that snapshotting approach,
It's something that can be built, for instance, pretty easily when you have even sourcing. When you store the state of your business data as a stream of changes or events or whatever you call it, but basically deltas that change the thing that you store in your system.
Adam Furmanek (32:32.407)
Okay, so we got a lot. So we had atomicity initially, we had some threads that were running in between, then we diverged into messaging, item potency, the magic trick as you called it. So now how do you make it simpler with TLA+, because honestly I have no idea.
Tomek Masternak (32:55.97)
Okay, so I think that you can make it as simple as it can be, I guess. It's like, and to be fair, I struggle with all of that all the time. And the reason that I went into the TLA class direction is just because I couldn't think about it. Like, I had that continuous sense of insecurity, like, is it going to work or not? I really need to figure out the way that will be like a safety net for me.
And what you can do with TLA+, is that you can extract the key parts of your implementation. Let's say you're implementing the Algebrax pattern. You can have it implemented, let's say in five classes, but you identify the crucial bits and then you get those crucial bits, you rewrite them in TLA+, and then you say things like that. You cannot have two different answers.
to the same request that comes to your system. Or you cannot drop any message. So you cannot respond with nothing. You give it to the machine, to the machine that checks it. And after going over those scenarios, doing some crunching, that machine is going to tell you whether it makes sense or not, whether what you're saying holds or not.
And I think that when I first heard about model checking, it sounded very exotic to me. And my initial idea was that seems very academic. Usually when I heard about this like magic, something that is going to prove whatever, my initial reaction was, yeah, but it's not going to be very useful in practice. But I think that...
TLA plus actually is close to that sweet spot between the amount of effort that you need to put and the results that you're getting. So just to give you some examples of where in other places in the industry TLA plus is used. So the team that implements Cosmos DB, they were using and are using for what I can tell TLA plus very extensively. People at AWS are using TLA plus pretty extensively.
Tomek Masternak (35:19.674)
People implementing MongoDB are using TLA+. I think Cassandra team is using that. So whenever you have a part of your system that you really need to figure out and make sure that work correctly, that's a good investment. And... at particular, when we are building a service bus, recently, I think it was...
probably a year ago, we are implementing a variation of Outbox pattern and that messaging logic. And we were doing that in a group with my friends, with Daniel, Leila and Shmon. And they never had any experience with TLA+. And they are great developers, great professionals, but I think that what I'm saying will extend to all developers out there.
I think that after five days, they were able to understand what TLA plus is, figure out the bits and pieces, and I'm pretty sure that they would be operational in that. So the effort to actually get into that space, especially now, is not that big. I think that the resources are there. There are a lot of great tutorials that basically handhold you into creating your specification.
So whenever you get into that situation when like, I just don't know whether it's going to work or not, I think that that's a good moment that which you can think about maybe TLA pluses is the way to go.
Mark Miller (36:59.126)
Tomek is it fair to say that TLA plus can essentially find a problem. You can prove that you have a problem but it can't prove that you don't have problems. Is that true? Okay.
Tomek Masternak (37:12.41)
Yes, yes, exactly. That's true. And so that's exactly what you said. There is a part which is called proof theorem prover, which is a bit of a different beast. But I think that at least my experience is that to actually get that thing right, it's way harder. And I think that even Leslie Lampert is not.
encouraging developers to go into that direction. So it's 100% correct. It can prove you wrong. It cannot prove that you're right. Now, there is in the community of TLA+, there is something which is called a small model hypothesis. And that small model hypothesis is a hypothesis, but it's something which is based on experiences of people using TLA+. And it basically says that
you don't have to specify or maybe a bit differently. You don't have to check your system as if there are a lot of moving parts in it. So to be more concrete, just not to get into the weeds. Let's say that you can, the model checking for messaging could be, I have a single endpoint that has two threads. It starts off with
three messages in it and two records in the database. And an argument could be made, but this is a toy example. Like in your normal system, you will have hundreds of threads, millions of rows, gazillion of messages, and I'm exaggerating, but like what I gave is a trivial example. But the small model hypothesis says is that if you have problem in your system, the model checker will find it in a very small model. So,
Basically, the experience in the community is that there is yet to be a situation in which your small model passes, and then when you extend it, it actually fails and the model checker finds a problem. So it's not a silver bullet, but the experiences in the community show that it's actually useful and helpful.
Tomek Masternak (39:31.486)
And to give you, I think a pretty interesting anecdote. I don't know if that is a practice still in the Cosmos DB team, but from what I could tell, the way that they operated was that they would always have a specification for the key parts or crucial parts of their system. And whenever they would have a bug, they would first fix it in the specification. So basically they would fix it in the algorithm.
and only after that they would apply it into the implementation.
Tomek Masternak (40:07.11)
I think it could be very stressful, especially if it's like a critical and you need to start with the TLA plus first. But once again, from what I could tell based on the presentations and informations that are available out there, it was very useful for Cosmos DB team.
Adam Furmanek (40:30.187)
So when you take this TLA+, you mentioned that you need to implement your problem in the TLA plus language. How do I actually know whether I re-implemented it correctly in TLA+, or whether I introduced a bug when translating my C-sharp to TLA+.
Shawn Clabough (40:49.299)
Yeah, related question to that is I was thinking where do you do you edit TLA plus and Visual Studio VS code or does it have some sort of other editor then I assume it has its own compiler right?
Tomek Masternak (40:49.73)
Uh...
Tomek Masternak (41:00.674)
Correct, so let me start with the tooling. So Visual Studio Code is the way to go. There is an extension and it does the job. That being said, I think that there is still some work that needs to be done in the developer friendliness. So you don't get IntelliSense, for instance, or the compiler errors might not be very, very friendly. And sometimes you need to spend some time to figure out what's wrong.
But basically the workflow, at least for me, is that I would have a Visual Studio opened on one screen and I would be looking at crucial parts. I would have Visual Studio code on my second screen. And then I would be translating what I have in C-sharp to TLA+. And then compiling TLA+, giving it the conditions to check.
and running the model checker. And the model checker would tell me whether this holds or not. When it comes to how do I know if the translation is correct? The answer is you don't know, or maybe the other way around. There is no safety net or there is no, like you cannot extract model from the code. And I think that the main reason is that usually that specification
describes different things or more things than are just visible in the code. So when you look at the C sharp class, the state that you see expressed in the code are variables that are there, variables available in the scope in which your code operates. When you look at the TLA class specification, you're describing the state of the whole system.
So that specification is going to model what is there in your database, which is not visible in C-sharp code. What is there in your message queue, which is not visible there. How many threads are executing, which is not visible in your C-sharp code. So that specification for distributed system space by definition talks and describes more than you can see in the code itself. And the other thing,
Tomek Masternak (43:26.75)
to bearing in mind is that if you want to create a model in TLA plus a specification, which is useful, you need to know your system in and out. So you need to know when an exception can be thrown. You need to know what are the transactional guarantees of the messaging system and your database. You need to know what concurrent bits of your program are actually can interleave.
like what is the possible concurrency in your code? So it might seem that the model is simpler, but to get to that simplicity, it actually takes an experience and good understanding of your system. One of the, I guess, ways in which I think about it is more or less this, that we as software developers, we are doing modeling. And...
The way that I understand modeling is just removing parts of the reality, which are not important for the problem that we are solving. And what is the problem that we are solving when we are modeling our system in TLA+. What we want to express is the behavior which is important from the what can go wrong. And this is basically concurrency that happens in your system and independent failures that can happen in your system.
So you need to know what is the behavior of your system from the perspective of concurrency and what can go wrong. And only with that information, you can actually specify your TLA plus model in such a way that it's useful, meaning that it will really reveal some of the problems that might be in your system.
Mark Miller (45:16.738)
So it's possible when I'm modeling something in TLA+, I'm imagining it's like a class, and it's got properties. And one of the properties might be, these are the conditions on where, or these are the kinds of failures this particular class could incur. Is that kind of true? Is that what it's?
Tomek Masternak (45:37.727)
Yes, so if you use the, let's say the C syntax, what you can do is you can say when the system reaches this point and executes, let's say the operation that picks up a message from the queue, it can either do that successfully or it can fail and then it goes into that state. So basically your code need to,
that model or that code expresses alternatives that can happen at various points in execution. And I know that what I'm saying is very hand wavy and very abstract, but in the GitHub repository for exactly once, we have a very short workshop, which I think consists of six or seven exercises or steps that goes through how do you...
create a specification for outbox and how you model check that. So if anyone is interested, I think it should take roughly an hour, an hour and a half to complete. Just to get the taste of what it takes to write that specification, it gets you on the journey of, this is the problem that we want to solve. And those are the steps in which you would basically start checking your system. And the way in which it is structured is that
It starts with one idea and then the model checker says, oh, there's a problem there. So we modify the idea and it says, okay, now that's fine, but there is another problem here. And there are iterations in which we go. And at the end we end up with something which doesn't fail at least obviously.
Adam Furmanek (47:19.787)
That sounds really cool, so I think we can recommend our listeners to take a look at this workshop and see how to get it working. Any other final thoughts before we wrap up?
Adam Furmanek (47:38.859)
Guess.
Shawn Clabough (47:39.163)
Is there is there good learning resources out there? Somebody wants to pick this up?
Tomek Masternak (47:42.918)
Yes, so I would recommend Learn TLA+, by Hillel Weini. And I hope that I'm pronouncing his name correctly. If he's going to listen, I apologize if I butchered his name. But so that's a very good intro. And he actually wrote a book, which is a very good introduction to TLA+. It's called Learning TLA+.
Shawn Clabough (47:56.383)
Yeah.
Tomek Masternak (48:09.282)
There is book by Leslie Lampert, which is called Specifying Systems. I would not recommend it as the first reference material after doing some work with other materials. That's a good step. And Leslie also has an online lectures, which basically go through the basics and then to the more advanced bits. And I think that those are the main.
the main learning materials that I would recommend.
Shawn Clabough (48:41.963)
Okay, great. Let's move on to picks. Adam, what's your pick this week?
Tomek Masternak (48:46.035)
Okay.
Adam Furmanek (48:49.067)
Okay, so my pick for this week is actually something that we probably all know. It is called VirtualBox, but a kind of different way of using that. I was actually recently playing with how to run a virtual machine inside a virtual machine inside a virtual machine and I was very happy to learn that well you can do that. The only trick to do that is you need to take with like Windows
guest environment and you need to take an old enough virtual box that will not yell about hyper v or lack of hardware supported virtualization. So if you take versions let's say 5.2.24 which is pretty old now then you can actually run virtual machine inside a virtual machine inside a virtual machine.
And I do have at least one business case. I won't bore you with that, but I guess our listeners may have some others. So VirtualBox is the way to go. And to be a little bit more specific, VMware player couldn't do that, nor could Hyper-V. So that's why I call VirtualBox specifically.
Shawn Clabough (50:03.763)
Very inception-y, very inception-y, yeah.
Mark Miller (50:06.611)
Sean, is Adam like hacking into governments? What is he doing with this? This setup.
Shawn Clabough (50:08.967)
Hehehehehehe
Shawn Clabough (50:14.091)
It reminds me of a house I grew up with. In the bathroom we had two giant mirrors, one in front of you and one behind you, and you would just see this big tunnel of reflections as each one reflected into each other. So, wow. All right, Mark, what's your pick?
Adam Furmanek (50:30.027)
That was exactly it.
Mark Miller (50:34.282)
Netflix The Recruit. It's not bad. I'm I think three episodes into this and I'm liking it. You know, usually pilots, the first episodes are hard, you know, they've got some rough edges but this thing I thought was really well done. There's some hints that the main character is maybe a little crazy and it's great.
By the way, I watched Night Agent on your recommendation, Sean. And I actually think so far the recruit is in the same style, but I think it's better than Night Agent.
Shawn Clabough (51:12.947)
Well, okay. I'll have to check that out. Yeah, my pick this week is also another show. It's Ahsoka.
So spin off of Mandalorian. So I wouldn't say it's as good as the Mandalorian, but it's still, it's got its own story. It's very interesting and things like that. So if you want to watch something that deals with Jedis and Star Wars and even has a little bit of Darth Vader in it, I'll give it a little bit of a giveaway there. So check out Ahsoka on Discovery+. All right, Tomek, do you have a pick for us?
Tomek Masternak (51:50.778)
Yes, so my pig is a YouTube channel which is called EGianometry. So that's a person who is based in Taiwan, I believe, but has a very cool episodes on the transistor industry and how the microprocessors are made with modern technology. And also on the history of computing. So the rise and fall of some micro systems, how the first PC was made.
very interesting documentaries, very well made technically and for what I can tell very accurate technically. So every episode is roughly 25 minutes, so it's a perfect pick to when you have a pause and just want to decompress to hear something interesting. I would recommend that.
Shawn Clabough (52:49.035)
Great. Well, thanks to Mike for coming on the show. That was a very, very interesting and you did a great job. Yeah. If our listeners wanna get to the show, they wanna read out to us to have questions or have feedback, we'd love to hear from you. You can find us on X slash Twitter. I am.NET Superhero. I am also on threads if you wanna reach out to me there. Is there a way that they can get in touch with you to make if they have questions?
Tomek Masternak (52:55.638)
Thank you for the invitation. I really enjoyed that.
Tomek Masternak (53:17.414)
I think that at masternug on X, not Twitter, I always get it wrong, does the go to place. So anyone that wants to talk about the other class, don't hesitate, just reach out to me there.
Shawn Clabough (53:31.547)
Okay, great. And we'll catch everybody on the next episode of AdventuresIn.net.
Checking safety in Exactly-once, TLA+ and Stronger message processing - .NET 159
0:00
Playback Speed: