#24: Understanding Distributed Systems with Roberto Vitillo

Software World with Candost

1×

0:00

-37:11

#24: Understanding Distributed Systems with Roberto Vitillo

Candost Dagdeviren

Oct 26, 2021

On this episode of the Software World, I welcomed Roberto Vitillo, Principal Software Engineer at Microsoft. We talked about distributed systems, the CAP theorem, writing a book, and growing a career in software engineering. Listen to the episode to learn more.

To get a chance to win Understanding Distributed Systems, listen to the episode and learn how you can get a chance.

All show notes, transcription, and links mentioned in the episode are available on https://mediations.candost.blog/s/podcast.

Don't forget to follow & mention @ravitillo on Twitter.

Transcript

[00:00:00] Candost Dagdeviren: Hello friends. Welcome back to Software World with Candost. Today my guest is Roberto Vitillo. Roberto is a principal software engineer at Microsoft. He has more than 10 years of experience in the tech industry in various roles, such as software engineer, tech lead and the manager.

Prior to joining Microsoft, Roberto got his master's degree in Pisa and worked on scientific computing applications in Berkeley Lab.

The software he developed are still in use in Atlas experiments and in the large Hadron Collider. After that Roberto joined Mozilla and worked in a variety of engineering roles. He worked on setting the direction of data platform from its early days and built the team and the platform. While in Mozilla, Roberto also graduated from Harvard extension school with data science certificate.

Roberto currently works in Microsoft and leads to Aria pipeline, which is one of the largest data telemetry pipelines in the world. And it processes millions of events per second, from billions of devices worldwide and supports many products such as Windows, X-Box, Office, Bing and Skype.

With this extensive experience, Roberto wrote a book called Understanding Distributed Systems, which is also topic of today's episode. After I read the book, I approached Roberto to talk about distributed systems. In our talk during the episode, you will hear many things from distributed systems definition to CAP Theorem, from memory leaks to writing a book and growing a carrier in software engineering. At the end of the episode, you will hear about how to get Roberto's book as a gift from me. So stay tuned until the end. Now it's time to listen, learn and enjoy.

Welcome, Roberto. I'm really glad that you're here. Because after I read your book I realized I learned a lot of things that I had no idea which many these days, I still have no idea, but, this is all me not on you.

This is the book that helped me to understand the picture of distributed systems. And now I can comfortably go on more details and like read other books and getting to details of the things. But this was a perfect start for me as a beginner, this is the reason why I wanted to invite you to talk a little bit about the distributed systems.

Most probably I will learn more about distributed systems and other things from you.

Welcome.

[00:03:05] Roberto Vitillo: Yeah. Thanks for having me and I'm very glad to hear the book was actually useful to you. It sounds like you, it, it really hits the nail there And what I was trying to achieve, like giving someone that maybe doesn't have yet all the know-how and overview of the whole field. Yeah, so very happy to hear that you like it.

[00:03:24] Candost Dagdeviren: Yeah, thanks a lot. It's, it's a great book. We will talk about the details of the book later on, but I want to start with asking you. my dumbass questions and start with the most looking simple, but it's, I know it's difficult to define sometimes what is a distributed system and what are the conditions of a system to be called as distributed?

[00:03:46] Roberto Vitillo: Yeah, that's a good question. And to answer it, I actually not going to answer it directly straight away. I think we need to go a little bit back. Why do we even need such systems? Let me start with the why.

So, I think in the late nineties, probably only 2% of the worldwide population and was online on the internet at some point. Then of course, as you know, over, over time, more and more people joined the internet. And I think but now it's for sure over, over half the worldwide population and what that means is if you have an online service, if you have an online business, then the number that the total addressable market that you have keeps growing, right.

And potentially you have more and more users. And this is what this where the distributed system comes in. Eventually you will hit the point where your little server Apache server won't anymore, be enough to serve all your users and you will need to somehow find a way around it. And for a long time, the answer is.

Oh, just let's just buy a bigger machine, which makes sense. Like, if you're a small company or one person, that's the first thing you will do, right. You're not going to go build some crazy distributed system just for the sake of it. But eventually even there, like eventually a hardware, you won't find the hardware anymore.

Things won't scale anymore. And this is where distributed systems comes in. So the idea there is to take off the shelves desktops or computers. Put them together and build the computer, which can serve more traffic than any single one ever called. So that that's, I think the idea, the initial, the initial need for distributed systems as I know them and the thing that happens next is if you have one box say that that machine fails once a year has a probability of failing once a year. Now, if you have a hundred machines, Then any given day, you have the likelihood of anyone then failing is higher than when you had just a single box. And so the more stuff you start adding, the more components start adding the more total failures you will see. So the next thing that happens is inevitably you have to deal with failure.

It would just be a constant in your life as a, as an engineer, working on such systems. And that's actually. Where my expertise lies, where I spend most of the time in, or probably most people in the field spend time in it's like, how do we make systems resilient to failures? And kind of like the next, the challenge there dealing with them.

And then finally, the last one that we'll say is the operational side. Once you have all this components running, how do you release a new change, comfortable that you, you won't break anything? How do you change configuration after the system is already running, without redeploying everything. So all, all of these things become very important once the skill increases.

So yeah, that's kinda like my answer to you. What is the distributed systems? There are many answers, of course. Like, I guess if you work in a different field, you might have a different idea. I worked my career on, on yeah, like online applications, that sort of thing. So that's probably the answer.

[00:06:41] Candost Dagdeviren: I like the way you started actually, when in 1999 or something in the beginnings of those things, it starts from a very, really, like, I don't know, as far as I remember, like, it was started in the United States as a like maybe a military project, like whole internet thingy.

And now when I think about it, like as web itself right now is a distributed system.

[00:07:02] Roberto Vitillo: It is.

[00:07:04] Candost Dagdeviren: And now when I think. Doesn't it make every application using this system also a distributed system?

[00:07:12] Roberto Vitillo: Yes, it does. It does. That's a very good very good point. Like I think even as a, let's say you're a JavaScript developer. Even as a JavaScript developer, you're actually dealing with the distributed systems. You're, you're making remote calls to some remote servers, which might or might not be there and what happens when it's not there. So yeah, we are all kind of distributed systems engineers at this time and age.

[00:07:34] Candost Dagdeviren: Yeah. So we have to learn the fundamentals. That's I think that's where I'm trying to go. Like, there is no other way because I've been working with like software, which uses a web, like for my whole life. And now, I just blame myself sometimes time. Like, why didn't I learn all the details until now, because those are the things that helps to shape the mindset a little bit. And when you don't have the mindset and when you don't know the really fundamentals I'm not talking about maybe knowing every single detail of course, but just knowing the fundamentals and being aware of them, changes your mindset a lot.

[00:08:08] Roberto Vitillo: it does. Yeah, Like, I, I feel like sometimes even if you read it.

Fundamental stuff even without using that knowledge right away, I guess it sort of stays in your brain somewhere. And then eventually when you need to design or build something else, somehow you feel you have this crazy intuitions. You don't really know where they come from, but I think your mind kind of like absorbs everything and then just puts everything together, and sometimes in the most unexpected ways. So yeah, fundamental knowledge is extremely important.

[00:08:40] Candost Dagdeviren: Yeah, these are usually helpful when we faced with some challenges. And now I'm curious about what are those challenges in distributed systems? Like for example, when I develop a mobile application, I know that network, can be very slow and I need to react to it and et cetera, but I have the perspective from the outside.

For example, developing a mobile application or developing a web application, but I don't have the intuition from the backside. So what are these big challenges of developing these distributed systems.

[00:09:10] Roberto Vitillo: Yep. that's a good question. I will say the first one may be is scaling and applications or scalability. And I started with, I want to start with this one because that's probably the most obvious one. That's what everyone thinks. Oh, I need to scale a distributed applications all about scale, even, even the system design interviews that popularized the, you know, a large scale system design.

They always start with scale. So the thing is scale is just the beginning. of the journey. So, and as I was mentioning earlier, the more component you add, the more, the number of failures will increase. And So that brings up all its challenges on new challenges with it. And so yeah, one of the biggest challenge is dealing with faults. So, how do you stop a fault that happened on a single machine or two of propagating to other machines to other services And eventually bringing down into our system with it. And there are many, many reasons where things fail. I will say the most common root causes of failures. I mean, there are many but the, the one, the one I will say top of mind is definitely networking or communication. So, as you just said, like even when you, even when you build a mobile apps, you have to think about the network and in a distributed systems even more so where you means of communications with other components is over the network. And the network is just unreliable. So the first thing you learn as you try to build a scalable application is that you are more called might never return.

Your service you depend on my, just not be there anymore, Your DNS, my number sell to anything. And, and so, so that's why in my book I spent the first part of my book is all about the network. And initially I was, I wasn't really sure. Should I put it, there should actually even put this in a distributed systems book or not. It's kind of like an orthogonal knowledge, but it felt like sometimes engineers, like deep understanding of how the network works and I wanted to yeah. Put kind of like a short summary to appreciate how this whole thing works also because the network itself is a distributed systems and there's a lot to learn by understanding how TCP works and yeah, the core protocols of the Internet.

[00:11:22] Candost Dagdeviren: When I read that chapter, like I said, wow, are like many things, every page that are like five things I didn't know. And after I realized, oh yeah. Okay. Now. Okay. Now I understand when I went back to second chapter, third chapter and scaling and resilience and et cetera, I just realized, okay. Now I know. What's your talking about, it's like this was the fundamentals. And one thing you mentioned there, like the resilience and as we are recording this on October 9th in 2021 we just had a really recent, big outage from Facebook. I'm not sure if you know the details, but this is not an important right now, but there were also outages from slack cloud flare AWS all this year. And these are all distributed systems at scale. This is, I think, It's closely related to what you're saying, because from my perspective, I expect those services at scale to be more resilient, like having this big, big, big outages, like full, like Facebook just went, disappear from the web, right.

So it was not even reachable. That's the point you are telling. So why do we keep having these outages, like what do these outages tell us?

[00:12:35] Roberto Vitillo: So that's a good point. So I think in some ways to answer your question, I, I feel like it's all about correlated failures. will think that by adding more machines to your applications, you make it more resilient. So I have the machines. If one of them fails, I still have nine left. Right?

That's what you will think in theory, but the problem is this argument applies only if the failure affects that single machine. You know, if you have a faulty Ram module, then probably we'll just restart machine. Although, in some cases, if you bought the RAM module all this all at the same time from the same brand, or they might even fail around the same time, which is also kind of crazy. But the thing is if you think about how the DNS works, like it doesn't matter how many machines you have in the backend, how resilient your whole, our infrastructure and in your architecture is if outside users can't resolve your DNS.

You can resolve your DNS name to an IP address. They won't be able to connect to your system at all. So that's failure in the DNS brings down the entire systems with it. and, the reason is because it has such a high blast radius. And I'm sure that people working on Facebook, they have thought about this and those things do happen. It's like it's inevitable. Like we try our best to make them not happen, but sometimes there is some weird sequence of events that happen or somehow create this condition for, for a failure to have a crazy blast radius and bring down a large chunk of the system, with it. so usually what happens is as the system grows, you have incidents like those, you have a post-mortem. In the post-mortem there is a very long discussion in generally about learning how to avoid such mistakes in the future, but most of all, how to reduce the blast radius. So a very good question to ask is, okay, we had this failure now.

How do we make sure that next time this happens it rather than bringing down a hundred percent of the fleet, it only brings down half of the fleet just to give you an example. So how do we reduce the blast radius? So over time you will see more and more failures. You will come up with better ways to be resilient against them and the application becomes more and more robust. But. The probability of failing never goes down to zero in these things can happen.

[00:14:56] Candost Dagdeviren: So it's not a matter of if it's going to happen or not, but it's a matter of when it's going to happen.

[00:15:01] Roberto Vitillo: Yes. There is always some weird side thing. The thing is when you have a large system with so many components, it kind of has a life of its own. Like, it's very hard to think about all the possible states that our system can get itself into. Humans try the best, but the reality is there are, so many possible edge cases that when combined can trigger some weird outcome that you've never thought of.

[00:15:27] Candost Dagdeviren: Yeah, there are drills, right? When I read the Facebooks, for example, explanation like why it happened. they also mention that they are doing a lot of drills Taking down systems to test if it is like really resilient and et cetera. These are, a couple of strategies, but how, like for software engineers, let's say for me for example, maybe a mobile engineer, web engineer, or like anyone who is not directly developing distributed systems, like on the backend side, how can we approach to be more resilient? What's the mindset here? like what are the strategies that should be built in this all system design from this perspective and development process for these distributed systems?

[00:16:05] Roberto Vitillo: So I guess the mindset there is in general, thinking about what happens if this component fails for a mobile app.

you will probably, I don't think you will think of what happens if the phone fails while the phone fails, tough luck. But there, you will think more. What happens if the server fails? What happens if the auth provider fails, like you start to think about things that can go wrong and how should your system react to it? So what I usually do during design time, I try to list all the failures I can think of, and I try to list them, by risk. So the risk one way to describe risk is the probability of the failure happening multiplied by the impact. The failure will have if it happens. So that helps me understand where I should spend my time on architecting for resiliency.

If for example, if a failure has a very low probability of happening and the impact is minimal, I won't even care. On the other hand, if the impact is high, even if probably this low then that maybe it's, it's a different thing. So the so typical, I'm starting to think of typical components that have high impact. I will say, yeah, single points of failures are top of mind, you know, any, any component where there's only one component. And if the component goes down, the whole application is no longer available. DNS is a single point of failure because if the DNS is down, of course, DNS by itself is also made resilient like you have multiple DNS servers that isn't just one server. You have the DNS services in multiple data centers. So that even if I did this and that is down, the other ones can take over and so on and so forth. But yeah, that's kind of like the mindset though I have.

[00:17:49] Candost Dagdeviren: That's a good mindset. And now I wanna pull a little bit back and come into the part that is more related to, coordinating systems, because we are talking about multiple machines and et cetera. But all the time, whenever I took a look at distributed systems, I see one theorem, CAP theorem, which always there, everyone talks about it whenever we mention, and everyone says it's not solvable. Like the CAP theorem, From what I read is it's the picking two things out of three, which are consistency availability and partition tolerance. And you sacrifice, the other one, the third one, but in your book, that one title, like when I was just taking a look at the table of contents before reading, and I saw solving the CAP Theorem, like everyone, everyone told me like, this is not solveable.

And this is the first thing I read by the way. And I'm curious, like from my beginner's perspective again, on distributed systems can you explain how the approach to solve this to the people who are listening us.

[00:18:55] Roberto Vitillo: Yeah. So the reason why I put the catchy the title and purpose. So the CAP Theorem applies for a specific type of consistency which is strong consistency. So, but it doesn't say anything about other types of consistencies. So strong consistency means, if you perform an operation and let's say you're making a request to your service to do something. When you get back a response from the request, you can assume that everyone else will see the side effects of that request. So that's, one way to describe strong consistency. And, but there are all the types of consistency, like for sure you're aware of like eventual consistency. So in that case, the CAP Theorem doesn't really apply, as it stated. and so when I Use the title, solving the CAP theorem. I really meant working around it. That was it. So in, and in fact, the whole section talks about how to leverage all the types of consistencies to build systems that are available, consistent, not strongly though, and and partition tolerant.

[00:19:52] Candost Dagdeviren: Yeah, that was the part because like, I remember still, I think I have a lot of notes around it as well. One thing that I realized there also is as we talked in a moment, network will fail. So we cannot avoid this one. It's usually about choosing what's like is consistency and, partition tolerance.

But anyway this is the this is one thing that's, kind of surprised me when I was reading the book. But now, the one other thing that you said in the beginning and distributed systems require big operations and operation management behind it. we have these multiple nodes around the world, we need new ways to manage and operate those systems. So what are the best practices? And what are the must haves in those best practices to manage this everything.

[00:20:39] Roberto Vitillo: I will say the first thing that comes to mind is strong observability or monitoring if you prefer that word. The first thing is you need to know what's going on. You need to monitor your system. How many failures are users seeing how many machines are down, how many requests you're processing, how many exceptions a specific service has encountered.

So you pretty much wanna know, everything there is even if you think it might not be useful, maybe during an incident, that information will be useful. So logs and metrics are very fundamental techniques, that help understanding the system when it's running. And what you can do with those with metrics is then use alerts, built alerts top of it. So that you can be paged when something that shouldn't happen is happening. so alerts are very, those are very simple techniques, but I've, I've seen or talked to many people that worked in places where even the most basic metrics where weren't in there or were not alerts were there. And so what happens is like the entire system fails and they find out that their service is down because a customer emailed them. So that that's definitely not something you want, you want to happen. So, yeah, those are the bare minimum. They shall have very simple, there are many, many tools these days to have to get metrics and logs out of the box and you can instrument your own application. yeah. So once you have those, the bare minimum, you also need to think about how you release changes.

Cause once you have a larger application production any change. Yeah. releasing change is very scary, right? you can bring down everything with it, so. How do you release a change in a safe way? So, so then deployment pipelines come, come in that you staged deployments where you deploy. First of all, very small fraction of your fleet.

You let it bake in there for a long time. You use metrics and logs and alerts to make sure that nothing is going on and then you continue slowly your deployment. Then you need to, be aware of when operationalizing big distributed system and the other one is I would maybe call it manageability is how easy is it to change the system without any code changes when it's running. So you might have to make some changes at run time because there is some kind of failure going on. So you have to reconfigure your system. Is it possible to do it without waiting a day for a deployment? So yeah, that's the thing like, so you, you need to have some system in place where you can, you can make configuration changes in a safe way. So they also need to be rolled out slowly, but you can do those changes without redeploying the entire application. So, so that's another concern, as well.

[00:23:10] Candost Dagdeviren: I always saw that. It's a restart.

I think I, I never seen or maybe we have it, but I have no idea about those systems, but whenever there was like, oh, okay, this is not working. Okay. Right. Let's redeploy. And it takes 30 minutes

[00:23:24] Roberto Vitillo: You're saying something there, the restart is, it's an interesting thing, right? Like the many times it works. And we actually some of our services have what we call the defibrillator, which is Like a watchdog process that looks, whether the process, the process actually handling requests is doing something.

And if it is stuck for whatever reason, because I don't know maybe there was a memory leak and the process slowed down so much it barely processes anything anymore. Or maybe there a socket leak and they can't open connections anymore and basically the process stops. So, so this defibrilator basically, it's very stupid to just sees whether is the process doing anything.

And if, it feels like something is off, it just restarts it. Very dumb, very helpful because the thing is even with we use C# at work. We used Java before Scala, whatever it even, even when we'd manage languages that have garbage collection can introduce memory leaks. So these things will happen.

Like, I cannot predict all possible memory leaks that will happen. I'm sure somehow they will get in, in the future. So this dumb restart mechanism helps you buy some time to actually go and fix the root cause rather than having the entire service at some point, just stop working because they all ended up without memory. so, So these things they are silly, but they work sometimes quite well.

[00:24:44] Candost Dagdeviren: Yeah. It's like, Yeah, that's, that's what I use. Like mostly I, the restart the service, I think. Yeah. There's like some automated restarts directly watching, as you said but sometimes, that doesn't work. And I just, I don't know instantly without thinking about it, I just think, okay, let's restart. You know, I remember this one from more like a, IT Crowd TV series like, there was an episode there.

They were calling the IT is like, did you restart your computer? Oh, no, that's a good idea. And it works like,

[00:25:11] Roberto Vitillo: Yeah. That's the thing is about the restart is your your application has a steady state, so, or let's say an HTTP service has a steady state. So this is a state, it's the most common state. It's the one where everything is working fine. And the application spends most of its lifetime in this steady state, your human brain also kind of the way the mental model you have about your system is the steady state.

When this weird thing happens, your application starts behaving a different way. So it's, we call it a modality change. So these modality changes are very dangerous because your entire design, your, you architected your application with the steady state mind. And so when, whenever you enter in a new mode of operation, it's very dangerous. So the restart actually kind of resets. It brings you back to the steady state many times, but not always and that helps. But an example where this won't help is, for example, let's say you have a large, large scale front-end that is processing many, many requests from around the globe and you have an entire region going dark.

Maybe there, there was a, like a electrical storm or something and brought the region down, but for whatever reason the region is down. So now the the older machines in all the other regions are taking over the traffic of the region that went down and they're processing way more than they used to before. Now there will, depending on how it's architected, they can degrade because they're just, running at maximum CPU in other, and then they, they barely can handle the traffic. They will degrade in that case, if you were to restart. yeah. restart won't help. It will probably make it worse because during the restart you actually even lose machines. and, and so, yeah. The restart is sometimes works sometimes not. But yeah, it depends on the on the situation, but it's uncanny how how powerful this a very simple mechanism is.

[00:26:56] Candost Dagdeviren: At the end, we have to have monitoring to be able to understand what's going on. And I think, uh, the most companies or most projects I've seen that they lack monitoring a lot, because those are the systems that you usually don't go and monitor things when everything goes fine, but whenever something goes wrong, that's the part that is going to help you once you can identify. And There are so many other things of course, we just didn't talk about it's in your book. You put all that information and way more things into the book. And the book name is understanding distributed systems, which I have understood a little bit more than I. I knew before I was totally ignorant to distributed systems.

Now I I know a little bit, but why did you write this book? Like what was the motivation to tackle this writing challenge? Because I know it's a big challenge. I know from my writings, because I write articles on my blog and that is a challenge, you know, like writing articles is a challenge for me and writing a book. I even can't imagine, like, what was the motivation to tackle this big challenge?

[00:28:00] Roberto Vitillo: Yeah, that's a good question. So it actually started I didn't want to write a book initially. So what I wanted to do was to do a little product, in my free time, I just wanted to have a product where I could learn a bit more about marketing sales

and actually very little, I didn't want to do any any coding at all.

It was very deliberate. And I was thinking, okay, what is the the best thing to learn? All those other skills that I don't have without spending too much time on the, on the engineering side. So I thought, okay, I already have some knowledge. I get paid for that knowledge. Maybe other people will find it useful.

So, so then I said, okay, writing a book about something I know well, will help me spend more time on these other things I don't do. I don't know how to do well. So that, that was the initial thinking and initially to be honest, it was supposed to be, a system design interview class, that, that's how it started.

And I did all the things you shall not do. if you want, if you just want to make a product. so the market for system design interview is huge. there's a lot of demand. people are willing to pay for it. And when I started, it seemed like a good idea. And but then I realized that in order to pass the systems design interview, in many cases, you actually don't need to be an expert building such systems. Because often what happens is companies will ask you this question just because Google or Facebook, Microsoft ask this questions, but they themselves don't have the expertise to evaluate you on this problems. And so I realized I wasn't, I was doing a disservice there. Like I was going to teach people how to pass interviews, but it wasn't really going to teach them how to build the systems, really build the systems.

So I felt like, okay, I want to, I want to go more on the actual here, here's the, what the job entails. And maybe you don't need all this stuff for an interview. But you actually gonna, if you're actually gonna ever build such a system, then my book will be useful to you. And so of course the, the crowd of people that are interested in learning the fundamentals is very different than the crowd that just wants to pass interview. So much smaller and that's fine. Like, I didn't do it for the money. It was just like, I, I wanted to have an experience building a product that people like. And so that's why I kind of pivoted to this book and slowly became a monster of its own. And it still is because I keep updating it and, it's just time drain, but hearing from people like you, that they found one thing, they somehow learned.

One thing they didn't know before that makes me really happy. it means useful to someone.

[00:30:19] Candost Dagdeviren: Yeah, when I got this book, I must be honest. Like I was saying, okay, let's see the system design interviews let's tackle this. And I just said, okay, so be able to pass this interview or maybe hold by myself to tests others in the interview I need to learn. And this was my starting point.

And I just saw this book. I don't know where maybe on Gergely's this website, I'm not sure. And I said, okay, this looks good. It's short enough because other books, like, I don't know, some books, I saw 500 pages. It's just frightening. You know, it was like 500 pages to just pass the interview. I won't read it. I need the information like the simple one, because I already have some knowledge about developing software systems and et cetera. I just need to be able to think in that way. That's why I bought this book. I must be honest. I passed two interviews thanks to this book. So I did not change my job, but that was the challenge I put myself like, okay, I'm going to try. and I just read this book. and I just watched some other videos on YouTube. That was all, that's all I did. And it helped. That's what I'm trying to say. And also one thing is the language of the book is so fluent and smooth, and even though there are many technical terms that I didn't know, I understand them easily. How did you translate this technical language to a language which almost all, almost all software engineers can understand.

[00:31:41] Roberto Vitillo: I guess when you write a book, there are many reasons sometimes to write a book, sometimes you just want to say, Hey, look how awesome I am and all this stuff. And you write this super technical things that very few people can understand that that's more, I guess, in Academia.

It's more, maybe it happens more often there. In my case, I tried to put myself in the shoes of a, of a beginner or someone that is a very strong technical technical person, but maybe does not have had yet the chance to work on such systems. I, every time I write a sentence I try, I reread it a hundred times, I think. Okay. If I had to explain this to a new coworkers that we just hired without previous experience, will that make sense to him? So I really strive to do that. It's not perfect far from it. every time I read, I reread the thing, I realize something is not right and changing and fiddling. So that's why it's also very time consuming process. The thing I dislike about writing is that when you write code, you have unit tests and you can refactor things easily in a book. Instead, if you change your mind about a term. Man, you have to go back, change everything by hand is such, so time-consuming, it's just crazy how much time it goes into this sort of, simple things.

[00:32:50] Candost Dagdeviren: This is why I like writing to be honest, like yes, in software we get used to so many like these automations and unit tests around it. Okay. I mean, if I make mistake, they will catch it you know, it's like, if I change this and I break things, they will catch it. But in writing, there's no such things.

if someone comes to you and says, then, you know, like there is no monitoring as well. Like if you think about distributed systems, so out of the things that you got used to it. But one thing I realized is that once you want to grow in your carrier, you have to learn how to explain those technical things to non-technical people, to beginners that's must have, otherwise you really cannot learn and grow in your carrier.

[00:33:31] Roberto Vitillo: Yeah. As, as an engineer at some point, it doesn't matter if you want to go into management or you want to stay in an individual contributor you need to be able to communicate really well and get buy-in from, from your peers, from coworkers, from people above you and writing is incredible, useful skill there And I'm gonna just mean being verbose. Now. It just being very clear, and very direct getting to the point, trying to make they make good arguments. So, so eventually I think to grow as an IC that becomes a very very important skill to have. I will say definitely much more than coding. I think there's some point where most people just plateau on the, on the programming skills, because you know, you, you are good enough coder. I will say that you can solve most problems. And, there's not much you can improve there. But it doesn't matter how good the in coding you are if you cannot convince other people or, off something or make good arguments for it or, or even change your own mind with the with other ideas. So, so yeah, writing is a very important.

The other thing about writing is it helps you put your thoughts together. Like you only, I feel like when you have to write something down, that's where you have to truly understand what you want to do or trying to say. So, it's a, it's a thinking tool as well.

[00:34:52] Candost Dagdeviren: Yeah. I like to call writing shapes your thoughts and Yeah. To be able to understand what you're thinking, you need to write it down. Otherwise they're just something in your head. You have no idea, what thought is connected to other one, it's just a cloud and you write it down. It gets a bit more clear everything.

[00:35:10] Roberto Vitillo: This is something that you, you write something down, you realize you actually don't understand it.

[00:35:14] Candost Dagdeviren: Yeah. Thanks a lot. I really enjoyed our conversation and it was so pleasure to have you here and for the people who are listening us that's a gift from me if you send a tweet to Twitter mentioning me and Roberto, then you will get into a lottery that's called this way. And I will gift one book, ebook of Roberto to our listeners and.

Thanks a lot Roberto for writing this book. I'm just holding the book in the printed version in my hand. So I I'm kinda regretting that I didn't get the ebook version because it has the updates, but this one I just have this, but most probably I will buy the new versions as an ebook, but thanks a lot for writing this book, it helped me a lot to understand many things around distributed systems and Thanks a lot for joining and taking one more step, teaching me more about distributed systems and writing, in this podcast.

Thanks a lot.

[00:36:07] Roberto Vitillo: Yeah, thanks for having me. You are a great host was pleasure being here.

[00:36:11] Candost Dagdeviren: Before you go. It's just the one second. As I said a minute before, I'm giving Roberto's book as a gift to one of the followers of the show. If you share the show on Twitter with mentioning me and Roberto, you've all have a chance to win the ebook version of Roberto's book.

But if you don't have a Twitter, don't worry.

Just send me an email and write about what did you like the most in the show? And you will also get a chance. You can find my Twitter and Roberto's Twitter handles on the description or on candost.blog/podcast. or you can send me an email at podcast[at]candostdagdeviren[dot]com, which you can also find in the description.

Additionally, don't forget to share the episode with one of your friends, colleagues or work friends. Until next time, take care.

0 Comments

Software World with Candost

Software World is a podcast for software engineers hosted by Candost. Every second Tuesday, Candost uncovers the journeys of people and software systems. I interview the experts or talk alone about software architecture, system design, feedback, software engineering leadership, careers, team management, processes, product and customer-centricity, and more.

Follow my blog at candost.blog, for articles and a podcast, and subscribe to my newsletter.

Listen on