Why do you think Lambda functions are fragile? You get multi-AZ redundancy (3 AZs) out of the box, better security (no OS-level attacks), and auto-scaling. That’s a pretty good baseline resilience!
Also, invocations can last for 15 mins.
To stop the invocation from failing and cancelling out all subsequent recursions, it’s as easy as putting a try
and catch
block around individual units of work you’re doing. Of course, more complex work would require more thought-out error handling strategy, but that’s not specific to Lambda.
And, of course, write tests to validate your error handling behaviour!
These are not specific to Lambda, if you’re writing a long-running process that is going to run inside a container/VM, you’ll have to do the same things there too.