As public clouds and Content Delivery Networks (CDNs) race to enable compute capabilities at the edge of their networks, software developers are no longer deploying just static files (video, images) at the edge, but significant application code as well. This presents a dramatic departure from current development practices. With this new breed of applications on the horizon, the security, performance, and feature requirements of the underlying execution platform grow commensurately. Second, the harsh reality of unexpected network failures at the edge forces every mainstream developer to become a distributed systems engineer. Distributed systems engineering multiplies complexity and difficulty manyfold, as partial failure (unavailability of one or more services) can adversely impact the system in unknown or hard-to-predict ways, even leading to catastrophic cascading failures. This research proposes an approach to full-stack “resilience engineering” to enable secure, effective, and performant edge computation in NextG systems. Our work focuses on and builds on WebAssembly, which is emerging as the common underlying language-agnostic execution platform in new edge computing environments. We propose a robust, performant, and secure experimental runtime engine with support for instrumentation and rapid prototyping that will facilitate fault injection and program repair. On top of this foundation, this work proposes a set of tools for programming language-agnostic fault injection, testing, and repair of resilience (distributed system-related) errors in distributed and networked systems. This would allow software developers to quickly and effectively test and fix resilience defects before application code is deployed to users, rather than simply deploying and hoping for the best without any indication of how that application behaves in the face of partial failure (which is the current state of the art). If widely adopted, the results of this work would, in turn, reduce the occurrence of catastrophic cascading failures that cause widespread outages in critical networked services.
Essentially all software that we use as a society is now networked, meaning apps connect to multiple servers and coordinate together to complete a task (such as online banking, searching for nearby restaurants, and even streaming media). This project directly supports the development of dependable networked services. Prior to this research, apps are typically released to users without testing for issues at the network level between services. This work develops new ways to test this brave new world of networked software, making it possible for software developers to automatically discover and automatically repair bugs before that software is released to users. As a result of this research, we will have new tools that, if broadly used, can result in fewer outages of critical networked services that society depends on.
This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.