Regardless which mechnism you pick, it does create a weird situation: conceptually unpriv user objects (processes, files, cgroups, …) are owned by UIDs that are under user control, but without this UID-UID ownership being directly known to the kernel. IOW: the only way to interact with those objects without privs is for the user to go through namespacing similar to how they were originally created. If user code tries to interact with them without going through userns, these objects will appear…
…as foreign UID owned.
(Well, handling of processes is slightly less complex than files/cgroups here, since during their runtime they retain attachment to the userns they where created with, and that userns remains owned by the user's UID, which gives it magic powers. But files and cgroups don't work that way: file system objects "at rest" retain no binding to the userns, and thus no such magic powers from the original user's UID remains)
For systemd this all creates various challenges. One specifically is what this episode is about: there's a per-user service manager for each user, and it manages cgroups. When invoking a userns based container, it makes sense to delegate a cgroup to it, so that the container has all it needs to boot a full blown systemd inside. Delegation means assigning ownership of the cgroup to the UID range used for the userns. But this then means that the per-user service manager…
…will lack the privs to clean up the cgroup delegated to the container, since after all it just runs under the user's UID, but it has no knowledge of the userns or its mappings created by the container manager.
And this is a problem for robustness: it means that the container executor has to carefully clean up after itself, and never leave cgroups around, because unlike almost all other resources, the service manager managing that container executor is unable to clean up after it.
I ran into this problem quite frequently while hacking on nspawn and other userns related code: when my unpriv code died due to some bug I ended up with cgroups in the user's cgroup hierarchy that the per-user service manager couldn't clean up anymore, thus creating something of a DoS scenario.
With systemd v258 this changes a bit. The per-system service manager gained an IPC call that the per-user service manager can call, requesting it to clean up such cgroups for it. The per-system service…
…manager runs privileged after all, and thus can do this.
Of course, the per-system service manager carefully validates the caller's credentials, and verifies that it delegated the cgroup to the caller in the first place. If that checks out, it will remove any subgroup requested, regardless by which UID it owns.
All of this is mostly transparent to services btw: if your code delegates a cgroup to other UIDs, and your service dies it will now be cleaned up no matter what.
That said, the D-Bus method call RemoveSubgroupFromUnit() that is behind this is actually available to clients too, which even allows just removing parts of a delegated subtree, instead of the whole thing.
Moreover, there's a related call KillUnitSubgroup() will allows killing processes in a delegated cgroup subtree, too, for similar reasons and usecases.