[Nix-dev] hotswappable self managing services in nix

Mon Nov 28 10:59:04 CET 2016

zimbatm, i appreciate you sharing your insight and experience, my main
exposure to this is the Erlang Way tm. Many good lessons can be drawn
from this level of engineering.

I'm not too fussed about the network level (inter machine) at this
moment. Inter-service we're using a component that wraps nanomsg.

After a bit of thought, we're redesigning the rustfbp scheduler/vm to
be more like the way nix operates.

i.e. as declarative as possible, we cannot have silly imperative state
fiddling at all, i.e. you cannot log into a fractalide node and start
connecting / disconnecting components, this would be like making the
/nix/store/ writable and messing with that.

The new design is this: you load a nix compiled hierarchy of
components into the fractalide virtual machine (fvm) then if you want
to make a mistake, and want to change it you'll issue a $ fvm reload
${new_subnet_hierarchy} and fvm will then kill/start/hotswap all rust
components that aren't/are in your new description of the graph. Just
as nixos-rebuild makes your system reflect your Configuration.nix
file, so fvm will reconfigure the component graph in the process to
reflect what nix compiled.

Once we get this layer, then we can "surf" with nix and just let
nix/nixops manage the rest. We can implement blue/green styles of
deployment in nixops *.nix files. That's another layer of problems
still to come.

Phew...

On Mon, Nov 28, 2016 at 5:33 PM, zimbatm <zimbatm at zimbatm.com> wrote:
> Hi Stewart,
>
> In a HA setup availability is generally achieved on a network level instead
> of system level. Typically you would have two hotswappable load-balancers
> that distribute the traffic to multiple instances of your service boxes. In
> that context is doesn't matter how processes are being restarted because the
> load-balancer will automatically detect unresponsive machines and route the
> traffic accordingly. It's also handy because it allows to restart the
> machines in the event where the kernel needs an upgrade. In that setup I
> suppose you can think of each machine as being one Erlang OTP "process" and
> the network the "message-passing".
>
> One responsibility of the service in that setup is to shutdown properly to
> avoid unnecessary disruption of service. Mainly when the process gets the
> SIGTERM signal it should close the listening socket (so the load-balancer
> can route new incoming connections to a different machine) and then drain
> the existing client connection gracefully. It shouldn't stop all at once but
> let the clients disconnect when they are done with their sessions (and
> optionally signal them to go away if the protocol supports it).
>
> A last thing regarding this approach: generally you need a way to control
> the deploys; if all the service boxes are being upgraded at the same time
> then the load-balancer doesn't have anywhere to route the traffic to. It's
> also something desirable to have to do blue/green deployments.
>
> I need to stop there for now but I also have a similar design answer on the
> system level where processes get replaced gracefully.
>
> Cheers,
> z
>
> On Sun, 27 Nov 2016 at 04:33 stewart mackenzie <setori88 at gmail.com> wrote:
>>
>> 9 9s not unheard of in these circles, Google uptimes are a joke not worthy
>> of mention.
>>
>> There are systems that have been running for some 40 odd years in
>> production that factor in changes to legal banking regulations, hardware,
>> business logic etc. Erlang has a system called the Ericsson AXD301 which has
>> achieved this time frame.
>>
>> Just because Nixos hasn't been around that long doesn't mean it can't have
>> the primitives to allow for such feats. Its these primitives I'm enquiring
>> about.
>>
>> So let's use a new, less controversial figure of 5 9s and keep on topic.
>>
>> The thing is, we're designing this system so that its governed by nix
>> don't necessarily have to depend heavily on the runtime - I really don't
>> want to go down the imperative route, by introducing imperative language
>> concepts into our declarative language which is managed by another
>> declarative language (nix). Besides just bringing in a single component with
>> an OS Dependency demands we manage this change from nix level.
>>
>> We currently have a hack in place, that will resolve dependencies and give
>> us a path to load a correctly compiled shared object into memory:
>> https://github.com/fractalide/fractalide/blob/master/components/nucleus/find/component/src/lib.rs#L43
>> nasty and cringe worthy I know.
>>
>> Thanks for your pointer, I'll take a look at these activation scripts.
>>
>> Maybe this hack is the answer, and confine the dynamism to an ssh login al
>> a Erlang style...
>>
>> _______________________________________________
>> nix-dev mailing list
>> nix-dev at lists.science.uu.nl
>> http://lists.science.uu.nl/mailman/listinfo/nix-dev