This is done by performing one synchronous query for carrier state at
the start of the program; after that, we just monitor the nlsocket for
carrier state changes and update the cached state accordingly.
The benefit is that ifchd needs to do a lot less work and this should
reduce the CPU cycle consumption; prior to this commit, the CPU time
ends up being a few CPU-minutes per month.
The existing code tracked the fingerprinting completion state with
a bool, which is insufficient; what is required is a tristate
(none|inprogress|done) where the inprogress state reverts to
none on carrier down, but done state stays done on carrier down.
There's really no advantage to using signalfd in ndhc, particularly
since the normal POSIX signal API is now used for handling SIGCHLD in
ndhc-master. So just use the tried and true volatile sig_atomic_t set
and check approach.
The only intended behavior change is in the dhcp RELEASE state --
before there would be a spurious attempt at renewing a nonexistent
lease when the RENEW signal was received.
If we get no response to three renews (unicast), switch to sending
rebinds (broadcast). Servers are supposed to always reply with
a DHCPACK or DHCPNAK even if the server doesn't update its internal
lease duration database, so this behavior should be RFC compliant.
'(c)' may not be a valid substitute for 'Copyright' in some legal
domains/interpretations. So be safe, since I obviously am asserting
copyright on my legal work.
If carrier is lost before network fingerprinting is complete, we
have a few problems; first, we don't know whether the network has
changed underneath us. Second, we've not yet configured the
interface properties, and it is not unlikely that doing so will
fail as the underlying network device may have been destroyed
and recreated during this time (eg, if ethtool has been run at
start-up time).
Thus, the safest reaction is to terminate and force a supervisor
respawn. It is best to do this once carrier recovers, not when
the carrier is lost, as it is more likely to minimize delays.
We instead check carrier status as needed. This approach is more
robust. For a simple example, imagine link state changes that happen
while the machine is suspended.
There was no way to disable writing pidfiles before.
pidfiles are an unreliable method of tracking processes, anyway; process
supervisors are strongly recommended. If a pidfile is really needed, it
can be explicitly specified.
Thus signed/unsigned conversion issues can't be a problem.
The extra range provided by ull isn't useful, either, since
we're dealing with uint32_t time_t seconds converted to ns,
which doesn't come close to exhausting the amount of post-epoch
ns that will be consumed until after 2100AD.
This change makes it much easier to reason about ndhc's behavior
and properly handle errors.
It is a very large changeset, but there is no way to make this
sort of change incrementally. Lease acquisition is tested to
work.
It is highly likely that some bugs were both introduced and
squashed here. Some obvious code cleanups will quickly follow.
This change paves the way for allowing ifch to notify the core ndhc
about failures. It would be far too difficult to reason about the
state machine if the requests to ifch were asynchronous.
Currently ndhc assumes that ifch requests never fail, but this
is not always true because of eg, rfkill.
In order for this to work, the correct rfkill index must be specified
with the rfkill-idx option.
It might be possible to auto-detect the corresponding rfkill-idx option,
but I'm not sure if there's a guaranteed mapping between rfkill name and
interface name, as it seems that rfkills should represent phy devices
and not wlan devices.
The rfkill indexes can be found by checking
/sys/class/rfkill/rfkill<IDX>.
Mostly reverts the previous commit and instead teaches ndhc to properly
handle the case when it is communicating with a DHCP relay agent on
its local segment rather than directly with a DHCP server.
different segment.
The network fingerprinting would never complete if the DHCP server was
on a different segment before this change, since it would be impossible
for the ARP messages sent by ndhc to ever reach the DHCP server
(and vice-versa).
Now just give up trying to find the hardware address after two tries
and assume that the DHCP server cannot be reached by ARP.
An alternative would be to fingerprint the relay agent instead, but
to do so would require a lot more work as the giaddr field is only
meaningful in the client->server message path, not in the
server->client path. Thus it would require gathering the source IP
for DHCP replies sent by unicast or broadcast and ferrying along
this information to the ARP checking code where it would be used
in place of the DHCP server address.
This is entirely possible to do, but is quite a bit more work.
AF_UNIX SOCK_STREAM sockets between the master processes and each subprocess,
and poll for the HUP event.
At the same time, be specific about the events that are checked in epoll
when dispatching on an event.