This requires execute_buffer() and its callers to distinguish between
fatal and non-fatal errors. The -99 return value was already used for
non-recoverable errors that should force the daemon to restart, but the
execute_buffer() callers treated any non-success return as a fatal
error.
There a judgement call here on how to handle various error types. I
choose to assume that failures to set the IP address, netmask, broadcast
address, or default router are fatal errors. ndhc should be run from
process supervision, and this will trigger a daemon restart, which will
allow the machine to recover as soon as the problem (probably on the
dhcp server or local kernel state outside of ndhc's control) is
corrected.
This change corrects errors such as:
Discovering DHCP servers...
(process_client_socket) received invalid commands: 'carrier:;'
(send_dhcp_raw) carrier down; sendto would fail
Failed to send a discover request packet.
which happened if ndhc is started on a machine where the network
interface is down. After this change, ndhc should function as intended
by going to sleep until the carrier returns rather than terminating
itself to be restarted by the process supervisor until carrier returns.
Thus signed/unsigned conversion issues can't be a problem.
The extra range provided by ull isn't useful, either, since
we're dealing with uint32_t time_t seconds converted to ns,
which doesn't come close to exhausting the amount of post-epoch
ns that will be consumed until after 2100AD.
Lease times and arp timeouts are all calculated using long long,
but epoll takes its timeout argument as an int. Guard against
timeouts > INT_MAX but < UINT_MAX wrapping and causing spuriously
short timeouts when converted to a signed int.
This problem has been observed in the wild. Thanks to thypon
for a detailed strace that pointed me towards this issue.
Somewhere along the line it quit being set at the start of discovery
and was always 0. This is clearly not desired behavior.
Found by manual examination of packets while fuzzing the options
parser.
This corrects a bug where stale dhcp packets would get reprocessed,
causing very bad behavior; an issue that was introduced in the
coroutine conversion.
Some networks have multiple DHCP servers that don't respect the
serverid that is specified in DHCPREQUESTs.
Instead we simply check to see that the yiaddr matches.
While we're at it, ignore DHCP NAK during REQUESTING state. It
doesn't really make sense. Instead we should just wait for
timeout.
This change actually has no effect because incoming dhcp packets
that differ from our xid are dropped, so the xid is always changed
to the same value that it already has.
If ndhc were a high-performance program that handled lots of events,
this change would harm performance. But it is not, and it implicitly
believes that events come in one at a time. Processing batches
would make it harder to assure correctness while also never allocating
memory at runtime.
The previous structure was fine when everything was handled
immediately by callbacks, but it isn't now.
This change makes it much easier to reason about ndhc's behavior
and properly handle errors.
It is a very large changeset, but there is no way to make this
sort of change incrementally. Lease acquisition is tested to
work.
It is highly likely that some bugs were both introduced and
squashed here. Some obvious code cleanups will quickly follow.
This change paves the way for allowing ifch to notify the core ndhc
about failures. It would be far too difficult to reason about the
state machine if the requests to ifch were asynchronous.
Currently ndhc assumes that ifch requests never fail, but this
is not always true because of eg, rfkill.
If a packet send failed because the carrier went down without a
netlink notification, then assume the hardware carrier was lost while
the machine was suspended (eg, ethernet cable pulled during suspend).
Simulate a netlink carrier down event and freeze the dhcp state
machine until a netlink carrier up event is received.
The ARP code is not yet handling this issue everywhere, but the
window of opportunity for it to happen there is much shorter.
Linux will quietly proceed as if the data were sent even if the carrier
is down and nothing actually happened. There is still a tiny race
condition where the carrier could drop between the check and the actual
write, but we really can't do anything about that and it is a very
small race.