If ndhc were a high-performance program that handled lots of events,
this change would harm performance. But it is not, and it implicitly
believes that events come in one at a time. Processing batches
would make it harder to assure correctness while also never allocating
memory at runtime.
The previous structure was fine when everything was handled
immediately by callbacks, but it isn't now.
This change makes it much easier to reason about ndhc's behavior
and properly handle errors.
It is a very large changeset, but there is no way to make this
sort of change incrementally. Lease acquisition is tested to
work.
It is highly likely that some bugs were both introduced and
squashed here. Some obvious code cleanups will quickly follow.
This change paves the way for allowing ifch to notify the core ndhc
about failures. It would be far too difficult to reason about the
state machine if the requests to ifch were asynchronous.
Currently ndhc assumes that ifch requests never fail, but this
is not always true because of eg, rfkill.
If a packet send failed because the carrier went down without a
netlink notification, then assume the hardware carrier was lost while
the machine was suspended (eg, ethernet cable pulled during suspend).
Simulate a netlink carrier down event and freeze the dhcp state
machine until a netlink carrier up event is received.
The ARP code is not yet handling this issue everywhere, but the
window of opportunity for it to happen there is much shorter.
Linux will quietly proceed as if the data were sent even if the carrier
is down and nothing actually happened. There is still a tiny race
condition where the carrier could drop between the check and the actual
write, but we really can't do anything about that and it is a very
small race.
In order for this to work, the correct rfkill index must be specified
with the rfkill-idx option.
It might be possible to auto-detect the corresponding rfkill-idx option,
but I'm not sure if there's a guaranteed mapping between rfkill name and
interface name, as it seems that rfkills should represent phy devices
and not wlan devices.
The rfkill indexes can be found by checking
/sys/class/rfkill/rfkill<IDX>.
Mostly reverts the previous commit and instead teaches ndhc to properly
handle the case when it is communicating with a DHCP relay agent on
its local segment rather than directly with a DHCP server.
different segment.
The network fingerprinting would never complete if the DHCP server was
on a different segment before this change, since it would be impossible
for the ARP messages sent by ndhc to ever reach the DHCP server
(and vice-versa).
Now just give up trying to find the hardware address after two tries
and assume that the DHCP server cannot be reached by ARP.
An alternative would be to fingerprint the relay agent instead, but
to do so would require a lot more work as the giaddr field is only
meaningful in the client->server message path, not in the
server->client path. Thus it would require gathering the source IP
for DHCP replies sent by unicast or broadcast and ferrying along
this information to the ARP checking code where it would be used
in place of the DHCP server address.
This is entirely possible to do, but is quite a bit more work.
Without this change, it is possible for malicious UDP packets to
make the function read past the end of a buffer.
If this was ever a possibility in ndhc, the previous commit fixed
that issue, but there is no reason for udp_checksum() to have
such a subtle precondition to proper use. This change also makes
it easier to audit correctness.
This facility was added to Linux in early 2013. If it is not available,
the BPF will still be installed, but redundant checks will be performed
to guard against the BPF possibly being removed by an attacker.