Commit Graph

589 Commits

Author SHA1 Message Date
Nicholas J. Kain
7f08d4b6fb arp: Handle initial gateway query asynchronously and retry failures.
The gateway/router MAC fingerprinting could perhaps be done more
robustly in the face of suspend or carrier loss, but the time window
in which things could get confused is very small and I would rather
just rely on supervisor respawn in that case.

Even this case I don't think I've ever seen.
2017-02-24 07:39:14 -05:00
Nicholas J. Kain
7080850f38 Silence a spurious ARP defense message from previous commit. 2017-02-24 06:57:10 -05:00
Nicholas J. Kain
34a8cd7ad9 arp: Handle initial announcement asynchronously and retry failures.
We need to send two ARP announcements, so these are now done via a
timeout callback so that failures can be handled properly.
2017-02-24 06:57:10 -05:00
Nicholas J. Kain
4d87d5075a Handle carrier interruptions in arp_collision_timeout() better.
Still not ideal; we need to note and retry these errors, but these
changes are preparatory and do not introduce regressions.
2017-02-24 05:36:05 -05:00
Nicholas J. Kain
09d5e9ad3c arp: If first ARP announce fails, rely on the second announce.
The previous approach would desynchronize the state machine if the
carrier is paused after receiving the lease but before sending the
announce, since we have received a lease already.

This change is an improvement but is still not ideal.
2017-02-24 04:57:33 -05:00
Nicholas J. Kain
fa1c5d3a0c Add a comment describing link_set_flags() return values. 2017-01-19 05:22:22 -05:00
Nicholas J. Kain
1c2a39c544 Reorder client_config_t members to reduce structure padding. 2017-01-19 05:20:29 -05:00
Nicholas J. Kain
4fdde404aa Remove unused client_config_t foreground variable. 2017-01-19 05:18:04 -05:00
Nicholas J. Kain
c38fd2be9b Convert logical booleans in client_config_t to bool type. 2017-01-19 05:13:30 -05:00
Nicholas J. Kain
571b22c4b2 Rename client_state_t init variable to program_init.
Easier to grep.  No functional change.
2017-01-19 05:05:35 -05:00
Nicholas J. Kain
931530786b Convert logically boolean client_state_t variables from uint8_t to bool. 2017-01-19 05:01:23 -05:00
Nicholas J. Kain
b8ee0bd5c2 Update copyright dates to 2017. 2017-01-13 20:15:27 -05:00
Nicholas J. Kain
29498f5341 Remove ifsPrevState and set non-infinite timeout on a send error.
We instead check carrier status as needed.  This approach is more
robust.  For a simple example, imagine link state changes that happen
while the machine is suspended.
2017-01-13 20:15:27 -05:00
Nicholas J. Kain
c47630ffca Rename check_carrier() to carrier_isup() and use bool return. 2017-01-12 05:25:15 -05:00
Nicholas J. Kain
1fc7bd3144 Eliminate fopen() in cfg.rl. Use unbuffered i/o instead.
This is fairly tricky, but fopen() almost surely interally calls malloc
when it creates the FILE* that it returns.  I did promise that
ndhc doesn't call malloc after initialization, besides what libc may
do internally, but it feels a bit dishonest given that fopen() is
basically sure to do so on any general-purpose libc.

Thus, eliminate it and just use direct POSIX i/o.

With the previous pidfile changes, ndhc doesn't use C fopen() at all.

In practice, this change won't really be noticeable as most libcs,
particularly with dynamic linking, will end up calling malloc themselves
during program initialization before main() is invoked.
2016-05-06 18:21:13 -04:00
Nicholas J. Kain
04ec7c8f4b Update to latest write_pid semantics and don't write pidfile by default.
There was no way to disable writing pidfiles before.

pidfiles are an unreliable method of tracking processes, anyway; process
supervisors are strongly recommended.  If a pidfile is really needed, it
can be explicitly specified.
2016-05-06 15:00:31 -04:00
Nicholas J. Kain
a47a2feea1 Support resolv_conf.head and resolv_conf.tail.
These will be used if they exist at initialization.
Operations are all done by acting on pre-opened fds.
2016-05-06 06:45:16 -04:00
Nicholas J. Kain
313da14518 Update nl.[ch] to latest. No functional change for ndhc. 2016-05-06 06:36:01 -04:00
Nicholas J. Kain
a4ea869a54 README: Trivial style edits. 2015-12-27 01:00:18 -05:00
Nicholas J. Kain
5ab36719f1 If a fd closes unexpectedly in epoll, print error and exit with failure.
Before the exit code would be success and no error message would print,
and it required a bit of control flow tracing to determine what would
actually happen.

No direct functional change (unless the supervising process cares about
the return code of ndhc on exit).
2015-12-25 02:44:54 -05:00
Nicholas J. Kain
ae16e26d00 arp: Fix case where changing interface properties consistently fails.
If changing interface properties fails after getting a lease, it is
possible under some strange conditions for the failure to be
persistent.  This seems to happen if the carrier cycles off and on
several times during ndhc initialization.

Since this issue is very hard to replicate, the most conservative
thing to do here is to simply have ndhc suicide itself so it can
be respawned by a process supervisor.

Logs of the issue in practice:

(carrier is down while the daemon is started here, it seems)

16:57:09.638979845 ndhc-ifch seccomp filter installed.  Please disable seccomp if you
16:57:09.638989136 Discovering DHCP servers...
16:57:09.638991371 (send_dhcp_raw) carrier down; sendto would fail
16:57:09.638993318 Failed to send a discover request packet.
 ...
16:57:13.636519925 Discovering DHCP servers...
16:57:13.651462476 Received IP offer: X from server Y via Z
 ...
16:57:13.912592571  wan0: Gateway router set to: A
16:57:13.912607463  wan0: arp: Searching for dhcp server and gw addresses...
16:57:14.635532676  wan0: Carrier down.
17:04:32.983897760  wan0: arp: Still looking for gateway hardware address...
17:04:32.984158226  wan0: arp: Still looking for DHCP agent hardware address...
17:04:32.984781255  wan0: Interface is back.  Revalidating lease...
17:04:32.985585501  wan0: arp: Gateway hardware address B
17:04:32.985590436  wan0: arp: DHCP agent hardware address C
17:04:38.234857403  wan0: arp: Still waiting for gateway to reply to arp ping...
17:04:38.235109016  wan0: arp: Still waiting for DHCP agent to reply to arp ping...
16:57:24.165620224  wan0: arp: Still waiting for gateway to reply to arp ping...
16:57:29.169621070  wan0: arp: DHCP agent and gateway didn't reply.  Getting new lease.
16:57:29.217710616  wan0: Discovering DHCP servers...
16:57:29.249645130  wan0: Received IP offer: X from server Y via Z
16:57:29.249657203  wan0: Sending a selection request for X...
16:57:29.285632973  wan0: Received ACK: X from server Y via Z
16:57:29.297717159  wan0: arp: Probing for hosts that may conflict with our lease...
16:57:29.360249458  wan0: arp: Probing for hosts that may conflict with our lease...
16:57:29.435114526  wan0: arp: Probing for hosts that may conflict with our lease...
16:57:29.500473345  wan0: Lease of X obtained.  Lease time is D seconds.
16:57:29.500485894  wan0: Failed to set the interface IP address and properties!
 ...
And the final two errors repeat.  Restarting ndhc by hand instantly
fixes the issue.

So there's a lot going on -- bizzare clock skew, and carrier flickering
on and off.
2015-10-28 20:20:21 -04:00
Nicholas J. Kain
e0b5ff8eaf perform_carrier() should not cause ifchd to terminate on failure.
This requires execute_buffer() and its callers to distinguish between
fatal and non-fatal errors.  The -99 return value was already used for
non-recoverable errors that should force the daemon to restart, but the
execute_buffer() callers treated any non-success return as a fatal
error.

There a judgement call here on how to handle various error types.  I
choose to assume that failures to set the IP address, netmask, broadcast
address, or default router are fatal errors.  ndhc should be run from
process supervision, and this will trigger a daemon restart, which will
allow the machine to recover as soon as the problem (probably on the
dhcp server or local kernel state outside of ndhc's control) is
corrected.

This change corrects errors such as:

Discovering DHCP servers...
(process_client_socket) received invalid commands: 'carrier:;'
(send_dhcp_raw) carrier down; sendto would fail
Failed to send a discover request packet.

which happened if ndhc is started on a machine where the network
interface is down.  After this change, ndhc should function as intended
by going to sleep until the carrier returns rather than terminating
itself to be restarted by the process supervisor until carrier returns.
2015-07-03 00:49:13 -04:00
Nicholas J. Kain
12d8af4c67 Convert remaining ull time-types to ll.
Thus signed/unsigned conversion issues can't be a problem.

The extra range provided by ull isn't useful, either, since
we're dealing with uint32_t time_t seconds converted to ns,
which doesn't come close to exhausting the amount of post-epoch
ns that will be consumed until after 2100AD.
2015-05-27 23:40:28 -04:00
Nicholas J. Kain
277f0f67c5 When converting timeout from ll to int, also guard against underflow. 2015-05-27 15:00:02 -04:00
Nicholas J. Kain
e02d30dcc5 renewTime, rebindTime should be unsigned long long.
While we're at it, rearrange the client state struct to
save some space lost to padding.
2015-05-27 13:11:23 -04:00
Nicholas J. Kain
abb1b54bfe Fix an overflow that can cause spuriously short epoll timeouts.
Lease times and arp timeouts are all calculated using long long,
but epoll takes its timeout argument as an int.  Guard against
timeouts > INT_MAX but < UINT_MAX wrapping and causing spuriously
short timeouts when converted to a signed int.

This problem has been observed in the wild.  Thanks to thypon
for a detailed strace that pointed me towards this issue.
2015-05-27 12:58:42 -04:00
Nicholas J. Kain
ba875d4b2e Failsafe should only trigger is the new timeout is also zero.
This is what I get for rushing!
2015-05-27 12:35:16 -04:00
Nicholas J. Kain
f061a78a18 Fix dumb mistake in patch before last; epoll timeout is in ms, not s. 2015-05-27 12:29:46 -04:00
Nicholas J. Kain
2057d7cd18 Improve compatibility with clang. 2015-05-27 12:23:50 -04:00
Nicholas J. Kain
8273b383ab Add a failsafe to prevent epoll busy-spin. 2015-05-27 12:23:16 -04:00
Nicholas J. Kain
6c136c3f85 Make add_option_(vendor|hostname)() not use ndhc internals.
This change makes it easier to fuzz test, but should have no
functional effect on ndhc's behavior.
2015-02-20 03:58:25 -05:00
Nicholas J. Kain
646931a2bf xid should be regenerated whenever we start a new discover cycle.
Somewhere along the line it quit being set at the start of discovery
and was always 0.  This is clearly not desired behavior.

Found by manual examination of packets while fuzzing the options
parser.
2015-02-20 03:48:13 -05:00
Nicholas J. Kain
b8df1b33f5 Fix the timeouts for rebinding and renewing.
Now renew/rebind requests will be sent every 60+/-10s.
2015-02-18 11:03:58 -05:00
Nicholas J. Kain
b3bd13d45f Fix the return values of dhcp_packet_get and arp_packet_get.
This corrects a bug where stale dhcp packets would get reprocessed,
causing very bad behavior; an issue that was introduced in the
coroutine conversion.
2015-02-18 11:02:13 -05:00
Nicholas J. Kain
a627e9dc9c Minor changes to log messages and comments. 2015-02-18 09:56:38 -05:00
Nicholas J. Kain
3ede5fbe33 Handle the release and renew signals again. 2015-02-18 07:31:19 -05:00
Nicholas J. Kain
731bd14f0a reinit_selecting() can't fail; convert it to void return. 2015-02-18 07:12:35 -05:00
Nicholas J. Kain
89d77313c6 Deconfigure the interface if we've failed a fingerprint check.
We've moved to a new network, so keeping the old information
is counterproductive.
2015-02-18 07:12:22 -05:00
Nicholas J. Kain
29da5d21a4 Tolerate DHCP ACKs in REQUESTING state that don't match the serverid.
Some networks have multiple DHCP servers that don't respect the
serverid that is specified in DHCPREQUESTs.

Instead we simply check to see that the yiaddr matches.

While we're at it, ignore DHCP NAK during REQUESTING state.  It
doesn't really make sense.  Instead we should just wait for
timeout.
2015-02-18 06:52:59 -05:00
Nicholas J. Kain
f28495a58b selecting_packet() shouldn't take the xid from the packet.
This change actually has no effect because incoming dhcp packets
that differ from our xid are dropped, so the xid is always changed
to the same value that it already has.
2015-02-18 06:01:43 -05:00
Nicholas J. Kain
da7d34a300 Bump version to 2.0.
When the coroutine restructuring is fully tested and debugged, it
will definitely be a significant enough change for a full version
bump.
2015-02-18 05:41:31 -05:00
Nicholas J. Kain
69cf41f1b1 Only process one epoll event at a time.
If ndhc were a high-performance program that handled lots of events,
this change would harm performance.  But it is not, and it implicitly
believes that events come in one at a time.  Processing batches
would make it harder to assure correctness while also never allocating
memory at runtime.

The previous structure was fine when everything was handled
immediately by callbacks, but it isn't now.
2015-02-18 05:36:13 -05:00
Nicholas J. Kain
99ce918a31 Use a coroutine instead of several callback state machines.
This change makes it much easier to reason about ndhc's behavior
and properly handle errors.

It is a very large changeset, but there is no way to make this
sort of change incrementally.  Lease acquisition is tested to
work.

It is highly likely that some bugs were both introduced and
squashed here.  Some obvious code cleanups will quickly follow.
2015-02-18 05:31:13 -05:00
Nicholas J. Kain
37aa866ae4 Move action dispatch out of main epoll loop. 2015-02-15 06:48:49 -05:00
Nicholas J. Kain
61387408d0 Separate event state gathering from action dispatch in main epoll loop.
This is the first step towards using coroutines.
2015-02-15 06:38:03 -05:00
Nicholas J. Kain
658d2954b8 Reduce log spam. 2015-02-15 02:58:51 -05:00
Nicholas J. Kain
e874373dcd Check link carrier via ifch and netlink instead of ioctl.
Thus, ioctl can once again be removed from the ndhc seccomp whitelist.
2015-02-15 02:50:29 -05:00
Nicholas J. Kain
6c9ca9eecd If ifchd commands fail, propagate the failure back to ndhc. 2015-02-15 02:29:37 -05:00
Nicholas J. Kain
33aa5b13de Recommend using a process supervisor for high reliability in README. 2015-02-14 20:50:46 -05:00
Nicholas J. Kain
5b82be8b00 If ifchd interactions fail, terminate.
Ideally we would pause and resume state, but for now just bail out.
If ndhc is process-supervised, it will recover to the proper state
quickly.
2015-02-14 20:47:14 -05:00