The NetBSD Project

CVS log for src/sys/netinet/ip_input.c

[BACK] Up to [] / src / sys / netinet

Request diff between arbitrary revisions

Default branch: MAIN

Revision 1.355.2.3 / (download) - annotate - [select for diffs], Tue Jan 2 10:20:34 2018 UTC (2 weeks ago) by snj
Branch: netbsd-8
Changes since 1.355.2.2: +6 -16 lines
Diff to previous 1.355.2.2 (colored) to branchpoint 1.355 (colored) next main 1.356 (colored) to selected 1.183 (colored)

Pull up following revision(s) (requested by ozaki-r in ticket #456):
	sys/arch/arm/sunxi/sunxi_emac.c: 1.9
	sys/dev/ic/dwc_gmac.c: 1.43-1.44
	sys/dev/pci/if_iwm.c: 1.75
	sys/dev/pci/if_wm.c: 1.543
	sys/dev/pci/ixgbe/ixgbe.c: 1.112
	sys/dev/pci/ixgbe/ixv.c: 1.74
	sys/kern/sys_socket.c: 1.75
	sys/net/agr/if_agr.c: 1.43
	sys/net/bpf.c: 1.219
	sys/net/if.c: 1.397, 1.399, 1.401-1.403, 1.406-1.410, 1.412-1.416
	sys/net/if.h: 1.242-1.247, 1.250, 1.252-1.257
	sys/net/if_bridge.c: 1.140 via patch, 1.142-1.146
	sys/net/if_etherip.c: 1.40
	sys/net/if_ethersubr.c: 1.243, 1.246
	sys/net/if_faith.c: 1.57
	sys/net/if_gif.c: 1.132
	sys/net/if_l2tp.c: 1.15, 1.17
	sys/net/if_loop.c: 1.98-1.101
	sys/net/if_media.c: 1.35
	sys/net/if_pppoe.c: 1.131-1.132
	sys/net/if_spppsubr.c: 1.176-1.177
	sys/net/if_tun.c: 1.142
	sys/net/if_vlan.c: 1.107, 1.109, 1.114-1.121
	sys/net/npf/npf_ifaddr.c: 1.3
	sys/net/npf/npf_os.c: 1.8-1.9
	sys/net/rtsock.c: 1.230
	sys/netcan/if_canloop.c: 1.3-1.5
	sys/netinet/if_arp.c: 1.255
	sys/netinet/igmp.c: 1.65
	sys/netinet/in.c: 1.210-1.211
	sys/netinet/in_pcb.c: 1.180
	sys/netinet/ip_carp.c: 1.92, 1.94
	sys/netinet/ip_flow.c: 1.81
	sys/netinet/ip_input.c: 1.362
	sys/netinet/ip_mroute.c: 1.147
	sys/netinet/ip_output.c: 1.283, 1.285, 1.287
	sys/netinet6/frag6.c: 1.61
	sys/netinet6/in6.c: 1.251, 1.255
	sys/netinet6/in6_pcb.c: 1.162
	sys/netinet6/ip6_flow.c: 1.35
	sys/netinet6/ip6_input.c: 1.183
	sys/netinet6/ip6_output.c: 1.196
	sys/netinet6/mld6.c: 1.90
	sys/netinet6/nd6.c: 1.239-1.240
	sys/netinet6/nd6_nbr.c: 1.139
	sys/netinet6/nd6_rtr.c: 1.136
	sys/netipsec/ipsec_output.c: 1.65
	sys/rump/net/lib/libnetinet/netinet_component.c: 1.9-1.10
kmem_intr_free kmem_intr_[z]alloced memory
the underlying pools are the same but api-wise those should match
There are already two flags for if_output and if_start, however, it seems such
MPSAFE flags are eventually needed for all if_XXX operations. Having discrete
flags for each operation is wasteful of if_extflags bits. So let's unify
the flags into one: IFEF_MPSAFE.
Fortunately IFEF_*_MPSAFE flags have never been included in any releases, so
we can change them without breaking backward compatibility of the releases
(though the kernel version of -current should be bumped).
Note that if an interface have both MP-safe and non-MP-safe operations at a
time, we have to set the IFEF_MPSAFE flag and let callees of non-MP-safe
opeartions take the kernel lock.
Proposed on tech-kern@ and tech-net@
Provide macros for softnet_lock and KERNEL_LOCK hiding NET_MPSAFE switch
It reduces C&P codes such as "#ifndef NET_MPSAFE KERNEL_LOCK(1, NULL); ..."
scattered all over the source code and makes it easy to identify remaining
KERNEL_LOCK and/or softnet_lock that are held even if NET_MPSAFE.
No functional change
Hold KERNEL_LOCK on if_ioctl selectively based on IFEF_MPSAFE
If IFEF_MPSAFE is set, hold the lock and otherwise don't hold.
This change requires additions of KERNEL_LOCK to subsequence functions from
if_ioctl such as ifmedia_ioctl and ifioctl_common to protect non-MP-safe
Proposed on tech-kern@ and tech-net@
Ensure to hold if_ioctl_lock when calling if_flags_set
Fix locking against myself on ifpromisc
vlan_unconfig_locked could be called with holding if_ioctl_lock.
Ensure to not turn on IFF_RUNNING of an interface until its initialization completes
And ensure to turn off it before destruction as per IFF_RUNNING's description
"resource allocated". (The description is a bit doubtful though, I believe the
change is still proper.)
Ensure to hold if_ioctl_lock on if_up and if_down
One exception for if_down is if_detach; in the case the lock isn't needed
because it's guaranteed that no other one can access ifp at that point.
Make if_link_queue MP-safe if IFEF_MPSAFE
if_link_queue is a queue to store events of link state changes, which is
used to pass events from (typically) an interrupt handler to
if_link_state_change softint. The queue was protected by KERNEL_LOCK so far,
but if IFEF_MPSAFE is enabled, it becomes unsafe because (perhaps) an interrupt
handler of an interface with IFEF_MPSAFE doesn't take KERNEL_LOCK. Protect it
by a spin mutex.
Additionally with this change KERNEL_LOCK of if_link_state_change softint is
omitted if NET_MPSAFE is enabled.
Note that the spin mutex is now ifp->if_snd.ifq_lock as well as the case of
if_timer (see the comment).
At that point no other one modifies the list so IFADDR_READER_FOREACH
is unnecessary. Use of IFADDR_READER_FOREACH is harmless in general though,
if we try to detect contract violations of pserialize, using it violates
the contract. So avoid using it makes life easy.
Ensure to call if_addr_init with holding if_ioctl_lock
Get rid of outdated comments
Fix build of kernels without ether
By throwing out if_enable_vlan_mtu and if_disable_vlan_mtu that
created a unnecessary dependency from if.c to if_ethersubr.c.
PR kern/52790
IFNET_LOCK will be used in another lock, if_ioctl_lock (might be renamed then).
Wrap if_ioctl_lock with IFNET_* macros (NFC)
Also if_ioctl_lock perhaps needs to be renamed to something because it's now
not just for ioctl...
Reorder some destruction routines in if_detach
- Destroy if_ioctl_lock at the end of the if_detach because it's used in various
  destruction routines
- Move psref_target_destroy after pr_purgeif because we want to use psref in
  pr_purgeif (otherwise destruction procedures can be tricky)
Ensure to call if_mcast_op with holding IFNET_LOCK
Note that CARP doesn't deal with IFNET_LOCK yet.
Remove IFNET_GLOBAL_LOCK where it's unnecessary because IFNET_LOCK is held
Describe which lock is used to protect each member variable of struct ifnet
Requested by skrll@
Write a guideline for converting an interface to IFEF_MPSAFE
Requested by skrll@
Note that IFNET_LOCK must not be held in softint
Don't set IFEF_MPSAFE unless NET_MPSAFE at this point
Because recent investigations show that interfaces with IFEF_MPSAFE need to
follow additional restrictions to work with the flag safely. We should enable it
on an interface by default only if the interface surely satisfies the
restrictions, which are described in if.h.
Note that enabling IFEF_MPSAFE solely gains a few benefit on performance because
the network stack is still serialized by the big kernel locks by default.

Revision 1.364 / (download) - annotate - [select for diffs], Mon Jan 1 00:51:36 2018 UTC (2 weeks, 1 day ago) by christos
Branch: MAIN
Changes since 1.363: +2 -11 lines
Diff to previous 1.363 (colored) to selected 1.183 (colored)

1) "#define ipi_spec_dst ipi_addr" in <netinet/in.h>
2) Change the IP_RECVPKTINFO option to control the generation of
   IP_PKTINFO control messages, the way it's done in Solaris.
3) Remove the superfluous IP_RECVPKTINFO control message.
4) Change the IP_PKTINFO option to do different things depending on
   the parameter it's supplied with:
   - If it's sizeof(int), assume it's being used as in Linux:
     - If it's non-zero, turn on the IP_RECVPKTINFO option.
     - If it's zero, turn off the IP_RECVPKTINFO option.
   - If it's sizeof(struct in_pktinfo), assume it's being used as in
     Solaris, to set a default for the source interface and/or
     source address for outgoing packets on the socket.
5) Return what Linux or Solaris compatible code expects, depending
   on data size, and just added a fallback to a Linux (and current NetBSD)
   compatible value if the size is unknown (as it is now), or,
   in the future, if the calling application specifies a receiving
   buffer that doesn't match either data item.

From: Tom Ivar Helbekkmo

Revision 1.355.2.2 / (download) - annotate - [select for diffs], Sun Dec 10 09:24:30 2017 UTC (5 weeks, 2 days ago) by snj
Branch: netbsd-8
Changes since 1.355.2.1: +13 -7 lines
Diff to previous 1.355.2.1 (colored) to branchpoint 1.355 (colored) to selected 1.183 (colored)

Pull up following revision(s) (requested by roy in ticket #390):
	sys/netinet/ip_input.c: 1.363
	sys/netinet6/ip6_input.c: 1.184-1.185
	sys/netinet6/ip6_output.c: 1.194-1.195
	sys/netinet6/in6_src.c: 1.83-1.84
Allow local communication over DETACHED addresses.
Allow binding to DETACHED or TENTATIVE addresses as we deny
sending upstream from them anyway.
Prefer non DETACHED or TENTATIVE addresses.
Attempt to restore v6 networking.   Not 100% certain that these
changes are all that is needed, but they're certainly a big part of it
(especially the ip6_input.c change.)
Treat unvalidated addresses as deprecated in rule 3.

Revision 1.302.2.4 / (download) - annotate - [select for diffs], Sun Dec 3 11:39:04 2017 UTC (6 weeks, 2 days ago) by jdolecek
Branch: tls-maxphys
Changes since 1.302.2.3: +419 -264 lines
Diff to previous 1.302.2.3 (colored) to branchpoint 1.302 (colored) next main 1.303 (colored) to selected 1.183 (colored)

update from HEAD

Revision 1.363 / (download) - annotate - [select for diffs], Fri Nov 24 14:03:25 2017 UTC (7 weeks, 4 days ago) by roy
Branch: MAIN
CVS Tags: tls-maxphys-base-20171202
Changes since 1.362: +13 -7 lines
Diff to previous 1.362 (colored) to selected 1.183 (colored)

Allow local communication over DETACHED addresses.
Allow binding to DETACHED or TENTATIVE addresses as we deny
sending upstream from them anyway.
Prefer non DETACHED or TENTATIVE addresses.

Revision 1.362 / (download) - annotate - [select for diffs], Fri Nov 17 07:37:12 2017 UTC (8 weeks, 4 days ago) by ozaki-r
Branch: MAIN
Changes since 1.361: +6 -16 lines
Diff to previous 1.361 (colored) to selected 1.183 (colored)

Provide macros for softnet_lock and KERNEL_LOCK hiding NET_MPSAFE switch

It reduces C&P codes such as "#ifndef NET_MPSAFE KERNEL_LOCK(1, NULL); ..."
scattered all over the source code and makes it easy to identify remaining
KERNEL_LOCK and/or softnet_lock that are held even if NET_MPSAFE.

No functional change

Revision 1.355.2.1 / (download) - annotate - [select for diffs], Sat Oct 21 19:43:54 2017 UTC (2 months, 3 weeks ago) by snj
Branch: netbsd-8
CVS Tags: matt-nb8-mediatek-base, matt-nb8-mediatek
Changes since 1.355: +3 -11 lines
Diff to previous 1.355 (colored) to selected 1.183 (colored)

Pull up following revision(s) (requested by ozaki-r in ticket #300):
	crypto/dist/ipsec-tools/src/setkey/parse.y: 1.19
	crypto/dist/ipsec-tools/src/setkey/token.l: 1.20
	distrib/sets/lists/tests/mi: 1.754, 1.757, 1.759
	doc/TODO.smpnet: 1.12-1.13
	sys/net/pfkeyv2.h: 1.32
	sys/net/raw_cb.c: 1.23-1.24, 1.28
	sys/net/raw_cb.h: 1.28
	sys/net/raw_usrreq.c: 1.57-1.58
	sys/net/rtsock.c: 1.228-1.229
	sys/netinet/in_proto.c: 1.125
	sys/netinet/ip_input.c: 1.359-1.361
	sys/netinet/tcp_input.c: 1.359-1.360
	sys/netinet/tcp_output.c: 1.197
	sys/netinet/tcp_var.h: 1.178
	sys/netinet6/icmp6.c: 1.213
	sys/netinet6/in6_proto.c: 1.119
	sys/netinet6/ip6_forward.c: 1.88
	sys/netinet6/ip6_input.c: 1.181-1.182
	sys/netinet6/ip6_output.c: 1.193
	sys/netinet6/ip6protosw.h: 1.26
	sys/netipsec/ipsec.c: 1.100-1.122
	sys/netipsec/ipsec.h: 1.51-1.61
	sys/netipsec/ipsec6.h: 1.18-1.20
	sys/netipsec/ipsec_input.c: 1.44-1.51
	sys/netipsec/ipsec_netbsd.c: 1.41-1.45
	sys/netipsec/ipsec_output.c: 1.49-1.64
	sys/netipsec/ipsec_private.h: 1.5
	sys/netipsec/key.c: 1.164-1.234
	sys/netipsec/key.h: 1.20-1.32
	sys/netipsec/key_debug.c: 1.18-1.21
	sys/netipsec/key_debug.h: 1.9
	sys/netipsec/keydb.h: 1.16-1.20
	sys/netipsec/keysock.c: 1.59-1.62
	sys/netipsec/keysock.h: 1.10
	sys/netipsec/xform.h: 1.9-1.12
	sys/netipsec/xform_ah.c: 1.55-1.74
	sys/netipsec/xform_esp.c: 1.56-1.72
	sys/netipsec/xform_ipcomp.c: 1.39-1.53
	sys/netipsec/xform_ipip.c: 1.50-1.54
	sys/netipsec/xform_tcp.c: 1.12-1.16
	sys/rump/librump/rumpkern/Makefile.rumpkern: 1.170
	sys/rump/librump/rumpnet/net_stub.c: 1.27
	sys/sys/protosw.h: 1.67-1.68
	tests/net/carp/ 1.7
	tests/net/if_gif/ 1.11
	tests/net/if_l2tp/ 1.3
	tests/net/ipsec/Makefile: 1.7-1.9
	tests/net/ipsec/ 1.5
	tests/net/ipsec/ 1.4-1.6
	tests/net/ipsec/ 1.2
	tests/net/ipsec/ 1.2
	tests/net/ipsec/ 1.6-1.7
	tests/net/ipsec/ 1.6-1.7
	tests/net/ipsec/ 1.8-1.18
	tests/net/ipsec/ 1.1-1.2
	tests/net/ipsec/ 1.1-1.2
	tests/net/ipsec/ 1.5-1.6
	tests/net/ipsec/ 1.9
	tests/net/ipsec/ 1.1-1.2
	tests/net/ipsec/ 1.3
	tests/net/mcast/ 1.6
	tests/net/net/ 1.11
	tests/net/ 1.20
	tests/net/npf/ 1.3
	tests/net/route/ 1.20
	tests/net/route/ 1.16
	usr.bin/netstat/fast_ipsec.c: 1.22
Do m_pullup before mtod

It may fix panicks of some tests on anita/sparc and anita/GuruPlug.
Enable DEBUG for babylon5
Apply C99-style struct initialization to xformsw
Tweak outputs of netstat -s for IPsec

- Get rid of "Fast"
- Use ipsec and ipsec6 for titles to clarify protocol
- Indent outputs of sub protocols

Original outputs were organized like this:

(Fast) IPsec:
IPsec ah:
IPsec esp:
IPsec ipip:
IPsec ipcomp:
(Fast) IPsec:
IPsec ah:
IPsec esp:
IPsec ipip:
IPsec ipcomp:

New outputs are organized like this:

Add test cases for IPComp
Simplify IPSEC_OSTAT macro (NFC)
KNF; replace leading whitespaces with hard tabs
Introduce and use SADB_SASTATE_USABLE_P
Add update command for testing

Updating an SA (SADB_UPDATE) requires that a process issuing
SADB_UPDATE is the same as a process issued SADB_ADD (or SADB_GETSPI).
This means that update command must be used with add command in a
configuration of setkey. This usage is normally meaningless but
useful for testing (and debugging) purposes.
Add test cases for updating SA/SP

The tests require newly-added udpate command of setkey.
PR/52346: Frank Kardel: Fix checksumming for NAT-T
See XXX for improvements.

It seems that PACKET_TAG_IPSEC_IN_CRYPTO_DONE is for network adapters
that have IPsec accelerators; a driver sets the mtag to a packet
when its device has already encrypted the packet.

Unfortunately no driver implements such offload features for long
years and seems unlikely to implement them soon. (Note that neither
FreeBSD nor Linux doesn't have such drivers.) Let's remove related
(unused) codes and simplify the IPsec code.
Fix usages of sadb_msg_errno
Avoid updating sav directly

On SADB_UPDATE a target sav was updated directly, which was unsafe.
Instead allocate another sav, copy variables of the old sav to
the new one and replace the old one with the new one.
Simplify; we can assume sav->tdb_xform cannot be NULL while it's valid
Rename key_alloc* functions (NFC)

We shouldn't use the term "alloc" for functions that just look up
data and actually don't allocate memory.
Use explicit_memset to surely zero-clear key_auth and key_enc
Make sure to clear keys on error paths of key_setsaval
Add missing KEY_FREESAV
Make sure a sav is inserted to a sah list after its initialization completes
Remove unnecessary zero-clearing codes from key_setsaval

key_setsaval is now used only for a newly-allocated sav. (It was
used to reset variables of an existing sav.)
Correct wrong assumption of sav->refcnt in key_delsah

A sav in a list is basically not to be sav->refcnt == 0. And also
KEY_FREESAV assumes sav->refcnt > 0.
Let key_getsavbyspi take a reference of a returning sav
Use time_mono_to_wall (NFC)
Separate sending message routine (NFC)
Simplify; remove unnecessary zero-clears

key_freesaval is used only when a target sav is being destroyed.
Omit NULL checks for sav->lft_c

sav->lft_c can be NULL only when initializing or destroying sav.
Omit unnecessary NULL checks for sav->sah
Omit unnecessary check of sav->state

key_allocsa_policy picks a sav of either MATURE or DYING so we
don't need to check its state again.
Simplify; omit unnecessary saidx passing

- ipsec_nextisr returns a saidx but no caller uses it
- key_checkrequest is passed a saidx but it can be gotton by
  another argument (isr)
Fix splx isn't called on some error paths
Fix header size calculation of esp where sav is NULL
Fix header size calculation of ah in the case sav is NULL

This fix was also needed for esp.
Pass sav directly to opencrypto callback

In a callback, use a passed sav as-is by default and look up a sav
only if the passed sav is dead.
Avoid examining freshness of sav on packet processing

If a sav list is sorted (by lft_c->sadb_lifetime_addtime) in advance,
we don't need to examine each sav and also don't need to delete one
on the fly and send up a message. Fortunately every sav lists are sorted
as we need.

Added key_validate_savlist validates that each sav list is surely sorted
(run only if DEBUG because it's not cheap).
Add test cases for SAs with different SPIs
Prepare to stop using isr->sav

isr is a shared resource and using isr->sav as a temporal storage
for each packet processing is racy. And also having a reference from
isr to sav makes the lifetime of sav non-deterministic; such a reference
is removed when a packet is processed and isr->sav is overwritten by
new one. Let's have a sav locally for each packet processing instead of
using shared isr->sav.

However this change doesn't stop using isr->sav yet because there are
some users of isr->sav. isr->sav will be removed after the users find
a way to not use isr->sav.
Fix wrong argument handling
fix printf format.
Don't validate sav lists of LARVAL or DEAD states

We don't sort the lists so the validation will always fail.

Fix PR kern/52405
Make sure to sort the list when changing the state by key_sa_chgstate
Rename key_allocsa_policy to key_lookup_sa_bysaidx
Separate test files
Calculate ah_max_authsize on initialization as well as esp_max_ivlen
Remove m_tag_find(PACKET_TAG_IPSEC_PENDING_TDB) because nobody sets the tag
Restore a comment removed in previous

The comment is valid for the below code.
Make tests more stable

sleep command seems to wait longer than expected on anita so
use polling to wait for a state change.
Add tests that explicitly delete SAs instead of waiting for expirations
Remove invalid M_AUTHIPDGM check on ESP isr->sav

M_AUTHIPDGM flag is set to a mbuf in ah_input_cb. An sav of ESP can
have AH authentication as sav->tdb_authalgxform. However, in that
case esp_input and esp_input_cb are used to do ESP decryption and
AH authentication and M_AUTHIPDGM never be set to a mbuf. So
checking M_AUTHIPDGM of a mbuf on isr->sav of ESP is meaningless.
Look up sav instead of relying on unstable sp->req->sav

This code is executed only in an error path so an additional lookup
doesn't matter.
Correct a comment
Don't release sav if calling crypto_dispatch again
Remove extra KEY_FREESAV from ipsec_process_done

It should be done by the caller.
Don't bother the case of crp->crp_buf == NULL in callbacks
Hold a reference to an SP during opencrypto processing

An SP has a list of isr (ipsecrequest) that represents a sequence
of IPsec encryption/authentication processing. One isr corresponds
to one opencrypto processing. The lifetime of an isr follows its SP.

We pass an isr to a callback function of opencrypto to continue
to a next encryption/authentication processing. However nobody
guaranteed that the isr wasn't freed, i.e., its SP wasn't destroyed.

In order to avoid such unexpected destruction of isr, hold a reference
to its SP during opencrypto processing.
Don't make SAs expired on tests that delete SAs explicitly
Fix a debug message
Dedup error paths (NFC)
Use pool to allocate tdb_crypto

For ESP and AH, we need to allocate an extra variable space in addition
to struct tdb_crypto. The fixed size of pool items may be larger than
an actual requisite size of a buffer, but still the performance
improvement by replacing malloc with pool wins.
Don't use unstable isr->sav for header size calculations

We may need to optimize to not look up sav here for users that
don't need to know an exact size of headers (e.g., TCP segmemt size
Don't use sp->req->sav when handling NAT-T ESP fragmentation

In order to do this we need to look up a sav however an additional
look-up degrades performance. A sav is later looked up in
ipsec4_process_packet so delay the fragmentation check until then
to avoid an extra look-up.
Don't use key_lookup_sp that depends on unstable sp->req->sav

It provided a fast look-up of SP. We will provide an alternative
method in the future (after basic MP-ification finishes).
Stop setting isr->sav on looking up sav in key_checkrequest
Remove ipsecrequest#sav
Stop setting mtag of PACKET_TAG_IPSEC_IN_DONE because there is no users anymore
Skip ipsec_spi_*_*_preferred_new_timeout when running on qemu

Probably due to PR 43997
Add localcount to rump kernels
Remove unused macro
Fix key_getcomb_setlifetime

The fix adjusts a soft limit to be 80% of a corresponding hard limit.

I'm not sure the fix is really correct though, at least the original
code is wrong. A passed comb is zero-cleared before calling
key_getcomb_setlifetime, so
  comb->sadb_comb_soft_addtime = comb->sadb_comb_soft_addtime * 80 / 100;
is meaningless.
Provide and apply key_sp_refcnt (NFC)

It simplifies further changes.
Fix indentation

Pointed out by knakahara@
Use pslist(9) for sptree
Don't acquire global locks for IPsec if NET_MPSAFE

Note that the change is just to make testing easy and IPsec isn't MP-safe yet.
Let PF_KEY socks hold their own lock instead of softnet_lock

Operations on SAD and SPD are executed via PF_KEY socks. The operations
include deletions of SAs and SPs that will use synchronization mechanisms
such as pserialize_perform to wait for references to SAs and SPs to be
released. It is known that using such mechanisms with holding softnet_lock
causes a dead lock. We should avoid the situation.
Make IPsec SPD MP-safe

We use localcount(9), not psref(9), to make the sptree and secpolicy (SP)
entries MP-safe because SPs need to be referenced over opencrypto
processing that executes a callback in a different context.

SPs on sockets aren't managed by the sptree and can be destroyed in softint.
localcount_drain cannot be used in softint so we delay the destruction of
such SPs to a thread context. To do so, a list to manage such SPs is added
(key_socksplist) and key_timehandler_spd deletes dead SPs in the list.

For more details please read the locking notes in key.c.

Proposed on tech-kern@ and tech-net@
Fix updating ipsec_used

- key_update_used wasn't called in key_api_spddelete2 and key_api_spdflush
- key_update_used wasn't called if an SP had been added/deleted but
  a reply to userland failed
Fix updating ipsec_used; turn on when SPs on sockets are added
Add missing IPsec policy checks to icmp6_rip6_input

icmp6_rip6_input is quite similar to rip6_input and the same checks exist
in rip6_input.
Add test cases for setsockopt(IP_IPSEC_POLICY)
Don't use KEY_NEWSP for dummy SP entries

By the change KEY_NEWSP is now not called from softint anymore
and we can use kmem_zalloc with KM_SLEEP for KEY_NEWSP.
Comment out unused functions
Add test cases that there are SPs but no relevant SAs
Don't allow sav->lft_c to be NULL

lft_c of an sav that was created by SADB_GETSPI could be NULL.
Clean up clunky eval strings

- Remove unnecessary \ at EOL
  - This allows to omit ; too
- Remove unnecessary quotes for arguments of atf_set
- Don't expand $DEBUG in eval
  - We expect it's expanded on execution

Suggested by kre@
Remove unnecessary KEY_FREESAV in an error path

sav should be freed (unreferenced) by the caller.
Use pslist(9) for sahtree
Use pslist(9) for sah->savtree
Rename local variable newsah to sah

It may not be new.
MP-ify SAD slightly

- Introduce key_sa_mtx and use it for some list operations
- Use pserialize for some list iterations
Introduce KEY_SA_UNREF and replace KEY_FREESAV with it where sav will never be actually freed in the future

KEY_SA_UNREF is still key_freesav so no functional change for now.

This change reduces diff of further changes.
Remove out-of-date log output

Pointed out by riastradh@
Use KDASSERT instead of KASSERT for mutex_ownable

Because mutex_ownable is too heavy to run in a fast path

Suggested by riastradh@
Assemble global lists and related locks into cache lines (NFCI)

Also rename variable names from *tree to *list because they are
just lists, not trees.

Suggested by riastradh@
Move locking notes
Update the locking notes

- Add locking order
- Add locking notes for misc lists such as reglist
- Mention pserialize, key_sp_ref and key_sp_unref on SP operations

Requested by riastradh@
Describe constraints of key_sp_ref and key_sp_unref

Requested by riastradh@
Add __read_mostly to key_psz

Suggested by riastradh@
Tweak wording (pserialize critical section => pserialize read section)

Suggested by riastradh@
Add missing mutex_exit
Fix setkey -D -P outputs

The outputs were tweaked (by me), but I forgot updating libipsec
in my local ATF environment...
MP-ify SAD (key_sad.sahlist and sah entries)

localcount(9) is used to protect key_sad.sahlist and sah entries
as well as SPD (and will be used for SAD sav).

Please read the locking notes of SAD for more details.
Introduce key_sa_refcnt and replace sav->refcnt with it (NFC)
Destroy sav only in the loop for DEAD sav
Fix KASSERT(solocked(sb->sb_so)) failure in sbappendaddr that is called eventually from key_sendup_mbuf

If key_sendup_mbuf isn't passed a socket, the assertion fails.
Originally in this case sb->sb_so was softnet_lock and callers
held softnet_lock so the assertion was magically satisfied.
Now sb->sb_so is key_so_mtx and also softnet_lock isn't always
held by callers so the assertion can fail.

Fix it by holding key_so_mtx if key_sendup_mbuf isn't passed a socket.

Reported by knakahara@
Tested by knakahara@ and ozaki-r@
Fix locking notes of SAD
Fix deadlock between key_sendup_mbuf called from key_acquire and localcount_drain

If we call key_sendup_mbuf from key_acquire that is called on packet
processing, a deadlock can happen like this:
- At key_acquire, a reference to an SP (and an SA) is held
- key_sendup_mbuf will try to take key_so_mtx
- Some other thread may try to localcount_drain to the SP with
  holding key_so_mtx in say key_api_spdflush
- In this case localcount_drain never return because key_sendup_mbuf
  that has stuck on key_so_mtx never release a reference to the SP

Fix the deadlock by deferring key_sendup_mbuf to the timer
Fix that prev isn't cleared on retry
Limit the number of mbufs queued for deferred key_sendup_mbuf

It's easy to be queued hundreds of mbufs on the list under heavy
network load.
MP-ify SAD (savlist)

localcount(9) is used to protect savlist of sah. The basic design is
similar to MP-ifications of SPD and SAD sahlist. Please read the
locking notes of SAD for more details.
Simplify ipsec_reinject_ipstack (NFC)
Add per-CPU rtcache to ipsec_reinject_ipstack

It reduces route lookups and also reduces rtcache lock contentions
when NET_MPSAFE is enabled.
Use pool_cache(9) instead of pool(9) for tdb_crypto objects

The change improves network throughput especially on multi-core systems.

ipsec(4), opencrypto(9) and vlan(4) are now MP-safe.
Write known issues on scalability
Share a global dummy SP between PCBs

It's never be changed so it can be pre-allocated and shared safely between PCBs.
Fix race condition on the rawcb list shared by rtsock and keysock

keysock now protects itself by its own mutex, which means that
the rawcb list is protected by two different mutexes (keysock's one
and softnet_lock for rtsock), of course it's useless.

Fix the situation by having a discrete rawcb list for each.
Use a dedicated mutex for rt_rawcb instead of softnet_lock if NET_MPSAFE
fix localcount leak in sav. fixed by ozaki-r@n.o.

I commit on behalf of him.
remove unnecessary comment.
Fix deadlock between pserialize_perform and localcount_drain

A typical ussage of localcount_drain looks like this:

  item = remove_from_list();
  localcount_drain(&item->localcount, &cv, &mtx);

This sequence can cause a deadlock which happens for example on the following

- Thread A calls localcount_drain which calls xc_broadcast after releasing
  a specified mutex
- Thread B enters the sequence and calls pserialize_perform with holding
  the mutex while pserialize_perform also calls xc_broadcast
- Thread C (xc_thread) that calls an xcall callback of localcount_drain tries
  to hold the mutex

xc_broadcast of thread B doesn't start until xc_broadcast of thread A
finishes, which is a feature of xcall(9). This means that pserialize_perform
never complete until xc_broadcast of thread A finishes. On the other hand,
thread C that is a callee of xc_broadcast of thread A sticks on the mutex.
Finally the threads block each other (A blocks B, B blocks C and C blocks A).

A possible fix is to serialize executions of the above sequence by another
mutex, but adding another mutex makes the code complex, so fix the deadlock
by another way; the fix is to release the mutex before pserialize_perform
and instead use a condvar to prevent pserialize_perform from being called

Note that the deadlock has happened only if NET_MPSAFE is enabled.
Add missing ifdef NET_MPSAFE
Take softnet_lock on pr_input properly if NET_MPSAFE

Currently softnet_lock is taken unnecessarily in some cases, e.g.,
icmp_input and encap4_input from ip_input, or not taken even if needed,
e.g., udp_input and tcp_input from ipsec4_common_input_cb. Fix them.

NFC if NET_MPSAFE is disabled (default).
- sanitize key debugging so that we don't print extra newlines or unassociated
  debugging messages.
- remove unused functions and make internal ones static
- print information in one line per message
humanize printing of ip addresses
cast reduction, NFC.
Fix typo in comment
Pull out ipsec_fill_saidx_bymbuf (NFC)
Don't abuse key_checkrequest just for looking up sav

It does more than expected for example key_acquire.
Fix SP is broken on transport mode

isr->saidx was modified accidentally in ipsec_nextisr.

Reported by christos@
Helped investigations by christos@ and knakahara@
Constify isr at many places (NFC)
Include socketvar.h for softnet_lock
Fix buffer length for ipsec_logsastr

Revision 1.361 / (download) - annotate - [select for diffs], Wed Sep 27 10:05:04 2017 UTC (3 months, 2 weeks ago) by ozaki-r
Branch: MAIN
Changes since 1.360: +2 -4 lines
Diff to previous 1.360 (colored) to selected 1.183 (colored)

Take softnet_lock on pr_input properly if NET_MPSAFE

Currently softnet_lock is taken unnecessarily in some cases, e.g.,
icmp_input and encap4_input from ip_input, or not taken even if needed,
e.g., udp_input and tcp_input from ipsec4_common_input_cb. Fix them.

NFC if NET_MPSAFE is disabled (default).

Revision 1.319.4.10 / (download) - annotate - [select for diffs], Mon Aug 28 17:53:12 2017 UTC (4 months, 2 weeks ago) by skrll
Branch: nick-nhusb
Changes since 1.319.4.9: +108 -102 lines
Diff to previous 1.319.4.9 (colored) to branchpoint 1.319 (colored) next main 1.320 (colored) to selected 1.183 (colored)

Sync with HEAD

Revision 1.360 / (download) - annotate - [select for diffs], Thu Jul 27 06:59:28 2017 UTC (5 months, 3 weeks ago) by ozaki-r
Branch: MAIN
CVS Tags: nick-nhusb-base-20170825
Changes since 1.359: +2 -8 lines
Diff to previous 1.359 (colored) to selected 1.183 (colored)

Don't acquire global locks for IPsec if NET_MPSAFE

Note that the change is just to make testing easy and IPsec isn't MP-safe yet.

Revision 1.359 / (download) - annotate - [select for diffs], Wed Jul 19 07:24:46 2017 UTC (5 months, 4 weeks ago) by ozaki-r
Branch: MAIN
Changes since 1.358: +3 -3 lines
Diff to previous 1.358 (colored) to selected 1.183 (colored)

Correct a comment

Revision 1.358 / (download) - annotate - [select for diffs], Sat Jul 8 22:56:15 2017 UTC (6 months, 1 week ago) by christos
Branch: MAIN
CVS Tags: perseant-stdc-iso10646-base, perseant-stdc-iso10646
Changes since 1.357: +19 -14 lines
Diff to previous 1.357 (colored) to selected 1.183 (colored)

Reorder the controls to the ones that need an interface and the ones that
don't; process the ones that don't first. Add a DIAGNOSTIC if there is no
interface; really this should be a KASSERT/panic because it is a bug if the
interface is not set at this point.

Revision 1.357 / (download) - annotate - [select for diffs], Thu Jul 6 17:12:34 2017 UTC (6 months, 1 week ago) by christos
Branch: MAIN
Changes since 1.356: +6 -6 lines
Diff to previous 1.356 (colored) to selected 1.183 (colored)

remove unnecessary casts (no functional change)

Revision 1.356 / (download) - annotate - [select for diffs], Thu Jul 6 17:08:57 2017 UTC (6 months, 1 week ago) by christos
Branch: MAIN
Changes since 1.355: +5 -29 lines
Diff to previous 1.355 (colored) to selected 1.183 (colored)

Merge the two copies SO_TIMESTAMP/SO_OTIMESTAMP processing to a single
function, and add a SOOPT_TIMESTAMP define reducing compat pollution from
5 places to 1.

Revision 1.355 / (download) - annotate - [select for diffs], Thu Jun 1 02:45:14 2017 UTC (7 months, 2 weeks ago) by chs
Branch: MAIN
CVS Tags: netbsd-8-base
Branch point for: netbsd-8
Changes since 1.354: +2 -6 lines
Diff to previous 1.354 (colored) to selected 1.183 (colored)

remove checks for failure after memory allocation calls that cannot fail:

  kmem_alloc() with KM_SLEEP
  kmem_zalloc() with KM_SLEEP

all of these paths include an assertion that the allocation has not failed,
so callers should not assert that again.

Revision 1.337.2.5 / (download) - annotate - [select for diffs], Wed Apr 26 02:53:29 2017 UTC (8 months, 3 weeks ago) by pgoyette
Branch: pgoyette-localcount
Changes since 1.337.2.4: +69 -40 lines
Diff to previous 1.337.2.4 (colored) to branchpoint 1.337 (colored) next main 1.338 (colored) to selected 1.183 (colored)

Sync with HEAD

Revision 1.347.2.1 / (download) - annotate - [select for diffs], Fri Apr 21 16:54:06 2017 UTC (8 months, 3 weeks ago) by bouyer
Branch: bouyer-socketcan
Changes since 1.347: +91 -51 lines
Diff to previous 1.347 (colored) next main 1.348 (colored) to selected 1.183 (colored)

Sync with HEAD

Revision 1.354 / (