[BACK]Return to 3.t CVS log [TXT][DIR] Up to [cvs.NetBSD.org] / src / share / doc / papers / sysperf

File: [cvs.NetBSD.org] / src / share / doc / papers / sysperf / 3.t (download)

Revision 1.4, Fri Feb 13 11:36:08 2004 UTC (17 years, 2 months ago) by wiz
Branch: MAIN
CVS Tags: yamt-pf42-baseX, yamt-pf42-base4, yamt-pf42-base3, yamt-pf42-base2, yamt-pf42-base, yamt-pf42, yamt-pagecache-tag8, yamt-pagecache-base9, yamt-pagecache-base8, yamt-pagecache-base7, yamt-pagecache-base6, yamt-pagecache-base5, yamt-pagecache-base4, yamt-pagecache-base3, yamt-pagecache-base2, yamt-pagecache-base, yamt-pagecache, wrstuden-revivesa-base-3, wrstuden-revivesa-base-2, wrstuden-revivesa-base-1, wrstuden-revivesa-base, wrstuden-revivesa, wrstuden-fixsa-newbase, wrstuden-fixsa-base-1, wrstuden-fixsa-base, wrstuden-fixsa, tls-maxphys-base, tls-maxphys, tls-earlyentropy-base, tls-earlyentropy, riastradh-xf86-video-intel-2-7-1-pre-2-21-15, riastradh-drm2-base3, riastradh-drm2-base2, riastradh-drm2-base1, riastradh-drm2-base, riastradh-drm2, prg-localcount2-base3, prg-localcount2-base2, prg-localcount2-base1, prg-localcount2-base, prg-localcount2, phil-wifi-base, phil-wifi-20200421, phil-wifi-20200411, phil-wifi-20200406, phil-wifi-20191119, phil-wifi-20190609, phil-wifi, pgoyette-localcount-base, pgoyette-localcount-20170426, pgoyette-localcount-20170320, pgoyette-localcount-20170107, pgoyette-localcount-20161104, pgoyette-localcount-20160806, pgoyette-localcount-20160726, pgoyette-localcount, pgoyette-compat-merge-20190127, pgoyette-compat-base, pgoyette-compat-20190127, pgoyette-compat-20190118, pgoyette-compat-1226, pgoyette-compat-1126, pgoyette-compat-1020, pgoyette-compat-0930, pgoyette-compat-0906, pgoyette-compat-0728, pgoyette-compat-0625, pgoyette-compat-0521, pgoyette-compat-0502, pgoyette-compat-0422, pgoyette-compat-0415, pgoyette-compat-0407, pgoyette-compat-0330, pgoyette-compat-0322, pgoyette-compat-0315, pgoyette-compat, perseant-stdc-iso10646-base, perseant-stdc-iso10646, netbsd-9-base, netbsd-9-1-RELEASE, netbsd-9-0-RELEASE, netbsd-9-0-RC2, netbsd-9-0-RC1, netbsd-9, netbsd-8-base, netbsd-8-2-RELEASE, netbsd-8-1-RELEASE, netbsd-8-1-RC1, netbsd-8-0-RELEASE, netbsd-8-0-RC2, netbsd-8-0-RC1, netbsd-8, netbsd-7-nhusb-base-20170116, netbsd-7-nhusb-base, netbsd-7-nhusb, netbsd-7-base, netbsd-7-2-RELEASE, netbsd-7-1-RELEASE, netbsd-7-1-RC2, netbsd-7-1-RC1, netbsd-7-1-2-RELEASE, netbsd-7-1-1-RELEASE, netbsd-7-1, netbsd-7-0-RELEASE, netbsd-7-0-RC3, netbsd-7-0-RC2, netbsd-7-0-RC1, netbsd-7-0-2-RELEASE, netbsd-7-0-1-RELEASE, netbsd-7-0, netbsd-7, netbsd-6-base, netbsd-6-1-RELEASE, netbsd-6-1-RC4, netbsd-6-1-RC3, netbsd-6-1-RC2, netbsd-6-1-RC1, netbsd-6-1-5-RELEASE, netbsd-6-1-4-RELEASE, netbsd-6-1-3-RELEASE, netbsd-6-1-2-RELEASE, netbsd-6-1-1-RELEASE, netbsd-6-1, netbsd-6-0-RELEASE, netbsd-6-0-RC2, netbsd-6-0-RC1, netbsd-6-0-6-RELEASE, netbsd-6-0-5-RELEASE, netbsd-6-0-4-RELEASE, netbsd-6-0-3-RELEASE, netbsd-6-0-2-RELEASE, netbsd-6-0-1-RELEASE, netbsd-6-0, netbsd-6, netbsd-5-base, netbsd-5-2-RELEASE, netbsd-5-2-RC1, netbsd-5-2-3-RELEASE, netbsd-5-2-2-RELEASE, netbsd-5-2-1-RELEASE, netbsd-5-2, netbsd-5-1-RELEASE, netbsd-5-1-RC4, netbsd-5-1-RC3, netbsd-5-1-RC2, netbsd-5-1-RC1, netbsd-5-1-5-RELEASE, netbsd-5-1-4-RELEASE, netbsd-5-1-3-RELEASE, netbsd-5-1-2-RELEASE, netbsd-5-1-1-RELEASE, netbsd-5-1, netbsd-5-0-RELEASE, netbsd-5-0-RC4, netbsd-5-0-RC3, netbsd-5-0-RC2, netbsd-5-0-RC1, netbsd-5-0-2-RELEASE, netbsd-5-0-1-RELEASE, netbsd-5-0, netbsd-5, netbsd-4-base, netbsd-4-0-RELEASE, netbsd-4-0-RC5, netbsd-4-0-RC4, netbsd-4-0-RC3, netbsd-4-0-RC2, netbsd-4-0-RC1, netbsd-4-0-1-RELEASE, netbsd-4-0, netbsd-4, netbsd-3-base, netbsd-3-1-RELEASE, netbsd-3-1-RC4, netbsd-3-1-RC3, netbsd-3-1-RC2, netbsd-3-1-RC1, netbsd-3-1-1-RELEASE, netbsd-3-1, netbsd-3-0-RELEASE, netbsd-3-0-RC6, netbsd-3-0-RC5, netbsd-3-0-RC4, netbsd-3-0-RC3, netbsd-3-0-RC2, netbsd-3-0-RC1, netbsd-3-0-3-RELEASE, netbsd-3-0-2-RELEASE, netbsd-3-0-1-RELEASE, netbsd-3-0, netbsd-3, netbsd-2-base, netbsd-2-1-RELEASE, netbsd-2-1-RC6, netbsd-2-1-RC5, netbsd-2-1-RC4, netbsd-2-1-RC3, netbsd-2-1-RC2, netbsd-2-1-RC1, netbsd-2-1, netbsd-2-0-base, netbsd-2-0-RELEASE, netbsd-2-0-RC5, netbsd-2-0-RC4, netbsd-2-0-RC3, netbsd-2-0-RC2, netbsd-2-0-RC1, netbsd-2-0-3-RELEASE, netbsd-2-0-2-RELEASE, netbsd-2-0-1-RELEASE, netbsd-2-0, netbsd-2, mjf-devfs2-base, mjf-devfs2, matt-premerge-20091211, matt-nb8-mediatek-base, matt-nb8-mediatek, matt-nb6-plus-nbase, matt-nb6-plus-base, matt-nb6-plus, matt-nb5-pq3-base, matt-nb5-pq3, matt-nb5-mips64-u2-k2-k4-k7-k8-k9, matt-nb5-mips64-u1-k1-k5, matt-nb5-mips64-premerge-20101231, matt-nb5-mips64-premerge-20091211, matt-nb5-mips64-k15, matt-nb5-mips64, matt-nb4-mips64-k7-u2a-k9b, matt-mips64-premerge-20101231, matt-mips64-base2, matt-mips64-base, matt-mips64, matt-armv6-prevmlocking, matt-armv6-nbase, matt-armv6-base, matt-armv6, localcount-20160914, keiichi-mipv6-nbase, keiichi-mipv6-base, keiichi-mipv6, jym-xensuspend-nbase, jym-xensuspend-base, jym-xensuspend, is-mlppp-base, is-mlppp, hpcarm-cleanup-nbase, hpcarm-cleanup-base, hpcarm-cleanup, cube-autoconf-base, cube-autoconf, cherry-xenmp-base, cherry-xenmp, bouyer-socketcan-base1, bouyer-socketcan-base, bouyer-socketcan, bouyer-quota2-nbase, bouyer-quota2-base, bouyer-quota2, agc-symver-base, agc-symver, abandoned-netbsd-4-base, abandoned-netbsd-4, HEAD
Changes since 1.3: +3 -3 lines

Uppercase CPU, plural is CPUs.

.\"	$NetBSD: 3.t,v 1.4 2004/02/13 11:36:08 wiz Exp $
.\" Copyright (c) 1985 The Regents of the University of California.
.\" All rights reserved.
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code must retain the above copyright
.\"    notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\"    notice, this list of conditions and the following disclaimer in the
.\"    documentation and/or other materials provided with the distribution.
.\" 3. Neither the name of the University nor the names of its contributors
.\"    may be used to endorse or promote products derived from this software
.\"    without specific prior written permission.
.\"	@(#)3.t	5.1 (Berkeley) 4/17/91
.ds RH Results of our observations
Results of our observations
When 4.2BSD was first installed on several large timesharing systems
the degradation in performance was significant.
Informal measurements showed 4.2BSD providing 80% of the throughput
of 4.1BSD (based on load averages observed under a normal timesharing load).
Many of the initial problems found were because of programs that were
not part of 4.1BSD.  Using the techniques described in the previous
section and standard process profiling several problems were identified.
Later work concentrated on the operation of the kernel itself.
In this section we discuss the problems uncovered;  in the next
section we describe the changes made to the system.
.NH 2
User programs
.NH 3
Mail system		
The mail system was the first culprit identified as a major
contributor to the degradation in system performance.
At Lucasfilm the mail system is heavily used
on one machine, a VAX-11/780 with eight megabytes of memory.\**
\** During part of these observations the machine had only four
megabytes of memory.
traffic is usually between users on the same machine and ranges from
person-to-person telephone messages to per-organization distribution
lists.  After conversion to 4.2BSD, it was
immediately noticed that mail to distribution lists of 20 or more people
caused the system load to jump by anywhere from 3 to 6 points.
The number of processes spawned by the \fIsendmail\fP program and
the messages sent from \fIsendmail\fP to the system logging
process, \fIsyslog\fP, generated significant load both from their
execution and their interference with basic system operation.  The
number of context switches and disk transfers often doubled while
\fIsendmail\fP operated; the system call rate jumped dramatically.
System accounting information consistently
showed \fIsendmail\fP as the top CPU user on the system.
.NH 3
Network servers
The network services provided in 4.2BSD add new capabilities to the system,
but are not without cost.  The system uses one daemon process to accept
requests for each network service provided.  The presence of many
such daemons increases the numbers of active processes and files,
and requires a larger configuration to support the same number of users.
The overhead of the routing and status updates can consume
several percent of the CPU.
Remote logins and shells incur more overhead
than their local equivalents.
For example, a remote login uses three processes and a
pseudo-terminal handler in addition to the local hardware terminal
handler.  When using a screen editor, sending and echoing a single
character involves four processes on two machines.
The additional processes, context switching, network traffic, and
terminal handler overhead can roughly triple the load presented by one
local terminal user.
.NH 2
System overhead
To measure the costs of various functions in the kernel,
a profiling system was run for a 17 hour
period on one of our general timesharing machines.
While this is not as reproducible as a synthetic workload,
it certainly represents a realistic test.
This test was run on several occasions over a three month period.
Despite the long period of time that elapsed
between the test runs the shape of the profiles,
as measured by the number of times each system call
entry point was called, were remarkably similar.
These profiles turned up several bottlenecks that are
discussed in the next section.
Several of these were new to 4.2BSD,
but most were caused by overloading of mechanisms
which worked acceptably well in previous BSD systems.
The general conclusion from our measurements was that
the ratio of user to system time had increased from
45% system / 55% user in 4.1BSD to 57% system / 43% user
in 4.2BSD.
.NH 3
Micro-operation benchmarks
To compare certain basic system operations
between 4.1BSD and 4.2BSD a suite of benchmark
programs was constructed and run on a VAX-11/750 with 4.5 megabytes
of physical memory and two disks on a MASSBUS controller.
Tests were run with the machine operating in single user mode
under both 4.1BSD and 4.2BSD.   Paging was localized to the drive
where the root file system was located.
The benchmark programs were modeled after the Kashtan benchmarks,
[Kashtan80], with identical sources compiled under each system.
The programs and their intended purpose are described briefly
before the presentation of the results.  The benchmark scripts
were run twice with the results shown as the average of
the two runs.
The source code for each program and the shell scripts used during
the benchmarks are included in the Appendix.
The set of tests shown in Table 1 was concerned with
system operations other than paging.  The intent of most
benchmarks is clear.  The result of running \fIsignocsw\fP is
deducted from the \fIcsw\fP benchmark to calculate the context
switch overhead.  The \fIexec\fP tests use two different jobs to gauge
the cost of overlaying a larger program with a smaller one
and vice versa.  The
``null job'' and ``big job'' differ solely in the size of their data
segments, 1 kilobyte versus 256 kilobytes.  In both cases the
text segment of the parent is larger than that of the child.\**
\** These tests should also have measured the cost of expanding the
text segment; unfortunately time did not permit running additional tests.
All programs were compiled into the default load format that causes
the text segment to be demand paged out of the file system and shared
between processes.
center box;
l | l.
Test	Description
syscall	perform 100,000 \fIgetpid\fP system calls
csw	perform 10,000 context switches using signals
signocsw	send 10,000 signals to yourself
pipeself4	send 10,000 4-byte messages to yourself
pipeself512	send 10,000 512-byte messages to yourself
pipediscard4	send 10,000 4-byte messages to child who discards
pipediscard512	send 10,000 512-byte messages to child who discards
pipeback4	exchange 10,000 4-byte messages with child
pipeback512	exchange 10,000 512-byte messages with child
forks0	fork-exit-wait 1,000 times
forks1k	sbrk(1024), fault page, fork-exit-wait 1,000 times
forks100k	sbrk(102400), fault pages, fork-exit-wait 1,000 times
vforks0	vfork-exit-wait 1,000 times
vforks1k	sbrk(1024), fault page, vfork-exit-wait 1,000 times
vforks100k	sbrk(102400), fault pages, vfork-exit-wait 1,000 times
execs0null	fork-exec ``null job''-exit-wait 1,000 times
execs0null (1K env)	execs0null above, with 1K environment added
execs1knull	sbrk(1024), fault page, fork-exec ``null job''-exit-wait 1,000 times
execs1knull (1K env)	execs1knull above, with 1K environment added
execs100knull	sbrk(102400), fault pages, fork-exec ``null job''-exit-wait 1,000 times
vexecs0null	vfork-exec ``null job''-exit-wait 1,000 times
vexecs1knull	sbrk(1024), fault page, vfork-exec ``null job''-exit-wait 1,000 times
vexecs100knull	sbrk(102400), fault pages, vfork-exec ``null job''-exit-wait 1,000 times
execs0big	fork-exec ``big job''-exit-wait 1,000 times
execs1kbig	sbrk(1024), fault page, fork-exec ``big job''-exit-wait 1,000 times
execs100kbig	sbrk(102400), fault pages, fork-exec ``big job''-exit-wait 1,000 times
vexecs0big	vfork-exec ``big job''-exit-wait 1,000 times
vexecs1kbig	sbrk(1024), fault pages, vfork-exec ``big job''-exit-wait 1,000 times
vexecs100kbig	sbrk(102400), fault pages, vfork-exec ``big job''-exit-wait 1,000 times
Table 1. Kernel Benchmark programs.
The results of these tests are shown in Table 2.  If the 4.1BSD results
are scaled to reflect their being run on a VAX-11/750, they
correspond closely to those found in [Joy80].\**
\** We assume that a VAX-11/750 runs at 60% of the speed of a VAX-11/780
(not considering floating point operations).
center box;
c s s s s s s s s s
c || c s s || c s s || c s s
c || c s s || c s s || c s s
c || c | c | c || c | c | c || c | c | c
l || n | n | n || n | n | n || n | n | n.
Berkeley Software Distribution UNIX Systems
Test	Elapsed Time	User Time	System Time
\^	_	_	_
\^	4.1	4.2	4.3	4.1	4.2	4.3	4.1	4.2	4.3
syscall	28.0	29.0	23.0	4.5	5.3	3.5	23.9	23.7	20.4
csw	45.0	60.0	45.0	3.5	4.3	3.3	19.5	25.4	19.0
signocsw	16.5	23.0	16.0	1.9	3.0	1.1	14.6	20.1	15.2
pipeself4	21.5	29.0	26.0	1.1	1.1	0.8	20.1	28.0	25.6
pipeself512	47.5	59.0	55.0	1.2	1.2	1.0	46.1	58.3	54.2
pipediscard4	32.0	42.0	36.0	3.2	3.7	3.0	15.5	18.8	15.6
pipediscard512	61.0	76.0	69.0	3.1	2.1	2.0	29.7	36.4	33.2
pipeback4	57.0	75.0	66.0	2.9	3.2	3.3	25.1	34.2	29.7
pipeback512	110.0	138.0	125.0	3.1	3.4	2.2	52.2	65.7	57.7
forks0	37.5	41.0	22.0	0.5	0.3	0.3	34.5	37.6	21.5
forks1k	40.0	43.0	22.0	0.4	0.3	0.3	36.0	38.8	21.6
forks100k	217.5	223.0	176.0	0.7	0.6	0.4	214.3	218.4	175.2
vforks0	34.5	37.0	22.0	0.5	0.6	0.5	27.3	28.5	17.9
vforks1k	35.0	37.0	22.0	0.6	0.8	0.5	27.2	28.6	17.9
vforks100k	35.0	37.0	22.0	0.6	0.8	0.6	27.6	28.9	17.9
execs0null	97.5	92.0	66.0	3.8	2.4	0.6	68.7	82.5	48.6
execs0null (1K env)	197.0	229.0	75.0	4.1	2.6	0.9	167.8	212.3	62.6
execs1knull	99.0	100.0	66.0	4.1	1.9	0.6	70.5	86.8	48.7
execs1knull (1K env)	199.0	230.0	75.0	4.2	2.6	0.7	170.4	214.9	62.7
execs100knull	283.5	278.0	216.0	4.8	2.8	1.1	251.9	269.3	202.0
vexecs0null	100.0	92.0	66.0	5.1	2.7	1.1	63.7	76.8	45.1
vexecs1knull	100.0	91.0	66.0	5.2	2.8	1.1	63.2	77.1	45.1
vexecs100knull	100.0	92.0	66.0	5.1	3.0	1.1	64.0	77.7	45.6
execs0big	129.0	201.0	101.0	4.0	3.0	1.0	102.6	153.5	92.7
execs1kbig	130.0	202.0	101.0	3.7	3.0	1.0	104.7	155.5	93.0
execs100kbig	318.0	385.0	263.0	4.8	3.1	1.1	286.6	339.1	247.9
vexecs0big	128.0	200.0	101.0	4.6	3.5	1.6	98.5	149.6	90.4
vexecs1kbig	125.0	200.0	101.0	4.7	3.5	1.3	98.9	149.3	88.6
vexecs100kbig	126.0	200.0	101.0	4.2	3.4	1.3	99.5	151.0	89.0
Table 2. Kernel Benchmark results (all times in seconds).
In studying the measurements we found that the basic system call
and context switch overhead did not change significantly
between 4.1BSD and 4.2BSD.  The \fIsignocsw\fP results were caused by
the changes to the \fIsignal\fP interface, resulting
in an additional subroutine invocation for each call, not
to mention additional complexity in the system's implementation.
The times for the use of pipes are significantly higher under
4.2BSD because of their implementation on top of the interprocess
communication facilities.  Under 4.1BSD pipes were implemented
without the complexity of the socket data structures and with
simpler code.  Further, while not obviously a factor here,
4.2BSD pipes have less system buffer space provided them than
4.1BSD pipes.
The \fIexec\fP tests shown in Table 2 were performed with 34 bytes of
environment information under 4.1BSD and 40 bytes under 4.2BSD.
To figure the cost of passing data through the environment,
the execs0null and execs1knull tests were rerun with
1065 additional bytes of data.  The results are show in Table 3.
center box;
c || c s || c s || c s
c || c s || c s || c s
c || c | c || c | c || c | c
l || n | n || n | n || n | n.
Test	Real	User	System
\^	_	_	_
\^	4.1	4.2	4.1	4.2	4.1	4.2
execs0null	197.0	229.0	4.1	2.6	167.8	212.3
execs1knull	199.0	230.0	4.2	2.6	170.4	214.9
Table 3. Benchmark results with ``large'' environment (all times in seconds).
These results show that passing argument data is significantly
slower than under 4.1BSD: 121 ms/byte versus 93 ms/byte.  Even using
this factor to adjust the basic overhead of an \fIexec\fP system
call, this facility is more costly under 4.2BSD than under 4.1BSD.
.NH 3
Path name translation
The single most expensive function performed by the kernel
is path name translation.
This has been true in almost every UNIX kernel [Mosher80];
we find that our general time sharing systems do about
500,000 name translations per day.
Name translations became more expensive in 4.2BSD for several reasons.
The single most expensive addition was the symbolic link.
Symbolic links
have the effect of increasing the average number of components
in path names to be translated.
As an insidious example,
consider the system manager that decides to change /tmp
to be a symbolic link to /usr/tmp.
A name such as /tmp/tmp1234 that previously required two component
now requires four component translations plus the cost of reading 
the contents of the symbolic link.
The new directory format also changes the characteristics of
name translation.
The more complex format requires more computation to determine
where to place new entries in a directory.
Conversely the additional information allows the system to only
look at active entries when searching,
hence searches of directories that had once grown large
but currently have few active entries are checked quickly.
The new format also stores the length of each name so that
costly string comparisons are only done on names that are the
same length as the name being sought.
The net effect of the changes is that the average time to
translate a path name in 4.2BSD is 24.2 milliseconds,
representing 40% of the time processing system calls,
that is 19% of the total cycles in the kernel,
or 11% of all cycles executed on the machine.
The times are shown in Table 4.  We have no comparable times
for \fInamei\fP under 4.1 though they are certain to
be significantly less.
center box;
l r r.
part	time	% of kernel
self	14.3 ms/call	11.3%
child	9.9 ms/call	7.9%
total	24.2 ms/call	19.2%
Table 4. Call times for \fInamei\fP in 4.2BSD.
.NH 3
Clock processing
Nearly 25% of the time spent in the kernel is spent in the clock
processing routines.
(This is a clear indication that to avoid sampling bias when profiling the
kernel with our tools
we need to drive them from an independent clock.)
These routines are responsible for implementing timeouts,
scheduling the processor,
maintaining kernel statistics,
and tending various hardware operations such as
draining the terminal input silos.
Only minimal work is done in the hardware clock interrupt
routine (at high priority), the rest is performed (at a lower priority)
in a software interrupt handler scheduled by the hardware interrupt
In the worst case, with a clock rate of 100 Hz
and with every hardware interrupt scheduling a software
interrupt, the processor must field 200 interrupts per second.
The overhead of simply trapping and returning
is 3% of the machine cycles,
figuring out that there is nothing to do
requires an additional 2%.
.NH 3
Terminal multiplexors
The terminal multiplexors supported by 4.2BSD have programmable receiver
silos that may be used in two ways.
With the silo disabled, each character received causes an interrupt
to the processor.
Enabling the receiver silo allows the silo to fill before
generating an interrupt, allowing multiple characters to be read
for each interrupt.
At low rates of input, received characters will not be processed
for some time unless the silo is emptied periodically.
The 4.2BSD kernel uses the input silos of each terminal multiplexor,
and empties each silo on each clock interrupt.
This allows high input rates without the cost of per-character interrupts
while assuring low latency.
However, as character input rates on most machines are usually
low (about 25 characters per second),
this can result in excessive overhead.
At the current clock rate of 100 Hz, a machine with 5 terminal multiplexors
configured makes 500 calls to the receiver interrupt routines per second.
In addition, to achieve acceptable input latency
for flow control, each clock interrupt must schedule
a software interrupt to run the silo draining routines.\**
\** It is not possible to check the input silos at
the time of the actual clock interrupt without modifying the terminal
line disciplines, as the input queues may not be in a consistent state \**.
\** This implies that the worst case estimate for clock processing
is the basic overhead for clock processing.
.NH 3
Process table management
In 4.2BSD there are numerous places in the kernel where a linear search
of the process table is performed: 
.IP \(bu 3
in \fIexit\fP to locate and wakeup a process's parent;
.IP \(bu 3
in \fIwait\fP when searching for \fB\s-2ZOMBIE\s+2\fP and
\fB\s-2STOPPED\s+2\fP processes;
.IP \(bu 3
in \fIfork\fP when allocating a new process table slot and
counting the number of processes already created by a user;
.IP \(bu 3
in \fInewproc\fP, to verify
that a process id assigned to a new process is not currently
in use;
.IP \(bu 3
in \fIkill\fP and \fIgsignal\fP to locate all processes to
which a signal should be delivered;
.IP \(bu 3
in \fIschedcpu\fP when adjusting the process priorities every
second; and
.IP \(bu 3
in \fIsched\fP when locating a process to swap out and/or swap
These linear searches can incur significant overhead.  The rule
for calculating the size of the process table is:
nproc = 20 + 8 * maxusers
that means a 48 user system will have a 404 slot process table.
With the addition of network services in 4.2BSD, as many as a dozen
server processes may be maintained simply to await incoming requests.
These servers are normally created at boot time which causes them
to be allocated slots near the beginning of the process table.  This
means that process table searches under 4.2BSD are likely to take
significantly longer than under 4.1BSD.  System profiling shows
that as much as 20% of the time spent in the kernel on a loaded
system (a VAX-11/780) can be spent in \fIschedcpu\fP and, on average,
5-10% of the kernel time is spent in \fIschedcpu\fP.
The other searches of the proc table are similarly affected.
This shows the system can no longer tolerate using linear searches of
the process table.
.NH 3
File system buffer cache
The trace facilities described in section 2.3 were used
to gather statistics on the performance of the buffer cache.
We were interested in measuring the effectiveness of the
cache and the read-ahead policies.
With the file system block size in 4.2BSD four to
eight times that of a 4.1BSD file system, we were concerned
that large amounts of read-ahead might be performed without
being used.  Also, we were interested in seeing if the
rules used to size the buffer cache at boot time were severely
affecting the overall cache operation.
The tracing package was run over a three hour period during
a peak mid-afternoon period on a VAX 11/780 with four megabytes
of physical memory.
This resulted in a buffer cache containing 400 kilobytes of memory
spread among 50 to 200 buffers
(the actual number of buffers depends on the size mix of
disk blocks being read at any given time).
The pertinent configuration information is shown in Table 5.
center box;
l l l l.
Controller	Drive	Device	File System
DEC MASSBUS	DEC RP06	hp0d	/usr
		hp0b	swap
Emulex SC780	Fujitsu Eagle	hp1a	/usr/spool/news
		hp1b	swap
		hp1e	/usr/src
		hp1d	/u0 (users)
	Fujitsu Eagle	hp2a	/tmp
		hp2b	swap
		hp2d	/u1 (users)
	Fujitsu Eagle	hp3a	/
Table 5. Active file systems during buffer cache tests.
During the test period the load average ranged from 2 to 13
with an average of 5.
The system had no idle time, 43% user time, and 57% system time.
The system averaged 90 interrupts per second 
(excluding the system clock interrupts),
220 system calls per second,
and 50 context switches per second (40 voluntary, 10 involuntary).
The active virtual memory (the sum of the address space sizes of
all jobs that have run in the previous twenty seconds)
over the period ranged from 2 to 6 megabytes with an average
of 3.5 megabytes.
There was no swapping, though the page daemon was inspecting
about 25 pages per second.
On average 250 requests to read disk blocks were initiated
per second.
These include read requests for file blocks made by user 
programs as well as requests initiated by the system.
System reads include requests for indexing information to determine
where a file's next data block resides,
file system layout maps to allocate new data blocks,
and requests for directory contents needed to do path name translations.
On average, an 85% cache hit rate was observed for read requests.
Thus only 37 disk reads were initiated per second.
In addition, 5 read-ahead requests were made each second
filling about 20% of the buffer pool.
Despite the policies to rapidly reuse read-ahead buffers
that remain unclaimed, more than 90% of the read-ahead
buffers were used.
These measurements showed that the buffer cache was working
effectively.  Independent tests have also showed that the size
of the buffer cache may be reduced significantly on memory-poor
system without severe effects;
we have not yet tested this hypothesis [Shannon83].
.NH 3
Network subsystem
The overhead associated with the 
network facilities found in 4.2BSD is often
difficult to gauge without profiling the system.
This is because most input processing is performed
in modules scheduled with software interrupts.
As a result, the system time spent performing protocol
processing is rarely attributed to the processes that
really receive the data.  Since the protocols supported
by 4.2BSD can involve significant overhead this was a serious
concern.  Results from a profiled kernel show an average
of 5% of the system time is spent
performing network input and timer processing in our environment
(a 3Mb/s Ethernet with most traffic using TCP).
This figure can vary significantly depending on
the network hardware used, the average message
size, and whether packet reassembly is required at the network
layer.  On one machine we profiled over a 17 hour
period (our gateway to the ARPANET)
206,000 input messages accounted for 2.4% of the system time,
while another 0.6% of the system time was spent performing
protocol timer processing.
This machine was configured with an ACC LH/DH IMP interface
and a DMA 3Mb/s Ethernet controller.
The performance of TCP over slower long-haul networks
was degraded substantially by two problems.
The first problem was a bug that prevented round-trip timing measurements
from being made, thus increasing retransmissions unnecessarily.
The second was a problem with the maximum segment size chosen by TCP,
that was well-tuned for Ethernet, but was poorly chosen for
the ARPANET, where it causes packet fragmentation.  (The maximum
segment size was actually negotiated upwards to a value that
resulted in excessive fragmentation.)
When benchmarked in Ethernet environments the main memory buffer management
of the network subsystem presented some performance anomalies.
The overhead of processing small ``mbufs'' severely affected throughput for a
substantial range of message sizes.
In spite of the fact that most system ustilities made use of the throughput
optimal 1024 byte size, user processes faced large degradations for some
arbitrary sizes. This was specially true for TCP/IP transmissions [Cabrera84,
.NH 3
Virtual memory subsystem
We ran a set of tests intended to exercise the virtual
memory system under both 4.1BSD and 4.2BSD.
The tests are described in Table 6.
The test programs dynamically allocated
a 7.3 Megabyte array (using \fIsbrk\fP\|(2)) then referenced
pages in the array either: sequentially, in a purely random
fashion, or such that the distance between
successive pages accessed was randomly selected from a Gaussian
distribution.  In the last case, successive runs were made with
increasing standard deviations.
center box;
l | l.
Test	Description
seqpage	sequentially touch pages, 10 iterations
seqpage-v	as above, but first make \fIvadvise\fP\|(2) call
randpage	touch random page 30,000 times
randpage-v	as above, but first make \fIvadvise\fP call
gausspage.1	30,000 Gaussian accesses, standard deviation of 1
gausspage.10	as above, standard deviation of 10
gausspage.30	as above, standard deviation of 30
gausspage.40	as above, standard deviation of 40
gausspage.50	as above, standard deviation of 50
gausspage.60	as above, standard deviation of 60
gausspage.80	as above, standard deviation of 80
gausspage.inf	as above, standard deviation of 10,000
Table 6. Paging benchmark programs.
The results in Table 7 show how the additional
memory requirements
of 4.2BSD can generate more work for the paging system.
Under 4.1BSD,
the system used 0.5 of the 4.5 megabytes of physical memory
on the test machine;
under 4.2BSD it used nearly 1 megabyte of physical memory.\**
\** The 4.1BSD system used for testing was really a 4.1a 
system configured
with networking facilities and code to support
remote file access.  The
4.2BSD system also included the remote file access code.
Since both
systems would be larger than similarly configured ``vanilla''
4.1BSD or 4.2BSD system, we consider out conclusions to still be valid.
This resulted in more page faults and, hence, more system time.
To establish a common ground on which to compare the paging
routines of each system, we check instead the average page fault
service times for those test runs that had a statistically significant
number of random page faults.  These figures, shown in Table 8, show
no significant difference between the two systems in
the area of page fault servicing.  We currently have
no explanation for the results of the sequential
paging tests.
center box;
l || c s || c s || c s || c s
l || c s || c s || c s || c s
l || c | c || c | c || c | c || c | c
l || n | n || n | n || n | n || n | n.
Test	Real	User	System	Page Faults
\^	_	_	_	_
\^	4.1	4.2	4.1	4.2	4.1	4.2	4.1	4.2
seqpage	959	1126	16.7	12.8	197.0	213.0	17132	17113
seqpage-v	579	812	3.8	5.3	216.0	237.7	8394	8351
randpage	571	569	6.7	7.6	64.0	77.2	8085	9776
randpage-v	572	562	6.1	7.3	62.2	77.5	8126	9852
gausspage.1	25	24	23.6	23.8	0.8	0.8	8	8
gausspage.10	26	26	22.7	23.0	3.2	3.6	2	2
gausspage.30	34	33	25.0	24.8	8.6	8.9	2	2
gausspage.40	42	81	23.9	25.0	11.5	13.6	3	260
gausspage.50	113	175	24.2	26.2	19.6	26.3	784	1851
gausspage.60	191	234	27.6	26.7	27.4	36.0	2067	3177
gausspage.80	312	329	28.0	27.9	41.5	52.0	3933	5105
gausspage.inf	619	621	82.9	85.6	68.3	81.5	8046	9650
Table 7. Paging benchmark results (all times in seconds).
center box;
c || c s || c s
c || c s || c s
c || c | c || c | c
l || n | n || n | n.
Test	Page Faults	PFST
\^	_	_
\^	4.1	4.2	4.1	4.2
randpage	8085	9776	791	789
randpage-v	8126	9852	765	786
gausspage.inf	8046	9650	848	844
Table 8. Page fault service times (all times in microseconds).