the-linux-programming-interface-a-linux-and-unix-system-programming-handbook


Чтобы посмотреть этот PDF файл с форматированием и разметкой, скачайте его и откройте на своем компьютере.
J�;
B?DKN
FHE=H7CC?D=
?
D
J
;H79;
"-JOVYBOE6/*9
4ZTUFN1SPHSBNNJOH)BOECPPL
.*$)"&-,&33*4
,
,&33*4
,
J�;
B?DKN
FHE=H7CC?D=
?
D
J
;H79;
5IF-JOVY1SPHSBNNJOH*OUFSGBDF
JTUIFEFGJOJUJWFHVJEF
UPUIF-JOVYBOE6/*9QSPHSBNNJOHJOUFSGBDFUIF
JOUFSGBDFFNQMPZFECZOFBSMZFWFSZBQQMJDBUJPOUIBU
SVOTPOB-JOVYPS6/*9TZTUFN
*OUIJTBVUIPSJUBUJWFXPSL\r-JOVYQSPHSBNNJOH
FYQFSU.JDIBFM,FSSJTLQSPWJEFTEFUBJMFEEFTDSJQUJPOT
PGUIFTZTUFNDBMMTBOEMJCSBSZGVODUJPOTUIBUZPVOFFE
JOPSEFSUPNBTUFSUIFDSBGUPGTZTUFNQSPHSBNNJOH\r
BOEBDDPNQBOJFTIJTFYQMBOBUJPOTXJUIDMFBS\rDPNQMFUF
FYBNQMFQSPHSBNT
:PVMMGJOEEFTDSJQUJPOTPGPWFSTZTUFNDBMMT
BOEMJCSBSZGVODUJPOT\rBOENPSFUIBOFYBNQMFQSP
HSBNT\rUBCMFT\rBOEEJBHSBNT:PVMMMFBSOIPXUP
3FBEBOEXSJUFGJMFTFGGJDJFOUMZ
6TFTJHOBMT\rDMPDLT\rBOEUJNFST
$SFBUFQSPDFTTFTBOEFYFDVUFQSPHSBNT
8SJUFTFDVSFQSPHSBNT
8SJUFNVMUJUISFBEFEQSPHSBNTVTJOH104*9UISFBET
#VJMEBOEVTFTIBSFEMJCSBSJFT
1FSGPSNJOUFSQSPDFTTDPNNVOJDBUJPOVTJOHQJQFT\r
NFTTBHFRVFVFT\rTIBSFENFNPSZ\rBOETFNBQIPSFT
8SJUFOFUXPSLBQQMJDBUJPOTXJUIUIFTPDLFUT"1*
8IJMF
5IF-JOVY1SPHSBNNJOH*OUFSGBDF
DPWFSTBXFBMUI
PG-JOVYTQFDJGJDGFBUVSFT\rJODMVEJOHFQPMM\rJOPUJGZ\rBOE
UIF
+lnk_
GJMFTZTUFN\rJUTFNQIBTJTPO6/*9TUBOEBSET
104*9
 
464W
BOE104*9
 
464W
NBLFTJUFRVBMMZWBMVBCMFUPQSPHSBNNFSTXPSLJOHPO
PUIFS6/*9QMBUGPSNT
5IF-JOVY1SPHSBNNJOH*OUFSGBDF
JTUIFNPTUDPN
QSFIFOTJWFTJOHMFWPMVNFXPSLPOUIF-JOVYBOE6/*9
QSPHSBNNJOHJOUFSGBDF\rBOEBCPPLUIBUTEFTUJOFEUP
CFDPNFBOFXDMBTTJD
#0655)&
"
65)03
.JDIBFM,FSSJTL
IUUQNBOPSH
IBTCFFOVTJOHBOEQSPHSBNNJOH6/*9TZTUFNT
GPSNPSFUIBOZFBST\rBOEIBTUBVHIUNBOZXFFLMPOHDPVSTFTPO6/*9TZTUFN
QSPHSBNNJOH4JODF\rIFIBTNBJOUBJOFEUIFNBOQBHFTQSPKFDU\rXIJDI
QSPEVDFTUIFNBOVBMQBHFTEFTDSJCJOHUIF-JOVYLFSOFMBOEHMJCDQSPHSBNNJOH
"1*T)FIBTXSJUUFOPSDPXSJUUFONPSFUIBOPGUIFNBOVBMQBHFTBOEJTBDUJWFMZ
JOWPMWFEJOUIFUFTUJOHBOEEFTJHOSFWJFXPGOFX-JOVYLFSOFMVTFSTQBDFJOUFSGBDFT
.JDIBFMMJWFTXJUIIJTGBNJMZJO.VOJDI\r(FSNBOZ
J�;:;?D?
J
?L;=K?:;
J
B?DKN
7D:
KD?N
IOI
J
;C
F
HE=H7CC?D=
PWFSTDVSSFOU
6
/
*
9TUBOEBSET
04*
9
'#(&&'
464
BOE1
04*
9
'#(&&.
464
W
*

9
9
9
5
9

7
8
1
5
9
3

2
7
2
2
0
3


ISBN: 978-1-59327-220-3

8
9
1
4
5

7
2
2
0
0




$%/
4IFMWF*O
HEJQT\rLNKCN=IIEJC
THE FINEST IN GEEK ENTERTAINMENT
www.nostarch.com
5IJTMPHPBQQMJFTPOMZUPUIFUFYUTUPDL
UPDATES
http://www.nostarch.com
/linuxprogramming.htm
for updates, errata, and other
information.
The Linux Programming Interface
is set in New Baskerville, TheSansMono
Condensed, Futura, and Dogma. Th
e book was printed and bound by
RRDonnelley in Crawfordsville, Indian
800.420.7240
415.863.9900
MONDAY
FRIDAY
PST
415.863.9950
HOURS
DAYS
EMAIL
SALES
NOSTARCH
COM
WEB
NOSTARCH
COM
MAIL
STARCH
PRESS
More No-Nonsense Books from
NO STARCH PRESS
The Electronic Frontier Foundation
organization defending civil liberties in the digital world. We defend
1506
INDEX
WIFSTOPPED()
, 546
example of use
, 547
wildcard address (IP), 1187
Wilhelm, S., 1442
Williams (2002), 20, 1445
Williams, S., 1445
structure, 1319, 1385, 13941395
definition
, 1319
example of use
, 1320, 1386, 1392
wireshark
command, 1277
WNOHANG
constant, 544, 551
example of use
, 557
WNOWAIT
constant, 551
Woodhull, A.S., 1444
working directory, current, 29, 225,
363365, 604, 613
Wright (1995), 1235, 1272, 1445
Wright, E.A., xl
Wright, G.R., 1445
write permission, 29, 282, 294, 297
wr
, 70, 80, 286, 395, 426, 673,
800,1138
example of use
, 71, 85
FIFO, 918
interrupted by signal handler, 443
pipe, 918
prototype
, 80
RLIMIT_FSIZE
resource limit and, 761
terminal output
by background job, 394
by orphaned process group, 730
write_bytes.c
, 236, 242, 250
writen
, 1254
code of implementation
, 1255
prototype
, 1254
writ
, 99100, 102, 286, 673
interrupted by signal handler, 443
prototype
, 99
Wronski, M., xxxix
WSTOPPED
constant, 550
WSTOPSIG()
, 546
example of use
, 547
WTERMSIG()
, 546
example of use
, 547
wtmp
file, 817
example of use
, 828
updating, 825
WTMP_FILE
constant, 818
WUNTRACED
constant, 544, 545, 552
example of use
, 549
X_OK
constant, 299
XATTR_CREATE
constant, 315
XATTR_REPLACE
constant, 315
xattr_view.c
, 317
XBD, 14
XCASE
constant, 1303
XCU, 14
XCURSES, 14
XDR (External Data Representation), 1200
Xen, 789
XENIX, 5
XFS
file system, 261
i-node flag, 304308
Xie, Q., 1444
xinetd
daemon, 1248
INDEX
1505
VEOL
constant, 1296, 1309
VEOL2
constant, 1296
VERASE
constant, 1296
version script (
), 868872
, 16, 523525, 530, 609
example of use
, 524
prototype
, 523
RLIMIT_NPROC
resource limit and, 763
scheduling of parent and child
after,523
speed, 610
vfork_fd_test.c
, 1430
VFS (virtual file system), 259
diagram
, 259
vhangu
, 801
Viega (2002), 795, 1445
Viega, J., 1445
view_lastlog.c
, 831
view_symlink.c
, 369
VINTR
constant, 1296
Viro (2006), 267, 1445
Viro, A., 1445
virtual address space, 120
diagram
, 120
virtual device, 252
virtual file switch, 259
virtual file system VFSVFS
diagram
, 259
virtual memory
resource limit on, 760
unified, 1032
virtual memory management, 22,
118121, 1440
virtual server, 789
virtual time, 206
virtualization, 608, 789
VKILL
constant, 1296
VLNEXT
constant, 1296
VMIN
constant, 1307, 1309
example of use
, 1311
, 1262
VQUIT
constant, 1296
VREPRINT
constant, 1296
VSTART
constant, 1296
VSTOP
constant, 1296
VSUSP
constant, 1296
vsyslog()
, 777
VT0
constant, 1302
VT1
constant, 1302
VTDLY
constant, 1302
VTIME
constant, 1307, 1309
example of use
, 1311
VWERASE
constant, 1296
W_OK
constant, 299
Wagner, D., 1438, 1444
wait status, 545547, 580
, 32, 426, 514, 541542, 673, 690
diagram
, 515
example of use
, 543, 901
interrupted by signal handler, 443
prototype
, 542
, 552553, 609, 754
interrupted by signal handler, 443
prototype
, 552
, 552553, 609, 754
interrupted by signal handler, 443
prototype
, 552
waitid()
, 550552, 610, 673
interrupted by signal handler, 443
prototype
, 550
waitpi
, 426, 544545, 609, 673
example of use
, 549, 583, 587, 602
interrupted by signal handler, 443
prototype
, 544
wall clock time, 185
command, 169
Wallach, D.S., 1438
watch descriptor (
inotify
), 376, 377
Watson (2000), 798, 1445
Watson, R.N.M., 1445
WCONTINUED
constant, 544, 545, 550
WCOREDUMP
, 546
example of use
, 547
, 656
wcsrtombs()
, 656
wcst
, 657
, 657
weakly specified (in standard
description),15
Weinberger, P.J., 1437
well-known address, 909
WERASE terminal special character,
1296, 1299, 1305, 1307
WEXITED
constant, 550
WEXITSTATUS
, 546
example of use
, 547
Wheeler, D., 795, 857
command, 817
WIFCONTINUED()
, 546
example of use
, 547
WIFEXITED
, 546
example of use
, 547
WIFSIGNALED
, 546
example of use
, 547
1504
INDEX
UNIX 95, 13, 17
UNIX 98, 13, 17
UNIX International, 13
UNIX System Laboratories, 8
INDEX
1503
, 109, 346
prototype
, 109
file system, 274275, 1009,
1090,1108
, 109, 656
structure, 206207
definition
, 206
1502
INDEX
thread group, 225, 604, 610
diagram
, 605
thread group ID, 604
thread group leader, 605
diagram
, 605
thread ID (kernel), 605
thPthreadsPthreads623, 624
comparing IDs, 624
thread of execution, 422
thread_cancel.c
, 674
thread_cleanup.c
, 678
thread_incr.c
, 632
thread_incr_mutex.c
, 636
thread_incr_psem.c
, 1101
thread_multijoin.c
, 649
thread-local storage, 668669
thread-safe function, 655
thread-specific data, 659668
implementation, 662663
three-way handshake, TCP, 1270
diagram
, 1272
TID (thread ID, kernel), 605
Tilk, K., xl
command, 206
time slice, 733
TIME terminal setting, 1307
, 187, 426
diagram
, 188
example of use
, 192
prototype
, 187
data type, 65, 186, 187, 188, 189,
190, 280, 283, 287, 290, 471, 480,
488, 498, 747, 830, 948, 972,
1012, 1333
converting to and from broken-down
time, 189190
converting to printable form, 188189
TIME_WAIT state (TCP), 1269,
12741275
assassination, 1275
timed_read.c
, 486
timeout on blocking system call, 486487
high-resolut
ion, 485
POSIX timer
profiling, 392, 480
real, 390, 480
virtual, 395, 480
timer overrun, 495, 503504, 505
TIMER_ABSTIME
constant, 494, 498
timer_crea
, 495497
example of use
, 501, 507
prototype
, 495
INDEX
1501
TCP/IP, 11791195, 1438, 1440, 1441,
1443, 1444, 1445
TCSADRAIN
constant, 1293
TCSAFLUSH
constant, 1293
example of use
, 1301, 1311, 1313,
1314,1315
TCSANOW
constant, 1293
example of use
, 1306, 1387
tc
, 426, 718, 727, 1293,
13161318
prototype
, 1318
file descriptor, 1321
output queue, 1291
flushing, 1318
po
on, 1342
resuming output, 1296, 1319
1500
INDEX
t_execlp.c
, 570
t_execve.c
, 566
t_flock.c
, 1121
t_fork.c
, 517
t_fpathconf.c
, 218
t_ftok.c
, 1433
t_gethostbyname.c
, 1233
t_getopt.c
, 1408
t_getservbyname.c
, 1235
t_kill.c
, 405
t_mmap.c
, 1028
t_mount.c
, 268
t_mprotect.c
, 1046
t_nanosleep.c
, 490
t_readv.c
, 101
INDEX
1499
restarting, 442445
1498
INDEX
strt
, 658
strx
, 202
stty
command, 12941295
command, 169
subnet, 1179
subnet broadcast address, 1187
subnet mask, 1187
diagram
, 1187
INDEX
1497
stack_t
data type, 65, 434, 435
example of use
, 436
Stallman, R.M., 5, 6, 11, 20, 1445
standard error, 30
standard input, 30
standard output, 30
START terminal specia
l character, 1296,
1298, 1319
stat
structure, 279, 280283
definition
, 280
example of use
, 284
st
, 106, 279283, 325, 345, 426,
907,1428
example of use
, 285, 303
prototype
, 279
stat64
structure, 105
st
, 105
statfs
, 277, 345
static
(used to control symbol
visibility),867
static library, 35, 834836
use in preference to a shared library,856
static linking, 840
statically allocated variable, 116
function reentra
ncy and, 423
STATUS terminal spec
ial character, 1299
statvfs
structure, 276277
definition
, 276
st
, 276277, 345
prototype
, 276
stderr
variable, 30, 70
STDERR_FILENO
constant, 70
stdin
variable, 30, 70
STDIN_FILENO
constant, 70
stdio
buffers, 237239
diagram
, 244
fo
and, 537538
stdio
library, 30
mixing use with I/O system calls, 248
stdout
variable, 30, 70
STDOUT_FILENO
constant, 70
Steele, G.L., 1440
Stevens (1992), 1322, 1421, 1443, 1444
Stevens (1994), 1190, 1210, 1235, 1256,
1267, 1268, 1272, 1443
Stevens (1996), 1282, 1444
19981998, 1443
Stevens (1999), 20, 975, 1087, 1105, 1108,
1143, 1146, 1421, 1443
Stevens (2004), 1151, 1162, 1184, 1188,
1203, 1210, 1213, 1246, 1254,
1270, 1272, 1275, 1278, 1279,
1282, 1283, 1285, 1286, 1328,
1330, 1374, 1421, 1444
Stevens (2005), 20, 30, 222, 487, 527,
561,731, 821, 1118, 1146, 1383,
1421, 1444
Stevens, D.L., 1438
Stevens, W.R., xl, 1194, 1421, 1443,
1444,1445
Stewart (2001), 1286, 1444
Stewart, R.R., 1444
sticky permission bit, 294, 295, 300, 800
acting as restricted
1496
INDEX
sockaddr_un
structure, 1151, 11651166
definition
, 1165
example of use
, 1168, 1176
sockatma
, 426
INDEX
1495
SIGSTKFLT
signal, 393, 396
SIGSTKSZ
constant, 435
example of use
, 437
SIGSTOP
signal, 393, 396, 411, 445, 450,
716, 717, 790
diagram
, 717
disposition cant be changed, 450
sigs
, 426, 465, 673
example of use
, 467
prototype
, 465
SIGSYS
signal, 393, 396
SIGTERM
signal, 393, 396, 772
sigtimedwa
, 471, 673
interrupted by stop signal, 445
prototype
, 471
SIGTRAP
signal, 394, 396, 442
SIGTSTP
signal, 394, 396, 445, 450, 451,
700, 715, 717, 720, 725, 790,
1296, 1299, 1312
diagram
, 717
example of use
, 724, 1313, 1315
handling within applications, 722
orphaned process group and, 730
SIGTTIN
signal, 394, 396, 445, 450, 451,
717, 718, 725
diagram
, 717
orphaned process group and, 730
SIGTTOU
signal, 394, 396, 445, 450, 451,
717, 718, 725, 1293, 1303
diagram
, 717
orphaned process group and, 730
SIGUNUSED
SIGURG
signal, 394, 396, 397, 1283
SIGUSR1
signal, 394, 396
used by LinuxThreads, 690
SIGUSR2
signal, 395, 396
used by LinuxThreads, 690
sigval
union, 459, 496, 1078
sigval_t
data type, 459
sigvec
structure, 476
definition
, 476
sigvec
, 476
prototype
, 476
SIGVTALRM
signal, 395, 396, 480
sigwai
, 685686, 673
prototype
, 685
sigwaitinf
, 468, 673
example of use
, 470
interrupted by stop signal, 445
prototype
, 468
SIGWINCH
signal, 395, 396, 1319, 1320, 1395
example of use
, 1320
SIGXCPU
signal, 395, 396, 746, 761, 764
SIGXFSZ
signal, 395, 396, 761
simple_pipe.c
, 896
simple_system.c
, 582
simple_thread.c
, 626
single directory hierarchy,
diagram
, 27
Single UNIX Specification SUSSUS13
version 2 (SUSv2), 13, 17
version 3 (SUSv3), 1315, 17, 1440
Technical Corrigenda, 14
version 4 (SUSv4), 1517
SIOCGPGRP
constant, 1350
SIOCSPGRP
constant, 1350
size
command, 116
size_t
data type, 65, 66, 79, 80, 98, 99, 141,
148, 149, 150, 179, 193, 237, 238,
314, 315, 316, 350, 363, 435, 749,
750, 941, 943, 998, 1012, 1020,
1023, 1031, 1037, 1041, 1046,
1049, 1051, 1054, 1073, 1075,
1077, 1161, 1200, 1206, 1214,
1218, 1254, 1259, 1261
sl
, 426, 487488, 673
interrupted by signal handler, 444
prototype
, 488
sleeping, 487494
high-resolution, 488491, 493494
sliding window (TCP), 1192
slow-sTCPTCP1193, 1194
Smith, M., xli
Snader (2000), 1235, 1275, 1443
Snader, J.C., xl, 1443
SO_RCVBUF
constant, 1192
SO_REUSEADDR
constant, 1220, 12791281
example of use
, 1222, 1229, 1281
SO_SNDBUF
constant, 1171
SO_TYPE
constant, 1279
SOCK_CLOEXEC
constant, 1153, 1158, 1175
SOCK_DGRAM
constant, 1152
example of use
, 1172, 1208
SOCK_NONBLOCK
constant, 1153, 1158, 1175
SOCK_RAW
constant, 1153, 1184
SOCK_SEQPACKET
constant, 1285
SOCK_STREAM
constant, 1151
example of use
, 1168, 1169, 1173, 1209,
1221, 1224
sockaddr
structure, 1153, 11541155, 1157,
1158, 1161
definition
, 1154
sockaddr_in
structure, 1151, 1202
definition
, 1202
sockaddr_in6
structure, 1151, 12021203
definition
, 1203
example of use
, 1208, 1209
sockaddr_storage
structure, 1204
definition
, 1204
example of use
, 1221, 1241
1494
INDEX
continued
list of all signals, 390397
mask, 38, 388, 410, 578, 613, 683
pending, 38, 388, 389, 411415, 578,
613, 683
permission required for sending,
402403, 800
diagram
, 403
queuing, 412414, 422, 456, 457
reading via a file descriptor, 471474
realtime signal
reliable, 390, 455
semantics in multithreaded process,
682683
sending, 401405
synchronous generation, 453
System V API, 475476
timing and order of delivery,
453454,464
unreliable, 454
used for IPC, 474
used for synchronization, 527529
waiting for, 418, 464471
signal handler
signal handler, 38, 389, 398401, 421446
design, 422428
diagram
, 399, 454
printf()
in example
programs, 427
invocation in multithreaded process,683
terminating process from, 549550
use of
errno
within, 427
use of global variables within, 428
use of nonlocal goto within, 429433
signal mask, 38, 388, 410, 578, 613, 683
INDEX
1493
example of use
, 728
sent to foreground process group
when controlling process
terminates, 707, 712714
containing stopped processes,
533, 727
sigdelse
, 407, 426
example of use
, 463
prototype
, 407
1492
INDEX
shm_unli
, 1058, 1114
example of use
, 1114
prototype
, 1114
SHM_UNLOCK
constant, 800, 1012
SHM_W
constant, 923
SHMALL
limit, 1014, 1015
shma
, 922, 9991001, 1013, 1014
example of use
, 1004, 1005
prototype
, 999
RLIMIT_AS
resource limit and, 760
shmatt_t
data type, 65, 1012, 1014
shmc
, 922, 10111012
example of use
, 1004
prototype
, 1011
RLIMIT_MEMLOCK
resource limit and, 761
shmd
, 922, 10001001, 1013, 1014
example of use
, 1004, 1005
prototype
, 1001
INDEX
1491
setreuid
, 175176, 181, 786, 801
prototype
, 175
se
, 755757, 801
example of use
, 759
prototype
, 756
se
, 105
setsid
, 426, 691, 693, 705, 768, 1377
example of use
, 706, 770, 1387
prototype
, 705
se
, 426, 12781279
example of use
, 1222
prototype
, 1278
se
, 161
prototype
, 161
settimeof
, 204205, 801
diagram
, 188
prototype
, 204
1490
INDEX
semzcnt
value, 972, 974, 985
send
, 426, 673, 12591260
interrupted by signal handler, 444
prototype
, 1259
sendfi
, 286, 12601263
diagram
, 1261
prototype
, 1261
sendfile.c
, 1435
sending TCP, 1191
sendip
command, 1184
sendmmsg
, 1284
send
, 426, 673, 1284
interrupted by signal handler, 444
send
, 426, 673, 11601161
diagram
, 1160
example of use
, 1172, 1173, 1208,
1209,1241
interrupted by signal handler, 444
prototype
, 1161
servent
structure, 1234
definition
, 1234
server, 40
affinity, 1247
design, 12391252
farm, 1247
load balancing, 1247
pool, 1246
service name, 1204, 1212
session, 39, 700, 704706
diagram
, 701
leader, 39, 700, 705
session ID, 39, 613, 700, 705, 819
set_mempol
, 615
set_thread_are
, 607, 692
SETALL
constant, 971, 972, 973, 987
example of use
, 975
setbuf
, 238, 532
prototype
, 238
setbuffe
, 238
prototype
, 238
setconte
, 442
INDEX
1489
1488
INDEX
S_ISFIFO()
, 282
S_ISGID
constant, 295, 351
S_ISLNK
, 282
S_ISREG
, 282
S_ISSOCK()
, 282
S_ISUID
constant, 295, 351
S_ISVTX
constant, 295, 300, 351
S_IWGRP
constant, 295
S_IWOTH
constant, 295
S_IWUSR
constant, 295
S_IXGRP
constant, 295
S_IXOTH
constant, 295
S_IXUSR
constant, 295
command, 591
sa_family_t
data type, 65, 1154, 1165,
1202, 1203, 1204
SA_NOCLDSTOP
constant, 417
SA_NOCLDWAIT
constant, 417, 560
SA_NODEFER
constant, 417, 427, 455
example of use
, 455
SA_NOMASK
constant, 417
SA_ONESHOT
constant, 417
SA_ONSTACK
constant, 417, 578
example of use
, 437
SA_RESETHAND
constant, 417, 454
example of use
, 455
SA_RESTART
constant, 417, 443, 486, 941,944
example of use
, 455, 486
SA_SIGINFO
constant, 417, 437442, 458,
1352, 1353
example of use
, 463, 501
Salus (1994), 3, 20, 1443
Salus (2008), 20, 1443
Salus, P.H., 1443
Salzman, P.J., 1442
Santos, J., 1441
Sarolahti (2002), 1236, 1443
Sarolahti, P., 1443
Sastry, N., 1438
INDEX
1487
RFC 4219, 1188
RFC 4291, 1203
RFC 4336, 1286
RFC 4340, 1286
RFC 4960, 1286
RFC Editor, 1193
Richarte, G., 1437
Ritchie (1974), 3, 1443
Ritchie (1984), 20, 1442
Ritchie, D.M., 2, 4, 1440, 1442, 1443
RLIM_INFINITY
constant, 736, 756
RLIM_SAVED_CUR
constant, 759
RLIM_SAVED_MAX
constant, 759
data type, 65, 756, 759760
casting in
printf
calls, 757
structure, 756
definition
, 756
example of use
, 758
RLIMIT_AS
resource limit, 757, 760, 1039
RLIMIT_CORE
resource limit, 448, 757,
760,789
RLIMIT_CPU
resource limit, 395, 746, 757,761
RLIMIT_DATA
resource limit, 140, 757, 761
RLIMIT_FSIZE
resource limit, 80, 395, 448,
757, 760, 761
RLIMIT_MEMLOCK
resource limit, 757, 761,
1012, 10481049, 1051
RLIMIT_MSGQUEUE
resource limit, 757,
761,1086
RLIMIT_NICE
resource limit, 736, 757, 762
RLIMIT_NOFILE
resource limit, 78, 217,
757,762
RLIMIT_NPROC
resource limit, 217, 516, 757,
763, 801
example of use
, 759
rlimit_nproc.c
, 758
RLIMIT_RSS
resource limit, 757, 763
RLIMIT_RTPRIO
resource limit, 743, 757, 764
RLIMIT_RTTIME
resource limit, 746, 757, 764
RLIMIT_SIGPENDING
resource limit, 458,
757,764
RLIMIT_STACK
resource limit, 124, 217, 434,
436, 682, 757, 764, 793, 1006
, 286, 300, 345, 351, 426, 800
prototype
, 351
Robbins (2003), 630, 1327, 1443
Robbins, K.A., 1443
Robbins, S., 1443
Robins, A.V., xxxix, xl
Rochki19851985xxxv, 1421, 1443
Rochkind (2004), xxxv, 837, 1421, 1443
Rochkind, M.J., 1443
Romanow, J., 1194
root directory, 27, 340
of a process, 225, 367368, 604, 613
root name server, 1211
user, 26
Rosen (2005), 6, 1443
Rosen, L., 1443
Rothwell, S., xxxix
n time-sharing, 733
router, 1180
RST control bitTCPTCP1267
rt_tgsig
, 685
RTLD_DEEPBIND
constant, 862
RTLD_DEFAULT
constant, 864
RTLD_GLOBAL
constant, 861, 862, 864
RTLD_LAZY
constant, 861
RTLD_LOCAL
constant, 861
RTLD_NEXT
constant, 864
RTLD_NODELETE
constant, 861
RTLD_NOLOAD
constant, 862
RTLD_NOW
constant, 861
1486
INDEX
realtime scheduling,
continued
priority, 614, 738, 740
changing, 741744
relinquishing CPU, 747
resource limit for CPU time, 764
resource limit for priority, 764
policy (
SCHED_RR
n time slice, 747
realtime signal, 214, 221, 388, 456463
handling, 460463
limit on number queued, 457, 764
sending, 458460
used by LinuxThreads, 690
used by NPTL, 693
rebo
, 801
receiving TCP, 1191
record lock.
See
file lock
mount, 273274
recursive resolution, DNS, 1211
recv
, 426, 673, 12591260
interrupted by signal handler, 444
prototype
, 1259
recv
, 426, 673, 11601161
diagram
, 1160
example of use
, 1172, 1173, 1208,
1209,1241
interrupted by signal handler, 444
prototype
, 1161
recv
, 1284
recv
, 426, 673, 1284
interrupted by signal handler, 444
reentrancy, 556
reentrant function,
422425, 622, 657
region_locking.c
, 1134
regionIsLock
, 1134
code of implementation
, 11341135
regular file, 27, 282
po
on, 1342
select
on, 1342
Reiserfs file system, 260
i-node flag, 304308
tail packing, 260, 307
relative pathname, 29, 363
, 989991
code of implementation
, 991
example of use
, 1004, 1005
reliable signal, 390
relocation (of symbols), 837
remap_file_pag
, 10411043
prototype
, 1041
remove
, 286, 345, 352
prototype
, 352
remove
, 286, 316, 345
prototype
, 316
, 286, 300, 345, 348349, 426, 800
prototype
, 348
renameat
, 365, 426
command, 735
REPRINT terminal special character,
1296, 1298, 1305, 1307
Request for Commen
tsRFCRFC
11931194.
See also individual
RFC entries
reserved blocks (file system), 277, 801
reserved port, 1189
reserv
, 989990
code of implementation
, 990
example of use
, 1004, 1005
INDEX
1485
ptyFor
, 13851386
code of implementation
, 13861388
example of use
, 1392
prototype
, 1385
ptyM
, 1383, 1396
code of implementation
, 1384, 13961397
example of use
, 1387
prototype
, 1383
putc_unloc
, 657
putchar_unlo
, 657
, 128, 130, 657
example of use
, 131
prototype
, 128
, 673
putpmsg()
, 673
pututxli
, 657, 826
example of use
, 829
prototype
, 826
pwri
, 9899, 286, 673
prototype
, 98
pwri
, 102, 286
prototype
, 102
quantum, 733
Quarterm1993199320, 1442
Quartermann, J.S., 1442
quit
character, 1296, 1298
QUIT terminal special character, 1296,
1298, 1303, 1305
quot
, 345, 801
R_OK
constant, 299
race condition, 9092, 465, 525527, 897,
975, 1118, 1368
time-of-check, time-of-use, 790
Rago, S.A., 1421, 1444
, 404, 426, 441, 458
example of use
, 720, 724
prototype
, 404
Ramakrishnan, K., 1194
Ramey, C., 25
, 657
, 658
Randow, D., xl
raw I/O, 246248
raw mode (terminal I/O), 13091316
1484
INDEX
pthread_cond_destroy
, 652
prototype
, 652
pthread_cond_i
, 651652
prototype
, 651
PTHREAD_COND_INITIALIZER
constant, 643
pthread_cond_sig
, 643644
example of use
, 645, 650
prototype
, 644
pthread_cond_t
data type, 620, 643, 644,
645, 651, 652
pthread_cond_timedwait
, 644645, 673
interrupted by signal handler, 444
prototype
, 645
pthread_cond_wait()
, 643644, 673, 683
example of use
, 647, 651
interrupted by signal handler, 444
prototype
, 644
pthread_condattr_t
data type, 620, 651
pthread_crea
, 622623
example of use
, 627, 628, 650, 675, 679
prototype
, 622
pthread_de
, 627628
example of use
, 627
prototype
, 627
pthrea
, 624625, 1431
prototype
, 624
pthread_exit
, 623624
prototype
, 623
INDEX
1483
changing capabilities of all
processesin, 815
changing membership of, 702
creating, 702
diagram
, 701
foreground.
See
foreground process
group
leader, 39, 699, 702, 705
1482
INDEX
POSIX_MADV_SEQUENTIAL
constant, 1055
POSIX_MADV_WILLNEED
constant, 1055
posix_madv
, 1055
posix_mema
, 149150
example of use
, 150
prototype
, 149
posix_open
, 13801381
example of use
, 1384
prototype
, 1380
posix_spawn()
, 514
posix_trace_even
, 426
POSIXLY_CORRECT
environment variable,1410
Postel, J., 1193, 1194
Potthoff, K.J., xxxix
PPID (parent process ID), 32, 114115,
608, 613
pp
, 1370
interrupted by signal handler, 444
PR_CAPBSET_DROP
constant, 806
PR_CAPBSET_READ
constant, 806
INDEX
1481
POSIX 1003.1-2001, 13
POSIX asynchronous
I/O, 613, 1327,1347
POSIX clock, 491493
obtaining clock ID for process or
thread, 493
1480
INDEX
personality()
, 1334
PGID (process group ID), 39, 613,
699,705
Phillips, M., xxxix
physical block, 253
PID (process ID), 32, 114, 604, 608,
613,705
pid_t
data type, 65, 114, 115, 402, 405,
438, 458, 493, 496, 516, 523, 542,
544, 552, 599, 605, 699, 700, 701,
702, 704, 705, 708, 741, 742, 744,
747, 749, 750, 819, 948, 1012,
1125, 1354, 1385
Piggin, N., xxxix
pipe, 3, 214, 282, 392, 882, 883, 886,
889906
atomicity of
writ
, 891
bidirectional, 890
capacity, 891
closing unused file descriptors, 894
connecting filters with, 899902
creating, 892
diagram
, 879, 890
po
on, 1342
semantics, 917918
select
on, 1342
to a shell command, 902906
stdio
buffering and, 906
used for process synchronization,
897899
writ
semantics, 918
pi
, 286, 426, 801, 892, 1175
diagram
, 892
example of use
, 896, 898, 900
prototype
, 892
RLIMIT_NOFILE
resource limit and, 762
PIPE_BUF
constant, 214, 891, 918,
1343,1351
pipe_ls_wc.c
, 900
pipe_sync.c
, 897
pi
, 894
pivot_ro
, 345, 801
Plauger (1992), 30, 1442
PMMU (paged memory management
unit), 120
pmsg_create.c
, 1069
pmsg_getattr.c
, 1071
pmsg_receive.c
, 1076
pmsg_send.c
, 1074
pmsg_unlink.c
, 1066
Podolsky, M., 1194
poll, 1326
po
, 426, 673, 13371339, 1389, 1439
comparison with
select
, 13441345
example of use
, 1341
interrupted by signal handler, 444
interrupted by stop signal, 445
performance, 1365
problems with, 1346
prototype
, 1337
POLL_ERR
constant, 441, 1353
POLL_HUP
constant, 441, 1343, 1353
POLL_IN
constant, 440, 441, 1353
POLL_MSG
constant, 440, 441, 1353
POLL_OUT
constant, 440, 441, 1353
poll_pipes.c
, 1340
POLL_PRI
constant, 441, 1353
Pollard, J., xl
POLLERR
constant, 1337, 1338, 1342,
1343,1353
pollfd
structure, 13371338
definition
, 1337
POLLHUP
constant, 1337, 1338, 1342,
1343,1353
POLLIN
constant, 1337, 1338, 1342,
1343,1353
POLLMSG
constant, 1337, 1338, 1353
POLLNVAL
constant, 1337, 1338, 1339
Pollock, W., xli
POLLOUT
constant, 1337, 1338, 1342,
1343,1353
POLLPRI
constant, 1337, 1338, 1343,
1353,1389
POLLRDBAND
constant, 1337, 1338
POLLRDHUP
constant, 1337, 1338, 1339, 1343
POLLRDNORM
constant, 1337, 1338, 1353
POLLWRBAND
constant, 1337, 1338, 1353
POLLWRNORM
constant, 1337, 1338, 1353
po
, 902903, 919
avoid in privileged programs, 788
diagram
, 902
example of use
, 905
prototype
, 902
popen_glob.c
, 904
port number, 64, 11881189
ephemeral, 1189, 1224, 1263
privileged, 800, 1189
registered, 1189
well-known, 1189
portability, xxxiv, 10
, 6168, 211, 1420
source code vs. binary, 19
Portable Applicat
ion Standards
Committee (PASC), 11
INDEX
1479
open64
, 105
open
, 15, 365366, 426, 674
prototype
, 365
OpenBSD, 7
opendi
, 345, 352, 355
example of use
, 356
prototype
, 352
openlo
, 777779
example of use
, 780
prototype
, 777
open
, 1386
operating system, 21, 1438, 1444
oplock (opportunistic lock), 1142
OPOST
constant, 1296, 1298, 1302, 1305
example of use
, 1311
opportunistic lock, 1142
optarg
variable, 1406
opterr
variable, 1406
optind
variable, 1406
optopt
variable, 1406
OReilly, T., 1444
ORIGIN
(in
rpath
list), 853
Orlov block-allocation strategy, 307, 1438
orphan.c
, 1430
orphaned process, 553
orphaned process group, 533, 725730
diagram
, 726
and, 730
and, 730
orphaned_pgrp_SIGHUP.c
, 728
OSDL (Open Source Development
Laboratory), 18
OSF (Open Software
Foundation), 13
OSF/1, 4
ouch.c
, 399
out-of-band data, 394, 1259, 1260, 1283,
1288, 1331, 1343
P_ALL
constant, 550
P_PGID
constant, 550
P_PID
constant, 550
1478
INDEX
nonreentrant.c
, 424
Nordmark, E., 1194
NPTL (Native POSIX Threads Library),
457, 592, 600, 603, 606, 607, 609,
668, 682, 687, 688, 689, 692694,
696, 987
Pthreads nonconformances, 693
NSIG
constant, 408
, 1199
prototype
, 1199
, 1199
prototype
, 1199
INDEX
1477
mtrace
, 146
MTU (Maximum Tran
smission Unit),
1182, 1185
Mui, L., 1444
multi_descriptors.c
, 1426
multi_SIGCHLD.c
, 557
multi_wait.c
, 543
MULTICS, 2
multihomed host, 1180, 1187, 1220
multiplexed I/O, 1327, 13301346
, 10491050
prototype
, 1049
munlockall
, 10501051
prototype
, 1050
, 1022, 10231024, 1058
example of use
, 1036
prototype
, 1023
muntrace
, 146
mutex, 614, 631642, 881
attributes, 640
deadlock, 639
diagram
, 639
initializing, 639640
locking, 636, 637638
performance, 638
statically allocated, 635
type, 640642
unlocking, 636
used with a condition variable, 646
mutual exclusion, 634
N_TTY
constant, 1292, 1294
NAME_MAX
constant, 214, 215
daemon, 1210
named semaphore.
See
nano
, 488489, 673
example of use
, 490
interrupted by signal handler, 444
interrupted by stop signal, 445
prototype
, 488
Native POSIX Thread Library (NPTL).
See
NCCS
constant, 1292
necho.c
, 123
1476
INDEX
, 1058, 10641065
example of use
, 1070
prototype
, 1064
RLIMIT_MSGQUEUE
resource limit and, 762
MQ_OPEN_MAX
constant, 1085
MQ_PRIO_MAX
constant, 1073, 1085
mq_recei
, 673, 1058, 1064, 10741075
example of use
, 1077
interrupted by signal handler, 444
prototype
, 1075
, 673, 1058, 1064, 1073
example of use
, 1074
interrupted by signal handler, 444
prototype
, 1073
INDEX
1475
memory residence, 10511054
memory usage (advising patterns of),
10541055
memory-based semaphore.
See
semaphore, unnamed
memory-mapped file.
memory-mapped I/O, 1019, 10261027
message queue descriptor.
See
message queue, descriptor
message queue.
See
POSIX message
queue; System V message queue
1474
INDEX
MADV_SOFT_OFFLINE
constant, 1055
MADV_UNMERGEABLE
constant, 1055
MADV_WILLNEED
constant, 764, 1055
madv
, 10541055
prototype
, 1054
RLIMIT_RSS
resource limit and, 764
madvise_dontneed.c
, 1434
magic SysRq key, 1300
Mahdavi, J., 1194
main thread, 622
majo
, 281
example of use
, 284
program, 1442
make_zombie.c
, 554, 562
makecontext()
, 442
mallinfo
, 147
malloc
debugging library, 147
malloc
, 140142, 423, 1035
debugging, 146147
example of use
, 143
implementation, 144146
diagram
, 145
prototype
, 141
MALLOC_CHECK_
environment variable, 146
MALLOC_TRACE
environment variable, 146
mall
, 147, 1035
mandatory file lock, 265, 293, 1119, 1138
Mane-Wheoki, J., xl
Mann (2003), 1250, 1442
Mann, S., 1442
manual pages, 14191421
MAP_ANON
constant, 1034
MAP_ANONYMOUS
constant, 1033, 1034
example of use
, 1036
MAP_FAILED
constant, 1020, 1037
MAP_FIXED
constant, 1033, 10401041, 1049
INDEX
1473
localization, 200
localtim
, 189, 198, 657
diagram
, 188
example of use
, 192, 195, 199
prototype
, 189
localtime_
, 189, 658
lock (file).
file lock
LOCK_EX
constant, 1120
example of use
, 1121
LOCK_NB
constant, 1119, 1120
example of use
, 1121
LOCK_SH
constant, 1120
example of use
, 1121
LOCK_UN
constant, 1120
lock
, 673, 1127
interrupted by signal handler, 444
lockRegi
, 1133
code of implementation
, 1134
lockRegionWait
, 1133
code of implementation
, 1134
LOG_ALERT
constant, 779
LOG_AUTH
constant, 778, 779
LOG_AUTHPRIV
constant, 778, 779
LOG_CONS
constant, 777
LOG_CRIT
constant, 779
LOG_CRON
constant, 779
LOG_DAEMON
constant, 779
LOG_DEBUG
constant, 779
LOG_EMERG
constant, 779
LOG_ERR
constant, 779
LOG_FTP
constant, 778, 779
LOG_INFO
constant, 779
LOG_KERN
constant, 779
LOG_LOCAL*
constants, 779
LOG_LPR
constant, 779
LOG_MAIL
constant, 779
LOG_MASK()
, 781
LOG_NDELAY
constant, 778
LOG_NEWS
constant, 779
LOG_NOTICE
constant, 779
LOG_NOWAIT
constant, 778
LOG_ODELAY
constant, 778
LOG_PERROR
constant, 778
LOG_PID
constant, 778
LOG_SYSLOG
constant, 778, 779
LOG_UPTO()
, 781
LOG_USER
constant, 779
LOG_UUCP
constant, 779
LOG_WARNING
constant, 779
logger
command, 780
logical block, 255
login accounting, 817832
login name, 26, 153, 154
1472
INDEX
LC_MESSAGES
locale category, 202
LC_MONETARY
environment variable, 203
LC_MONETARY
locale category, 202
LC_NAME
locale category, 202
LC_NUMERIC
environment variable, 203
LC_NUMERIC
locale category, 202
LC_PAPER
locale category, 202
LC_TELEPHONE
locale category, 202
LC_TIME
environment variable, 203
LC_TIME
locale category, 202
lchown
, 286, 292293, 345
prototype
, 292
command, 833
LD_ASSUME_KERNEL
environment variable,695
LD_BIND_NOW
environment variable, 861
LD_DEBUG
environment variable, 874875
LD_DEBUG_OUTPUT
environment variable, 875
LD_LIBRARY_PATH
environment variable,
840, 853, 854
LD_PRELOAD
environment variable, 873
LD_RUN_PATH
environment variable, 851
ldconfig
command, 848849
ldd
command, 843
lease, file, 615, 800, 1135, 1142
Leffler, S.J., 1442
LEGACY (SUSv3 specification), 15
Lemon (2001), 1328, 1441
Lemon (2002), 1185, 1441
Lemon, J., 1441
Leroy, X., 689
level-triggered noti
fication, 13291330
Levine (2000), 857, 1441
Levine, J., 1441
Lewine (1991), 222, 1441
Lewine, D., 1441
Lezcano, D., 1437
LFS (Large File Summit), 104
lg
, 657
lgammaf()
, 657
lgammal()
, 657
program, 857
limit
C shell command, 448
Lindner, F., 1437
link, 27, 257, 339342
creating, 344346
diagram
, 343
removing, 346348
link count (file), 281, 341
link editing, 840
li
, 286, 344346, 426, 1145
prototype
, 344
linkat
, 365, 426
linker, 833, 1441
linking, 840
Linux
distributions, 10
hardware ports, 10
history, 510, 1443
kernel, 67
mailing list, 1423
version numbering, 89
programming-related newsgroups,1423
standards conformance, 18
Linux Documentation Project, 1422
Linux Foundation, 18
Linux Standard BaseLSBLSB
LinuxThreads, 457, 592, 603, 604, 609,
687, 688, 689692, 695
Pthreads nonconformances, 690
Linux/x86-32, 5
Lions (1996), 24, 1441
Lions, J., 1441
list_files.c
, 356, 373
list_files_readdir_r.c
, 1429
LITCPTCP1269
listen
, 426, 1152, 11561157
diagram
, 1156
example of use
, 1168, 1222, 1229
prototype
, 1156
listxa
, 316, 345
example of use
, 318
prototype
, 316
little-endian byte order, 1198
diagram
, 1198
Liu, C., 1437
LKML (Linux kernel mailing list), 1423
llistx
, 316, 345
prototype
, 316
command, 341
LNEXT terminal specia
l character, 1296,
1298, 1305, 1307
locale, 200204, 615
specifying to a
program, 203204
locale
command, 203
localeco
, 202, 657
localhost, 1187
locality of reference, 118
INDEX
1471
job_mon.c
, 719
job-control signal, 717
handling within applications, 722725
jobs
shell command, 715
Johnson (2005), 1440
Johnson, M.K., 1440
Johnson, R., 1438
Johnson, S., 4
Jolitz, L.G., 7
Jolitz, W.F., 7
Jones, R., xxxviii
Josey (2004), 20, 222, 1440
Josey, A., xxxix, 1440
journaling file system, 260261
Joy, W.N., 4, 25, 1442
jumbogram, 1185
Kahabka, T., xl
Kara, J., xl
Kegel, D., 1374
Kegel, K., xxxix
Kent (1987), 1186, 1440
Kent, A., 1440
configuration, 1417
source code, 1424
kernel mode, 23, 44
kernel scheduling entiKSEKSE3, 687
kernel space, 23
kernel stack, 44, 122
kernel thread, 241, 608, 768
Kernighan (1988), xxxii, 10, 11, 30, 1440
Kernighan, B.W., 1437, 1440
, 801
data type, 64, 925, 927, 938, 969, 998
KEYCTL_CHOWN
constant, 801
KEYCTL_SETPERM
constant, 801
KILL terminal specia
l character, 1296,
1298, 1303, 1304, 1305, 1307
ki
, 401403, 426, 439, 441, 458, 800
example of use
, 405, 413
prototype
, 402
killable sleep state, 451
ki
, 405, 458
prototype
, 405
Kirkwood, M., 1439
Kleen, A., xxxviii
Kleikamp, D., xl
, 776
daemon, 776
diagram
, 775
Kopparapu (2002), 1247, 1440
Kopparapu, C., 1440
Korn shell (
Korn, D., 25
Korrel, K., xl
Kozierok (2005), 1235, 1441
Kozierok, C.M., 1441
API BSDBSD 375, 1328, 1441
Kroah-Hartman (2003), 253, 1441
Kroah-Hartman, G., 1438, 1441
KSE (kernel scheduling
entity), 603, 687
(Korn shell), 25
Kumar (2008), 307, 1441
Kumar, A., 1441
kernel thread, 241, 1032
1470
INDEX
IP (Internet protocol),
continued
diagram
, 1181
fragmentation, 1185, 1440
header, 1185
checksum, 1185
minimum reassembly buffer size, 1185
unreliability, 1185
IPC.
See
interprocess communication
ip
, 922
IPC_CREAT
constant, 924, 925, 932, 938,
969, 998
example of use
, 939
IPC_EXCL
constant, 924, 925, 928, 932, 938,
969, 999
example of use
, 940
IPC_INFO
constant, 936, 951, 992, 1015
IPC_NOWAIT
constant, 941, 943, 979
example of use
, 942, 946, 983
ipc_perm
structure, 927928, 948, 972,1012
definition
, 927
IPC_PRIVATE
constant, 925, 928
example of use
, 939, 960
IPC_RMID
constant, 801, 924, 929, 947,
971,1011
example of use
, 948
INDEX
1469
initgroups
, 179180, 800
prototype
, 179
initial thread, 622
initialized data segment, 116, 117, 118,
1019, 1025
initSemAvailable
, 989990
code of implementation
, 990
example of use
, 1004
initSemInUse()
, 989990
code of implementation
, 990
example of use
, 1004
INLCR
constant, 1296, 1302
example of use
, 1311
ino_t
data type, 64, 280, 353
i-node, 95, 256259
diagram
, 95, 258, 340
i-node flag, 304308
i-node number, 64, 256, 281, 341
i-node table, 256, 340
inotify
(file system event notification)
handler,444
interrupted by stop signal, 445
inotify
(notification of file-system events),
375385
inotify_add_watch()
, 376, 377
example of use
, 383
prototype
, 377
inotify_event
structure, 379381
definition
, 379
diagram
, 380
example of use
, 382
inotify_init
, 376377
example of use
, 383
prototype
, 376
inotify_init1()
, 377
inotify_rm_watch
, 376, 378
prototype
, 378
INPCK
constant, 1302, 1305
example of use
, 1311
Institute of Electrical and Electronic
Engineers (IEEE), 11
int32_t
data type, 472, 593, 819
International Standards Organization
(ISO), 11
internationalization, 200
1468
INDEX
id_echo.h
, 1240
id_echo_cl.c
, 1242
id_echo_sv.c
, 1241
id_t
data type, 64, 550, 735
idshow.c
, 182
IEEE (Institute of Electrical and
Electronic Engineers), 11
INDEX
1467
GNU project, 56, 1422
GNU/Linux, 6
1466
INDEX
getopt
, 657, 14051411
example of use
, 1409
prototype
, 1406
getopt_lon
, 1411
getpagesiz
, 215
INDEX
1465
FUSE (Filesystem in Userspace), 255, 267
fuser
command, 342
futex (fast user space mutex), 605, 607,
638, 1438, 1439
futex()
, 638, 1090
interrupted by signal, 444
interrupted by stop signal, 445
FUTEX_WAIT
constant, 444, 445
futimens
, 15, 286, 426
futimes()
, 15, 286, 288289, 426
prototype
, 289
, 12171218
prototype
, 1218
Gallmeister (1995), 222, 512, 751, 1087,
1327, 1439
Gallmeister, B.O., 1439
Gammo (2004), 1374, 1439
Gammo, L., 1439
Gancarz (2003), 1422, 1439
Gancarz, M., 1439
Garfinkel (2003), 20, 795, 1439
Garfinkel, S., 1439
gather output, 102
gcore
gdb
) command, 448, 1430
, 656, 657
gdb
program, 1442
General Public License GPLGPL
1464
INDEX
fo
continued
example of use
, 516, 517, 519, 526, 543,
554, 582, 587, 770, 900, 1387
file descriptors and, 96, 517520
glibc
clone()
, 609
memory semantics, 520521
prototype
, 516
RLIMIT_NPROC
resource limit and, 763
scheduling of parent and child after, 525
speed, 610
stdio
buffers and, 537538
threads and, 686
fork_file_sharing.c
, 518
fork_sig_sync.c
, 528
fork_stdio_buf.c
, 537
fork_whos_on_first.c
, 526
format-string attack, 780
Fox, B., 25
fpathconf()
, 217218, 425, 426
example of use
, 218
prototype
, 217
FPE_FLTDIV
constant, 441
FPE_FLTINV
constant, 441
FPE_FLTOVF
constant, 441
FPE_FLTRES
constant, 441
FPE_FLTUND
constant, 441
FPE_INTDIV
constant, 441
FPE_INTOVF
constant, 441
FPE_SUB
constant, 441
FQDN (fully qualified domain
name),1210
fragmentation of free disk space, 257
Franke (2002), 638, 1439
Franke, H., 1439
Free Software Foundation, 5
fr
, 140142, 144, 423
example of use
, 143
implementation, 144146
diagram
, 145
prototype
, 141
free_and_sbrk.c
, 142
freeaddrinf
, 1217
example of use
, 1222
prototype
, 1217
FreeBSD, 7, 1442
fremovexa
, 286, 316
prototype
, 316
Frisch (2002), 616, 818, 1439
Frisch, A., 1439
FS_APPEND_FL
constant, 305, 306
FS_COMPR_FL
constant, 305, 306
FS_DIRSYNC_FL
constant, 265, 305, 306, 307
FS_IMMUTABLE_FL
constant, 305, 306, 307
FS_IOC_GETFLAGS
constant, 308
FS_IOC_SETFLAGS
constant, 308
FS_JOURNAL_DATA_FL
constant, 305
FS_JOURNAL_FL
constant, 306
FS_NOATIME_FL
constant, 77, 265, 305, 306
FS_NODUMP_FL
constant, 305, 307
FS_NOTAIL_FL
constant, 305, 307
FS_SECRM_FL
constant, 305, 307
FS_SYNC_FL
constant, 305, 307
FS_TOPDIR_FL
constant, 305, 307
FS_UNRM_FL
constant, 305, 307
fsblkcnt_t
data type, 64, 276
fsck
command, 260, 263
constant, 363
FTW_SL
constant, 359, 360
FTW_SLN
constant, 360
FTW_STOP
constant, 363
fully qualified domain name (FQDN), 1210
INDEX
1463
LinuxThreads nonconformance, 691
mandatory, 265, 293, 1119, 11371140
priority of queued lock requests, 1137
speed, 11351136
starvation, 1137
file mapping, 35, 882, 886, 1017,
10241031
diagram
, 1025
private, 1018, 10241025
shared, 1019, 10251029
file mode creation mask (umask), 301303,
328, 351, 604, 613, 790, 907, 923,
1060, 1065, 1091, 1110, 1174
1462
INDEX
feenableex
, 391
Fellinger, P., xxxix, xl
Fenner, B., 1421, 1444
fexecv
, 15, 426, 571
prototype
, 571
FF0
constant, 1302
FF1
constant, 1302
FFDLY
constant, 1302
fflush
, 239, 244, 538
prototype
, 239
shell command, 715
diagram
, 717
INDEX
1461
security
, 312, 801
system
, 312, 321, 327
trusted
, 312, 316, 801
user
, 312
extended file attribute (i-node flag),
304308
1460
INDEX
EPOLLIN
constant, 1359
EPOLLONESHOT
constant, 1359, 1360
EPOLLOUT
constant, 1359
EPOLLPRI
constant, 1359
EPOLLRDHUP
constant, 1359
ERANGE
error, 315, 363, 991
Eranian, S., 1442
ERASE terminal special character, 1296,
1297, 1303, 1304, 1305, 1307
Erickson (2008), 792, 795, 1439
Erickson, J.M., 1439
EROFS
error, 78
, 5253
code of implementation
, 56
prototype
, 52
, 52
code of implementation
, 55
prototype
, 52
, 5253
code of implementation
, 56
prototype
, 52
errMsg
code of implementation
, 55
prototype
, 52
variable, 45, 49, 53, 620, 780
in threaded programs, 621
use inside signal handler, 427, 556
error number, 49
error_functions.c
error_functions.h
error-diagnostic
functions, 5158
ESPIPE
error, 83
ESRCH
error, 158, 402, 403, 702
ESTABLISHED state (TCP), 1269
INDEX
1459
EIDRM
error, 933, 947, 971, 979
EINTR
error, 418, 442, 443, 486, 489, 941,
944, 979, 1095, 1334, 1339
EINVAL
error, 179, 216, 246, 247, 349, 381,
750, 762, 933, 950, 952, 969, 991,
999, 1000, 1014
EIO
error, 709, 718, 727, 730, 764, 1382,
1388, 1389, 1396
EISDIR
error, 78, 346, 349
elapsed time, 185
Electric Fence (
malloc
debugger), 147
ELF (Executable and Linking Format),
113, 565, 837, 1441
1458
INDEX
dmalloc
malloc
debugger), 147
dnotify
(directory change
notification),
386, 615
DNS (Domain Name System),
12091212,1437
anonymous root, 1210
domain name, 1210
iterative resolution, 1211
name server, 1210
recursive resolution, 1211
root name server, 1211
round-robin load sharing, 1247
top-level domain, 1212
Domaign, L., xxxvii
domain name, 1210
Domain Name System.
See
command, 230
Dring, G., xxxvii
dotted-decimal notati
on (IPv4 address),
1186
DragonFly BSD, 8
drand48()
, 657
Drepper (2004a), 638, 1438
Drepper (2004b), 857, 868, 1439
Drepper (2007), 748, 1439
Drepper (2009), 795, 1439
Drepper, U., 47, 689, 1438, 1439
DST, 187
DSUSP terminal spec
ial character, 1299
DT_DIR
constant, 353
DT_FIFO
constant, 353
DT_LNK
constant, 353
DT_NEEDED
tag ELFELF
DT_REG
constant, 353
DT_RPATH
taELFELF 853, 854
DT_RUNPATH
tag (ELF), 853, 854
DT_SONAME
tag ELFELF
dumb terminal, 714
dump_utmpx.c
Dunchak, M., xli
du
, 97, 426, 1425
prototype
, 97
RLIMIT_NOFILE
resource limit and, 762
dup2
, 97, 426, 899, 900, 1426
example of use
, 771, 901
prototype
, 97
RLIMIT_NOFILE
resource limit and, 762
dup3
prototype
, 98
Dupont, K., xxxix
dynamic linker, 36, 839
dynamic linking, 839, 840
dynamically allocated storage, 116
dynamically loaded library, 859867
dynload.c
, 865
E2BIG
error, 565, 943, 991
EACCES
error, 77, 312, 564, 702, 928, 952,
1031, 1127
eaccess
, 300
EADDRINUSE
error, 1166, 1279
EAGAIN
error, 57, 103, 270, 379, 460, 471,
473, 509, 761, 763, 764, 917, 918,
941, 979, 980, 1065, 1073, 1075,
1095, 1127, 1139, 1259, 1260,
1330, 1347, 1367
EAI_ADDRFAMILY
constant, 1217
EAI_AGAIN
constant, 1217
EAI_BADFLAGS
constant, 1217
EAI_FAIL
constant, 1217
EAI_FAMILY
constant, 1217
EAI_MEMORY
constant, 1217
EAI_NODATA
constant, 1217
EAI_NONAME
constant, 1217, 1219
EAI_OVERFLOW
constant, 1217
EAI_SERVICE
constant, 1217
EAI_SOCKTYPE
constant, 1217
EAI_SYSTEM
constant, 1217
EBADF
error, 97, 762, 1126, 1334,
1344,1345
EBUSY
error, 270, 637, 1078, 1396
ECHILD
error, 542, 556, 903
ECHO
constant, 1303, 1304
example of use
, 1306, 1310, 1311
ECHOCTL
constant, 1303, 1304
ECHOE
constant, 1303, 1304
ECHOK
constant, 1303, 1304
ECHOKE
constant, 1303, 1304
ECHONL
constant, 1296, 1303
ECHOPRT
constant, 1303
ecvt
, 656, 657
edata
variable, 118
diagram
, 119
EDEADLK
error, 636, 1129, 1139, 1431
edge-triggered notifi
cation, 13291330,
13661367
preventing file-descriptor
starvation,1367
EEXIST
error, 76, 315, 345, 349, 350,
924,932, 938, 969, 999, 1059,
1109,1357
EF_DUMPCORE
environment variable, 52
EFAULT
error, 187, 465
EFBIG
error, 761
effective group ID, 33, 168, 172, 173, 175,
177, 613
effective user ID, 33, 168, 172, 174, 175,
177, 613
effect on process capabilities, 806
INDEX
1457
da Silva, D., 1444
data segment, 116
resource limit on size of, 761
Datagram Congestion
Control Protocol
(DCCP), 1286
data-link layer, 1182
diagram
, 1181
DATEMSK
environment variable, 196
Davidson, F., xxxix
Davidson, S., xxxix
daylight saving time, 187
daylight
variable, 198
, 657
dbm_close()
, 657
1456
INDEX
CONFIG_UNIX98_PTYS
kernel option, 1381
confst
, 48, 588, 694
congestion control (TCP), 1192, 1194,
1236, 1443
connec
, 426, 673, 1152, 1158
diagram
, 1156
example of use
, 1169, 1224, 1228
interrupted by signal handler, 444
prototype
, 1158
448,1430
resource limit on size of, 760
INDEX
1455
1454
INDEX
CAP_LINUX_IMMUTABLE
capability, 306,
800,807
CAP_MAC_ADMIN
capability, 800
CAP_MAC_OVERRIDE
capability, 800, 807
CAP_MKNOD
capability, 252, 368, 800, 807
INDEX
1453
Borman, D., 1194
Bostic, K., 1442
Bound, J., 1194
Bourne again shell (
Bourne, S., 25
Bourne shell (
), 3, 25, 154
file system, 261
buffer cache, 233, 234
using direct I/O to bypass, 246247
buffer overrun, 792
buffering of file I/O, 233250
diagram
, 244
effect of buffer size on performance,
234236
in the kernel, 233236, 239243
overview, 243244
in the
stdio
library, 237239, 249
BUFSIZ
constant, 238
Build_ename.sh
, 57
built-in commanshellshell 576
bus error error messageerror message
See
SIGBUS
BUS_ADRALN
constant, 441
BUS_ADRERR
constant, 441
BUS_MCEERR_AO
constant, 441
BUS_MCEERR_AR
constant, 441
BUS_OBJERR
constant, 441
busy file system, 270
Butenhof (1996), 630, 639, 647, 659, 687,
696, 751, 1105, 1422, 1438
Butenhof, D.R., xxxvi, 1438
byte stream, 879, 890
separating messages in, 910911
diagram
, 911
bzer
, 1166
C library, 4748, 1442
C programming language, 2, 1440, 1444
ANSI 1989 standard, 11
C89 standard, 11, 17
C99 standard, 11, 17
ISO 1990 standard, 11
standards, 1011
C shell (
), 4, 25
C89, 11, 17
C99, 11, 17
cache line, 748
calendar time, 185187
changing, 204205
calendar_time.c
, 191
callo
, 147148
example of use
, 148
prototype
, 148
canceling a thread.
thread cancellation
cancellation point, thread cancellation,
673674
canonical mode, terminal I/O, 1290,
1305, 1307
Cao, M., 1441
CAP_AUDIT_CONTROL
capability, 800
CAP_AUDIT_WRITE
capability, 800
CAP_CHOWN
capability, 292, 800, 807
CAP_DAC_OVERRIDE
capability, 287, 299,
800,807
CAP_DAC_READ_SEARCH
capability, 299,
800,807
CAP_FOWNER
capability, 76, 168, 287, 288,
300, 303, 308, 800, 807
cap_fr
, 808
example of use
, 809
1452
INDEX
ANSI (American National Standards
Institute), 11
ANSI C, 11
Anzinger, G., xxxix
application binary interface, 118, 867
command, 834
archive, 834
ARG_MAX
constant, 214
argument to
main()
, 31, 123
, 31, 118, 123,
124, 214, 564, 567
diagram
, 123
example of use
, 123
ARP (Address Resolution Protocol), 1181
ARPA (Advanced Research Projects
Agency), 1180
INDEX
1451
1450
INDEX
_SC_THREAD_KEYS_MAX
constant, 668
_SC_THREAD_STACK_MIN
constant, 682
_SC_THREADS
constant, 221
_SC_XOPEN_UNIX
constant, 221
_SEM_SEMUN_UNDEFINED
constant, 970
INDEX
1449
1448
INDEX
/dev/stdout
, 108
/dev/tty
device, 707, 708, 1321.
See also
controlling terminal
/dev/tty
devices, 1289
/dev/tty
devices, 1395
/dev/zero
device, 1034
used with
, 1034
example of use
, 1036
/etc
directory, 774
/etc/fstab
file, 263
/etc/group
file, 26, 155156
/etc/gshadow
file, 156
/etc/hosts
file, 1210
The following conventions are used in the index:
Library function and system call prototypes are indexed with a subentry
prototype
. Normally, youll find the ma
in discussion of a function
or system call in the same location as the prototype.
Definitions of C structures are
indexed with subentries labeled
definition
This is where youll normally find
the main discussion of a structure.
Implementations of functions developed in the text are indexed with a
subentry labeled
code of implementation
Instructional or otherwise interesting examples of the use of functions,
variables, signals, structures, macros, constants, and files in example pro-
grams are indexed with subentries labeled
example of use
instances of the use of each interfac
e are indexed; instead, just a few
examples that provide useful guidance are indexed.
Diagrams are indexed wi
th subentries labeled
The names of example programs are indexed to make it easy to find an
explanation of a program that is prov
ided in the source code distribu-
tion for this book.
Citations referring to the publicatio
ns listed in the bibliography are
indexed using the name of the first
author and the year of publication,
for example, Rochkind (1985).
Bibliography
1445
Viega, J., and McGraw, G. 2002.
Building Secure Software
. Addison-Wesley,
Reading, Massachusetts.
Viro, A. and Pai, R. 2006. Shared-
Subtree Concept, Implementation, and
Applications in Linux,
Proceedings of the Ottawa Linux Symposium 2006
http://www.kernel.org/doc/ols/
2006/ols2006v2-pages-209-222.pdf
Watson, R.N.M. 2000. Introducing
Supporting Infrastructure for Trusted
Operating System Support in FreeBSD,
Proceedings of BSDCon 2000
http://www.trustedbsd.org/
trustedbsd-bsdcon-2000.pdf
Williams, S. 2002.
Free as in Freedom: Richard Stal
lmans Crusade for Free Software
OReilly, Sebastopol, California.
Wright, G.R., and Stevens, W.R. 1995.
TCP/IP Illustrated, Volume 2:
The Implementation
. Addison-Wesley, Reading, Massachusetts.
1444
Bibliography
Stevens, W.R. 1996.
TCP/IP Illustrated, Volume 3: TCP for Transactions, HTTP, NNTP,
and the UNIX Domain Protocols
. Addison-Wesley, Reading, Massachusetts.
Stevens, W.R., Fenner, B., and Rudoff, A.M. 2004.
. New Riders, Indianapolis, Indiana.
http://sources.redhat.com/autobook/
Bibliography
1443
Ritchie, D.M., and Thompson, K.L. 197
4. The Unix Time-Sharing System,
Communications of the ACM
, 17 (July 1974), pages 365375.
Robbins, K.A., and Robbins, S. 2003.
UNIX Systems Programm
Concurrency, and Threads (2nd edition)
. Prentice Hall, Upper Saddle River,
Rochkind, M.J. 1985.
Advanced UNIX Programming
. Prentice Hall, Englewood Cliffs,
Rochkind, M.J. 2004.
Advanced UNIX Programming 2nd edition2nd edition
. Addison-Wesley,
Reading, Massachusetts.
Rosen, L. 2005.
Open Source Licensing: Software Fr
eedom and Intellectual Property Law
Prentice Hall, Upper Saddle River, New Jersey.
St. Laurent, A.M. 2004.
Understanding Open Source and Free Software Licensing
OReilly, Sebastopol, California.
Salus, P.H. 1994.
A Quarter Century of UNIX
. Addison-Wesley, Reading, Massachusetts.
Salus, P.H. 2008.
The Daemon, the Gnu, and the Penguin
. Addison-Wesley,
Reading, Massachusetts.
1442
Bibliography
Mann, S., and Mitchell, E.L. 2003.
Linux System 2nd edition2nd edition
. Prentice Hall,
Englewood Cliffs, New Jersey.
Matloff, N. and Salzman, P.J. 2008.
The Art of Debugging with GDB, DDD, and Eclipse
NoStarch Press, San Francisco, California.
Maxwell, S. 1999.
Linux Core Kernel Commentary
. Coriolis, Scottsdale, Arizona.
This book provides an annotated listin
g of selected parts of the Linux 2.2.5
kernel.
McKusick, M.K., Joy, W.N., Leffler, S.J., and
Fabry, R.S. 1984. A fast file system for
UNIX,
ACM Transactions on Computer Systems
, Vol. 2, IssuAugustAugust
Bibliography
1441
Kozierok, C.M. 2005.
The TCP/IP Guide
. No Starch Press, San
Francisco, California.
http://www.tcpipguide.com/
Kroah-Hartman, G. 2003. udevA Us
erspace Implementation of devfs,
Proceedings
of the 2003 Linux Symposium
m/linux/talks/ols_2003_udev_paper/Reprint-Kroah-Hartman-
OLS2003.pdf
Kumar, A., Cao, M., Santos, J., and Dilger, A. 2008. Ext4 block and inode allocator
improvements,
Proceedings of the 2008 Linux Symposium
, Ottawa, Canada.
http://ols.fedoraproject.org/OL
S/Reprints-2008/kumar-reprint.pdf
Lemon, J. 2001. Kqueue: A generic and
scalable event notification facility,
Proceedings
of USENIX 2001/Freenix Track
http://people.freebsd.org/
jlemon/papers/kqueue_freenix.pdf
Lemon, J. 2002. Resis
ting SYN flood DoS attacks with a SYN cache,
Proceedings of
USENIX BSDCon 2002
http://people.freebsd.org/
jlemon/papers/syncache.pdf
Levine, J. 2000.
Linkers and Loaders
. Morgan Kaufmann, San Francisco, California.
http://www.iecc.com/linker/
Lewine, D. 1991.
POSIX Programmers Guide
. OReilly, Sebastopol, California.
Liang, S. 1999.
The Java Native Interface: Progra
mmers Guide and Specification
Addison-Wesley, Reading, Massachusetts.
Libes, D., and Ressler, S. 1989.
Life with UNIX: A Guide for Everyone
. Prentice Hall,
Englewood Cliffs, New Jersey.
Lions, J. 1996.
Lions Commentary on UNIX 6th Edition with Source Code
. Peer-to-Peer
Communications, San Jose, California.
[Lions, 1996] was originally produced
by an Australian academic, the late
John Lions, in 1977 for
use in an operating system
s class that he taught. At
that time, it could not be formally
published because of licensing restric-
tions. Nevertheless, pirated photocop
ies became widely distributed within
the UNIX community, and, in Dennis Ritchies words, educated a generation of
UNIX programmers.
Love, R. 2010.
Linux Kernel Development (3rd edition)
. Addison-Wesley, Reading,
Massachusetts.
Lu, H.J. 1995. ELF: From the
Programmers Perspective.
http://www.trunix.org/programlama/os
/elf-hl/Documentation/elf/elf.html
1440
Bibliography
Goodheart, B., and Cox, J. 1994.
The Magic Garden Explaine
d: The Internals of UNIX
SVR4
. Prentice Hall, Englewood Cliffs, New Jersey.
Goralski, W. 2009.
Bibliography
1439
Drepper, U. 2004bb How
to Write Shared Libraries.
http://people.redhat.com/drepper/dsohowto.pdf
Drepper, U. 2007. What Every Prog
rammer Should Know About Memory.
http://people.redhat.com/drepper/cpumemory.pdf
Drepper, U. 2009. Defensive Progra
mming for Red Hat Enterprise Linux.
http://people.redhat.com/drepper/defprogramming.pdf
Erickson, J.M. 2008.
Hacking: The Art of Ex
ploitation (2nd edition)
. No Starch Press,
San Francisco, California.
Floyd, S. 1994. TCP and Explicit Congestion Notification,
ACM Computer
, Vol. 24, No. 5, Oc
tober 1994, pages 1023.
d/papers/tcp_ecn.4.pdf
Franke, H., Russell, R., and Kirkwood, M.
2002. Fuss, Futexes and Furwocks: Fast
Userlevel Locking in Linux,
Proceedings of the Ottawa Linux Symposium 2002
http://www.kernel.org/doc/ols/
2002/ols2002-pages-479-495.pdf
Frisch, A. 2002.
Essential System Administration (3rd edition)
. OReilly, Sebastopol,
Gallmeister, B.O. 1995.
POSIX.4: Programming for the Real World
Sebastopol, California.
Gammo, L., Brecht, T., Shukla, A., an
d Pariag, D. 2004. Comparing and
Evaluating epoll, select,
and poll Event Mechanisms,
Linux Symposium 2002
http://www.kernel.org/doc/ols/
2004/ols2004v1-pages-215-226.pdf
Gancarz, M. 2003.
Linux and the Unix Philosophy
. Digital Press.
., and Schwartz, A. 2003.
Practical Unix and
1438
Bibliography
Borisov, N., Johnson, R., Sastry, N., and Wagner, D. 2005. Fixing Races for Fun
and Profit: How to abuse atime,
Proceedings of the 14th US
ENIX Security Symposium
http://www.cs.berkeley.edu/
nks/papers/races-usenix05.pdf
Aho, A.V., Kernighan, B.W., and Weinberger, P. J. 1988.
The AWK Programming
Language
. Addison-Wesley, Reading, Massachusetts.
Albitz, P., and Liu, C. 2006.
DNS and BIND (5th edition)
. OReilly, Sebastopol,
Anley, C., Heasman, J., Lindner, F., and Richarte, G. 2007.
Handbook: Discovering and Ex
ploiting Security Holes
Bach, M. 1986.
The Design of the UNIX Operating System
. Prentice Hall, Englewood Cliffs,
New Jersey.
Bhattiprolu, S., Biederman, E.W., Hall
yn, S., and Lezcano, D. 2008. Virtual
Servers and Checkpoint/Restart in Mainstream Linux,
ACM SIGOPS
, Vol. 42, Issue 5, July 2008, pages 104113.
http://www.mnis.fr/fr/service
s/virtualisation/pdf/cr.pdf
Bishop, M. 2003.
Computer Security: Art and Science
. Addison-Wesley, Reading,
Massachusetts.
Bishop, M. 2005.
. Addison-Wesley, Reading,
Massachusetts.
1436
Appendix F
Chapter 62
62-1.
Solutions to Selected Exercises
1435
Chapter 55
55-1.
The following hold for
floc
on Linux:
a)A series of shared locks can starve a pr
ocess waiting to place an exclusive lock.
b)There are no rules regarding which process is granted the lock. Essentially,
the lock is granted to the process that
is next scheduled. If that process
happens to be one that obtains a sh
ared lock, then all other processes
requesting shared locks will also be
able to have their requests granted
simultaneously.
55-2.
flock()
1434
Appendix F
Chapter 47
47-5.
svsem/event_flags.c
for this book.
47-6.
A reserve operation can be implemente
d by reading a byte from the FIFO.
Conversely, a release operation can be
implemented by writing a byte to the
FIFO. A conditional reserve operation can be implemented as a nonblocking read of
a byte from the FIFO.
Chapter 48
48-2.
Since access to the
�shmpcnt
value in the
loop increment step is no longer
protected by the semaphore, there is a race condition between the writer next
Solutions to Selected Exercises
1433
writing (this will succeed without blocking
), and then writes data to the FIFO
after
the server has closed the reading descriptor
. At this point, the client will receive a
SIGPIPE
signal, since no process has the FIFO open for reading. Alternatively, the
client might be able to both open the FIFO and write data to it before the server
closes the reading descriptor. In this case
, the clients data would be lost, and it
wouldnt receive a response from the serv
er. As a further exercise, you could try to
demonstrate these behaviors by making th
e suggested modifica
tion to the server
and creating a special-purpose client th
at repeatedly opens the servers FIFOs,
sends a message to the server, closes the servers FIFO, and reads the servers
response (if any).
44-6.
One possible solution would
1432
Appendix F
The problem is that
will be part of the same process group as
ourprog
therefore the
call will also terminate the
process. This is probably not
desired, and is likely to confus
e users. The solution is to use
Solutions to Selected Exercises
1431
orphaned, the grandchild is adopted by
init
(process ID of 1). The program doesnt
need to perform a second
call, since
init
automatically reaps the zombie
when the grandchild terminates. Therein lies a possible use for this code sequence:
if we need to create a child for which we
cant later wait, then this sequence can be
used to ensure that no zombie results. One example of such a requirement is where
the parent execs some program that is no
t guaranteed to perform a wait (and we
dont want to rely on se
tting the disposition of
SIGCHLD
to
SIG_IGN
disposition of an ignored
SIGCHLD
after an
is left unspecified by SUSv3).
27-5.
The string given to
printf
doesnt include a newline character, and therefore the
output is not flus
hed before the
exec
call. The
overwrites the existing
programs data segments (as well as
the heap and stack), which contain the
stdio
buffers, and thus the unflushed output is lost.
27-6.
SIGCHLD
is delivered to the parent. If the
SIGCHLD
handler attempts to do a
wait()
the call returns an error (
ECHILD
) indicating that there were no children whose status
1430
Appendix F
processes. All four processes carry on to execute the next
fork()
, each creating a
further child. Consequently, a total
of seven new processes are created.
24-2.
A solution is provided in the file
procexec/vfork_fd_test.c
in the source code
distribution for this book.
24-3.
If we call
fork
, and then have the child call
ra
to send itself a signal such as
SIGABRT
, this will yield a core dump file that
closely mirrors the state of the parent at
the time of the
fo
. The
gcore
command allows us to perform a similar task for
a program, without needing to change the source code.
24-5.
Add a converse
call in the parent:
if (kill(childPid, SIGUSR1) == -1)
errExit("kill")
And in the child, add a converse
sigsuspend
call:
sigsuspend(&origMask); /* Unblock SIGUSR1, wait for signal */
Chapter 25
25-1.
Assuming a twos complement architectu
re, where 1 is represented by a bit
pattern with all bits on, then the parent will s
ee an exit status of 255 (all bits on in the
least significant 8 bits, which are all that is
passed back to the parent when it calls
wait
). (The presence of the call
exit11
in a program is usually a programmer error
Solutions to Selected Exercises
1429
since it is interpreted relative
to the location of the link
file, and thus refers to a
nonexistent file in the parent directory. Consequently,
fails with the error
ENOENT
(No such file or directory). (A
1428
Appendix F
Chapter 12
12-1.
A solution is provided in the file
sysinfo/procfs_user_exe.c
in the source code
distribution for this book.
Chapter 13
13-3.
This sequence of statements ensu
res that the data written to a
buffer is flushed
fflush()
call flushes the
buffer for
to the kernel buffer cache.
The argument given to the subsequent
fsyn
is the file descriptor underlying
thus, the call flushes the (recently filled) ke
rnel buffer for that file descriptor to
13-4.
When standard output is sent to a terminal,
it is line-buffered,
so that the output of
the
printf
call appears immediately, and is followed by the output of
writ
standard output is sent to a disk file, it
is block-buffered. Consequently, the output of
is held in a
stdio
buffer and is flushed only when the program exits (i.e.,
after the
writ
call). (A complete program contai
ning the code of this exercise is
available in the file
filebuff/mix23_linebuff.c
in the source code distribution for
this book.)
Chapter 15
15-2.
system call doesnt change any file ti
Solutions to Selected Exercises
1427
9-1.
In considering the following, remember
that changes to the effective user ID
always also change th
e file-system user ID.
real=2000, effective=2000, saved=2000, file-system=2000
real=1000, effective=2000, saved=2000, file-system=2000
real=1000, effective=2000, saved=0, file-system=2000
real=1000, effective=0, saved=0, file-system=2000
real=1000, effective=2000, saved=3000, file-system=2000
9-2.
Strictly speaking, such a process is unprivileged, since its effective user ID is
nonzero. However, an unprivileged process can use the
1426
Appendix F
A call to
if (oldfd == newfd) { /* oldfd == newfd is a special case */
SELECTED EXERCISES
Chapter 5
A solution is provided in the file
fileio/atomic_append.c
in the source code
distribution for this book.
Here is an example of the results that we see when we
run this program as suggested:
ls -l f1 f2
-rw------- 1 mtk users 2000000 Jan 9 11:14 f1
-rw------- 1 mtk users 1999962 Jan 9 11:14 f2
Because the combination of
plus
is not atomic, one instance of the
1424
Appendix E
http://www.kernelnewbies.org/
Kernel Newbies
, is a starting point for pro-
grammers who want to learn about and modify the Linux kernel.
http://lxr.linux.no/linux/
Linux Cross-reference,
provides browser access to vari-
ous versions of the Linux kernel source code. Each identifier in a source file is
hyperlinked to make it easy to find the
definition and uses of that identifier.
The kernel source code
If none of the preceding sources answer our
questions, or if we want to confirm that
documented information is true, then we
can read the kernel source code. Although
parts of the source code can
be difficult to understand, re
ading the code of a particu-
lar system call in the Linux kernel source (or a library function in the GNU C library
source) can often prove to be a surprisingly quick way to find the answer to a question.
If the Linux kernel source code has been installed on the system, it can usually
be found in the directory
/usr/src/linux
. Table E-1 provides summary information
about some of the subdirecto
Table E-1:
Subdirectories in the Linux kernel source tree
DirectoryContents
Documentation
Documentation of variou
s aspects of the kernel
arch
, organized into subdir
alpha
ia64
sparc
, and
x86
drivers
Code for device drivers
into subdirectoriesfor example,
btrfs
ext4
proc
(the
/proc
file system), and
vfat
include
Header files needed by kernel code
init
Initialization code for the kernel
Code for System V IPC
and POSIX message queues
kernel
Code related to processes, program execution, kernel modules, signals,
time, and timers
General-purpose functions used
by various parts of the kernel
Further Sources of Information
1423
1422
Appendix E
The POSIX threading API is thoroughly described in
Programming with POSIX
Threads
([
, 1996]).
Linux and the Unix Philosophy
([Gancarz, 2003]) is a
brief introduction to the
philosophy of application design on Linux and UNIX systems.
Various books provide an introduction to reading and modifying the Linux
kernel sources, including
Linux Kernel Development
([Love, 2010]) and
the Linux Kernel
command, and then examine the
source code, which is typically
/usr/src
On systems using the Debian package manager, the process is similar. We can
Further Sources of Information
1421
The manual pages describing the kernel and
glibc
APIs are available online at
http://www.kernel.org/doc/man-pages/
info
documents
Rather than using the traditional manual
page format, the GNU project documents
much of its software using
info
documents, which are hyperlinked documents that
can be browsed using the
command. A tutorial on the use of
can be
obtained using the command
Although in many cases the informat
ion in manual pages and corresponding
1420
Appendix E
Overview, conventions, protocols, and miscellany
: overviews of various topics, and
OF INFORMATION
Aside from the material in this book, many other sources of information about
Linux system programming are available. Th
is appendix provides a short introduc-
Manual pages
Manual pages are accessible via the
man
command. (The command
man man
describes how to use
to read manual pages.) The manual pages are divided
into numbered sections that ca
tegorize information as follows:
Programs and shell commands
: commands executed by users at the shell prompt.
System calls
: Linux system calls.
Library functions
: standard C library functions (as well as many other library
functions).
Special files
: special files, such as device files.
File formats
: formats of files such as the system password (
/etc/passwd
) and
group (
1418
Appendix D
Throughout this book, when
we describe kernel options, we wont describe pre-
cisely where in the
menuconfig
or
menu the option can be found. There are a
few reasons for this:
KERNEL CONFIGURATION
Many features of the Linux kernel are co
mponents that can be optionally config-
ured. Before compiling the kernel, these components can be disabled, enabled, or,
in many cases, enabled as loadable ke
rnel modules. One reason to disable an
unneeded component is to reduce the si
ze of the kernel binary, and thus save
memory, if the component is
not required. Enabling a component as a loadable
module means that it will be loaded into memory only if it is required at run time.
n save memory.
Kernel configuration is done by executing one of a few different
commands
in the root directory of the kernel source treefor example,
make menuconfig
, which
provides a
-style configuration menu
, or, more comfortably,
make xconfig
which provides a graphical configur
ation menu. These commands produce a
.config
file in the root director
y of the kernel source tree that is then used during
kernel compilation. This file contains
Casting the
Pointer
1415
Casting
NULL
in the manner of the last call above is generally required, even on
is defined as
(void *) 0
. This is because, although the C
standards require that null pointers of di
fferent types should test true for compari-
sons on equality, they dont require that
pointers of different types have the same
internal representation (although on mo
st implementations they do). And, as
before, in a variadic function, the compiler cant cast
(void *) 0
to a null pointer of
the appropriate type.
The C standards make one exception to
the rule that pointers of different
types need not have the same representation: pointers of the types
char *
and
void *
are required to have the same inte
rnal representation. This means that
passing
instead of
char *char *
would not be a problem in the example
case of
ex
, but, in the general case, a cast is needed.
1414
Appendix C
POINTER
Consider the following call to the variadic function
execl("ls", "ls", "-l", (char *) NULL);
variadic function
is one that takes a variable
number of arguments or argu-
ments of varying types.
Parsing Command-Line Options
1411
Although we have used the example of
globbingglobbing
above, similar scenarios can also occur as
a result of other shell processing (e.g.,
1410
Appendix B
Parsing Command-Line Options
1409
int
main(int argc, char *argv[])
int opt, xfnd;
char *pstr;
xfnd = 0;
pstr = NULL;
while ((opt = getopt(argc, argv, ":p:x")) != -1) {
printf("opt =%3d (%c); optind = %d", opt, printable(opt), optind);
if (opt == '?' || opt == ':')
printf("; optopt =%3d (%c)", optopt, printable(optopt));
printf"\n""\n";
switchoptopt {
case 'p': pstr = optarg; break;
case 'x': xfnd++; break;
case ':': usageError(argv[0], "Missing argument", optopt);
case '?': usageError(argv[0], "Unrecognized option", optopt);
default: fatal("Unexpected case in switc)
}
}
if (xfnd != 0)
printf("-x was specified (count=%d)\n", xfnd);
if (pstr != NULL)
printf("-p was specified with the value \"%s\"\n", pstr);
if (optind argc)
printf("First nonoption argument is \"%s\" at argv[%d]\n",
argv[optind], optind);
exit(EXIT_SUCCESS);
1408
Appendix B
Example program
Listing B-1 demonstrates the use of
Parsing Command-Line Options
1407
1406
Appendix B
PARSING COMMAND-LINE
A typical UNIX command line
has the following form:
command
options ] arguments
An option takes the form of a hyphen (
) followed by a unique character identify-
ing the option and a possible argument
for the option. An option that takes an
argument may optionally be separated from
that argument by white space. Multi-
ple options can be grouped after a single
hyphen, and the last option in the group
may be one that takes an argument. Acco
rding to these rules, the following com-
mands are all equivalent:
grep -l -i -f patterns *.c
grep -lif patterns *.c
grep -lifpatterns *.c
In the above commands, the
options dont have an argument, while the
option takes the string
as its argument.
Since many programs (including some of
the example programs in this book)
need to parse options in the above format, th
e facility to do so is encapsulated in a
standard library function,
Tracing System Calls
1403
The
option causes children of this proce
ss also to be traced. If we are sending
trace output to a file (
o filename
), then the alternative
ff
option causes each pro-
cess to write its trace output to a file named
filename.PID
command is Linux-specific, but mo
st UNIX implementations provide
their own equivalents (e.g.,
on Solaris and
on the BSDs).
The
ltrace
command performs an analogous task to
, but for library func-
tions. See the
1402
Appendix A
Each system call is displayed in the form of
a function call, with both input and out-
put arguments shown in parentheses. As can be seen from the above examples,
arguments are printed in symbolic form:
corresponding symbolic constants.
Strings are printed in text form (up
to a limit of 32 characters, but the
s strsize
option can be used to
change this limit).
Structure fields are individually displaye
d (by default, only an abbreviated sub-
TRACING SYSTEM CALLS
command allows us to trace the system calls made by a program. This is
useful for debugging, or simply to find o
ut what a program is doing. In its simplest
form, we use
strace
command arg...
This runs
command
, with the given command-line arguments, producing a trace of
the system calls it makes. By default,
writes its output to
stderr
, but we can
o filename
option.
Examples of the type of output produced by
include the following (taken
from the output of the command
strace date
execve("/bin/date", ["date"], [/* 114 vars */]) = 0
Pseudoterminals
1399
One problem with the above scenar
io is that, by default, the
package flushes
standard output only when the
stdio
buffer is filled. This means that the output from
program will appear in bursts separa
ted by long intervals of time. One
way to circumvent this problem is to write a program that does the following:
a)Create a pseudoterminal.
b)Exec the program named in its comma
nd-line arguments with the standard
file descriptors connected to the pseudoterminal slave.
c)Read output from the pseudoterminal master and write it immediately to
standard output (
STDOUT_FILENO
, file descriptor 1), and, at the same time,
read input from the terminal and write it to the pseudoterminal master, so
that it can be read by the execed program.
Such a program, which well call
unbuffer
, would be used as follows:
./unbuffer longrunner | grep str
Write the
unbuffer
program. (Much of the code for this program will be similar to
that of Listing 64-3.)
64-8.
Write a program that implements a scripting language that can be used to drive
in a noninteractive mode. Since
expects to be run from a terminal, the program
will need to employ a pseudoterminal.
1398
Chapter 64
64.10Exercises
64-1.
In what order do the
script
parent process and the child shell process terminate
when the user types the end-of-file character (usually
Control-D
) while running the
program in Listing 64-3? Why?
64-2.
Make the following modi
fications to the program in Listing 64-3 (
script.c
a)The standard
script11
program adds lines to the beginning and the end of the
output file showing the time the script started and finished. Add this feature.
b)Add code to handle changes to the
terminal window size as described in
Section 64.7. You may find th
e program in Listing 62-5 (
for testing this feature.
64-3.
Modify the program in Listing 64-3 (
script.c
) to replace the use of
by a pair
of processes: one to handle
data transfer from the terminal to the pseudoterminal
master, and the other to handle data
transfer in the opposite direction.
64-4.
Modify the program in Listing 64-3 (
script.c
) to add a time-stamped recording
feature. Each time the program writes a string to the
typescript
file, it should also
write a time-stamped string
to a second file (say,
typescript.timed
). Records written
to this second file might ha
ve the following general form:
timesta&#xt7im;sta;mp0;mp s&#xs7pa;츀pace strin&#xstri;&#xn7g0;g ne&#xne7w;&#xline;wline
should be recorded in text form
as the number of milliseconds since
the start of the script session. Recording
the timestamp in text form has the advan-
tage that the resulting file
is human-readable. Within
, real newline characters
will need to be escaped. One possibility is to record a newline as the 2-character
sequence
and a backslash as
Write a second program,
script_replay.c
, that reads the time-stamped script file
and displays its contents on standard ou
tput at the same rate at which they were
originally written. Together, these two programs provide a simple recording and
playback feature for
shell session logs.
64-5.
Implement client and server programs to provide a simple
Pseudoterminals
1397
1396
Chapter 64
To find an unused pseudoterminal pair,
we execute a loop that attempts to open
each master device in turn, until one of
them is opened successfully. While execut-
ing this loop, there are two errors
that we may encounter when calling
open()
If a given master device
name doesnt exist,
open
fails with the error
Typically, this means weve run throug
responding slave by substituting
tty
in the name of the master. We can then
open the slave using
open()
With BSD pseudoterminals, there is no equivalent of
gr
to change the
ownership and permissions of the pseudoterm
inal slave. If we need to do this,
then we must make explicit calls to
chown()
(only possible in a privileged program)
chmod()
Pseudoterminals
1395
understanding of the terminal window size
differs from the actual size of the termi-
nal. We can solve this problem as follows:
1.Install a handler for
SIGWINCH
in the
parent process, so that it is signaled
when the size of the terminal window changes.
2.When the
parent receives a
SIGWINCH
signal, it uses an
TIOCGWINSZ
by calling
posi
/dev/ptmx
, the pseudoterminal master clone
device. We then obtain the name of the
corresponding pseudoterminal slave using
. By contrast, with BSD pseudoterminal
s, the master and slave device pairs
are precreated entries in the
directory. Each master device has a name of the
/dev/pty
, where
1394
Chapter 64
We then start an instance of our
program, which invokes a subshell. Once
more, we display the name of the terminal
on which the shell is running and the
process ID of the shell:
./script
/dev/pts/24
Pseudoterminal slave opened by script
echo $$
29825
PID of subshell process started by script
Now we use
to display information about th
e two shells and the process run-
script
, and then terminate the shell started by
ps -p 7979 -p 29825 -C script -o "pid ppid sid tty cmd"
PID PPID SID TT CMD
7979 7972 7979 pts/1 /bin/bash
29824 7979 7979 pts/1 ./script
29825 29824 29825 pts/24 /bin/bash
The output of
shows the parent-child relationships between the login shell,
the process running
, and the subshell started by
Pseudoterminals
1393
1392
Chapter 64
Listing 64-3:
A simple implementation of
pty/script.c
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include fcntünt;l.h;l.h
#include libg&#xlibg;~n.;&#xh000;en.h
#include term&#xterm;ios;&#x.h00;ios.h
#include sys/select&#xsys/;sel;ìt7;&#x.h00;.h
#include "pty_fork.h" /* Declaration of ptyFo/
#include "tty_functions.h" /* Declaration of ttySe */
#include "tlpi_hdr.h"
#define BUF_SIZE 256
#define MAX_SNAME 1000
struct termios ttyOrig;
Pseudoterminals
1391
Our implementation of
is shown in Listing 64-3. This program performs the
following steps:
1390
Chapter 64
64.6Implementing
scri11
We are now ready to implement a simple version of the standard
script11
program.
This program starts a new shell session, an
d records all input and output from the ses-
sion to a file. Most of the shell sessions shown in this book were recorded using
script
In a normal login session,
the shell is connected direct
ly to the users terminal.
When we run
, it places itself between the us
ers terminal and the shell, and
uses a pseudoterminal pair to create a co
master
slave
(child)
process
(parent)
terminal
terminal
(script file)
space
kernel
if echoing is enabled, slave
input is copied to output
f
stdin
stdout,
stderr
stdin
stdout,
stderr
Pseudoterminals
1389
wr
to the master device succeeds, un
less the input queue of the slave
device is full, in which case the
blocks. If the slave device is subsequently
reopened, these bytes can be read.
UNIX implementations vary widely in their
behavior for the last case. On some UNIX
implementations,
fails with the error
. On other implementations,
succeeds, but the output bytes are discarded (i
.e., they cant be read if the slave is
reopened). In general, these variations
dont present a problem. Normally, the
1388
Chapter 64
/* Duplicate pty slave to be child's stdin, stdout, and stderr */
if (dup2(slaveFd, STDIN_FILENO) != STDIN_FILENO)
err_exit("ptyFork:dup2-STDIN_FILENO");
if (dup2(slaveFd, STDOUT_FILENO) != STDOUT_FILENO)
err_exit("ptyFork:dup2-STDOUT_FILENO");
if (dup2(slaveFd, STDERR_FILENO) != STDERR_FILENO)
err_exit("ptyFork:dup2-STDERR_FILENO");
Pseudoterminals
1387
mfd = ptyMasterOpen(slname, MAX_SNAME);
if (mfd == -1)
1386
Chapter 64
this argument
. Use of this argument is a convenience for certain interac-
tive programs (e.g.,
) that use a pseudotermin
Pseudoterminals
1385
64.4Connecting Processes
ptyFor
We are now ready to implement a function that
1384
Chapter 64
Listing 64-1:
pty/pty_master_open.c
#define _XOPEN_SOURCE 600
#include stdl&#xstdl;ib.;&#xh000;ib.h
#include fcntünt;l.h;l.h
#include "pty_master_open.h" /* Declares ptyMaster */
#include "tlpi_hdr.h"
int
ptyMasterOpen(char *slaveName, size_t snLen)
int masterFd, savedErrno;
char *p;
masterFd = posix_openpt(O_RDWR | O_NOCTTY); /* Open pty master */
if (masterFd == -1)
Pseudoterminals
1383
The GNU C library provides a reentrant analog of
ptsn
in the form of
ptsname_r(mfd, strbuf, buflen)
. However, this function is nonstandard and is
available on few other UN
IX implementations. The
_GNU_SOURCE
feature test
macro must be defined in order to obtain the declaration of
ptsname_
from
stdlib.6s;&#xtd7.;lib;.6.;&#xh000;.h
Once we have unlocked the slave device with
unlockpt()
, we can open it using the
traditional
system call.
On System V derivatives that employ STREAMS, it may be necessary to per-
form some further steps (pushing ST
REAMS modules onto the slave device
after opening it). An example of how to perform these steps can be found in
[Stevens & Rago, 2005].
64.3Opening a Master:
ptyMaste
We now present a function,
ptyMasterOpe
, that employs the functions described
in the previous sections to
open a pseudoterminal master and obtain the name of
the corresponding pseudoterminal slave.
Our reasons for providing such a func-
tion are twofold:
Most programs perform these steps in exactly the same way, so it is convenient
to encapsulate them in a single function.
Our
ptyMasterOpen()
1382
Chapter 64
change the permissions on the slave so
that the owner has read and write per-
missions, and group has write permission.
The reason for changing the group of the terminal to
and enabling group write
permission is that the
and
pt
#define _XOPEN_SOURCE 500
#include stdl&#xstdl;ib.;&#xh000;ib.h
char *
ptsname
(int
mfd
Pseudoterminals
1381
function is new in SUSv3, and was an invention of the
POSIX committee. In the original Syst
em V pseudoterminal implementation,
obtaining an available pseudoterminal ma
ster was accomplished by opening the
pseudoterminal master clone device
/dev/ptmx
. Opening this virtual device automati-
cally locates and opens the next unused
pseudoterminal master, and returns a file
descriptor for it. This device
is provided on Linux, where
posi
is imple-
mented as follows:
int
posix_openpt(int flags)
1380
Chapter 64
64.2UNIX 98 Pseudoterminals
Bit by bit, well work toward
the development of a function,
, that does
most of the work to create
Pseudoterminals
1379
In some cases, multiple processes may be connected to the slave side of the
pseudoterminal. Our
ssh
example illustrates this point.
The session leader for the slave
is a shell, which creates process groups
to execute the commands entered by the
remote user. All of these processes have the pseudoterminal slave as their control-
ling terminal. As with a conventional te
rminal, one of these process groups can be
the foreground process group for the pseudo
terminal slave, and only this process
group is allowed to read from the slave and (if the
TOSTOP
bit has been set) write to it.
Applications of pseudoterminals
Pseudoterminals are also used in many a
pplications other than network services.
Examples include the following:
expect11
program uses a pseudoterminal to
allow an interactive terminal-
oriented program to be driven from a script file.
Terminal emulators such as
xterm
employ pseudoterminals to provide the
terminal-related functionality that goes with a terminal window.
The
screen11
program uses pseudoterminals to multiplex a single physical terminal
1378
Chapter 64
Pseudoterminals can also be used to connect an arbitrary pair of processes
(i.e., not necessarily a parent and child). All that is required is that the process
that opens the pseudoterminal master
informs the other process of the name
of the corresponding slave device, perhap
s by writing that name to a file or by
transmitting it using some othe
r IPC mechanism. (When we use
in the
manner described above, the child automatically inherits sufficient informa-
stdin
stdout,
stderr
socket
master
slave
server
Pseudoterminals
1377
and we employ this abbreviation in vari
ous diagrams and function names in this
chapter.) The standard input, output, an
d error of the terminal-oriented program
are connected to the pseudoterminal slave,
which is also the controlling terminal
for the program. On the other side of th
e pseudoterminal, a driver program acts as
a proxy for the user, supplying input to
the terminal-oriented program and reading
that programs output.
Figure 64-2:
Two programs communicating via a pseudoterminal
Typically, the driver program is simultaneously reading from and writing to
another I/O channel. It is acting as a relay, passing data in both directions between
the pseudoterminal and another program. In
order to do this, the driver program
must simultaneously monitor input arriving
from either direction. Typically, this is
done using I/O multiplexing (
sele
or
po
), or using a pair of processes or threads to
perform data transfer in each direction.
An application that uses a pseudoterm
inal typically does so as follows:
1.The driver program opens the pseudoterminal master device.
2.The driver program calls
fork
to create a child process. The child performs
the following steps:
a)Call
master
slave
stdin
stdout,
stderr
driver
program
space
kernel
f
terminal-oriented
program
is the controlling
terminal for
1376
Chapter 64
Furthermore, a terminal-oriented progra
m expects a terminal driver to per-
form certain kinds of processing of its in
put and output. For example, in canonical
mode, when the terminal driver sees the end-of-file character (normally
Control-D
at the start of a line
read
ptor for the controlling terminal by open-
ing
/dev/tty
, and also makes it possible to gene
rate job-control and terminal-related
signals (e.g.,
SIGTSTP
SIGTTIN
SIGINT
) for the program.
From this description, it should be clear that the definition of a terminal-oriented
program is quite broad. It encompasses a
wide range of programs that we would
normally run in an interactive terminal session.
Figure 64-1:
socket
terminal-oriented
program
protocols
socket
protocols
terminal
terminal
Local hostRemote host
kernel
space
program
PSEUDOTERMINALS
pseudoterminal
is a virtual device that provides
an IPC channel. On one end of the
channel is a program that expects to be
connected to a terminal device. On the
other end is a program that drives the
terminal-oriented program by using the
channel to send it input and read its output.
This chapter describes the use of ps
eudoterminals, showing how they are
employed in applications such
as terminal emulators, the
scri11
program, and
1374
Chapter 63
Further information
Alternative I/O Models
1373
63.6Summary
In this chapter, we explored various al
ternatives to the standard model for per-
forming I/O: I/O multiplexing (
and
), signal-driven I/O, and the
Linux-specific
API. All of these mechanisms a
llow us to monitor multiple file
descriptors to see if I/O is possible on any
of them. None of these mechanisms actually
1372
Chapter 63
/* ... Initialize 'timeout', 'readfds', and 'nfds' for se */
if (pipe(pfd) == -1)
errExit("pipe");
Alternative I/O Models
1371
The signal handler is installed after
creating the pipe, in order to prevent
the race condition that would occur if
a signal was delivered before the
pipe was created.
It is safe to use
inside the signal handler, because it is one of the
async-signal-safe functions listed in Table 21-1, on page 426.
4.Place the
call in a loop, so that it is restarted if interrupted by a signal
handler. (Restarting in this fashion is not strictly necessary; it merely means
that we can check for the arrival of a signal by inspecting
readfds
, rather than
checking for an
EINTR
1370
Chapter 63
Listing 63-8:
Using
pp
and
system calls
Linux 2.6.16 also added a new,
nonstandard system call,
, whose relationship
is analogous to the relationship of
pselec
to
. Similarly, starting with
kernel 2.6.19, Linux also includes
, providing an analogous extension
. See the
Alternative I/O Models
1369
file descriptor that is monitored (alo
ng with other file descriptors) using
, or
epoll_wait()
63.5.1The
pselect()
system call performs a similar task to
se
ence is an additional argument,
1368
Chapter 63
63.5Waiting on Signals
sig_atomic_t gotSig = 0;
void
handler(int sig)
gotSig = 1;
int
main(int argc, char *argv[])
struct sigaction sa;
...
sa.sa_sigaction = handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
if (sigaction(SIGUSR1, &sa, NULL) == -1)
errExit("sigaction");
/* What if the signal is delivered now? */
ready = select(nfds, &readfds, NULL, NULL, NULL);
if (ready � 0) {
printf("%d file descriptors ready\n", ready);
} else if (ready == -1 && errno == EINTR) {
if (gotSig)
printf("Got signal\n");
} else {
/* Some other error */
}
...
The problem with this code
is that if the signal (
SIGUSR1
in this example) arrives
after establishing th
e handler but before
is called, then the
call will
nevertheless block. (This is a form of ra
ce condition.) We now look at some solu-
tions to this problem.
Since version 2.6.27, Linux provides a
further technique that can be used to
simultaneously wait on signal
s and file descriptors: the
signalfd
mechanism
described in Section 22.11. Using this m
echanism, we can receive signals via a
Alternative I/O Models
1367
As we noted in Section 63.1.1, edge-trigg
ered notification is usually employed
in conjunction with nonblocking file desc
riptors. Thus, the general framework for
using edge-triggered
is as follows:
1.Make all file descriptors that
are to be monitored nonblocking.
2.Build the
interest list using
epol
3.Handle I/O events using the following loop:
1366
Chapter 63
insignificant compared to the time requ
ired for the system call to monitor
descriptors. Table 63-9 doesnt includ
e the times for the inspection step.
Very roughly, we can say that for large values of
(the number of file descriptors
being monitored), the performance of
select
and
scales linearly with
. We
start to see this behavior for the
and
N = 1000
cases in Table 63-9. By the
time we reach
N = 10000
, the scaling has actually become worse than linear.
By contrast,
scales (linearly) according to the number of I/O events that
occur. The
API is thus particularly efficient in a scenario that is common in
servers that handle many simultaneous clie
nts: of the many file descriptors being
monitored, most are idle; only a few descriptors are ready.
63.4.6Edge-Triggered Notification
By default, the
mechanism provides
level-triggered
notification. By this, we
mean that
Alternative I/O Models
1365
63.4.5Performance of
Versus I/O Multiplexing
Table 63-9 shows the results (on Linux 2.6.25) when we monitor
contiguous file
descriptors in the range
to
using
se
. (The test was
arranged such that during each moni
toring operation, exactly one randomly
selected file descriptor is ready.) From this table, we see that as the number of file
descriptors to be monitored grows large,
and
perform poorly. By con-
trast, the performance of
hardly declines as
grows large. (The small decline
in performance as
increases is possibly a result
of reaching CPU caching limits on
the test system.)
For the purposes of this test,
FD_SETSIZE
was changed to 16,384 in the
header files to allow the test program to monitor large numbers of file descrip-
sele
In Section 63.2.5, we saw why
select
and
perform poorly when monitoring large
numbers of file descriptors. We now look at the reasons why
epoll
1364
Chapter 63
argument in a call to
epoll_ct
int epfd, fd1, fd2;
struct epoll_event ev;
struct epoll_event evlist[MAX_EVENTS];
/* Omitted: code to open 'fd1' and create epoll file descriptor 'epfd' ... */
ev.data.fd = fd1
ev.events = EPOLLIN;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd1, ev) == -1)
errExit("epoll_ctl");
/* Suppose that 'fd1' now happens to become ready for input */
fd2 = dup(fd1);
closfd1fd1;
ready = epoll_wait(epfd, evlist, MAX_EVENTS, -1);
if (ready == -1)
errExit("epoll_wait");
Alternative I/O Models
1363
printf("Ready: %d\n", ready);
1362
Chapter 63
Listing 63-5:
Using the
altio/epoll_input.c
#include sys/epoll.&#xsys/;~po;&#xll.7;&#xh000;h
#include fcntünt;l.h;l.h
#include "tlpi_hdr.h"
Alternative I/O Models
1361
described in Section 44.7.) In the
other window, we run instances of
that
write data to these FIFOs.
Terminal window 1

Terminal window 2
mkfifo p q
./epoll_input p q
$
�cat p
Opened "p" on fd 4

Type Control-Z to suspend cat
[1]+ Stopped �cat p
$
�cat q
Opened "q" on fd 5
About to epoll_wait()
Type Control-Z to suspend the epoll_input program
[1]+ Stopped ./epoll_input p q
Above, we suspended our monitoring prog
ram so that we can now generate input
on both FIFOs, and close the write end of one of them:

qqq

Type Control-D to� terminate cat q
$
ca�t p

ppp
Now we resume our monitoring program by
bringing it into the foreground, at
which point
1360
Chapter 63
EPOLLONESHOT
By default, once a file descriptor is added to an
interest list using the
epoll_ct
EPOLL_CTL_ADD
operation, it remains active
(i.e., subsequent calls to
epoll_wa
will
inform us whenever the file descriptor is
ready) until we explicitly remove it from
the list using the
epoll_ctl()
EPOLL_CTL_DEL
operation. If we want to be notified only
once about a particular file desc
riptor, then we can specify the
EPOLLONESHOT
flag
(available since Linux 2.6.2) in the
ev.events
value passed in
epoll_ctl()
. If this flag is
specified, then,
call that informs us that the corresponding
file descriptor is ready, the file descriptor is marked inactive in the interest list, and
we wont be informed about its state by future
epoll_wait
calls. If desired, we can
subsequently reenable monitoring of
this file descriptor using the
epoll_ct
EPOLL_CTL_MOD
operation. (We cant use the
EPOLL_CTL_ADD
operation for this purpose,
because the inactive file desc
riptor is still part of the
interest list.)
Example program
Listing 63-5
demonstrates the use of the
API. As command-line arguments, this
program expects the pathnames of one or more terminals or FIFOs. The program
performs the following steps:
instance
Open each of the files named on the command line for input
and add the
resulting file descriptor to the interest list of the
instance
, specifying the
Alternative I/O Models
1359
descriptor associated with this event. Thus, when we make the
epol
call that
places a file descriptor in the in
terest list, we should either set
to the file
descriptor number (as shown in Listing 63-4) or set
to point to a struc-
ture that contains the file descriptor number.
The
timeout
argument determines the blocking behavior of
, as follows:
If
equals 1, block until an event occu
rs for one of the file descriptors in
the interest list for
or until a signal is caught.
If
equals 0, perform a nonblocking check to see which events are cur-
rently available on the file desc
riptors in the interest list for
If
is greater than 0, block for up to
milliseconds, until an event
occurs on one of the file descri
ptors in the interest list for
, or until a signal
is caught.
On success,
1358
Chapter 63
Listing 63-4 shows an example of the use of
epoll_create
epol
Listing 63-4:
Using
and
epol
int epfd;
struct epoll_event ev;
epfd = epoll_cre55;
if (epfd == -1)
errExit("epoll_create");
ev.data.fd = fd;
ev.events = EPOLLIN;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, ev) == -1)
errExit("epoll_ctl");
max_user_watches
limit
Because each file descriptor registered in an
interest list requires a small
amount of nonswappable kernel memory,
the kernel provides an interface that
defines a limit on the total number of file
descriptors that each user can register in
all
interest lists. The value of this
limit can be viewed and modified via
max_user_watches
, a Linux-specific file in the
/proc/sys/fs/epoll
directory. The default
value of this limit is calculated based on available system memory (see the
63.4.3Waiting for Events:
Alternative I/O Models
1357
argument identifies which of the file descriptors in the interest list is to have
its settings modified. This argument can
be a file descriptor for a pipe, FIFO,
1356
Chapter 63
epoll_ct
system call manipulates the inte
rest list associated with an
instance. Using
epoll_ct
, we can add a new file desc
riptor to the list, remove
an existing descriptor from the list
#include sys/epoll.&#xsys/;~po;&#xll.7;&#xh000;h
int
epoll_ctl
(int
, int
, struct epoll_event *
Returns 0 on success, or 1 on error
Alternative I/O Models
1355
F_OWNER_TID
field specifies the ID of a thread th
1354
Chapter 63
sigtimedwait
(Section 22.10). These
system calls return a
siginfo_t
structure that con-
tains the same information as is passed
to a signal handler established with
SA_SIGINFO
Accepting signals in this manner returns us
to a synchronous model of event process-
ing, but with the advantage that we are mu
ch more efficiently notified about the file
descriptors on which I/O events
sele
Handling signal-queue overflow
We saw in Section 22.8 that there is a limi
t on the number of realtime signals that
may be queued. If this limit is reached, th
e kernel reverts to delivering the default
SIGIO
signal for I/O possible notifications. This informs the process that a signal-
queue overflow occurred. When this happens, we lose information about which file
descriptors have I/O events, because
is not queued. (Furthermore, the
SIGIO
argument, which means that the signal handler
Alternative I/O Models
1353
(In order to obtain th
e definitions of the
1352
Chapter 63
Signal-driven I/O works for stream sock
Alternative I/O Models
1351
1350
Chapter 63
Establish the signal handler before enabling signal-driven I/O
Because the default action of
SIGIO
is to terminate the process, we should enable the
handler for
before enabling signal-driven I/O on a file descriptor. If we enable
signal-driven I/O before establishing the
handler, then there is a time window
during which, if I/O becomes possible, delivery of
will terminate the process.
On some UNIX implementations,
SIGIO
is ignored by default.
Alternative I/O Models
1349
/* Establish handler for "I/O possible" signal */
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_RESTART;
sa.sa_handler = sigioHandler;
if (sigaction(SIGIO, &sa, NULL) == -1)
errExit("sigaction");
1348
Chapter 63
Example program
Listing 63-3 provides a simple
example of the use of signal-driven I/O. This program
performs the steps describe
d above for enabling signal-driven I/O on standard
input, and then places the terminal in cbre
ak mode (Section 62.6.3), so that input is
available a character at a time. The progra
m then enters an infinite loop, perform-
ing the work of incrementing a variable,
, while waiting for input to become avail-
able. Whenever input becomes available, the
Alternative I/O Models
1347
the signal is delivered to
the process. To use signal
-driven I/O, a program per-
forms the following steps:
1.Establish a handler for the signal de
livered by the signal-driven I/O mecha-
nism. By default, this
notification signal is
1346
Chapter 63
63.2.5Problems with
se
and
and
system calls are the portable,
long-standing, and widely used
Alternative I/O Models
1345
precision afforded by
microsecondsmicrosecondsgreater than that
afforded by
(milliseconds). (The accuracy of
the timeouts of
both of these
system calls is neverthe
less limited by the software clock granularity.)
If one of the file descriptors
being monitored was closed, then
informs us
exactly which one, via the
POLLNVAL
bit in the corresponding
field. By
contrast,
select()
1344
Chapter 63
of the connection. The use
of this flag allows an
application that uses the
edge-
triggered interface to employ simpler co
de to recognize a remote shutdown. (The
alternative is for the appl
ication to note that the
POLLIN
Alternative I/O Models
1343
On some other UNIX implementations, if
the read end of a pipe is closed,
1342
Chapter 63
Regular files
File descriptors that refer to regular file
s are always marked as readable and writ-
able by
Alternative I/O Models
1341
srandom((int) time(NULL));
for (j = 0; j numWrites; j++) {
randPipe = random() % numPipes;
printf("Writing to fd: %3d (read fd: %3d)\n",
pfds[randPipe][1], pfds[randPipe][0]);
if (write(pfds[randPipe][1], "a", 1) == -1)
errExit("write %d", pfds[randPipe][1]);
}
/* Build the file descriptor list to be supplied to poll(). This list
1340
Chapter 63
The following shell se
ssion shows an example of wh
at we see when running this
program. The command-line arguments to
the program specify that ten pipes
should be created, and writes should be
made to three randomly selected pipes.
./poll_pipes 10 3
Writing to fd: 4 (read fd: 3)
Writing to fd: 14 (read fd: 13)
Writing to fd: 14 (read fd: 13)
Alternative I/O Models
1339
POLLRDHUP
is a Linux-specific flag available
since kernel 2.6.17. In order to obtain
this definition from
feature test macro must be defined.
POLLNVAL
is returned if the specified file desc
riptor was closed at the time of the
call.
Summarizing the above points, the
flags of real interest are
POLLIN
POLLOUT
POLLPRI
POLLRDHUP
POLLHUP
, and
POLLERR
. We consider the meanings of these flags in
1338
Chapter 63
It is permissible to specify
events
as 0 if we are not interest
ed in events
on a particular
file descriptor. Furthermore, spec
ifying a negative value for the
field (e.g., negating
its value if nonzero) causes the corresponding
field to be ignored and the
field always to be returned as 0. Ei
ther of these techniques can be used to
(perhaps temporarily) disable monitoring of
a single file descriptor, without need-
ing to rebuild the entire
Note the following further points re
garding the Linux implementation of
POLLRDNORM
are synonymous.
POLLOUT
POLLWRNORM
are synonymous.
POLLRDBAND
is generally unused; that is, it is ignored in the
events
definitions of the constants
POLLRDNORM
POLLRDBAND
POLLWRNORM
POLLWRBAND
poll.hpol;&#xl.h7;
Table 63-2:
Bit-mask values for
and
revents
fields of the
structure
BitInput in
events
Alternative I/O Models
1337
63.2.2The
poll()
system call performs a similar task to
. The major difference
1336
Chapter 63
for (fd = 0; fd nfds; fd++)
Alternative I/O Models
1335
int
main(int argc, char *argv[])
1334
Chapter 63
If we use the Linux-specific
system call to set a personality that
includes the
STICKY_TIMEOUTS
personality bit, then
doesnt modify the
structure pointed to by
Alternative I/O Models
1333
timeout
argument controls the blocking behavior of
. It can be specified
either as
, in which case
blocks indefinitely, or as a pointer to a
struct timeval {
time_t tv_sec; /* Seconds */
suseconds_t tv_usec; /* Microseconds (long int) */
If both fields of
are 0, then
doesnt block; it simply polls the specified
file descriptors to see which ones are
1332
Chapter 63
These macros operate as follows:
FD_ZERO()
Alternative I/O Models
1331
We can use
and
to monitor file descriptor
s for regular files, termi-
nals, pseudoterminals, pipes, FIFOs, sock
1330
Chapter 63
starve other file descriptors of attentio
n if we perform a large amount of I/O
on one file descriptor. We consider this
Alternative I/O Models
1329
63.1.1Level-Triggered and Edge-Triggered Notification
Before discussing the various alternative I/
1328
Chapter 63
Which technique?
consider the reasons we may choose one of
these techniques rather than
another. In the meantime, we summarize a few points:
and
system calls are long-standing interfaces that have been
present on UNIX systems for many years. Compared to the other techniques,
their primary advantage is portability. Their main disadvantage is that they
dont scale well when monitoring large numbers (hundreds or thousands) of
file descriptors.
The key advantage of the
API is that it allows an application to efficiently
monitor large numbers of file descriptors.
Its primary disadvantage is that it is
a Linux-specific API.
Some other UNIX implementations prov
ide nonstandardnonstandard mechanisms simi-
epoll
. For example, Solaris provides the special
/dev/poll
file (described
in the Solaris
poll(7d)
manual page), and some of the BSDs provide the
API (which provides a more general-purpose monitoring facility than
epoll
Alternative I/O Models
1327
Because of the limitation
s of both nonblocking I/O and the use of multiple
threads or processes, one of the following alternatives is often preferable:
allows a process to simultaneous
ly monitor multiple file descrip-
1326
Chapter 63
Disk files are a special case. As describe
d in Chapter 13, the kernel employs the
buffer cache to speed disk
I/O requests. Thus, a
This chapter discusses three alternatives to
the conventional file I/O model that we
have employed in most prog
rams shown in this book:
I/O multiplexing (the
system calls);
signal-driven I/O; and
63.1Overview
Most of the programs that we have presen
ted so far in this book employ an I/O
model under which a process performs I/O on
just one file descriptor at a time,
and each I/O system call blocks until the data is transferred. For example, when
reading from a pipe, a
call normally blocks if no
data is currently present in
the pipe, and a
wr
call blocks if there is insufficien
t space in the pipe to hold the
data to be written. Similar behavior occurs when performing I/O on various other
types of files, including FIFOs and sockets.
Terminals
1323
62.12Exercises
62-1.
Implement
. (You may find it useful
to read the description of
e controlling terminal by opening
/dev/tty
62-4.
Write a program that displays inform
1322
Chapter 62
62.11Summary
On early UNIX systems, terminals were
real hardware devices connected to a
computer via serial lines. Early terminals
were not standardized, meaning that different
escape sequences were required to prog
ram the terminals produced by different
vendors. On modern workstations, such
terminals have been superseded by bit-
mapped monitors running the X Window Sy
stem. However, the ability to program
terminals is still required wh
en dealing with virtual devices, such as virtual consoles
and terminal emulators (which employ pseu
doterminals), and real devices connected
via serial lines.
Terminals
1321
Note, however, that these ev
ents on their own are insuff
icient to change the actual
dimensions of the displaye
d window, which are controlled by software outside the
kernel (such as a window manager or a terminal emulator program).
Although not standardized in SUSv3, most UNIX implementations provide
access to the terminal window size using the
ioctl()
operations described in this section.
62.10Terminal Identification
In Section 34.4, we described the
1320
Chapter 62
Listing 62-5:
Monitoring changes in the terminal window size
tty/demo_SIGWINCH.c
#include sign&#xsign;zl.;&#xh000;al.h
#include term&#xterm;ios;&#x.h00;ios.h
#include sys/ioctl.&#xsys/;ioc;&#xtl.7;&#xh000;h
#include "tlpi_hdr.h"
static void
sigwinchHandleint sigint sig
int
main(int argc, char *argv[])
struct winsize ws;
struct sigaction sa;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sa.sa_handler = sigwinchHandler;
if (sigaction(SIGWINCH, &sa, NULL) == -1)
errExit("sigaction");
for ;;;;
pause(); /* Wait for SIGWINCH signal */
if (ioctl(STDIN_FILENO, TIOCGWINSZ, &ws) == -1)
errExit("ioctl");
printf("Caught SIGWINCH, new window size: "
"%d rows * %d columns\n", ws.ws_row, ws.ws_col);
}
tty/demo_SIGWINCH.c
It is also possible to change the terminal dr
ivers notion of the window size by pass-
ing an initialized
winsize
structure in an
TIOCSWINSZ
operation:
ws.ws_row = 40;
ws.ws_col = 100;
if (ioctl(fd, TIOCSWINSZ, &ws) == -1)
errExit("ioctl");
If the new values in the
structure differ from the terminal drivers current
notion of the terminal window size, two things happen:
The terminal driver data structures are updated using the values supplied in
argument.
SIGWINCH
signal is sent to the foreground process group of the terminal.
Terminals
1319
62.9Terminal Window Size
In a windowing environment, a screen-han
dling application needs to be able to
that the screen can be redrawn appropri-
ately if the user modifies the window size. The kernel provides two pieces of sup-
port to allow this to happen:
SIGWINCH
signal is sent to the foreground
process group after a change in the
terminal window size. By defa
ult, this signal is ignored.
At any timeusually following the receipt of a
SIGWINCH
signala process can use
the
operation to find out the current size of the terminal window.
TIOCGWINSZ
operation is used as follows:
if (ioctl(fd, TIOCGWINSZ, &ws) == -1)
errExit("ioctl");
argument is a file descriptor referring
to a terminal window. The final argu-
ment to
is a pointer to a
winsize
structure (defined in
sys/ioctl.h&#xsys/;&#xi7oc;&#xtl.h;瀀
), used to
1318
Chapter 62
In each function,
is a file descriptor that refe
rs to a terminal or other remote
device on a serial line.
function generates a BREAK co
ndition, by transmitting a
continuous stream of 0 bits. The
argument specifies the length of the
transmission. If
duration
is 0, 0 bits are transmitted fo
r 0.25 seconds. (SUSv3 speci-
fies at least 0.25 and not mo
re than 0.5 seconds.) If
duration
is greater than 0, 0 bits
are transmitted for
milliseconds. SUSv3 leaves this case unspecified; the
handling of a nonzero
varies widely on other UNIX implementations (the
Terminals
1317
For example, to find out the current te
rminal output line speed, we would do
struct termios tp;
speed_t rate;
if (tcgetattr(fd, &tp) == -1)
1316
Chapter 62
In the last line of the preceding shell se
ssion, we see that the shell printed its
prompt on the same line as the
character that caused the program to terminate.
The following shows an example run using cbreak mode:
./test_tty_functions x
XYZ

[1]+ Stopped ./test_tty_functions x
stty
Verify that terminal mode was restored
speed 38400 baud; line = 0;
fg
Resume in foreground
./test_tty_functions x
***
Type 123 and Control-J
$
Type Control-C to terminate program
Terminals
1315
if (sigaction(SIGTSTP, NULL, &prev) == -1)
errExit("sigaction");
if (prev.sa_handler != SIG_IGN)
if (sigaction(SIGTSTP, &sa, NULL) == -1)
errExit("sigaction");
} else { /* Use raw mode */
1314
Chapter 62
sigemptyset(&sa.sa_mask); /* Reestablish handler */
sa.sa_flags = SA_RESTART;
sa.sa_handler = tstpHandler;
if (sigaction(SIGTSTP, &sa, NULL) == -1)
errExit("sigaction");
Terminals
1313
Listing 62-4:
Demonstrating cbreak and raw modes
tty/test_tty_functions.c
#include term&#xterm;ios;&#x.h00;ios.h
#include sign&#xsign;zl.;&#xh000;al.h
#include ctyp typ;~.h;e.h
1312
Chapter 62
the program is not prematurely terminated.
(Job-control signals can still be gener-
ated from the keyboard in cbreak mode.)
An example of how to do this is provided in Listing 62-4. This program per-
forms the following steps:
Terminals
1311
t.c_cc[VMIN] = 1; /* Character-at-a-time input */
t.c_cc[VTIME] = 0; /* with blocking */
1310
Chapter 62
Terminals
1309
characters individually. This
can be done by performing a
read
with a small inter-
byte timeout, say 0.2 seconds. Such a te
chnique is used in the command mode of
some versions of
. (Depending on the length of th
we may be able to simulate a left-arrow
key press by quickly typing the aforemen-
tioned 3-character sequence.)
Portably modifying and restoring MIN and TIME
For historical compatibility with some UNIX implementations, SUSv3 allows the
values of the
and
VTIME
constants to be the same as
and
VEOL
which means that these elements of the
c_cc
array may coincide. (On Linux,
the values of these constants are distinct.) This is possible because
VEOF
VEOL
unused in noncanonical
mode. The fact that
VMIN
may have the same value
means that caution is needed in a prog
1308
Chapter 62
The precise operation and interaction of the MIN and TIME parameters
Terminals
1307
62.6Terminal I/O Modes
We have already noted that
the terminal driver is capable of handling input in
either canonical or noncanonical mo
1306
Chapter 62
passed to the reading process instead. If none of
IGNPAR
PARMRK
, or
Terminals
1305
ICANON
1304
Chapter 62
BRKINT
If the
BRKINT
yielding the value 63, the ASCII code for
, so that DEL is echoed as
ECHOE
Terminals
1303
Several of the flags listed in Table 62-2
were provided for historical terminals with
limited capabilities, and these flags have
little use on modern systems. For example,
IUCLC
OLCUC
, and
XCASE
flags were used with termin
als that were capable of dis-
1302
Chapter 62
Many shells that provide command-line
editing facilities perform their own
manipulations of the flags li
sted in Table 62-2. This means that if we try using
stty11
Terminals
1301
Listing 62-1:
Changing the terminal
tty/new_intr.c
#include term&#xterm;ios;&#x.h00;ios.h
#include ctyp typ;~.h;e.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
struct termios tp;
int intrChar;
if (�argc 1 && strcmp(argv[1], "--help") == 0)
usageErr("%s [intr-char]\n", argv[0]);
1300
Chapter 62
information. (Linux provides a vaguel
y similar feature in the form of the
; see the kernel source file
Documentation/sysrq.txt
for details.)
System V derivatives provide the SWTCH
character. This character is used to
switch shells under
, a System V predecessor to job control.
Example program
Listing 62-1 shows the use of
Terminals
1299
input queue is full, then the terminal dr
iver automatically sends a STOP character
to throttle the input.
Typing the START character
causes terminal output to
resume after previously
being stopped by the STOP character. The
START character itself is not passed to
the reading process. If the
IXOFF
enable start/stop input control
) flag is set (this flag is
disabled by default) and the terminal driv
er had previously sent a STOP character
because the input queue was full, the terminal driver automatically generates a
START character when space once more
becomes available in the input queue.
If the
1298
Chapter 62
foreground process group (Section 34.2). The INTR character itself is not passed
to the reading process.
KILL is the
(also known as
kill line
) character. In canonical mode, typing
this character causes the cu
rrent line of input to be
discarded (i.e., the characters
typed so far, as well as the KILL character
itself, are not passed to the reading process).
LNEXT is the
literal next
character. In some circumstances, we may wish to treat
one of the terminal special characters as
though it were a normal character for
input to a reading process. Typing
the literal next character (usually
Control-V
causes the next character to be treated
Terminals
1297
CR is the
1296
Chapter 62
62.4Terminal Special Characters
Table 62-1 lists the special characters reco
gnized by the Linux terminal driver. The
first two columns show the name of the
character and the corresponding constant
that can be used as a subscript in the
array. (As can be seen, these constants
Terminals
1295
If we employ the final opti
1294
Chapter 62
62.3The
stty
The
stty
command is the command-line analog of the
Terminals
1293
When changing terminal attributes with
1292
Chapter 62
argument is a file descriptor
that refers to a terminal. (If
doesnt refer to a
terminal, these functions fail with the error
termios_p
argument is a pointer to a
structure, which records termi-
struct termios {
tcflag_t c_iflag; /* Input flags */
tcflag_t c_oflag; /* Output flags */
tcflag_t c_cflag; /* Control flags */
tcflag_t c_lflag; /* Local modes */
cc_t c_line; /* Line discipline (nonstandard)*/
cc_t c_cc[NCCS]; /* Terminal special characters */
speed_t c_ispeed; /* Input speed (nonstandard; unused) */
speed_t c_ospeed; /* Output speed (nonstandard; unused) */
The first four fields of the
structure are bit masks (the
data type is an
integer type of suitable size) containing fl
ags that control various aspects of the ter-
minal drivers operation:
contains flags controlling terminal input;
c_oflag
contains flags controlling terminal output;
contains flags relating to hardware
control of the terminal line; and
contains flags controlling the user interface for terminal input.
All of the flags used in the above fields are listed in Table 62-2 (on page 1302).
field specifies the line discipline
for this terminal. For the purposes
of programming terminal emulators, the
Terminals
1291
is enabled, then the terminal driver au
tomatically appends a copy of any input
character to the end of the output queue,
so that input characters are also output
on the terminal.
SUSv3 specifies the limit
MAX_INPUT
, which an implementation can use to indi-
cate the maximum length of the term
inals input queue. A related limit,
MAX_CANON
, defines the maximum number of bytes in a line of input in canonical
mode. On Linux,
sysconf(_SC_MAX_INPUT)
and
sysconf(_SC_MAX_CANON)
characters read
by process
characters written
by process
characters typed
at terminal
characters displaye
at terminal
If echoing enabled
Terminal driver
Preses
performing I/O
Input queue
Output queue
Terminal
#include term&#xterm;ios;&#x.h00;ios.h
int
tcgetattr
(int
, struct termios *
int
tcsetattr
(int
, int
optional_actions
, const struct termios *
1290
Chapter 62
how to perform various screen-control op
Historically, users accessed a UNIX system
using a terminal connected via a serial
an RS-232 connectionan RS-232 connectionTerminals we
re cathode ray tubes (CRTs) capable of
displaying characters and, in some cases, primitive graphics. Typically, CRTs pro-
vided a monochrome display of 24 lines by
80 columns. By todays standards, these
CRTs were small and expensive. In even
earlier times, terminals were sometimes
hard-copy teletype devices. Serial lines
were also used to
connect other devices,
such as printers and modems, to a computer or to connect one computer to
On early UNIX systems, the terminal li
nes connected to the system were repre-
sented by character devices
with names of the form
/dev/tty
. (On Linux,
/dev/tty
devices are the virtual consoles
on the system.) It is common to
see the abbreviation
(derived from teletype) as a shorthand for
Especially during the early years of UNIX
, terminal devices were not standardized,
which meant that different character sequ
ences were required to perform opera-
tions such as moving the cursor to the
beginning of the line or the top of the
screen. (Eventually, some vendor implementations of such
escape sequences
for
example, Digitals VT-100became de facto
and, ultimately, ANSI standards, but a
1288
Chapter 61
61-6.
Section 61.13.1 noted that an alternative to
out-of-band data would be to create two
address during the
of the normal connection.)
Implement some type of security mech
anism to prevent a rogue process from
trying to connect to the clients listen
ing socket. To do this, the client could
send a cookie (i.e., some type of uniq
ue message) to the server using the nor-
1286
Chapter 61
SCTP is described in [Stewart & Xie, 2001], [Stevens et al., 2004], and in
RFCs 4960, 3257, and 3286.
SCTP is available on Linux since kern
el 2.6. Further information about this
implementation can be found at
1284
Chapter 61
61.13.2The
sendmsg()
recvmsg()
and
recvmsg()
from a call to
open()
or
ability of out-of-band data without needing to
read all of the intervening data in the
stream. This feature is used in programs such as
telnet
rlogin
, and
to make it possible
to abort previously transmitted commands.
Out-of-band data is sent and received
using the
MSG_OOB
flag in calls to
send
and
recv()
1282
Chapter 61
61.12TCP Versus UDP
Given that TCP provides reli
able delivery of data, whil
e UDP does not, an obvious
question is, Why use UDP at all? The answ
er to this question is covered at some
int sockfd, optval;
61.11Inheritance of Flags and Options Across
1280
Chapter 61
In both of these scenarios, the outstand
ing TCP endpoint is unable to accept new
both cases, by default, most TCP implementations
prevent a new listening socket from being
bound to the servers well-known port.
The
error doesnt usually occur with cl
ients, since they typically use an
ephemeral port that wont be one of
those ports currently in the TIME_WAIT
state. However, if a client binds to a
specific port number, then it also can
encounter this error.
To understand the operation of the
SO_REUSEADDR
1278
Chapter 61
These three segments are the SYN, SYN/ACK, and ACK segments exchanged for
the three-way handshake (see Figure 61-5).
In the following output, the client se
nds the server two messages, containing
16 and 32 bytes, respectively, and the serv
er responds in each case with a 4-byte
message:
IP pukaki.6039�1 tekapo.55555: P 1:171616ack 1 win 5840
IP tekapo.5555�5 pukaki.60391: . ack 17 win 1448
IP tekapo.5555�5 pukaki.60391: P 1:5(4) ack 17 win 1448
IP pukaki.6039�1 tekapo.55555: . ack 5 win 5840
IP pukaki.6039�1 tekapo.55555: P 17:43232 ack 5 win 5840
IP tekapo.5555�5 pukaki.60391: . ack 49 win 1448
IP tekapo.5555�5 pukaki.60391: P 5:9(4) ack 49 win 1448
IP pukaki.6039�1 tekapo.55555: . ack 9 win 5840
For each of the data segments, we see
an ACK sent in the opposite direction.
Lastly, we show the segments exchanged during connection termination (first,
the client closes its end of the connection,
and then the server cl
oses the other end):
IP pukaki.6039�1 tekapo.55555: F 49:400ack 9 win 5840
IP tekapo.5555�5 pukaki.60391: . ack 50 win 1448
IP tekapo.5555�5 pukaki.60391: F 9:9(0) ack 50 win 1448
IP pukaki.6039�1 tekapo.55555: . ack 10 win 5840
The above output shows the fo
ur segments exchanged du
ring connection termina-
tion (see Figure 61-6).
dst
flags
window
urg
These fields have the following meanings:
: This is the source IP address and port.
: This is the destination IP address and port.
: This field contains zero or more of
1276
Chapter 61
Here is an abridged example of
the output that we see when using
1274
Chapter 61
61.6.7The TIME_WAIT State
The TCP TIME_WAIT state is a frequent so
urce of confusion in network program-
ming. Looking at Figure 61-4, we can see
that a TCP performing an active close
goes through this state. The TIME_WAIT state exists to serve two purposes:
to implement reliable connection termination; and
to allow expiration of old duplicate segm
ACK
ACK
(active close)
CLOSE_W LOS;_W8;.1A;&#xIT00;AIT
LAST_A&#xL-36; S6.; T_A;ЬK;CK
ESTS7T;5.9;«LI;&#xSHED;ABLISHED
ESTS7T;5.9;«LI;&#xSHED;ABLISHED
FIN_WIN_;&#xW86.;IT;ကAIT1
FIN_WIN_;&#xW86.;IT; AIT2
TIME_W&#xTIME;&#x_W86;&#x.1AI;&#xT000;AIT
Server
(passive close)
CLOSED�
1272
Chapter 61
The SYN segments exchanged in the firs
t two steps of the three-way handshake
may contain information in the
options
field of the TCP header that is used to
, ACK
ClientServer
ACK
(blocks)
active open
recv: FIN
send: ACK
recv: ACK
recv: SYN, A22
send: A33
recv: A33
recv: A55
r66
send: A77
recv: FIN, ACK
send: ACK
or timeoutor timeout
timeout
recv: A77
data-transfer
state
: action by local application
usual path for client
usual path for server
recv
: segment from peer that caused transition
Key
CLOSE_WAIT
LAST_ACK
FIN_WAIT1
TIME_WAIT
FIN_WAIT2
:segment sent to peer during transition
CLOSING
r44
send: A55
ESTABLISHED
passive
r11
send: SYN, A22
LISTEN
1270
Chapter 61
LAST_ACK: The application performed a
passive close, and the TCP, formerly
in the CLOSE_WAIT state, sent a FIN to the peer TCP and is waiting for it to
be acknowledged. When this ACK is received, the connection is closed, and
the associated kernel resources are freed.
To the above states, RFC 793 adds one further, fictional state, CLOSED, represent-
ing the state when there is
no connection (i.e., no kernel resources are allocated to
describe a TCP connection).
In the above list we use the spellings f
or the TCP states as defined in the Linux
source code. These differ slightly
from the spellings in RFC 793.
Figure 61-4 shows the
state transition diagram
for TCP. (This figure is based on diagrams
1268
Chapter 61
allow the receiving TCP to double-check
that an incoming segment has arrived
at the correct destination (i.e., that
IP has not wrongly accepted a datagram
that was addressed to another host or
passed TCP a packet that should have
bytes)
(Seq # range:
ACK of segment
(Ack #:
SenderReceiver
1266
Chapter 61
61.6A Closer Look at TCP
length
Reser-
ved
Source port number
Destination port number
Window size
TCP checksum
Urgent pointer
Options (if present)
(0 40 bytes)
Sequence number
Acknowledgement number
20 bytes
Control
Data (if present)
(0+ bytes)
015
1264
Chapter 61
call is performed; ho
wever, if the server
was execed by the program that did the
(e.g.,
1262
Chapter 61
count
argument specifies the number of by
tes to be transferred. If end-of-
file is encountered before
bytes are transferred, on
ly the available bytes are
transferred. On success,
sendfile
could be used to transfer
bytes between two regular files. On Linux 2.4 and earlier,
out_fd
could refer to
a regular file. Some reworking of the
underlying implementation meant that
this possibility disappeared in the 2.6 kernel. However, this feature may be
reinstated in a future kernel version.
If
r
re
sendf
user-space
buffer
buffer
cache
1260
Chapter 61
All of the above flags are spec
ified in SUSv3, except for
MSG_DONTWAIT
, which is never-
theless available on some other UNIX implementations. The
MSG_WAITALL
flag was a
behavior by using
to set nonblocking mode (
O_NONBLOCK
1258
Chapter 61
Listing 61-2:
service
1256
Chapter 61
61.2The
shutdown
cl
on a socket closes both halves of the bidirectional communication
channel. Sometimes, it is usef
ul to close one half of the connection, so that data can
1254
Chapter 61
61.1Partial Reads and Wr
1252
Chapter 60
60.6Summary
An iterative server handles one client at a time, processing that clients request(s)
1250
Chapter 60
In the example lines shown
in Listing 60-5 for the
# echo stream tcp nowait root internal
# echo dgram udp wait root internal
ftp stream tcp nowait root /usr/sbin/tcpd in.ftpd
The first two lines of Listing 60-5 are commented out by the initial
character; we
show them now since well refer to the
service shortly.
Each line of
1248
Chapter 60
1246
Chapter 60
1244
Chapter 60
Listing 60-4:
A concurrent server that implements the TCP
service
1242
Chapter 60
Listing 60-3:
A client for the UDP
1240
Chapter 60
Concurrent servers are suitable when a significant amount of processing time
is required to handle each request, or where the client and server engage in an
extended conversation, passing messages back
and forth. In this chapter, we mainly
focus on the traditional (and simplest) method of designing a concurrent server:
creating a new child process for each new
client. Each server child performs all
tasks necessary to service a single client
and then terminates. Since each of these
processes can operate independently, multiple clients can be handled simulta-
neously. The principal task of the main server process (the parent) is to create a
new child process for each new client. (A vari
ation on this approach is to create a new
thread for each client.)
In the following sections, we look at ex
amples of an iterative and a concurrent
server using Internet domain sockets. These two servers implement the
service
(RFC 862), a rudimentary servic
1236
Chapter 59
1234
Chapter 59
printf(" address type: %s\n",
1232
Chapter 59
ng corresponding to the error
value specified in
err
inet
from
1230
Chapter 59
freeaddrinfo(result);
1228
Chapter 59
Listing 59-9:
necting client.
1226
Chapter 59
Listing 59-8:
Header file for
1224
Chapter 59
In the shell session log, we see that th
e kernel cycles sequentially through the
ephemeral port numbers. (Other impl
ementations exhibit similar behavior.)
On Linux, this behavior is the result of an optimization to minimize hash look-
ups in the kernels table of local sock
et bindings. When the upper limit for
these numbers is reached, the kernel
recommences allocating an available
number starting at the low end of th
e range (defined by the Linux-specific
we can try connecting to */
1222
Chapter 59
addrlen = sizeof(struct sockaddr_storage);
cfd = accept(lfd, (struct sockaddr *) &claddr, &addrlen);
if (cfd == -1) {
errMsg("accept");
continue;
}
we can try binding to */
1220
Chapter 59
Ignore the
SIGPIPE
signal
. This prevents the server from receiving the
SIGPIPE
the client. The server displays the cl
ients address (IP address plus port
number) on standard output
Read the clients message
, which consists of a newline-terminated string
specifying how many sequence number
s the client wants. The server con-
verts this string to an integer and stores it in the variable
reqLen
Send the current value of the sequence number (
) back to the cli-
ent, encoding it as a newline-terminated string
. The client can assume
that it has been allocated all of
the sequence numbers in the range
(seqNum + reqLen 1)
Update the value of the servers sequence number by adding
to
Listing 59-5:
Header file used by
is_seqnum_sv.c
and
is_seqnum_cl.c
sockets/is_seqnum.h
1218
Chapter 59
acce
recvfrom
NI_MAXHOST
and
NI_MAXSERV
If we are not interested in obtai
ning the hostname, we can specify
host
as
NULL
hostlen
as 0. Similarly, if we dont ne
ed the service name, we can specify
and
as 0. However, at least one of
host
and
must be non-
NULL
(and the corresponding length argument must be nonzero).
The final argument,
, is a bit mask that controls the behavior of
Table 59-1:
1216
Chapter 59
hints.ai_flags
field is a bit mask that modifies the behavior of
result
addrinfo
structures
hostname
1214
Chapter 59
As input,
1212
Chapter 59
Top-level domains
The nodes immediately below the anonymous root form the so-called
domains
(TLDs). (Below these are the
second-level domains
, and so on.) TLDs fall into
Historically, there were seven
TLDs, most of which can be considered
international. We have shown four of the original generic TLDs in Figure 59-2. The
other three are
, and
; the latter two are reserved for the United States. In
more recent times, a number of new
generic TLDs have
been added (e.g.,
name
museum
Each nation has a corresponding
(or
) TLD (standardized as
ISO 3166-1), with a 2-character name. In Figu
re 59-2, we have shown a few of these:
(Germany,
Deutschland
(a supra-national geographical TLD for the European
New ZealandNew Zealand
(United States of America). Several countries
or from
the web page at
http://www.root-servers.org/
.) Given the name
www.otago.ac.nz
root name server refers the local DNS server to one of the
DNS servers. The local
DNS server then queries the
server with the name
, and receives a
response referring it to the
server. The local DNS server then queries the
server with the name
, and is referred to the
server. Finally,
the local DNS server queries the
otago.ac.nz
server with the name
www.otago.ac.nz
and obtains the required IP address.
1210
Chapter 59
Before the advent of DNS, mappings
between hostnames and IP addresses
were defined in a manually maintained local file,
/etc/hosts
, containing records of
the following form:
# IP-address canonical hostname [aliases]
127.0.0.1 localhost
by searching this file, looking for a match on
either the canonical hostname (i.e., the
official or primary name of
the host) or one of the (optio
nal, space-delimited) aliases.
However, the
domain name,
e information. Occasionally, this reso-
lution process may take a noticeable amount of time, and DNS servers employ
caching techniques to avoid unnecess
ary communication for frequently que-
ried domain names.
sponding to a hostname, and
1208
Chapter 59
Listing 59-3:
IPv6 case-conversion serv
recvfrom
call) to printable
The client program shown in Listing 59
-4 contains two notable modifications
from the earlier UNIX domain version (Lis
ting 57-7, on page 1173
). The first differ-
ence is that the client interprets its initial command-line argument as the IPv6
address of the server. (The remaining co
mmand-line arguments are passed as separate
datagrams to the server.) The client converts
the server address to binary form using
inet_pton()
. The other difference is that the client
1206
Chapter 59
that maps binary IP addresses to hostname
s and vice versa. The existence of a system
such as DNS is essential to the operation
1204
Chapter 59
Unlike their IPv4 counterpar
ts, the IPv6 constant and variable initializers are in
1202
Chapter 59
If the number of bytes read before a ne
wline is encountered is greater than or
equal to
n 1n 1
, then the
read
function discards the excess bytes (including
the newline). If a newline was read within the first
n 1n 1
bytes, then it is included
1200
Chapter 59
employ different rules for aligning the fiel
ds of a structure to address boundaries on
the host system, leaving different number
1198
Chapter 59
(LSB)
address
address
byte order
byte order
(LSB)
address
address
address
address
(LSB)
address
address
(LSB)
address
address
address
address
2-byte integer4-byte integer
MSB = Most Si
nificant B
te, LSB = Least Si
nificant B
esesparticular hostname
and the port number that corresponds to
a particular service name. Our discussion
of hostnames includes a description of
the Domain Name System (DNS), which
implements a distributed database that
maps hostnames to IP addresses and vice versa.
1194
Chapter 58
RFC 793,
Transmission Control Protocol
. J. Postel ed.ed.
RFC 768,
User Datagram Protocol
. J. Poed.ed.1980.
RFC 1122,
1192
Chapter 58
The acknowledgement message passed from the receiver back to the sender
can use the sequence number to iden
tify which TCP segment was received.
The receiver can use the sequence number to eliminate duplicate segments.
Such duplicates may occur either becaus
e of the duplication of IP datagrams or
because of TCPs own retransmission al
1190
Chapter 58
The checksums used by both UDP and TC
P are just 16 bits long, and are simple
add-up checksums that can
sockfd
A
sockfd
Kernel
Kernel
receive
buffer
buffer
state
info
receive
buffer
buffer
state
info
1188
Chapter 58
IPv6 addresses
The principles of IPv6 addresses are simila
r to IPv4 addresses. The key difference is
that IPv6 addresses consist
of 128 bits, and the first fe
w bits of the address are a
format prefix
, indicating the address type. (We wo
all zeros
IPv4 address
16 bits32 bits
Host ID
32 bits
1186
Chapter 58
(Each IP fragment is itself an IP datagram
Host ID
32 bits
1184
Chapter 58
58.4The Network Layer: IP
Above the data-link layer is the
TCP protocol
(transfers TCP segments)
Data link protocol
(transfers data frames)
IP protocol
(transfers IP datagrams)
Data link
Data link
Application-defined protocol
(transfers application data)
datagram
Source + destination
port #, sequence #,
acknowledgement #,
1182
Chapter 58
always strictly hold true; occasionally, an
application does need to know some of
SOCK_DGRAM
SOCK_STREAM
layer
Transport
layer
Data-link
layer
Kernel space
Hardware
1180
Chapter 58
protocols that were formerly common on local and wide area networks. The term
wakatipuwanaka
pukakirotoiti
Network 1
Network 2
1176
Chapter 57
To create an abstract binding,
we specify the first byte of the
field as a null
byte (
). This distinguishes abstract so
1174
Chapter 57
The following shell session log demonstr
ates the use of the server and client
./ud_ucase_sv &
[1] 20113
./ud_ucase_cl hello world

Send 2 messages to server
Server received 5 bytes from /tmp/ud_ucase_cl.20150
Response 1: HELLO
Server received 5 bytes from /tmp/ud_ucase_cl.20150
Response 2: WORLD
./ud_ucase_cl 'long message'
Send 1 longer message to server
Server received 10 bytes from /tmp/ud_ucase_cl.20151
Response 1: LONG MESSA
kill %1

Terminate server
The second invocation of the client prog
ram was designed to show that when a
recvfrom
call specifies a
BUF_SIZE
, defined in Listing 57-5 with the value 10)
that is shorter than the message size, the me
ssage is silently truncated. We can see that
this truncation occurred, because the server
prints a message saying it received just
10 bytes, while the message sent by
the client consisted of 12 bytes.
1172
Chapter 57
Listing 57-6:
A simple UNIX domain datagram server
recvfrom
The client programListing 57-7Listing 57-7create
1170
Chapter 57
Construct an address structure for the se
1168
Chapter 57
Listing 57-3:
A simple UNIX domain
1166
Chapter 57
In order to bind a UNIX domain socket to an address, we initialize a
sockaddr_un
structure, and then pass a castcast
pointer to this structure as the
argument to
, and specify
as the size of the structure, as shown in
Listing 57-1:
const char *SOCKNAME = "/tmp/mysock";
int sfd;
struct sockaddr_un addr;
The use of the
1162
Chapter 56
56.6.2Using
1160
Chapter 56
2.In order to allow another
address of the sender, we can send a reply if desired. (This is useful if the
senders socket is bound to an address
that is not well known, which is typical
Server
(Possibly multiple) data
transfers in either direction
recvfr
recvfr
sockfd
Kernel
sockfd
buffer
buffer
1158
Chapter 56
The key point to understand about
acce
is that it creates a
socket, and it is
Passive socket
(server)
Active socket
may block, depending on
number of backlogged
connection requests
lis
1156
Chapter 56
In most applications that employ stream
Passive socket
(server)
blocks until
resumes
(Possibly multiple) data
transfers in either direction
Active socket
r
lis
r
1154
Chapter 56
The
sockfd
argument is a file descriptor obtained from a previous call to
Sun RPC solves this problem using its
server.) Of course, the
1152
Chapter 56
1150
Chapter 56
Chapter 60 discusses the design
1148
Chapter 55
55-7.
If you have access to other UNIX implemen
tations, use the program in Listing 55-2
i_fcntl_locking.c
) to see if you can establish any rules for
record locking
regarding starvation of writers and regarding the order in which multiple queued
lock requests are granted.
55-8.
Use the program in Listing 55-2 (
i_fcntl_locking.c
) to demonstrate that the kernel
detects circular deadlocks involving three
(or more) processes locking the same file.
55-9.
Write a pair of programs (or a single prog
ram that uses a child process) to bring
about the deadlock situation with mand
atory locks described in Section 55.4.
55-10.
Read the manual page of the
lockfile11
utility that is supplied with
simple version of this program.
File Locking
1147
1146
Chapter 55
If the
call succeeds, we have obtained the lock. If it fails (
EEXIST
), then
another process has the lock and we must try again later. This technique suffers the
same limitations as the
open(file, O_CREAT | O_EXCL,...)
technique described above.
open(file, O_CREAT
O_TRUNC
O_WRONLY,
plus
unlink(file)
The fact that calling
open
on an existing file fails if
O_TRUNC
is specified and write
permission is denied on the file can be used as the basis of a locking technique. To
obtain a lock, we use the following code
(which omits error checking) to create a
fd = open(file, O_CREAT | O_TRUNC | O_WRONLY, (mode_t) 0);
close(fd);
For an explanation of why we use the
(mode_t)
cast in the
call above, see
Appendix C.
If the
call succeeds (i.e., the file didnt previously exist), we have the lock. If it
fails with
EACCES
(i.e., the file exists and has
no permissions for anyone), then
another process has the lock, and we must
try again later. This technique suffers
the same limitations as the pr
evious techniqu
es, with the added caveat that we cant
employ it in a program with superuser privileges, since the
open()
call will always
succeed, regardless of the permis
File Locking
1145
open(file, O_CREAT
O_EXCL,...)
plus
unlink(file)
SUSv3 requires that an
call with the flags
O_CREAT
and
O_EXCL
perform the
steps of checking for the existence of a file and creating it atomically (Section 5.1).
This means that if two processes attempt to
create a file specifying these flags, it is
guaranteed that only one of them will su
cceed. (The other process will receive the
EEXIST
from
open
.) Used in conjunction with the
system call, this
provides the basis for a locking mechanism. Acquiring the lock is performed by
successfully opening the file with the
O_CREAT
O_EXCL
flags, followed by an imme-
. Releasing the lock is performed using
unlink()
. Although workable,
this technique has several limitations:
If the
open()
fails, indicating that some other process has the lock, then we must
contrast, the kernel provides deadlock
1144
Chapter 55
int
createPidFile(const char *progName, const char *pidFile, int flags)
int fd;
char buf[BUF_SIZE];
fd = open(pidFile, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
if (fd == -1)
errExit("Could not open PID file %s", pidFile);
if (flags & CPF_CLOEXEC) {
File Locking
1143
process ID of the daemon. It also allo
ws an extra sanity checkwe can verify
1142
Chapter 55
is employing a technique to ensure that only one instance of the daemon is run-
ning at a time. We describe
this technique in Section 55.6.
We can also use
/proc/locks
to obtain information a
as demonstrated in the following output:
cat /proc/locks
1: POSIX ADVISORY WRITE 11073 03:07:436283 100 109
1:� - POSIX ADVISORY WRITE 11152 03:07:436283 100 109
2: POSIX MANDATORY WRITE 11014 03:07:436283 0 9
2:� - POSIX MANDATORY WRITE 11024 03:07:436283 0 9
2:� - POSIX MANDATORY READ 11122 03:07:436283 0 19
3: FLOCK ADVISORY WRITE 10802 03:07:134447 0 EOF
3:� - FLOCK ADVISORY WRITE 10840 03:07:134447 0 EOF
Lines shown with the characters
immediately after a lock
number represent lock
requests blocked by the corresponding lock number. Thus, we see one request
blocked on lock 1 (an advisory lock created with
), two requests blocked on
lock 2 (a mandatory lock created with
), and one request blocked on lock 3 (a lock
created with
flock()
The
/proc/locks
file also displays information about any file leases that are held
by processes on the system. File leases
are a Linux-specific mechanism available
in Linux 2.4 and later. If a process takes
out a lease on a file, then it is notified
(by delivery of a signal) if another process tries to
or
truncate()
that file.
(The inclusion of
truncat
is necessary because it is
the only system call that
can be used to change the contents of a file without first opening it.) File leases
are provided in order to allow Samba to support the
functionality of the Microsoft SMB protoc
ol and to allow NFS version 4 to sup-
(which are similar to SMB oplocks). Further details about file
leases can be found under the description of the
F_SETLEASE
operation in the
55.6Running Just One Instance of a Program
Some programsin particular, many daemonsneed to ensure that only one
instance of the program is running on
the system at a time. A common method of
doing this is to have the daemon create a file in a standard directory and place a
write lock on it. The daemon holds the file lock for the duration of its execution
File Locking
1141
4.The type of lock, either
READ
or
(corresponding to shared and exclusive
5.The process ID of the process holding the lock.
6.Three colon-separated numbers that identi
fy the file on which the lock is held.
These numbers are the major and minor device numbers of the file system on
which the file resides, followed by the i-node number of the file.
7.The starting byte of the
lock. This is always 0 for
flock()
8.The ending byte of the lock. Here,
indicates that the lock runs to the end of
the file (i.e.,
l_len
was specified as 0 for a lock created by
fcntl()
). For
flock()
locks, this column is always
EOF
In Linux 2.4 and earlier, each line of
/proc/locks
includes five additional hexa-
decimal values. These are pointer addresses used by the kernel to record locks
in various lists. These values are
not useful in application programs.
Using the information in
, we can find out which process is holding a lock,
and on what file. The following shell session shows how to do this for lock number 3 in
the list above. This lock is held by
process ID 312, on the i-node 133853 on the
device with major ID 3 and minor ID 7. We begin by using
to list information
about the process wi
th process ID 312:
ps -p 312
PID TTY TIME CMD
312 ? 00:00:00 atd
The above output shows that the
program holding the lock is
, the daemon that
In order to find the lo
cked file, we first search the files in the
/dev
directory,
1140
Chapter 55
Mandatory locking caveats
Mandatory locks do less for us than we mi
ght at first expect, and have some poten-
tial shortcomings and problems:
Holding a mandatory lock on a file doesnt prevent another process from
deleting it, since all that is required to unlink a file is suitable permissions on
the parent directory.
Careful consideration should be applied
before enabling mandatory locks on a
publicly accessible file, since not even privileged processes can override a man-
datory lock. A malicious user could cont
inuously hold a lock on the file in
order to create a denial-of-service attack
. (While in most cases, we could make
the file accessible once more by turning
File Locking
1139
in blocking mode, the system call bloc
ks. If the file was opened with the
O_NONBLOCK
flag, the system call immediately fails with the error
EAGAIN
. Similar rules apply for
and
ftruncate()
, if the bytes they are attempting to add or remove from the
file overlap a region currently locked (for reading or writing) by another process.
If we have opened a file in blocking mode (i.e.,
O_NONBLOCK
is not specified in the
call), then I/O system calls can be in
volved in deadlock situations. Consider
the example shown in Figure 55-7, involving
two processes that open
the same file for
blocking I/O, obtain write locks on differen
t parts of the file, an
d then each attempt
to write to the region locked by the other
process. The kernel resolves this situation
Deadlock
Blocks
1138
Chapter 55
Linux, like many other UNIX implementations, also allows
fcntl()
record locks to be
mandatory
. This means that every file I/O operation is checked to see whether it is
compatible with any locks held by other
processes on the region of the file on
which I/O is being performed.
File Locking
1137
The semantics of
fcntl()
lock inheritance and release are an architectural blem-
ish. For example, they make the use of record locks from library packages problem-
atic, since a library function cant prevent
the possibility that its caller will close a
file descriptor referring to a locked file and thus remove a lock obtained by the
library code. An alternative implementation
scheme would have been to associate a
lock with a file descriptor rather than
with an i-node. However, the current seman-
tics are the historical and now standardiz
ed behavior of record locks. Unfortu-
nately, these semantics grea
tly limit the utility of
locking.
With
, a lock is associated only with an
open file description, and remains in
effect until either any process holding a reference to the lock explicitly releases the
lock or all file descriptors referring to
the open file description are closed.
55.3.6Lock Starvation and Priority of Queued Lock Requests
When multiple processes must wait in order to place a lock on a currently locked
region, a couple of questions arise.
Can a process waiting to place a write lock
be starved by a series of processes
placing read locks on the same region?
On Linux (as on many other UNIX imple-
mentations), a series of read locks can in
deed starve a blocked write lock, possibly
indefinitely.
When two or more processes are waiting to place a lock, are there any rules
lock when it beco
mes available? For
example, are lock requests satisfied in FI
FO order? And do the rules depend on the
types of locks being requested by each pr
ocess (i.e., does a process requesting a
read lock have priority over one requesting
a write lock or vice versa, or neither)?
On Linux, the rules are as follows:
1136
Chapter 55
Whenever a new lock is added to this data structure, the kernel must check for con-
flicts with any existing lock on the file. Th
is search is carried out sequentially, start-
ing at the head of the list.
Assuming a large number of locks di
stributed randomly among many pro-
cesses, we can say that the time required
to add or remove a lock increases roughly
linearly with the number of lo
cks already held on the file.
55.3.5Semantics of Lock Inheritance and Release
The semantics of
fcntl()
record lock inheritance an
d release differ substantially
from those for locks created using
floc
. Note the following points:
Record locks are not inherited across a
fork()
by a child process. This contrasts
with
, where the child inherits a reference to the
lock and can
release this lock, with th
e consequence that the pare
nt also loses the lock.
Record locks are preserved across an
. (However, note th
e effect of the
close-on-exec flag, described below.)
struct flock fl;
fl.l_type = F_WRLCK;
an open file descriptor. And, as
well as performing an explicit
, a descriptor
can be closed by an
exec()
File Locking
1135
l_start
Key
1134
Chapter 55
regionIsLock
File Locking
1133
As we see from the last line of output, th
is allowed process As blocked lock request
It is important to realize that even th
ough process Bs deadlocked request was
canceled, it still held its other lock, and so process As queued lock request
remained blocked. Process As lock reques
t is granted only when process B removes
its other lock, bringing about the situation shown in part
of Figure 55-5.
Figure 55-5:
State of granted and queued
lock requests while running
i_fcntl_locking.c
55.3.3Example: A Library of Locking Functions
Granted lock
Queued lock
39
0
99
70
0
39
0
99
70
0
99
a)
b)
39
0
99
70
0
99
c)
PID=800, type=WRITE (canceled because of deadlock)
0
99
d)
1132
Chapter 55
We start a first instance (process A) of
the program in Listing 55-2, placing a
0 to 39 of the file:
Terminal window 1
ls -l tfile
-rw-r--r-- 1 mtk users 100 Apr 18 12:19 tfile
./i_fcntl_locking tfile
Enter ? for help
PI�D=790
s r 0 40
[PID=790] got lock
Then we start a second instance of the pr
process Bprocess Bplacing a read lock on
a bytes 70 through to the end of the file:

Terminal window 2
$
./i_fcntl_locking tfile
Enter ? for help
PID=800�
s r -30 0 e
[PID=800] got lock
At this point, things appear as shown in part
of Figure 55-5, where process A (pro-
cess ID 790) and process B (process ID 800)
hold locks on differen
Now we return to process A, where we tr
y to place a write lock on the entire
file. We first employ
F_GETLK
File Locking
1131
numRead = sscanf(line, "%c %c %lld %lld %c", &cmdCh, &lock,
&st, &len, &whence);
fl.l_start = st;
fl.l_len = len;
if (numRead 4 || strchr("gsw", cmdCh)==NULL||
strchr("rwu", lock) ==NULL||strchr("sce", whence) == NULL){
printf("Invalid command!\n");
continue;
}
1130
Chapter 55
Listing 55-2:
Experimenting with record locking
filelock/i_fcntl_locking.c
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include fcntünt;l.h;l.h
#include "tlpi_hdr.h"
#define MAX_LINE 100
static void
displayCmdFmt(void)
printf("\n Format: cmd lock start length [whence]\n\n");
File Locking
1129
of the blocked processes and causes its
call to unblock and fail with the error
EDEADLK
. (On Linux, the process making the most recent
call is selected, but
this is not required by SUSv3, and may not hold true on future versions of Linux or
on other UNIX implementations. Any process using
F_SETLKW
must be prepared to
EDEADLK
Figure 55-4:
Deadlock when two processes deny each others lock requests
whence
cmd
, we can specify
to perform an
Deadlock
Blocks
1128
Chapter 55
1020
1930
2939
read lock
write lock
read lock
1039
read lock
1. After placing read lock (
l_start
= 30)
2. After placing write lock (
l_start
= 20,
= 10)
File Locking
1127
process on any part of the region to be locked,
fcntl()
fails with the error
EAGAIN
. On some UNIX implementations,
fcntl()
fails with the error
EACCES
in this case. SUSv3 permits either possibility, and a portable application
should test for both values.
1126
Chapter 55
In order to place a read lock on a file, th
e file must be open for reading. Similarly,
to place a write lock, the file must be open for writing. To place both types of locks,
we open the file read-write (
). Attempting to place a lock that is incompatible
with the file access mode results in the error
EBADF
l_whence
File Locking
1125
off_t l_start; /* Offset where the lock begins */
off_t l_len; /* Number of bytes to lock; 0 means "until EOF" */
Process A
Request write lock on
bytes 0 to 99
Update bytes 0 to 99
Process B
Convert lock on bytes
0 to 99 to read lock
Request read lock on
bytes 0 to 99
blocks
unblocks
Read bytes 0 to 99
Read bytes 0 to 99
Convert lock on bytes
0 to 99 to write lock
blocks
Unlock bytes 0 to 99
unblocks
Update bytes 0 to 99
Unlock bytes 0 to 99
Time
1124
Chapter 55
Historically, the Linux NFS server did not support
floc
locks. Since kernel
2.6.12, the Linux NFS server supports
floc
locks by implementing them as an
lock on the entire file. This can
cause some strange effects when mixing
BSD locks on the server and BSD locks on the client: the clients usually wont
see the servers locks, and vice versa.
55.3Record Locking with
fcnt
fcntl()
Section 5.2Section 5.2we can place a lock on
any part of a file, ranging from a
single byte to the entire
file. This form of file locking is usually called
record locking
However, this term is a misnomer, beca
use files on the UNIX system are byte
sequences, with no concept of record boundaries. Any notion of records within a
file is defined purely
within an application.
fcntl()
is used to lock byte ranges
corresponding to the application-
defined record boundaries within the
file; hence the origin of the term
record
locking
. The terms
byte range
file region
file segment
are less commonly used, but
more accurate, descriptions of
this type of lock. (Because this is the only kind of
locking specified in the original POSIX.1 standard and in SUSv3, it is sometimes
also called POSIX file locking.)
SUSv3 requires record locking to be su
pported for regular files, and permits it
to be supported for other file types. Although it generally makes sense to apply
record locks only to regular files (since,
for most other file types, it isnt mean-
ingful to talk about byte ranges for the
data contained in the file), on Linux, it
is possible to apply a record lock to any type of file descriptor.
Figure 55-2 shows how record locking might be used to synchronize access by two
processes to the same region of a file. (I
n this diagram, we assume that all lock
requests are blocking, so that they will wa
it if a lock is held by another process.)
The general form of the
call used to create or remove a file lock is as
follows:
struct flock flockstr;
File Locking
1123
However, if we use
open()
to obtain a second file descriptor (and associated
open file description) referri
ng to the same file, this se
cond descriptor is treated
independently by
floc
. For example, a process executing the following code will
block on the second
flock()
call:
fd1 = open("a.txt", O_RDWR);
fd2 = open("a.txt", O_RDWR);
flock(fd1, LOCK_EX);
flock(fd2, LOCK_EX); /* Locked out by lock on 'fd1' */
Thus, a process can lock it
self out of a file using
. As well see later, this cant
happen with record locks obtained by
When we create a child process using
fork
, that child obtains duplicates of its
parents file descriptors, and, as
with descriptors duplicated via
du
and so on,
these descriptors refer to the same open
file descriptions and thus to the same
locks. For example, the following code ca
uses a child to remove a parents lock:
flock(fd, LOCK_EX); /* Parent obtains lock */
if (fork0) /* If child... */
flock(fd, LOCK_UN); /* Release lock shared with parent */
fcntl()
Locks created by
are preserved across an
(unless the close-on-exec
1122
Chapter 55
Using the program in Listing 55-1, we can conduct a number of experiments to
explore the behavior of
. Some examples are shown in the following shell session.
We begin by creating a file, and then start an instance of our program that sits in
the background and holds a shared lock for 60 seconds:
./t_flock tfile s 60 &
[1] 9777
PID 9777: requesting LOCK_SH at 21:19:37
PID 9777: granted LOCK_SH at 21:19:37
Next, we start another instance of the pr
ogram that successfully requests a shared
lock and then releases it:
./t_flock tfile s 2
PID 9778: requesting LOCK_SH at 21:19:49
PID 9778: granted LOCK_SH at 21:19:49
PID 9778: releasing LOCK_SH at 21:19:51
However, when we start another instance
of the program that makes a nonblocking
requests for an exclusive lock, the request immediately fails:
./t_flock tfile xn
PID 9779: requesting LOCK_EX at 21:20:03
PID 9779: already locked - bye!
When we start another instance of the pr
ogram that makes a blocking request for
an exclusive lock, the program blocks. When the background process that was
holding a shared lock for 60 seconds releases its lock, the blocked request is granted:
./t_flock tfile x
PID 9780: requesting LOCK_EX at 21:20:21
PID 9777: releasing LOCK_SH at 21:20:37
PID 9780: granted LOCK_EX at 21:20:37
PID 9780: releasing LOCK_EX at 21:20:47
55.2.1Semantics of Lock Inheritance and Release
As shown in Table 55-1, we ca
n release a file lock via an
call that specifies
operation
as
. In addition, locks are automatica
lly released when the correspond-
ing file descriptor is closed. However, the
story is more complicated than this. A file
lock obtained via
is associated with the open file
description (Section 5.4), rather
than the file descriptor or the file (i-node) itself. This means that when a file
descriptor is duplicated (via
, or an
fcntl()
F_DUPFD
operation), the new
file descriptor refers to th
e same file lock. For example,
if we have obtained a lock
on the file re
ferred to by
, then the following code (w
hich omits error checking)
releases that lock:
flock(fd, LOCK_EX); /* Gain lock via 'fd' */
newfd = dup(fd); /* 'newfd' refers to same lock as 'fd' */
flock(newfd, LOCK_UN); /* Frees lock acquired via 'fd' */
If we have acquired a lock via a particular
file descriptor, and we create one or more
duplicates of that descriptor, thenif we do
nt explicitly perform
an unlock operation
the lock is released only when all of th
e duplicate descriptors have been closed.
File Locking
1121
Listing 55-1:
Using
filelock/t_flock.c
#include sys/file.h&#xsys/;il;.h7;
#include fcntünt;l.h;l.h
#include "curr_time.h" /* Declaration of currTime() */
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int fd, lock;
const char *lname;
if (argc 3 || strcmp(argv[1], "--help") == 0 ||
strchr("sx", argv[2][0]) == NULL)
usageErr("%s file lock [sleep-time]\n"
" 'lock' is 's' (shared) or 'x' (exclusive)\n"
" optionally followed by 'n' (nonblocking)\n"
" 'secs' specifies time to hold lock\n", argv[0]);
lock = (argv[2][0] == 's') ? LOCK_SH : LOCK_EX;
if (argv[2][1] == 'n')
lock |= LOCK_NB;
fd = open(argv[1], O_RDONLY); /* Open file to be locked */
if (fd == -1)
errExit("open");
lname = (lock & LOCK_SH) ? "LOCK_SH" : "LOCK_EX";
1120
Chapter 55
Any number of processes may simultaneously hold a shared lock on a file. How-
ever, only one process at a time can hold an exclusive lock on a file. (In other
words, exclusive locks deny both exclusive and shared locks by other processes.)
Table 55-2 summarizes the compatibility rules for
flock()
locks. Here, we assume
that process A is the first to place the lo
ck, and the table indica
File Locking
1119
Although file locking is normally used in conjunction with file I/O, we can also use
it as a more general synchronization technique. Cooperating processes can follow a
convention that locking all or part of a file indicates access by a process to some
shared resource other than the file it
self (e.g., a shared memory region).
functions
Because of the user-space buffering performed by the
library, we should be
cautious when using
functions with the locking
techniques described in this
chapter. The problem is that an input buffer
might be filled before a lock is placed,
lock is removed. There are a few ways to
avoid these problems:
Perform file I/O using
read
and
writ
(and related system calls) instead of
library.
stream immediately after placing a lock on the file, and flush it
once more immediately be
fore releasing the lock.
Perhaps at the cost of some efficiency, disable
1118
Chapter 55
Figure 55-1:
Two processes updating a file at the same time without synchronization
The problem is clear: at the end of thes
e steps, the file contains the value 1001,
when it should contain the value 1002. (Thi
s is an example of a race condition.) To
prevent such possibilities, we need some form of interprocess synchronization.
Although we could use (say) semaphores
to perform the required synchronization,
using file locks is usually
preferable, because the kern
el automatically associates
[Stevens & Rago, 2005] dates the first
UNIX file locking implementation to
1980, and notes that
fc
locking, upon which we
primarily focus in this
chapter, appeared in System V Release 2 in 1984.
In this chapter, we describe two di
fferent APIs for placing file locks:
, which places locks on entire files; and
, which places locks on regions of a file.
flock()
system call originated on BSD;
originated on System V.
Read seq. # (obtains 1000)
Process B
Process A
expires
begins
begins
ends
Read seq. # (obtains 1000)
Increment seq. # (to 1001)
and write back to file
Increment seq. # (to 1001)
and write back to file
Executing
Waiting
for CPU
Key
FILE LOCKING
Previous chapters have covered various te
chniques that proces
ses can use to syn-
chronize their actions, including signals (Chapters 20 to 22) and semaphores
Chapters 47 and 53Chapters 47 and 53chapter, we
look at further sy
nchronization tech-
niques designed specifically for use with files.
55.1Overview
A frequent application requirement is to read data from a file, make some change
to that data, and then write it back to the
file. As long as just one process at a time
ever uses a file in this way, then ther
e are no problems. Ho
wever, problems can
arise if multiple processes are simultaneously updating a file. Suppose, for example,
that each process performs
the following steps to u
pdate a file containing a
sequence number:
1.Read the sequence number from the file.
2.Use the sequence number for so
me application-defined purpose.
3.Increment the sequence number and write it back to the file.
The problem here is that, in the absenc
e of any synchronization technique, two
processes could perform the above steps at
the same for examplefor example
consequences shown in Figure 55-1 (here, we assume that the initial value of the
sequence number is 1000).
1116
Chapter 54
Historically, System V shared memory
was more widely available than
mma
and POSIX shared memory, although
most UNIX implementations now pro-
vide all of these techniques.
With the exception of the final point regarding portability, the differences listed
above are advantages in favor of shared
file mappings and POSIX shared memory
objects. Thus, in new applications, one of
these interfaces may be preferable to
System V shared memory. Which one we choose depends on whether or not we
require a persistent backing store. Shared file mappings provide such a store;
POSIX shared memory objects allow us to
avoid the overhead of using a disk file
when a backing store is not required.
54.6Summary
A POSIX shared memory object is used to
POSIX Shared Memory
1115
1114
Chapter 54
From the output, we can see that the prog
ram resized the shared memory object so
that it is large enough to hold the specified string.
Finally, we use the program in Listing 54-
3 to display the string in the shared
memory object:
./pshm_read /demo_shm
hello
Applications must typically use some synchronization technique to allow processes
to coordinate their access
to shared memory. In the example shell session shown
here, the coordination was provided by
the user running the programs one after
the other. Typically, applications would instead use a synchronization primitive
(e.g., semaphores) to coordinate a
ccess to a shared memory object.
54.4Removing Shared Memory Objects
SUSv3 requires that POSIX shared memory objects have at least kernel persistence;
that is, they continue to exist until they
are explicitly removed or the system is
rebooted. When a shared memory object is no longer required, it should be
removed using
function removes the shared memory object specified by
Removing a shared memory object doesn
t affect existing mappings of the object
(which will remain in effect un
til the corresponding processes call
munmap()
or ter-
minate), but prevents further
shm_open
calls from opening th
e object. Once all pro-
cesses have unmapped the object, the object
is removed, and its contents are lost.
The program in Listing 54-4 uses
to remove the shared memory
object specified in the programs command-line argument.
Listing 54-4:
Using
to unlink a POSIX shared memory object
pshm/pshm_unlink.c
#include fcntünt;l.h;l.h
#include sys/mman.h&#xsys/;mma;&#xn.h7;
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s shm-name\n", argv[0]);
if (shm_unlink(argv[1]) == -1)
errExit("shm_unlink");
exit(EXIT_SUCCESS);
pshm/pshm_unlink.c
#include sys/mman.h&#xsys/;mma;&#xn.h7;
int
shm_unlink
(const char *
Returns 0 on success, or 1 on error
POSIX Shared Memory
1113
Listing 54-3:
Copying data from a POSIX shared memory object
pshm/pshm_read.c
#include fcntünt;l.h;l.h
#include sys/mman.h&#xsys/;mma;&#xn.h7;
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int fd;
char *addr;
struct stat sb;
if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s shm-name\n", argv[0]);
fd = shm_open(argv[1], O_RDONLY, 0); /* Open existing object */
if (fd == -1)
errExit("shm_open");
/* Use shared memory object size as length argument for m
and as number of bytes to writ
if (fstat(fd, &sb) == -1)
errExit("fstat");
addr = mmap(NULL, sb.st_size, PROT_READ, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
errExit("mmap");
if (close(fd) == -1); /* 'fd' is no longer needed */
errExit("close");
write(STDOUT_FILENO, addr, sb.st_size);
printf("\n");
exit(EXIT_SUCCESS);
pshm/pshm_read.c
The following shell session demonstrates the use of the programs in Listing 54-2
and Listing 54-3. We first create a zero-l
gram in Listing 54-1.
./pshm_create -c /demo_shm 0
ls -l /dev/shm

Check the size of object
total 4
-rw------- 1 mtk users 0 Jun 21 13:33 demo_shm
We then use the program in Listing 54-2 to co
py a string into the shared memory object:
./pshm_write /demo_shm 'hello'
ls -l /dev/shm
Check that object has changed in size
total 4
-rw------- 1 mtk users 5 Jun 21 13:33 demo_shm
1112
Chapter 54
54.3Using Shared Memory Objects
Listing 54-2 and Listing 54-3 demonstrate
the use of a shared memory object to
transfer data from one process to another. The program in Listing 54-2 copies the
string contained in its se
cond command-line argument into the existing shared
memory object named in its first comma
nd-line argument. Before mapping the
object and performing the copy, the program uses
ftruncat
to resize the shared
memory object to be the same length as
the string that is to be copied.
Listing 54-2:
Copying data into a POSIX shared memory object
pshm/pshm_write.c
#include fcntünt;l.h;l.h
#include sys/mman.h&#xsys/;mma;&#xn.h7;
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int fd;
size_t len; /* Size of shared memory object */
char *addr;
if (argc != 3 || strcmp(argv[1], "--help") == 0)
usageErr("%s shm-name string\n", argv[0]);
fd = shm_open(argv[1], O_RDWR, 0); /* Open existing object */
if (fd == -1)
errExit("shm_open");
len = strlen(argv[2]);
if (ftruncate(fd, len) == -1) /* Resize object to hold string */
errExit("ftruncate");
printf("Resized to %ld bytes\n", longlong len);
addr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
errExit("mmap");
if (close(fd) == -1)
errExit("close"); /* 'fd' is no longer needed */
printf("copying %ld bytes\n", (long) len);
memcpy(addr, argv[2], len); /* Copy string to shared memory */
exit(EXIT_SUCCESS);
pshm/pshm_write.c
The program in Listing 54-3 displays the string in the existing shared memory
object named in its command-line argument on standard output. After calling
, the program uses
POSIX Shared Memory
1111
static void
usageError(const char *progName)
fprintf(stderr, "Usage: %s [-cx] name size [octal-perms]\n", progName);
fprintf(stderr, " -c Create shared memory (O_CREAT)\n");
fprintf(stderr, " -x Create exclusively (O_EXCL)\n");
exit(EXIT_FAILURE);
int
main(int argc, char *argv[])
int flags, opt,fd;
mode_t perms;
size_t size;
void *addr;
flags = O_RDWR;
while ((opt = getopt(argc, argv, "cx")) != -1) {
switchoptopt {
case 'c': flags |= O_CREAT; break;
case 'x': flags |= O_EXCL; break;
default: usageError(argv[0]);
}
}
if (optind� + 1 = argc)
usageError(argv[0]);
1110
Chapter 54
against the process umask
(Section 15.4.6). Unlike
mode
argument is
always required for a call to
; if we are not creati
ng a new object, this
argument should be specified as 0.
The close-on-exec flag (
FD_CLOEXEC
, Section 27.4) is set on the file descriptor
stat
structure whose fields contain information about
the shared memory object, including its size (
st_size
), permissions (
), owner
st_uid
), and group (
). (These are the only fields that SUSv3 requires
fs
POSIX Shared Memory
1109
54.2Creating Shared
Memory Objects
shm_open
function creates and opens a new shared memory object or opens
an existing object. The arguments to
open()
The
argument identifies the shared memory
object to be created or opened. The
oflag
argument is a mask of bits that modify
the behavior of the call. The values that
can be included in this mask are summarized in Table 54-1.
One of the purposes of the
argument is to determine whether we are opening
an existing shared memory object or
creating and opening a new object. If
O_CREAT
, we are opening an existing object. If
O_CREAT
is specified,
then the object is created if it
doesnt already exist. Specifying
in conjunction
with
O_CREAT
is a request to ensure that the calle
r is the creator of the object; if the
object already exists, an error results (
EEXIST
oflag
argument also indicates the kind of
access that the calling process will
make to the shared memory object, by
specifying exactly one of the values
O_RDONLY
O_RDWR
The remaining flag value,
O_TRUNC
, causes a successful open of an existing
shared memory object to truncate
the object to a length of zero.
On Linux, truncation occurs even on a read-only open. However, SUSv3 says
that results of using
O_TRUNC
with a read-only open is undefined, so we cant
portably rely on a specific behavior in this case.
When a new shared memory object is cr
eated, its ownership and group ownership
are taken from the effective user and group IDs of the process calling
1108
Chapter 54
54.1Overview
POSIX shared memory allows to us to
in the previous step in a call to
mmap()
that
specifies
MAP_SHARED
in the
flags
argument. This maps the shared memory object
into the processs virtual address
space. As with other uses of
mmap
, once we
have mapped the object, we can close the
file descriptor wi
thout affecting the
mapping. However, we may need to keep the file descriptor open for subse-
quent use in calls to
ftruncat
(see Section 54.2).
The relationship between
shm_op
and
for POSIX shared memory is
POSIX SHARED MEMORY
In previous chapters, we looked at two techniques that allow unrelated processes
to share memory regions in order to
perform IPC: System V shared memory
(Chapter 48) and shared file mappings (S
ection 49.4.2). Both
of these techniques
have potential drawbacks:
The System V shared memory model, which uses keys and identifiers, is not
consistent with the standard UNIX
I/O model, which uses filenames and
descriptors. This difference means that
POSIX Semaphores
1105
SEM_VALUE_MAX
This is the maximum value that a POSIX semaphore may reach. Sema-
phores may assume any value from 0 up
to this limit. SUSv3 requires this
limit to be at least 32,767; the Linux
implementation allows values up to
INT_MAX
(2,147,483,647 on Linux/x86-32).
53.7Summary
POSIX semaphores allow processes or threads to synchronize their actions. POSIX
two types: named and unnamed.
tified by a name, and can be shared by an
y processes that have permission to open
the semaphore. An unnamed semaphore ha
s no name, but processes or threads
can share the same semaphore by placing it
in a region of memory that they share
(e.g., in a POSIX shared memory object fo
r process sharing, or in a global variable
for thread sharing).
The POSIX semaphore interface is simple
r than the System V semaphore inter-
face. Semaphores are allocated and operated on individually, and the wait and post
operations adjust a semaphores value by one.
POSIX semaphores have a number of ad
vantages over System V semaphores,
but they are somewhat less portable. Fo
r synchronization within multithreaded
applications, mutexes are genera
lly preferred over semaphores.
Further information
[Stevens, 1999] provides an
alternative presentation
of POSIX semaphores and
shows user-space implementations using
various other IPC mechanisms (FIFOs,
memory-mapped files, and System V sema
phores). [Butenhof,
1996] describes the
use of POSIX semaphores in multithreaded applications.
53.8Exercises
53-1.
Rewrite the programs in Listing 48-2 and Listing 48-3 (Section 48.4) as a threaded
application, with the two threads passing da
ta to each other via a global buffer, and
using POSIX semaphores for synchronization.
53-2.
Modify the program in Listing 53-3 (
psem_wait.c
) to use
sem_timedwa
instead of
. The program should take an addi
tional command-line argument that
specifies a (relative) number of second
s to be used as the timeout for the
call.
53-3.
Devise an implementation of POSIX semaphores using System V semaphores.
53-4.
In Section 53.5, we noted that POSI
1104
Chapter 53
proceed without blocking), then POSIX semaphores perform considerably bet-
ter than System V semaphores. (On the sy
stems tested by the author, the differ-
ence in performance is more than an order of magnitude; see Exercise 53-4.)
POSIX semaphores perform so much be
tter in this case because the way in
which they are implemented only requires a system call when contention
occurs, whereas System V semaphore operations always require a system call,
regardless of contention.
However, POSIX semaphores also have the following disadvantages compared to
System V semaphores:
POSIX semaphores are somewhat less
portable. (On Linux, named sema-
phores have been supported only since kernel 2.6.)
POSIX semaphores dont provide an eq
uivalent of the System V semaphore
undo feature. (However, as
we noted in Section 47.8,
this feature may not be
useful in some circumstances.)
POSIX semaphores versus Pthreads mutexes
POSIX semaphores and Pthreads mutexes can both be used to synchronize the
actions of threads within the same proces
s, and their performance is similar. How-
ever, mutexes are usually preferable, beca
use the ownership property of mutexes
enforces good structuring of code (only
the thread that locks a mutex can unlock
it). By contrast, one thread can increment a semaphore that was decremented by
another thread. This flexibility can lead to
poorly structured sy
nchronization designs.
POSIX Semaphores
1103
After an unnamed semaphore segm
ent has been destroyed with
, it can
be reinitialized with
sem_in
An unnamed semaphore should be destro
yed before its underlying memory is
deallocated. For example, if the semaphore
is an automatically allocated variable, it
should be destroyed before its host function returns. If the semaphore resides in a
POSIX shared memory region, then it should be destroyed after all processes have
e shared memory object is unlinked with
On some implementations, omitting calls to
sem_destro
doesnt cause prob-
lems. However, on other impl
ementations, failing to call
sem_destro
can result in
resource leaks. Portable
applications should call
sem_dest
to avoid such problems.
53.5Comparisons with Other
Synchronization Techniques
In this section, we compare POSIX sema
phores with two other synchronization
techniques: System V semaphores and mutexes.
POSIX semaphores versus System V semaphores
POSIX semaphores and System V semaphores
can both be used to synchronize the
actions of processes. Section 51.2 list
ed various advantages of POSIX IPC over
System V IPC: the POSIX IPC interface is si
mpler and more consistent with the tradi-
tional UNIX file model, and POSIX IP
C objects are reference counted, which
1102
Chapter 53
loc = glob;
loc++;
glob = loc;
if (sem_post&sem&sem == -1)
errExit("sem_post");
}
POSIX Semaphores
1101
implementation ignores
, since no special action is
required for either type of sharing. Ne
vertheless, portable
and future-proof appli-
cations should specify an appropriate value for
The SUSv3 specification for
sem_i
defines a failure return of 1, but makes
1100
Chapter 53
If we are building a dynami
c data structure (e.g., a binary tree), each of whose
items requires an associated semaphore, then the simplest approach is to allo-
cate an unnamed semaphore within each item. Opening a named semaphore
for each item would require us to de
sign a convention for generating a
(unique) semaphore name for each item and to manage those names (e.g.,
unlinking them when they are no longer required).
53.4.1Initializing an Unnamed Semaphore
function initializes the unnamed semaphore pointed to by
to the
value specified by
POSIX Semaphores
1099
We then execute a command that increm
ents the semaphore. This causes the
blocked
in the background program to complete:
./psem_post /demo
$ 31208 sem_wasucceeded
(The last line of output above shows the
shell prompt mixed with the output of the
background job.)
We press
to see the next shell prompt,
which also causes the shell to
report on the terminated background job,
and then perform further operations on
the semaphore:
Press Enter
[1]- Done ./psem_wait /demo
./psem_post /demo
Increment semaphore
1098
Chapter 53
The program in Listing 53-5 uses
POSIX Semaphores
1097
Listing 53-4:
Using
to increment a POSIX semaphore
psem/psem_post.c
#include semaphore.&#xsema;pho;&#xre.7;&#xh000;h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
sem_t *sem;
if (argc != 2)
usageErr("%s sem-name\n", argv[0]);
sem = sem_open(argv[1], 0);
if (sem == SEM_FAILED)
errExit("sem_open");
if (sem_post(sem) == -1)
errExit("sem_post");
exit(EXIT_SUCCESS);
psem/psem_post.c
1096
Chapter 53
If a
sem_timed
call times out without being able to decrement the semaphore,
then the call fails with the error
#include semaphore.&#xsema;pho;&#xre.7;&#xh000;h
int
sem_post
(sem_t *
Returns 0 on success, or 1 on error
POSIX Semaphores
1095
If the semaphore currently has a value greater than 0,
se
1094
Chapter 53
Listing 53-2:
Using
to unlink a POSIX named semaphore
psem/psem_unlink.c
#include semaphore.&#xsema;pho;&#xre.7;&#xh000;h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s sem-name\n", argv[0]);
if (sem_unlink(argv[1]) == -1)
errExit("sem_unlink");
exit(EXIT_SUCCESS);
psem/psem_unlink.c
53.3Semaphore Operations
As with a System V semaphore, a POSIX
semaphore is an integer that the system
never allows to go below 0. However, POSIX semaphore operations differ from
their System V counterparts
in the following respects:
The functions for changing a semaphores value
and
sem_wa
operate on just one semaphore at
a time. By contrast, the System V
semop()
system
call can operate on multiple semaphores in a set.
and
sem_wa
functions increment and decrement a semaphores
can add and subtract arbitrary values.
There is no equivalent of the wait-for-zero operation provided by System V
semaphores (a
call where the
field is specified as 0).
From this list, it may seem that POSIX se
maphores are less powerful than System V
semaphores. However, this is not the case
anything that we can do with System V
semaphores can also be done with POSIX semaphores. In a few cases, a bit more
programming effort may be required, but, for typical scenarios, using POSIX sema-
phores actually requires less programming
effort. (The System V semaphore API is
rather more complicated than is
required for most applications.)
53.3.1Waiting on a Semaphore
sem_wa
function decrements (decreases by 1) the value of the semaphore
referred to by
#include semaphore.&#xsema;pho;&#xre.7;&#xh000;h
int
sem_wait
(sem_t *
Returns 0 on success, or 1 on error
POSIX Semaphores
1093
if (optind� = argc)
usageError(argv[0]);
/* Default permissions are rw-------; default semaphore initialization
value is 0 */
perms = (argc = optind + 1) ? (S_IRUSR | S_IWUSR) :
#include semaphore.&#xsema;pho;&#xre.7;&#xh000;h
int
sem_unlink
(const char *
Returns 0 on success, or 1 on error
1092
Chapter 53
The following shell session log demonstrates the use of this program. We first
umask
command to deny all permissions to
users in the class other. We then
exclusively create a semaphore and examine
the contents of the Linux-specific virtual
directory that contains named semaphores.
./psem_create -cx /demo 666

666 means read+write for all users
ls -l /dev/shm/sem.*
-rw-rw---- 1 mtk users 16 Jul 6 12:09 /dev/shm/sem.demo
The output of the
command shows that the proces
s umask overrode the specified
permissions of read plus wr
ite for the user class other.
If we try once more to exclusively crea
te a semaphore with the same name, the
operation fails, because the name already exists.
./psem_create -cx /demo 666
ERROR [EEXIST File exists] sem_open
Failed because of
O_EXCL
Listing 53-1:
Using
to open or create a POSIX named semaphore
psem/psem_create.c
#include semaphore.&#xsema;pho;&#xre.7;&#xh000;h
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include fcntünt;l.h;l.h
#include "tlpi_hdr.h"
static void
usageError(const char *progName)
fprintf(stderr, "Usage: %s [-cx] name [octal-perms [value]]\n", progName);
fprintf(stderr, " -c Create semaphore (O_CREAT)\n");
fprintf(stderr, " -x Create exclusively (O_EXCL)\n");
exit(EXIT_FAILURE);
int
main(int argc, char *argv[])
int flags, opt;
mode_t perms;
unsigned int value;
sem_t *sem;
flags = 0;
while ((opt = getopt(argc, argv, "cx")) != -1) {
switchoptopt {
case 'c': flags |= O_CREAT; break;
case 'x': flags |= O_EXCL; break;
default: usageError(argv[0]);
}
}
POSIX Semaphores
1091
name
argument identifies the semaphore. It is specified according to the rules
given in Section 51.1.
The
1090
Chapter 53
POSIX semaphores operate in a manner si
milar to System V semaphores; that is,
a POSIX semaphore is an integer whose valu
e is not permitted to fall below 0. If a
process attempts to decrease the value
of a semaphore below 0, then, depending
on the function used, the call either blocks
or fails with an error indicating that the
operation was not currently possible.
Some systems dont provide a full implementation of POSIX semaphores. A
typical restriction is that only unname
d thread-shared semaph
ores are supported.
That was the situation on Linu
x 2.4; with Linux 2.6 and a
glibc
that provides NPTL,
a full implementation of POSIX semaphores is available.
On Linux 2.6 with NPTL, semaphore op
erations (increment and decrement)
are implemented using the
system call.
53.2Named Semaphores
To work with a named semaphore, we employ the following functions:
sem_op
function opens or creates a semaphore, initializes the sema-
phore if it is created by the call, and
This chapter describes PO
SIX semaphores, which allow processes and threads to
synchronize access to shared resources. In Chapter 47, we described System V
semaphores, and well assume that the read
er is familiar with the general sema-
phore concepts and rationale for using sema
phores that were presented at the start
of that chapter. During the
course of this chapter, we
1088
Chapter 52
52-6.
Replace the use of a signal handler in Listing 52-6 (
mq_notify_sig.c
) with the use of
sigwaitin
. Upon return from
POSIX Message Queues
1087
However, POSIX message queues also have
some disadvantages compared to Sys-
tem V message queues:
POSIX message queues are less portable.
This problem applies even across Linux
systems, since message queue support is
available only since kernel 2.6.6.
The facility to select System V messages
by type provides slightly greater flexi-
bility than the strict priority
ordering of POSIX messages.
There is a wide variation in the manne
r in which POSIX message queues are
implemented on UNIX systems. Some
systems provide implementations in
user space, and on at least one su
ch implementation (Solaris 10), the
mq_open()
manual page explicitly notes that the implementation cant be considered
secure. On Linux, one of the motives f
or selecting a kernel implementation of
message queues was that it was not deem
ed possible to provide a secure user-
space implementation.
52.10Summary
POSIX message queues allow processes to exchange data in the form of messages.
Each message has an associated integer
priority, and messages are queued (and
thus received) in
order of priority.
POSIX message queues have some advantages over System V message queues,
notably that they are refe
rence counted and that a process can be asynchronously
notified of the arrival of a message on
an empty queue. However, POSIX message
queues are less portable th
an System V message queues.
Further information
[Stevens, 1999] provides an alternative pr
esentation of POSIX message queues and
shows a user-space implementation usin
g memory-mapped files. POSIX message
queues are also described in some
1086
Chapter 52
As well as the above SUSv3-specified limits, Linux provides a number of
files
for viewing and (with privilege) changing limits that control the use of POSIX mes-
sage queues. The following three files reside in the directory
/proc/sys/fs/mqueue
msg_max
This limit specifies a ceiling for the
mq_maxmsg
attribute of new message
queues (i.e., a ceiling for
attr.mq_maxmsg
when creating a queue with
). The default value for this limit is 10. The minimum value is 1
(10 in kernels before Linux 2.6.28).
The maximum value is defined by the
kernel constant
HARD_MSGMAX
. The value for this constant is calculated as
(131,072 /
), which evaluates to 32,768 on Linux/x86-32.
When a privileged process (
CAP_SYS_RESOURCE
) calls
mq_o
msg_max
limit is ignored, but
HARD_MSGMAX
still acts as a ceiling for
msgsize_max
This limit specifies a ceiling for the
attribute of new message
queues created by unprivileged
processes (i.e., a ceiling for
attr.mq_msgsize
when creating a queue with
). The default value for this limit is 8192.
The minimum value is 128 (8192 in ke
rnels before Linu
x 2.6.28). The max-
imum value is 1,048,576 (
INT_MAX
in kernels before 2.6.28). This limit is
ignored when a privileged process (
CAP_SYS_RESOURCE
) calls
mq_open()
queues_max
This is a system-wide limit on the number of message queues that may be
created. Once this limit is reached, only a privileged process (
can create new queues. The default va
lue for this limit is 256. It can be
changed to any value in the range 0 to
INT_MAX
Linux also provides the
RLIMIT_MSGQUEUE
resource limit, which can be used to place a
ceiling on the amount of space that can
be consumed by all of the message queues
belonging to the real user
ID of the calling process.
POSIX Message Queues
1085
field is a count of the total number
of bytes of data in the queue. The
remaining fields relate to
message notification. If
NOTIFY_PID
is nonzero, then the
process with the specified process ID has re
gistered for message notification from this
queue, and the remaining fields provide in
formation about the kind of notification:
NOTIFY
is a value corresponding to one of the
sigev_notify
constants: 0 for
SIGEV_SIGNAL
, 1 for
SIGEV_NONE
SIGEV_THREAD
1084
Chapter 52
POSIX Message Queues
1083
sev.sigev_notify = SIGEV_THREAD; /* Notify via thread */
sev.sigev_notify_function = threadFunc;
sev.sigev_notify_attributes = NULL;
/* Could be pointer to pthread_attr_t structure */
sev.sigev_value.sival_ptr = mqdp; /* Argument to threadFun
if (mq_notify(*mqdp, &sev) == -1)
errExit("mq_notify");
int
main(int argc, char *argv[])
mqd_t mqd;
if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s mq-name\n", argv[0]);
mqd = mq_open(argv[1], O_RDONLY | O_NONBLOCK);
if (mqd == (mqd_t) -1)
errExit("mq_open");
1082
Chapter 52
52.6.2Receiving Notification via a Thread
Listing 52-7 provides an example of message notification using threads. This pro-
gram shares a number of design featur
es with the program in Listing 52-6:
When message notification occurs, the
program reenables notification before
draining the queue
Nonblocking mode is employed so that, after receiving a notification, we can
POSIX Message Queues
1081
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sa.sa_handler = handler;
if (sigaction(NOTIFY_SIG, &sa, NULL) == -1)
errExit("sigaction");
sev.sigev_notify = SIGEV_SIGNAL;
sev.sigev_signo = NOTIFY_SIG;
if (mq_notify(mqd, &sev) == -1)
errExit("mq_notify");
sigemptyset(&emptyMask);
for ;;;;
sigsuspend(&emptyMask); /* Wait for notification signal */
if (mq_notify(mqd, &sev) == -1)
errExit("mq_notify");
while ((numRead = mq_receive(mqd, buffer, attr.mq_msgsize, NULL)�) = 0)
printf("Read %ld bytes\n", (long) numRead);
if (errno != EAGAIN) /* Unexpected error */
errExit("mq_receive");
}
pmsg/mq_notify_sig.c
Various aspects of the program in
Listing 52-6 merit further comment:
We block the notification signal and use
to wait for it, rather than
, to prevent the possibility of missing a signal that is delivered while the
program is executing elsewhere (i.e., is not blocked waiting for signals) in the
for
loop. If this occurred, and we were using
to wait for signals, then the
next call to
pause()
would block, even though a signal had already been delivered.
We open the queue in nonblocking mode, and, whenever a notification occurs,
we use a
loop to read all messages from the queue. Emptying the queue in
this way ensures that a further notifica
tion is generated when a new message
arrives. Employing nonblocking mode means that the
while
will fail with the error
) when we have emptied the queue.
(This approach is analogous to the use
of nonblocking I/O with edge-triggered
I/O notification, which we describe in
Section 63.1.1, and is employed for sim-
ilar reasons.)
Within the
loop, it is important that we
before
reading all messages from the queue. If we reversed these steps, the fol-
lowing sequence could occur: all messages are read from the queue, and the
while
loop terminates; another message is placed on the queue;
is
called to reregister for mess
age notification. At this point, no further notifica-
tion signal would be generated, becaus
e the queue is already nonempty. Conse-
quently, the program would remain pe
rmanently blocked in its next call to
1080
Chapter 52
t, the process will have been deregis-
tered for message notification.
b)Call
mq_notify()
to reregister this process to
receive message notification
c)Execute a
loop that drains the queue by reading as many messages as
Listing 52-6:
Receiving message notification via a signal
pmsg/mq_notify_sig.c
#include sign&#xsign;zl.;&#xh000;al.h
#include mque&#xmque;ue.;&#xh000;ue.h
#include fcntünt;l.h;l.h /* For definition of O_NONBLOCK */
#include "tlpi_hdr.h"
#define NOTIFY_SIG SIGUSR1
static void
handler(int sig)
/* Just interrupt sigsuspend
int
main(int argc, char *argv[])
struct sigevent sev;
mqd_t mqd;
struct mq_attr attr;
void *buffer;
ssize_t numRead;
POSIX Message Queues
1079
struct sigevent {
1078
Chapter 52
or marking the message queue descriptor
nonblocking and performing periodic
pollspollson the queue, a proc
ess can request a notification of mes-
sage arrival and then perform other tasks
until it is notified. A process can choose
to be notified either via a signal or via in
vocation of a function in a separate thread.
The notification feature of POSIX message
queues is similar to the notification
facility that we described for POSIX time
rs in Section 23.6. (Both of these APIs
originated in POSIX.1b.)
function registers the calling proc
ess to receive a
notification when
a message arrives on the empty queue referred to by the descriptor
notification
argument specifies the mechanism by which the process is to be
POSIX Message Queues
1077
numRead = mq_receive(mqd, buffer, attr.mq_msgsize, &prio);
if (numRead == -1)
errExit("mq_receive");
printf("Read %ld bytes; priority = %u\n", (long) numRead, prio);
if (write(STDOUT_FILENO, buffer, numRead) == -1)
errExit("write");
write(STDOUT_FILENO, "\n", 1);
exit(EXIT_SUCCESS);
pmsg/pmsg_receive.c
52.5.3Sending and Receiving Messages with a Timeout
mq_timedsend
and
mq_timedreceiv
functions are exactly like
mq_sen
and
, except that if the operation can
t be performed immediately, and the
O_NONBLOCK
flag is not in effect for the message queue description, then the
argument specifies a limit on th
e time for which the call will block.
argument is a
structure (Section 23.4.
2) that specifies the
timeout as an absolute value in seconds
and nanoseconds since the Epoch. To per-
form a relative timeout, we ca
1076
Chapter 52
On the other hand, if we perform a nonblo
cking receive, the call returns immedi-
ately with a failure status:
./pmsg_receive -n /mq
ERROR [EAGAIN/EWOULDBLOCK Resource temporarily unavailable] mq_receive
Listing 52-5:
Reading a message from a POSIX message queue
pmsg/pmsg_receive.c
#include mque&#xmque;ue.;&#xh000;ue.h
#include fcntünt;l.h;l.h /* For definition of O_NONBLOCK */
#include "tlpi_hdr.h"
static void
usageError(const char *progName)
fprintf(stderr, "Usage: %s [-n] name\n", progName);
fprintf(stderr, " -n Use O_NONBLOCK flag\n");
exit(EXIT_FAILURE);
int
main(int argc, char *argv[])
int flags, opt;
mqd_t mqd;
unsigned int prio;
void *buffer;
struct mq_attr attr;
ssize_t numRead;
flags = O_RDONLY;
while ((opt = getopt(argc, argv, "n")) != -1) {
switchoptopt {
case 'n': flags |= O_NONBLOCK; break;
default: usageError(argv[0]);
}
}
if (optind� = argc)
usageError(argv[0]);
mqd = mq_open(argv[optind], flags);
if (mqd == (mqd_t) -1)
errExit("mq_open");
POSIX Message Queues
1075
msg_len
argument is used by the caller to specify the number of bytes of space
available in the buffer pointed to by
Regardless of the actual size of the message,
(and thus the size of the
buffer pointed to by
) must be greater than or equal to the
attribute of the queue; otherwise,
fails with the error
EMSGSIZE
. If we
dont know the value of the
attribute of a queue, we can obtain it using
1074
Chapter 52
Listing 52-4:
Writing a message to a POSIX message queue
pmsg/pmsg_send.c
#include mque&#xmque;ue.;&#xh000;ue.h
#include fcntünt;l.h;l.h /* For definition of O_NONBLOCK */
#include "tlpi_hdr.h"
static void
usageError(const char *progName)
fprintf(stderr, "Usage: %s [-n] name msg [prio]\n", progName);
fprintf(stderr, " -n Use O_NONBLOCK flag\n");
exit(EXIT_FAILURE);
int
main(int argc, char *argv[])
int flags, opt;
mqd_t mqd;
unsigned int prio;
flags = O_WRONLY;
while ((opt = getopt(argc, argv, "n")) != -1) {
switchoptopt {
case 'n': flags |= O_NONBLOCK; break;
default: usageError(argv[0]);
}
}
if (optind� + 1 = argc)
usageError(argv[0]);
mqd = mq_open(argv[optind], flags);
if (mqd == (mqd_t) -1)
errExit("mq_open");
prio = (ar�gc optind + 2) ? atoi(argv[optind + 2]) : 0;
if (mq_send(mqd, argv[optind + 1], strlen(argv[optind + 1]), prio) == -1)
errExit("mq_send");
exit(EXIT_SUCCESS);
pmsg/pmsg_send.c
52.5.2Receiving Messages
function removes the oldest me
ssage with the highest priority
from the message queue referred to by
mqdes
POSIX Message Queues
1073
52.5Exchanging Messages
In this section, we look at the functions that are used to send messages to and receive
messages from a queue.
52.5.1Sending Messages
function adds the message
in the buffer pointed to by
msg_ptr
to the
message queue referred to by the descriptor
argument specifies the length
of the message pointed to by
This value must be less than or equal to the
mq_msgsize
attribute of the queue;
otherwise,
mq_send()
fails with the error
EMSGSIZE
. Zero-length messages are
permitted.
Each message has a nonnegative integer priority, specified by the
msg_prio
argument. Messages are ordered within th
e queue in descending order of priority
(i.e., 0 is the lowest priority). When a
new message is added to the queue, it is
placed after any other messag
es of the same priority. If
an application doesnt need
to use message priorities, it is sufficient to always specify
As noted at the beginning of this chapter,
the type attribute of System V messages
provides different functionality. System
V messages are always queued in FIFO
order, but
allows us to select messages
in various ways: in FIFO order,
by exact type, or by highest type less than or equal to some value.
SUSv3 allows an implementation to adve
rtise its upper limit for message priori-
ties, either by defining the constant
MQ_PRIO_MAX
1072
Chapter 52
./pmsg_create -cx /mq
POSIX Message Queues
1071
In addition to the
mq_maxmsg
and
fields, which we have already
described, the following fields are re
turned in the struc
ture pointed to by
mq_flags
These are flags for the open message queue description associated with the
descriptor
. Only one such flag is specified:
O_NONBLOCK
. This flag is ini-
tialized from the
oflag
argument of
, and can be changed using
1070
Chapter 52
/* Parse command-line options */
while ((opt = getopt(argc, argv, "cm:s:x")) != -1) {
switchoptopt {
case 'c':
flags |= O_CREAT;
break;
case 'm':
attr.mq_maxmsg = atoi(optarg);
attrp = &attr;
break;
case 's':
attr.mq_msgsize = atoi(optarg);
attrp = &attr;
break;
case 'x':
flags |= O_EXCL;
break;
default:
usageError(argv[0]);
}
}
if (optind� = argc)
usageError(argv[0]);
perms = (argc = optind + 1) ? (S_IRUSR | S_IWUSR) :
POSIX Message Queues
1069
The
mq_msgsize
field defines the upper limit on the size of each message that may
be placed on the queue. This value must be greater than 0.
1068
Chapter 52
Figure 52-1 helps clarify a number of de
tails of the use of message queue descrip-
tors (all of which are analogous to
the use to file descriptors):
An open message queue description has an
POSIX Message Queues
1067
if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s mq-name\n", argv[0]);
if (mq_unlink(argv[1]) == -1)
errExit("mq_unlink");
exit(EXIT_SUCCESS);
pmsg/pmsg_unlink.c
ptr to MQ
Process A
Message queue
descriptor table
flags
ptr to
Table of open message
(system-wide)
Message queue table
(system-wide)
(other
info)
(per-queue info:
MQ attributes; UID
& GID; notification
ptr to MQ
(other
info)
Process B
Message queue
descriptor table
1066
Chapter 52
(We explain message queue descriptions in
Section 52.3.) The child doesnt inherit
any of its parents message notification registrations.
When a process performs an
exec
or terminates, all of its open message queue
descriptors are closed. As a consequence of closing its message queue descriptors,
all of the processs message notification
registrations on the corresponding queues
mq_close()
function closes the message queue descriptor
If the calling process has registered via
for message notification from the
queue (Section 52.6), then the notification
registration is automatically removed, and
another process can subsequently register
for message notification from the queue.
A message queue descriptor is automatically closed when a process terminates
or calls
. As with file descriptors, we should explicitly close message queue
descriptors that are no longer required,
in order to prevent the process from run-
ning out of message queue descriptors.
As
for files, closing a message queue doesnt delete it. For that purpose,
we need
mq_unlink()
, which is the message queue analog of
unli
Removing a message queue
function removes the message queue identified by
marks the queue to be destroyed once all processes cease using it (this may mean
immediately, if all processes that had
the queue open have already closed it).
Listing 52-1 demons
Listing 52-1:
Using
mq_unl
to unlink a POSIX message queue
pmsg/pmsg_unlink.c
#include mque&#xmque;ue.;&#xh000;ue.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
#include mque&#xmque;ue.;&#xh000;ue.h
int
mq_close
(mqd_t
Returns 0 on success, or 1 on error
#include mque&#xmque;ue.;&#xh000;ue.h
int
mq_unlink
(const char *
name
Returns 0 on success, or 1 on error
POSIX Message Queues
1065
One of the purposes of the
argument is to determine whether we are opening
an existing queue or creating and opening a new queue. If
oflag
doesnt include
O_CREAT
, we are opening an existing queue. If
oflag
includes
O_CREAT
, a new, empty
queue is created if one with the given
name
doesnt already exist. If
specifies
both
O_CREAT
and
O_EXCL
, and a queue with the given
already exists, then
mq_open()
fails.
oflag
argument also indicates the kind of
access that the calling process will
make to the message queue, by specifying exactly one of the values
O_RDONLY
O_WRONLY
O_RDWR
The remaining flag value,
O_NONBLOCK
, causes the queue to be opened in non-
blocking mode. If a subsequent call to
mq_receive
mq_sen
cant be performed
without blocking, the call will fa
il immediately with the error
EAGAIN
If
is being used to open an existing message queue, the call requires
only two arguments. However, if
O_CREAT
is specified in
, two further arguments
mode
and
. (If the queue specified by
already exists, these two
arguments are ignored.) These arguments are used as follows:
mode
argument is a bit mask that specif
ies the permissions to be placed on
the new message queue. The bit values that may be specified are the same as
for files (Table 15-4, on
page 295), and, as with
open()
, the value in
mode
is
masked against the process umask (Sec
tion 15.4.6). To read from a queue
), read permission must be granted to the corresponding class of
user; to write to a queue (
mq_send()
), write permission is required.
argument is an
structure that specifies attributes for the new
message queue. If
is
, the queue is created with implementation-defined
default attributes. We describe the
structure in Section 52.4.
1064
Chapter 52
52.1Overview
The main functions in the POSIX message queue API are the following:
mq_open()
function creates a new message queue or opens an existing
This chapter describes POSIX message queues, which allow processes to exchange
data in the form of messages. POSIX message queues are similar to their System V
counterparts, in that data is exchanged
in units of whole messages. However, there
are also some notable differences:
POSIX message queues are reference counted. A queue that is marked for
deletion is removed only after it is cl
osed by all processes that are currently
using it.
Each System V message has an integer ty
pe, and messages can be selected in a
variety of ways using
msgrcv()
. By contrast, POSIX messages have an associated pri-
ority, and messages are always strictly queu
ed (and thus received
) in priority order.
POSIX message queues provide a feature th
at allows a process to be asynchro-
nously notified when a mess
age is available on a queue.
POSIX message queues are a relatively recent addition to Linux. The required
implementation support was added
in kernel 2.6.6 (in addition,
2.3.4 or later
is required).
POSIX message queue support is an optional kernel component that is config-
ured via the
CONFIG_POSIX_MQUEUE
option.
1062
Chapter 51
Despite the SUSv3 specification for POSIX IPC object names, the various
implementations follow different conv
entions for naming IPC objects. These
differences require us a littlea littleex
tra work to write portable applications.
Various details of POSIX IPC are not specified in SUSv3. In particular, no com-
mands are specified for displaying and de
Introduction to POSIX IPC
1061
As with System V IPC, POSIX IPC objects have kernel persistence. Once cre-
ated, an object continues to exist until it
is unlinked or the system is shut down.
This allows a process to create an object, modify its state, and then exit, leaving the
object to be accessed by some proc
ess that is started at a later time.
Listing and removing POSIX IPC objects via the command line
System V IPC provides two commands,
and
ipcrm
1060
Chapter 51
Depending on the type of object,
oflag
may also include one of the values
O_RDONLY
O_WRONLY
O_RDWR
, with meanings similar to
open()
. Additional flags are allowed for
some IPC mechanisms.
The remaining argument,
mode
, is a bit mask specifying the permissions to be
placed on a new object, if one is created by the call (i.e.,
O_CREAT
was specified and
the object did not already exist). Th
e values that may be specified for
mode
are the
same as for files (Table 15-
4, on page 295). As with the
open()
system call, the per-
missions mask in
is masked against the proc
ess umask (Section 15.4.6). The
p of a new IPC object are taken from the effective
user and group IDs of the process making the IPC
call. (To be strictly accurate,
on Linux, the ownership of a new POSIX IP
C object is determined by the processs
file-system IDs, which normally have the
same value as the corresponding effective
IDs. Refer to Section 9.5.)
On systems where IPC objects appear in
the standard file system, SUSv3 permits
an implementation to set the group ID of a new IPC object to the group ID of
the parent directory.
For POSIX message queues and semaphores, there is an IPC
close
call that indicates
finished using the object and the system may deallocate
any resources that were associated with th
e object for this process. A POSIX shared
memory object is closed by unmapping it with
munmap
IPC objects are automatically closed if
the process terminates or performs an
IPC object permissions
IPC objects have a permissions mask that is the same as for files. Permissions for
accessing an IPC object are similar to
those for accessing files (Section 15.4.3),
except that execute permission has no meaning for POSIX IPC objects.
Since kernel 2.6.19, Linux supports the us
e of access control listACLsACLs
setting the permissions on POSIX shared memory objects and named semaphores.
Currently, ACLs are not supported for POSIX message queues.
Introduction to POSIX IPC
1059
On Linux, names for POSIX shared memory and message queue objects are
limited to
NAME_MAX
255255
the limit is 4 characters less,
since the implementation prepends the string
to the semaphore name.
SUSv3 doesnt prohibit names of a form other than
/myobject
, but says that
the semantics of such names are implem
entation-defined. The rules for creating
IPC object names on some systems are diffe
rent. For example, on Tru64 5.1, IPC
object names are created as names within
the standard file system, and the name is
1058
Chapter 51
Shared memory
enables multiple processes to share the same region of memory.
As with System V shared memory, POSI
X shared memory provides fast IPC.
Once one process has updated the shared
memory, the change is immediately
visible to other processes sharing the same region.
This chapter provides an overview of the POSIX IPC facilities, focusing on their
common features.
51.1API Overview
The three POSIX IPC mechanisms have a number of common features. Table 51-1
summarizes their APIs, and we go into th
INTRODUCTION TO POSIX IPC
The POSIX.1b realtime extens
values from multiple semaphores
in a System V semaphore set).
1056
Chapter 50
50.5Summary
In this chapter, we considered various op
erations that can be performed on a pro-
cesss virtual memory:
mpro
system call changes the protection
on a region of virtual memory.
mlock()
and
mlockall()
system calls lock part or all of a processs virtual
address space, respectively, into physical memory.
mincore()
system call reports which pages in a virtual memory region are
currently resident in
physical memory.
ma
system call and the
function allow a process to
advise the kernel about
the processs expected patterns of memory use.
50.6Exercises
50-1.
Verify the effect of the
RLIMIT_MEMLOCK
resource limit by writing a program that sets a
value for this limit and then attempts
ry than the limit.
50-2.
Write a program to verify the operation of the
ma
MADV_DONTNEED
operation for
a writable
MAP_PRIVATE
Virtual Memory Operations
1055
MADV_SEQUENTIAL
Pages in this range will be accessed on
ce, sequentially. Thus, the kernel can
aggressively read ahead, and pages
can be quickly freed after they have
been accessed.
MADV_WILLNEED
Read pages in this region ahead,
in preparation for future access. The
MADV_WILLNEED
operation has an effect similar to the Linux-specific
system call and the
POSIX_FADV_WILLNEED
operation.
MADV_DONTNEED
The calling process no longer requires the pages in this region to be memory-
resident. The precise effect of this fl
ag varies across UNIX implementations.
We first note the beha
vior on Linux. For a
MAP_PRIVATE
region, the mapped
pages are explicitly discarded, which
means that modifications to the pages
are lost. The virtual memory address range remains accessible, but the
next access of each page will result in a page fault reinitializing the page,
either with the contents of the file fr
om which it is mapped or with zeros in
the case of an anonymous mapping. This
can be used as a means of explic-
itly reinitializing the contents of a
MAP_PRIVATE
region. For a
MAP_SHARED
region, the kernel
may
discard modified pages in some circumstances,
depending on the architecture (this behavior doesnt occur on x86). Some
other UNIX implementations also behave in the same way as Linux. How-
ever, on some UNIX implementations,
MADV_DONTNEED
simply informs the
kernel that the specified pages can be swapped out if necessary. Portable
applications should not rely on th
e Linuxs destructive semantics for
MADV_DONTNEED
Linux 2.6.16 added three new nonstandard
values:
MADV_DONTFORK
MADV_DOFORK
MADV_REMOVE
. Linux 2.6.32 and 2.6.33 added another four
nonstandard
values:
MADV_HWPOISON
MADV_SOFT_OFFLINE
MADV_MERGEABLE
MADV_UNMERGEABLE
. These values are used in special circumstances and are
described in the
madvise(2)
Most UNIX implementation
s provide a version of
, typically allowing at least
advice
constants described above. However, SUSv3 standardizes this API under a
different name,
posix_madvise()
, and prefixes the corresponding
constants with
the string
POSIX_
. Thus, the constants are
POSIX_MADV_NORMAL
POSIX_MADV_RANDOM
POSIX_MADV_SEQUENTIAL
POSIX_MADV_WILLNEED
POSIX_MADV_DONTNEED
. This alternative
interface is implemented in
glibc
(version 2.2 and later) by calls to
ma
, but it is
not available on all UNIX implementations.
SUSv3 says that
should not affect the semantics of a program.
However, in
versions before 2.7, the
POSIX_MADV_DONTNEED
operation is
implemented using
MADV_DONTNEED
, which does affect the semantics of
a program, as described earlier. Since
2.7, the
posix_madv
wrapper
implements
POSIX_MADV_DONTNEED
to do nothing, so that it does not affect the
semantics of a program.
1054
Chapter 50
The following shell session shows a sample run of the program in Listing 50-2. In
this example, we allocate 32 pages, and in
each group of 8 pages, we lock 3 consec-
su
Password:
./memlock 32 8 3
Allocated 131072 (0x20000) bytes starting at 0x4014a000
Before mlock:
0x4014a000: ................................
After mlock:
0x4014a000: ***.....***.....***.....***.....
In the program output, dots represent page
s that are not resident in memory, and
asterisks represent pages that are resident in memory. As we can see from the final
line of output, 3 out of each gr
oup of 8 pages are memory-resident.
In this example, we assumed superuser privilege so that the program can use
mlock()
. This is not necessary in Linux 2.6.9 and later if the amount of memory to
be locked falls within the
RLIMIT_MEMLOCK
soft resource limit.
50.4Advising Future
Memory Usage Patterns:
madvise()
system call is used is to improve the performance of an application
by informing the kernel about the calling pr
ocesss likely usage of the pages in the
range starting at
and continuing for
bytes. The kernel may use this
information to improve the efficiency of
I/O performed on the file mapping that
underlies the pages. (See Section 49.4 for
a discussion of file mappings.) On Linux,
madvise()
has been available since kernel 2.4.
The value specified in
must be page-aligned, and
length
is effectively rounded
up to the next multiple of the system page size. The
argument is one of the
following:
MADV_NORMAL
This is the default behavior. Pages are tr
ansferred in clusters (a small multiple
of the system page size). This result
s in some read-ahead and read-behind.
MADV_RANDOM
Pages in this region will be accessed
randomly, so read-ahead will yield no
benefit. Thus, the kernel should fe
tch the minimum amount of data on
#define _BSD_SOURCE
#include sys/mman.h&#xsys/;mma;&#xn.h7;
int
madvise
(void *
, size_t
, int
Returns 0 on success, or 1 on error
Virtual Memory Operations
1053
if (mincore(addr, length, vec) == -1)
errExit("mincore");
for (j = 0; j numPages; j++) {
if (j % 64 == 0)
printf("%s%10p: ", (j == 0) ? "" : "\n", addr + (j * pageSize));
printf("%c", (vec[j] & 1) ? '*' : '.');
}
printf("\n");
freevecvec
int
main(int argc, char *argv[])
char *addr;
size_t len, lockLen;
long pageSize, stepSize, j;
if (argc != 4 || strcmp(argv[1], "--help") == 0)
usageErr("%s num-pages lock-page-step lock-page-len\n", argv[0]);
pageSize = sysconf(_SC_PAGESIZE);
if (pageSize == -1)
errExit("sysconf(_SC_PAGESIZE)");
1052
Chapter 50
mincor
system call returns memory-residen
ce information about pages in the
virtual address range starting at
and running for
bytes. The address sup-
must be page-aligned, and, since
Virtual Memory Operations
1051
system call locks all of the curre
ntly mapped pages in a processs vir-
tual address space, all of the pages mapped
in the future, or both, according to the
bit mask, which is specified by ORing
1050
Chapter 50
Virtual Memory Operations
1049
RLIMIT_MEMLOCK
has different semantics for System V shared memory is
that a shared memory segment can continue
to exist even when it is not attached by
any process. (It is removed only after an explicit
IPC_RMID
operation, and
then only after all processes have de
tached it from their address space.)
Locking and unlocking memory regions
A process can use
mlock()
munlock()
to lock and unlock regions of memory.
mlock()
system call locks all of the pages of
the calling processs virtual address
range starting at
and continuing for
length
bytes. Unlike the corresponding
argument passed to several other memory-related system calls,
does not need
to be page-aligned: the kernel
locks pages starting at the next page boundary below
addr
. However, SUSv3 optionally allows
an implementation to require that
be
a multiple of the system page size, and port
able applications should ensure that this
is so when calling
mlock()
munlock()
Because locking is done in units of whole pages, the end of the locked region is
the next page boundary greater than
plus
addr
. For example, on a system
where the page size is 4096 bytes, the call
mlock(2000, 4000)
will lock bytes 0
through to 8191.
We can find out how much memory a process currently has locked by inspect-
ing the
VmLck
entry of the Linux-specific
/proc/
PID
/status
file.
After a successful
mlock()
call, all of the pages in the specified range are guaranteed
to be locked and resident in physical memory. The
mlock()
system call fails if there
is insufficient physical memory to lock all of the requested pages or if the request
violates the
RLIMIT_MEMLOCK
soft resource limit.
We show an example of the use of
mlock()
in Listing 50-2.
munl
system call performs the converse of
, removing a memory
lock previously established by the calling process. The
addr
and
length
arguments
1048
Chapter 50
In this section, we look at the system calls used for locking and unlocking part or all
of a processs virtual memory. However, befo
re doing this, we first look at a resource
limit that governs memory locking.
RLIMIT_MEMLOCK
resource limit
In Section 36.3, we briefly described the
RLIMIT_MEMLOCK
limit, which defines a limit
on the number of bytes that a process can lock into memory. We now consider this
Virtual Memory Operations
1047
int
main(int argc, char *argv[])
char cmd[CMD_SIZE];
char *addr;
/* Create an anonymous mapping with all access denied */
addr = mmap(NULL, LEN, PROT_NONE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED)
errExit("mmap");
/* Display line from /proc/self/maps corresponding to mapping */
printf("Before mprotec\);
1046
Chapter 50
The value given in
ation by directly parsing
/proc/self/maps
, but we used
the call to
because it results in a shorter
program.) When we run this pro-
gram, we see the following:
./t_mprotect
Before mprotec
VIRTUAL MEMORY OPERATIONS
This chapter looks at various system call
s that perform operations on a processs
virtual address space:
The
mprotect()
system call changes the protection
mlock()
and
mlockall
system calls lock a region of virtual memory into
physical memory, thus preventi
ng it from being swapped out.
mincore()
1044
Chapter 49
Swap space overcommitting allows the system to allocate more memory to
processes than is actually available in RA
M and swap space. Overcommitting is possible
because, typically, each process does not ma
ke full use of its allocation. Overcommit-
ting can be controlled on a per-
mma
basis using the
MAP_NORESERVE
flag, and on a
system-wide basis using
/proc
files.
mremap
system call allows an existing mapping to be resized. The
remap_file_p
system call allows the creati
on of nonlinear file mappings.
Further information
Information about the implementation of
mma
the size of the input file, which can then
be used to size
the required memory
mappings, and use
ftruncat
Memory Mappings
1043
individual pages. It was intended that
remap_file_pages
would allow permis-
sions on individual pages within a VMA to
be changed, but this facility has not
so far been implemented.
flags
argument is currently unused.
As currently implemented,
remap_file_p
can be applied only to shared
MAP_SHARED
remap_file_pag
system call is Linux-specific; it is not specified in SUSv3
and is not available on other UNIX implementations.
49.12Summary
mmap()
system call creates a new memory mapping in the calling processs vir-
tual address space. The
munmap()
system call performs the converse operation,
removing a mapping from
a processs address space.
A mapping may be of two types: file-based or anonymous. A file mapping maps
the contents of a file region into the pr
ocesss virtual address space. An anonymous
mapping (created by using the
MAP_ANONYMOUS
flag or by mapping
/dev/zero
) doesnt
have a corresponding file region; the by
tes of the mapping are initialized to 0.
Mappings can be either private (
) or shared (
). This distinction
1042
Chapter 49
and
arguments identify a file region whose position in memory is to
be changed. The
argument specifies the start of
the file region in units of the
addr = mmap(0, 3 * ps, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
The following calls would then create th
e nonlinear mapping shown in Figure 49-5:
remap_file_pages(addr, ps, 0, 2, 0);
/* Maps page 0 of file into page 2 of region */
remap_file_pages(addr + 2 * ps, ps, 0, 0, 0);
/* Maps page 2 of file into page 0 of region */
Figure 49-5:
A nonlinear file mapping
There are two other arguments to
remap_file_p
Mapped file
(12288 bytes)
page 0 of mapping
maps page 2 of file
Memory region
(12288 bytes)
page 1 of mapping
maps page 1 of file
page 2 of mapping
maps page 0 of file
Increasing virtual addresses
page 0 of filepage 1 of filepage 2 of file
Memory Mappings
1041
above, a portable application should avoid trying to create a new mapping at a fixed
address. The first step avoids the portab
ility problem, because we let the kernel
select a contiguous addres
s range, and then create
new mappings within that
From Linux 2.6 onward, the
remap_file_pages
system call, which we describe
in the next section, can also be used to
achieve the same effect. However, the use of
MAP_FIXED
is more portable than
rema
, which is Linux-specific.
49.11Nonlinear Mappings:
File mappings created with
mmap()
are linear: there is a sequential, one-to-one cor-
respondence between the pa
ges of the mapped file an
d the pages of the memory
region. For most applications, a linear
mapping suffices. However, some applica-
tions need to create large numbers of nonlinear mappingsmappings where the
pages of the file appear in a different
order within contiguous memory. We show
an example of a nonlinear mapping in Figure 49-5.
We described one way of creating nonl
inear mappings in the previous section:
using multiple calls to
mmap()
with the
MAP_FIXED
flag. However, this approach
doesnt scale well. The problem is that each of these
mmap()
calls creates a separate
kernel virtual memory area (VMA) data st
1040
Chapter 49
killer. Other factors that increase a proces
ss likelihood of selection are forking to
create many child processes and having a lo
w nice value (i.e., one that is greater
than 0). The kernel disfav
ors killing the following:
processes that are privileged, since they
are probably performing important tasks;
processes that are performing raw device
access, since killing them may leave
the device in an unusable state; and
processes that have been running for a
long time or have consumed a lot of
CPU, since killing them would re
sult in a lot of lost work.
the OOM killer delivers a
SIGKILL
signal.
/proc/
/oom_score
file, available since kernel 2.6.11, shows
the weighting that the kernel gives to a process if it is necessary to invoke the OOM
killer. The greater the value in
this file, the more likely th
e process is to be selected,
if necessary, by the OOM killer. The Linux-specific
file, also available
since kernel 2.6.11, can be
used to influence the
oom_score
of a process. This file can
Memory Mappings
1039
Denying obvious overcommits means th
at new mappings whose size doesnt
exceed the amount of currently available fr
ee memory are permi
tted. Existing allo-
cations may be overcommitted (since they
may not be using all of the pages that
they mapped).
Since Linux 2.6, a value of 1 has the same
meaning as a positive value in earlier
kernels, but the value 2 or greateror greater
strict overcommitting
to be employed. In
this case, the kernel performs strict accounting on all
mmap()
allocations and limits
the system-wide total of all such alloca
tions to be less than or equal to:
[swap size] + [RAM size] * overcommit_ratio / 100
overcommit_ratio
value is an integerexpressing
a percentagecontained in the
Linux-specific
/proc/sys/vm/overcommit_ratio
file. The default value contained in this
file is 50, meaning that the kernel can over
allocate up to 50% of
the size of the sys-
tems RAM, and this will be successful, as
long as not all proces
ses try to use their
full allocation.
Note that overcommit moni
toring comes into play on
ly for the following types
of mappings:
private writable mappings (both file
and anonymous mappings), for which the
swap cost of the mapping is equal to
the size of the mappi
ng for each process
that employs the mapping; and
shared anonymous mappings, for which th
e swap cost of the mapping is the
size of the mapping (since all processes share that mapping).
Reserving swap space for a read-only privat
e mapping is unnecess
ary: since the con-
tents of the mapping cant be modified, th
ere is no need to employ swap space.
Swap space is also not required for shar
ed file mappings, because the mapped file
itself acts as the swap space for the mapping.
When a child process inherits a mapping across a
fork()
, it inherits the
MAP_NORESERVE
1038
Chapter 49
On Linux, the
reallo
function uses
mr
to efficiently reallocate large
blocks of memory that
malloc()
previously allocated using
mmap
MAP_ANONYMOUS
(We mentioned this feature of the
glibc
malloc()
implementation in Section 49.7.)
Using
for this task makes it possible
to avoid copying of bytes during
the reallocation.
and Swap Space Overcommitting
Some applications create large (usually
private anonymous) mappings, but use only
a small part of the mapped region. For exam
ple, certain types of scientific applica-
tions allocate a very large array, but operate on only a few widely separated ele-
ments of the array (a so-called
If the kernel always allocated (or reserved) enough swap space for the whole of
such mappings, then a lot of swap space
would potentially be wasted. Instead, the
kernel can reserve swap space for the page
s of a mapping only as they are actually
required (i.e., when the application accesses a page). This approach is called
lazy
swap reservation
, and has the advantage that the total virtual memory used by appli-
cations can exceed the total si
ze of RAM plus swap space.
To put things another way, lazy swap re
servation allows swap space to be over-
committed. This works fine,
as long as all processes do
nt attempt to access the
entire range of their mappings.
However, if all applications
attempt to access
the full range of their mappings, RAM an
d swap space will be exhausted. In this
situation, the kernel reduces memory pressure
by killing one or more of the processes
on the system. Ideally, the kernel attempts
to select the process causing the mem-
ory problems (see the discussion of the
OOM killer
below), but this isnt guaranteed.
For this reason, we may choo
se to prevent lazy swap reservation, instead forcing
the system to allocate all
of the necessary swap space
when the mapping is created.
How the kernel handles reservation of swap space is controlled by the use of
MAP_NORESERVE
flag when calling
mmap()
interfaces that affect the
system-wide operation of swap space overcommitting. These factors are summa-
rized in Table 49-4.
The Linux-specific
/proc/sys/vm/overcommit_memory
file contains an integer value that
controls the kernels handli
ng of swap space overcommits. Linux versions before
2.6 differentiated only two values in th
is file: 0, meaning deny obvious over-
commits (subject to the use of the
MAP_NORESERVE
flag), and greater than 0, meaning
that overcommits should be permitted in all cases.
Table 49-4:
Handling of swap space reservation during
mmap()
overcommit_memory
value
MAP_NORESERVE
specified in
mm
call?
NoYes
0Deny obvious overcommitsAllow overcommits
1Allow overcommitsAllow overcommits
2 (since Linux 2.6)Strict overcommitting
Memory Mappings
1037
if (munmap(addr, sizeof(int)) == -1)
errExit("munmap");
exit(EXIT_SUCCESS);
}
mmap/anon_mmap.c
49.8Remapping a Mapped Region:
On most UNIX implementations, once a
mapping has been created, its location
and size cant be changed. However,
Linunonportablenonportable
mremap
system call, which permits such changes.
and
arguments specify the location and size of an existing
mapping that we wish to expand or
shrink. The address specified in
must be page-aligned, and is normally
1036
Chapter 49
Listing 49-3:
tween parent and child processes
mmap/anon_mmap.c
#ifdef USE_MAP_ANON
#define _BSD_SOURCE /* Get MAP_ANONYMOUS definition */
#endif
#include sys/wait.h&#xsys/;wai;&#xt.h7;
#include sys/mman.h&#xsys/;mma;&#xn.h7;
#include fcntünt;l.h;l.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int *addr; /* Pointer to shared memory region */
#ifdef USE_MAP_ANON /* Use MAP_ANONYMOUS */
addr = mmap(NULL, sizeof(int), PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED)
errExit("mmap");
#else /* Map /dev/zero */
int fd;
fd = open("/dev/zero", O_RDWR);
if (fd == -1)
errExit("open");
addr = mmap(NULL, sizeof(int), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
errExit("mmap");
if (close(fd) == -1) /* No longer needed */
errExit("close");
#endif
*addr = 1; /* Initialize integer in mapped region */
switch (fo){ /* Parent and child share mapping */
case -1:
errExit("fork");
case 0: /* Child: increment shared integer and exit */
printf("Child started, value = %d\n", *addr);
(*addr)++;
if (munmap(addr, sizeof(int)) == -1)
errExit("munmap");
exit(EXIT_SUCCESS);
default: /* Parent: wait for child to terminate */
if (wait(NULL) == -1)
errExit("wait");
printf("In parent, value = %d\n", *addr);
Memory Mappings
1035
MAP_PRIVATE
MAP_PRIVATE
anonymous mappings are used to a
llocate blocks of process-private
memory initialized to 0. We can use the
/dev/zero
technique to create a
MAP_PRIVATE
anonymous mapping as follows:
fd = open("/dev/zero", O_RDWR);
if (fd == -1)
errExit("open");
addr = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
if (addr == MAP_FAILED)
errExit("mmap");
The
implementation of
malloc
MAP_PRIVATE
anonymous mappings to
allocate blocks of memory larger than
MMAP_THRESHOLD
bytes. This makes it possible
to efficiently dealloca
te such blocks (via
) if they are later given to
free()
(It also reduces the possibi
lity of memory fragmentat
ion when repeatedly allo-
cating and deallocating large blocks of memory.)
MMAP_THRESHOLD
is 128 kB by
1034
Chapter 49
MAP_UNINITIALIZED
(since Linux 2.6.33)
Specifying this flag prevents the
pages of an anonymous mapping from
being zeroed. It provides a performance
benefit, but carries a security risk,
because the allocated pages may contain sensitive information left by a pre-
vious process. This flag is thus on
ly intended for use on embedded sys-
tems, where performance may be critical, and the entire system is under
the control of the embedded applicss
. This flag is only honored if the
kernel was configured with the
CONFIG_MMAP_ALLOW_UNINITIALIZED
option.
49.7Anonymous
Mappings
An
anonymous mapping
is one that doesnt have a corr
esponding file. In this section,
we show how to create anonymous mappings, and look at the purposes served by
private and shared anonymous mappings.
MAP_ANONYMOUS
and
/dev/zero
On Linux, there are two different, equiva
lent methods of creating an anonymous
mapping with
mmap()
Specify
MAP_ANONYMOUS
in
and specify
as 1. (On Linux, the value of
is
ignored when
MAP_ANONYMOUS
is specified. However, some UNIX implementations
to be 1 when employing
MAP_ANONYMOUS
, and portable applications
should ensure that they do this.)
We must define either the
_BSD_SOURCE
or the
_SVID_SOURCE
feature test macros
Memory Mappings
1033
49.6Additional
mmap
In addition to
MAP_PRIVATE
and
MAP_SHARED
, Linux allows a number of other values to
be included (ORed) in the
m
argument. Table 49-3 summarizes these values.
MAP_PRIVATE
MAP_SHARED
MAP_FIXED
flag is specified in SUSv3.
The following list provides further details on the
flags
values listed in Table 49-3
(other than
MAP_PRIVATE
MAP_SHARED
, which have already been discussed):
MAP_ANONYMOUS
Create an anonymous mappingthat is,
a mapping that is not backed by a
file. We describe this fl
ag further in Section 49.7.
MAP_FIXED
We describe this flag in Section 49.10.
MAP_HUGETLB
(since Linux 2.6.32)
This flag serves the same purpose for
mmap()
as the
1032
Chapter 49
Possible values for the
argument include one of the following:
MS_SYNC
Perform a synchronous file write. The
call blocks until all modified pages
of the memory regio
n have been written to the disk.
MS_ASYNC
Perform an asynchronous file write.
The modified pages of the memory
region are written to the disk at
some later point and are immediately
made visible to other
processes performing a
read
on the corresponding
file region.
Another way of distinguishing these tw
o values is to say that after an
operation,
the memory region is synchronized
with the disk, while after an
MS_ASYNC
operation, the
memory region is merely synchronized with the kernel buffer cache.
If we take no further action after an
MS_ASYNC
operation, then the modified
pages in the memory region will eventually be flushed as part of the automatic
buffer flushing performed by the
pdflush
kernel thread (
in Linux 2.4
pping and via I/O system calls (
read
so on) are always consistent, and the only use of
msyn
is to force the contents of a
flushed to disk.
However, a unified virtual memory system
is not required by SUSv3 and is not
employed on all UNIX implementations. On such systems, a call to
msyn
is
required to make changes to the contents
of a mapping visible to other processes
the file, and the
MS_INVALIDATE
flag is required to perform the converse
action of making writes to the file by another process visible in the mapped region.
Multiprocess applications that employ both
mmap()
and I/O system calls to operate
on the same file should be desi
gned to make appropriate use of
msyn
if they are
to be portable to systems that dont
have a unified virtual memory system.
Memory Mappings
1031
However, the situation is
complicated by the limi
ted granularity of memory
protections provided by some hardware architectures (Section 49.2). For such
architectures, we make th
e following observations:
All combinations of memory protection
are compatible with opening the file
O_RDWR
flag.
No combination of memory protectionsnot even just
is compatible
with a file opened
O_WRONLY
(the error
EACCES
results). This is consistent with the
fact that some hardware architectures
dont allow us write-only access to a
page. As noted in Section 49.2,
PROT_WRITE
implies
PROT_READ
on those architec-
tures, which means that if the page can be
written, then it can also be read. A
read operation is incompatible with
O_WRONLY
, which must not reveal the origi-
nal contents of the file.
The results when a file is opened with the
O_RDONLY
1030
Chapter 49
Since the size of the mapping is not a multiple of the system page size, it is
rounded up to the next multiple of the system page size. Because the file is larger
than this rounded-up size, the correspon
ding bytes of the file are mapped as
Attempts to access bytes
beyond the end of the mappi
ng result in the genera-
tion of a
SIGSEGV
signal (assuming that there is no other mapping at that location).
The default action for this
signal is to terminate the process with a core dump.
When the mapping extends beyond the
end of the underlying file (see Fig-
ure 49-4), the situation is more complex.
As before, because the size of the map-
ping is not a multiple of the system page
size, it is rounded up. However, in this
case, while the bytes in the rounded-up re
gion (i.e., bytes 2200 to 4095 in the dia-
gram) are accessible, they are not mapped to
the underlying file
(since no corre-
sponding bytes exist in the fi
le). Instead, they are initialized to 0 (SUSv3 requires
this). These bytes will nevertheless be shar
ed with other processes mapping the file,
if they specify a sufficiently large
argument. Changes to these bytes are not
written to the file.
If the mapping includes pages beyond
the rounded-up regi
on (i.e., bytes 4096
and beyond in Figure 49-4), then attempts
to access addresses in these pages result
in the generation of a
SIGBUS
signal, which warns the process that there is no region
of the file corresponding to these addr
esses. As before,
attempts to access
addresses beyond the end of the mapp
ing result in the generation of a
signal.
From the above description, it may appe
ar pointless to crea
te a mapping whose
size exceeds that of the un
derlying file. However, by ex
tending the size of the file
(e.g., using
ftrunc
or
), we can render previously inaccessible parts of such
a mapping usable.
Figure 49-4:
Memory mapping extending be
yond end of mapped file
49.4.4Memory Protection and
File Access Mode Interactions
One point that we have not so far explaine
d in detail is the in
Mapped file
(2200 bytes)
remainder
of pag0s0s
to file
references
to file
40958192
21992200
Memory Mappings
1029
addr = mmap(NULL, MEM_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
errExit("mmap");
if (close(fd) == -1) /* No longer need 'fd' */
errExit("close");
printf("Current string=%.*s\n", MEM_SIZE, addr);
/* Secure practice: output at most MEM_SIZE bytes */
if (�argc 2) { /* Update contents of region */
if (strlen(argv[2]�) = MEM_SIZE)
cmdLineErr("'new-value' too large\n");
Memory
references
Mapped file
(9500 bytes)
remainder
of page
accessible, mapped to file
59998192
requested size of mapping
actual mapped region of file
1028
Chapter 49
We then use our program to map the fi
le and copy a string into the mapped
./t_mmap s.txt hello
Current string=
Copied "hello" to shared memory
The program displayed nothing for the curre
nt string because the initial value of
the mapped files began with a null
byte (i.e., zero-length string).
Next, we use our program to again map th
e file and copy a ne
w string into the
./t_mmap s.txt goodbye
Current string=hello
Copied "goodbye" to shared memory
Finally, we dump the contents of the file, 8
characters per line, to verify its contents:
od -c -w8 s.txt
0000000 g o o d b y e nul
0000010 nul nul nul nul nul nul nul nul
0002000
Our trivial program doesnt use any mech
anism to synchronize access by multiple
processes to the mapped file
. However, real-world applications typically need to
synchronize access to shared mappings. Th
Memory Mappings
1027
corresponding file blocks into memory
. For output, the user process merely
needs to modify the contents of the me
mory, and can then rely on the kernel
memory manager to a
utomatically update the underlying file.
In addition to saving a transfer be
tween kernel space and user space,
mma
can also improve performance by lowering memory requirements. When using
read
or
, the data is maintained in two buffers: one in user space and the
other in kernel space. When using
mma
, a single buffer is shared between
the kernel space and user space. Furthermore, if multiple processes are per-
forming I/O on the same file, then, using
mmap()
, they can all share the same
kernel buffer, resulting in an additional memory saving.
Performance benefits from memory-mappe
d I/O are most likely to be realized
when performing repeated random accesses
in a large file. If we are performing
sequential access of a file, then
mmap()
will probably provide little or no gain over
read
and
writ
, assuming that we perform I/O
using buffer sizes big enough to
avoid making a large number of I/O system call
s. The reason that there is little perfor-
mance benefit is that, regardless of which technique we use, the entire contents of the
1026
Chapter 49
Figure 49-2:
Two processes with a shared mappin
g of the same region of a file
Memory-mapped I/O
Since the contents of the shared file mapping are initialized from the file, and any
modifications to the conten
ts of the mapping are automatically carried through to
the file, we can perform file I/O simply by
accessing bytes of memory, relying on the
kernel to ensure that the changes to me
mory are propagated to the mapped file.
(Typically, a program would define a struc
tured data type that corresponds to the
contents of the disk file, and then use that data type to cast the contents of the
mapping.) This technique is referred to as
memory-mapped I/O
, and is an alternative
read
wr
to access the co
ntents of a file.
Memory-mapped I/O has two potential advantages:
By replacing
and
writ
system calls with memory accesses, it can simplify
the logic of some applications.
It can, in some circumstances, provide
Process A
page table
for mapped
region
Process B
page table
for mapped
region
pages
Physical
memory
region of
file
Open file
I/O managed
by kernel
Memory Mappings
1025
Although the executable text segment is normally protected to allow only read
and execute access (
PROT_READ
PROT_EXEC
), it is mapped using
MAP_PRIVATE
rather than
MAP_SHARED
, because a debugger or a self-modifying program can
modify the program text (after first ch
anging the protection on the memory),
and such changes should not be carried
through to the underlying file or affect
To map the initialized data segment of an executable or shared library. Such
mappings are made private so that mo
difications to the
contents of the
mapped data segment are not carried
through to the underlying file.
Both of these uses of
mmap()
are normally invisible to a program, because these
mappings are created by the program loader
and dynamic linker. Examples of both
kinds of mappings can be seen in the
/proc/
output shown in Section 48.5.
One other, less frequent, use of a private file mapping is to simplify the file-
input logic of a program. This is simila
r to the use of shared file mappings for
memory-mapped I/O (described in the next section), but allows only for file input.
Figure 49-1:
Overview of me
mory-mapped file
49.4.2Shared File Mappings
When multiple processes create shared mappings of the same file region, they all
share the same physical pages of memory. In addition, modifications to the contents
of the mapping are carried thro
ugh to the file. In effect, the file is being treated as
the paging store for this region of memory, as shown in Figure 49-2. (We simplify
things in this diagram by omitting to sh
ow that the mapped pages are typically not
contiguous in physical memory.)
Shared file mappings serve two purp
oses: memory-mapped I/O and IPC. We
consider each of these uses below.
Process virtual
memory
address
region
region
1024
Chapter 49
If there are no mappings in the address range specified by
and
munmap()
file, typically via a call to
open()
2.Pass that file descriptor as the
argument in a call to
mmap()
As a result of these steps,
mma
maps the contents of the
open file into the address
space of the calling process. Once
mmap()
has been called, we can close the file
descriptor without affecting the mapping. Ho
wever, in some cases it may be useful
to keep this file descriptor opensee, fo
r example, Listing 49-1 and also Chapter 54.
As well as normal disk files, it is possible to use
to map the contents of
various real and virtual de
vices, such as hard di
sks, optical disks, and
/dev/mem
The file referred to by the descriptor
must have been opened with permissions
appropriate for the values specified in
and
flags
. In particular, the file must
always be opened for reading, and, if
PROT_WRITE
and
MAP_SHARED
are specified in
flags
then the file must be opened for both reading and writing.
The
Memory Mappings
1023
/* Obtain the size of the file and use it to specify the size of
the mapping and the size of the buffer to be written */
if (fstat(fd, &sb) == -1)
errExit("fstat");
addr = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (addr == MAP_FAILED)
errExit("mmap");
if (write(STDOUT_FILENO, addr, sb.st_size) != sb.st_size)
fatal("partial/failed write");
exit(EXIT_SUCCESS);
mmap/mmcat.c
49.3Unmapping a Mapped Region:
munmap
munmap()
system call performs the converse of
mmap()
, removing a mapping
from the calling processs virtual address space.
addr
argument is the starting address of the address range to be unmapped. It
must be aligned to a page bou
ndary. (SUSv3 specified that
addr
must
be page-aligned.
SUSv4 says that an implementation
require this argument to be page-aligned.)
length
argument is a nonnegative integer
specifying the size (in bytes) of
the region to be unmapped. The address range up to the next multiple of the system
page size will be unmapped.
Commonly, we unmap an entire mapping. Thus, we specify
as the address
1022
Chapter 49
Modern x86-32 architectures provide
hardware support for marking pages
tables as
(no execute), and, since kernel 2.6.8, Linux makes use of this feature
to properly separate
permissions on Linux/x86-32.
Alignment restrictions specified in standards for
Memory Mappings
1021
argument is a bit mask of options controlling various aspects of the map-
ping operation. Exactly one of the following values must be included in this mask:
MAP_PRIVATE
Create a private mapping. Modification
s to the contents of the region are
not visible to other processes employ
ing the same mapping, and, in the
case of a file mapping, are not carri
ed through to the underlying file.
MAP_SHARED
Create a shared mapping. Modifications
to the contents of the region are
visible to other processes mapp
ing the same region with the
MAP_SHARED
attribute and, in the case of a file
mapping, are carried through to the
underlying file. Updates to the file
are not guaranteed to be immediate;
see the discussion of the
msync()
system call in Section 49.5.
Aside from
MAP_PRIVATE
and
MAP_SHARED
, other flag values can optionally be ORed in
. We discuss these flags in Sections 49.6 and 49.10.
The remaining arguments,
and
1020
Chapter 49
49.2Creating a Mapping:
mmap()
system call creates a new mapping in the calling processs virtual
address space.
argument indicates the virtual address at which the mapping is to be
located. If we specify
as
NULL
, the kernel chooses a suitable address for the
mapping. This is the preferred way of creating a mapping. Alternatively, we can
value in
, which the kernel takes as a hint about the address
at which the mapping should be placed. In
practice, the kernel will at the very
least round the address to a nearby page boundary. In either case, the kernel will
choose an address that doesnt conflict
with any existing mapping. (If the value
MAP_FIXED
is included in
must be page-aligned. We describe this
flag in Section 49.10.)
On success,
mmap()
Memory Mappings
1019
main use of this type of mapping is to initialize a region of memory from the
contents of a file. Some common examples
are initializing a processs text and
initialized data segments from the corresponding parts of a binary executable
file or a shared library file.
Private anonymous mapping
: Each call to
mmap()
to create a private anonymous
mapping yields a new mapping that is dist
inct from (i.e., does not share physical
pages with) other anonymous mappings created by the same (or a different)
process. Although a child process inheri
ts its parents mappings, copy-on-write
semantics ensure that, after the
fo
, the parent and child dont see changes
made to the mapping by the other proc
ess. The primary purpose of private
anonymous mappings is to allocate new (zero-filled) memory for a process
(e.g.,
malloc()
employs
mmap()
for this purpose when allocating large blocks of
Shared file mapping
: All processes mapping the same region of a file share the
same physical pages of memory, which are initialized from a file region. Modi-
fications to the contents of the mapping
are carried through to the file. This
type of mapping serves two
purposes. First, it permits
memory-mapped I/O
this, we mean that a file is loaded into a region of the processs virtual memory,
and modifications to that memory are automatically written to the file. Thus,
memory-mapped I/O provides an alternative to using
and
writ
for per-
forming file I/O. A second purpose of this
type of mapping is to allow unrelated
processes to share a region of memory in
fastfast
ner similar to System V shared
memory segments (Chapter 48).
Shared anonymous mapping
: As with a private anonymous mapping, each call to
mmap()
to create a shared anonymous mapping creates a new, distinct map-
ping that doesnt share pages with an
y other mapping. The difference is that
the pages of the mapping are not copied-on-write. This means that when a
child inherits the mapping after a
fork()
, the parent and child share the same
pages of RAM, and changes made to th
e contents of the mapping by one pro-
cess are visible to the other process. Shared anonymous mappings allow IPC in
a manner similar to System V shared
1018
Chapter 49
The memory in one processs mapping may
be shared with mappings in other pro-
cesses (i.e., the page-table entries of each
process point to the same pages of RAM).
This can occur in two ways:
When two processes map the same region
of a file, they share the same pages
of physical memory.
A child process created by
fork
inherits copies of its parents mappings, and
these mappings refer to the
same pages of physical
ing mappings in the parent.
When two or more processes share the same pages, each process can potentially
see the changes to the page contents
made by other processes, depending on
MEMORY MAPPINGS
This chapter discusses the use of the
mma
system call to create memory mappings.
Memory mappings can be used for IPC, as well as a range of other purposes. We
begin with an overview of some fund
amental concepts before considering
mmap()
depth.
49.1Overview
mmap()
system call creates a new
memory mapping
in the calling processs virtual
address space. A mapping can be of two types:
File mapping
: A file mapping maps a region of a file directly into the calling
processs virtual memory. Once a file is
mapped, its conten
ts can be accessed
by operations on the bytes in the corresponding memory region. The pages of
the mapping are (automatically) loaded fr
om the file as required. This type of
mapping is also known as a
file-based mapping
or
memory-mapped file
Anonymous mapping
: An anonymous mapping doesnt have a corresponding
file. Instead, the pages of the mapping are initialized to 0.
Another way of thinking of an anonymou
s mapping (and one is that is close to
the truth) is that it is a mapping of a
virtual file whose contents are always ini-
tialized with zeros.
1016
Chapter 48
48.11Exercises
48-1.
Replace the use of binary semaphores in Listing 48-2 (
svshm_xfr_writer.c
Listing 48-3 (
svshm_xfr_reader.c
) with the use of event flags (Exercise 47-5).
48-2.
Explain why the program in Listing 48-3
incorrectly reports the number of bytes
transferred if the
loop is modified as follows:
for (xfrs = 0, bytes = 0; �shmp-cnt != 0; xfrs++, bytes += sh�mp-cnt) {
reserveSem(semid, READ_SEM); /* Wait for our turn */
if (write(STDOUT_FILENO, shm�p-buf, shm�p-cnt) != s�hmp-cnt)
fatal("write");
releaseSem(semid, WRITE_SEM); /* Give writer a turn */
48-3.
Try compiling the programs in Listing 48-2 (
svshm_xfr_writer.c
) and Listing 48-3
svshm_xfr_reader.c
) with a range of different sizes (defined by the constant
BUF_SIZE
for the buffer used to exchange data
System V Shared Memory
1015
cat shmmax
33554432
cat shmall
2097152
The Linux-specific
IPC_INFO
1014
Chapter 48
This field counts the number of proc
esses that currently have the segment
attached. It is initialized to 0 when the segment is created, and then incre-
mented by each successful
and decremented by each successful
shmd
. The
shmatt_t
data type used to define this
field is an unsigned integer
type that SUSv3 requires to
be at least the size of
this type is defined as
unsigned long
48.9Shared Memory Limits
Most UNIX implementations impose various limits on System V shared memory.
Below is a list of the Linux shared memory
limits. The system
call affected by the
limit and the error that results if the li
mit is reached are noted in parentheses.
SHMMNI
This is a system-wide limit on the numb
er of shared memory identifiers (in
other words, shared memory segments) that can be created. (
System V Shared Memory
1013
SUSv3 requires all of the fields shown here. Some other UNIX implementations
include additional nonstandard fields in the
structure.
The fields of the
shmid_ds
structure are implicitly
updated by various shared
memory system calls, and certain subfields of the
field can be explicitly
1012
Chapter 48
Locking and unlocking shared memory
A shared memory segment can be locked into RAM, so that it is never swapped
out. This provides a performance benefit,
since, once each page of the segment is
memory-resident, an application is guarante
ed never to be delayed by a page fault
when it accesses the page. There are two
locking operations:
SHM_LOCK
operation locks a shared memory segment into memory.
SHM_UNLOCK
operation unlocks the shared memory segment, allowing it to
be swapped out.
These operations are not specified by SUSv3, and they are not provided on all
UNIX implementations.
In versions of Linux before
2.6.10, only privileged (
CAP_IPC_LOCK
) processes can
lock a shared memory segment into me
mory. Since Linux 2.6.10, an unprivileged
process can lock and unlock a shared memory segment if its effective user ID
matches either the ow
ner or the creator user ID of the segment and (in the case of
SHM_LOCK
) the process has a sufficiently high
RLIMIT_MEMLOCK
resource limit. See Sec-
System V Shared Memory
1011
48.7Shared Memory Control Operations
system call performs a range of control operations on the shared mem-
ory segment identified by
shmid
argument specifies the control operation to be performed. The
argu-
ment is required by the
IPC_STAT
and
1010
Chapter 48
48.6Storing Pointe
rs in Shared Memory
Each process may employ different shar
ed libraries and memory mappings, and
Shared memory segment
target
System V Shared Memory
1009
In the output from
shown in Listing 48-4,
we can see the following:
Three lines for the main program,
. These correspond to the text
and data segments of the program
. The second of these lines is for a read-
only page holding the string constants used by the program.
Two lines for the attached System V shared memory segments
Lines corresponding to the segments for two shared libraries. One of these is
the standard C library (
libc-
version
. The other is the dynamic linker
version
.so
), which we describe in Section 41.4.3
A line labeled
[stack]
. This corresponds to the process stack
A line containing the tag
[vdso]
. This is an entry for the
linux-gate
virtual
dynamic DSODSO This entr
y appears only in ke
rnels since 2.6.12.
See
http://www.trilithium.com/johan/2005/08/linux-gate/
for further informa-
tion about this entry.
The following columns are shown in each line of
/proc/
, in order from left
1.A pair of hyphen-separated numbers indicating the virtual address range (in
hexadecimal) at which the memory segm
ent is mapped. The second of these
numbers is the address of the next byte
after
the end of the segment.
2.Protection and flags for this memory
1008
Chapter 48
We begin the shell session by creating
two shared memory segments (100 kB and
3200 kB in size):
./svshm_create -p 102400
9633796
./svshm_create -p 3276800
9666565
./svshm_create -p 102400
1015817
./svshm_create -p 3276800
1048586
We then start a program that attaches th
ese two segments at addresses chosen by
./svshm_attach 9633796:0 9666565:0
SHMLBA = 4096 (0x1000), PID = 9903
1: 9633796:0 =�= 0xb7f0d000
2: 9666565:0 =�= 0xb7bed000
Sleeping 5 seconds
The output above shows the
addresses at which the segments were attached.
cat /proc/9903/maps
08048000-0804a000 r-xp 00000000 08:05 5526989 /home/mtk/svshm_attach
0804a000-0804b000 r--p 00001000 08:05 5526989 /home/mtk/svshm_attach
0804b000-0804c000 rw-p 00002000 08:05 5526989 /home/mtk/svshm_attach
b7bed000-b7f0d000 rw-s 00000000 00:09 9666565 /SYSV00000000 (deleted)
System V Shared Memory
1007
Figure 48-2:
Locations of shared memory, memory mappings, and shared libraries (x86-32)
In the shell session below, we employ th
ree programs that are not shown in this
chapter, but are provided in the
svshm
subdirectory in the source code distribution
for this book. These programs perform the following tasks:
svshm_create.c
program creates a shared memory segment. This program
takes the same command-line options as
the corresponding programs that we
provide for message queues (Listing 46-
1, on page 938) and semaphores, but
includes an additional argument that
specifies the size of the segment.
svshm_attach.c
program attaches the shared
memory segments identified by
its command-line arguments. Each of th
ese arguments is a colon-separated pair
of numbers consisting of a shared memo
ry identifier and an attach address.
Specifying 0 for the attach address mean
s that the system should choose the
address. The program displays the a
ddress at which the
memory is actually
e program also displays the value of
the
constant and the process ID of the process running the program.
svshm_rm.c
Shared memory, memory
mappings, and shared
libraries placed here
Reserved for heap expansion
irtual memory address
(hexadecimal)
argv, environ
Uninitialized datbssbss
Initialized data
Text (program code)
Stack
Top of
stack
Program
break
increasing virtual addesses
1006
Chapter 48
The following shell session demonstrates the use of the programs in Listing 48-2
and Listing 46-9. We invoke the writer, using the file
/etc/services
as input, and
then invoke the reader, directing its output to another file:
System V Shared Memory
1005
Listing 48-3:
Transfer blocks of data from a System V shared memory segment to
svshm/svshm_xfr_reader.c
#include "svshm_xfr.h"
int
main(int argc, char *argv[])
int semid, shmid, xfrs, bytes;
struct shmseg *shmp;
1004
Chapter 48
if (initSemAvailable(semid, WRITE_SEM) == -1)
errExit("initSemAvailable");
if (initSemInUse(semid, READ_SEM) == -1)
errExit("initSemInUse");
System V Shared Memory
1003
Create the shared memory segment and atta
ch it to the writers virtual address
space at an address chosen by the system
Enter a loop that transfers data from st
andard input to the shared memory seg-
. The following steps are performed in each loop iteration:
Reserve (decrement) the writer semaphore
Read data from standard input into the shared memory segment
Release (increment) the reader semaphore
The loop terminates when no further data is available from standard input
On the last pass through the loop, the wr
iter indicates to the reader that there
is no more data by passing a block of data of length 0 (
shmp�cnt
is 0).
Upon exiting the loop, the writer once more reserves its semaphore, so that it
1002
Chapter 48
Figure 48-1:
Using semaphores to ensure exclusive,
alternating access to shared memory
Listing 48-1:
Header file for
svshm_xfr_writer.c
and
svshm_xfr_reader.c
svshm/svshm_xfr.h
#include sys/types.&#xsys/;typ;s.7;&#xh000;h
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include sys/&#xsys/;sem;&#x.h00;sem.h
#include sys/&#xsys/;shm;&#x.h00;shm.h
#include "binary_sems.h" /* Declares our binary semaphore functions */
#include "tlpi_hdr.h"
#define SHM_KEY 0x1234 /* Key for shared memory segment */
#define SEM_KEY 0x5678 /* Key for semaphore set */
#define OBJ_PERMS (S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP)
/* Permissions for our IPC objects */
#define WRITE_SEM 0 /* Writer has access to shared memory */
#define READ_SEM 1 /* Reader has access to shared memory */
#ifndef BUF_SIZE /* Allow "cc -D" to override definition */
#define BUF_SIZE 1024 /* Size of transfer buffer */
#endif
struct shmseg { /* Defines structure of shared memory segment */
int cnt; /* Number of bytes used in 'buf' */
char buf[BUF_SIZE]; /* Data being transferred */
svshm/svshm_xfr.h
Listing 48-2 is the writer program. Th
is program performs the following steps:
Writer process
Reader process
memory
rWRITE_SEMWRITE_SEM
rREAD_SEMREAD_SEM
copy data block from
stdin
to shared memory
rREAD_SEMREAD_SEM
rWRITE_SEMWRITE_SEM
copy data block from
shared memory to
stdout
System V Shared Memory
1001
1000
Chapter 48
multiple of the system page size. Attaching a segment at an address that is
a multiple of
is necessary on some architectures in order to improve
CPU cache performance and to prevent th
e possibility that different attaches
of the same segment have inconsis
tent views within the CPU cache.
On the x86 architectures,
SHMLBA
is the same as the system page size, reflecting
the fact that such caching inconsistenc
ies cant arise on those architectures.
Specifying a non-
value for
shmaddr
(i.e., either the second
or third option listed
above) is not recommended,
for the following reasons:
It reduces the portability of an application. An address valid on one UNIX
implementation may be invalid on another.
An attempt to attach a shared memory
segment at a particular address will fail
if that address is already in use. This
could happen if, for example, the applica-
tion (perhaps inside a library function) had already attached another segment
or created a memory mapping at that address.
As its function result,
shmat()
System V Shared Memory
999
IPC_EXCL
If
IPC_CREAT
was also specified, and a segment with the specified
already
exists, fail with the error
EEXIST
The above flags are described in more de
tail in Section 45.1. In addition, Linux
permits the following nonstandard flags:
SHM_HUGETLB
(since Linux 2.6)
A privileged (
) process can use this flag
to create a shared memory
segment that uses
huge pages
. Huge pages are a feature provided by many
modern hardware architectures to ma
nage memory using very large page
sizes. (For example, x86-32 allows
4-MB pages as an alternative to 4-kB
pages.) On systems that have large amounts of memory, and where appli-
cations require large blocks of memory, using huge pages reduces the
number of entries required in the
hardware memory management units
translation look-aside buffer (TLB). This is beneficial because entries in the
TLB are usually a scarce resource
. See the kernel source file
Documentation/
998
Chapter 48
48.1Overview
In order to use a shared memory segment,
we typically perform the following steps:
an existing segment (i.e., one created
by another process). This call returns a
shared memory identifier for use in later calls.
to
the shared memory segment; that is, make the segment
part of the virtual memory of the calling process.
At this point, the shared
memory segment can be treated just like any other
memory available to the program. In or
der to refer to the shared memory, the
program uses the
existing segment, then
has no effect on the segment,
but it must be less than or equal to the size of the segment.
shmflg
argument performs the same task as for the other IPC
SYSTEM V SHARED MEMORY
This chapter describes System V shared
memory. Shared memory allows two or
more processes to share the same re
gion (usually referred to as a
segment
) of physical
memory. Since a shared memory segment becomes part of a processs user-space
memory, no kernel intervention is required
for IPC. All that is required is that one
process copies data into the shared memo
ry; that data is immediately available to
all other processes sharing the same segmen
t. This provides fast IPC by comparison
with techniques such as pipes or message queues, where the sending process copies
data from a buffer in user space into kern
el memory and the receiving process copies
in the reverse direction. (Each process also incurs the overhead of a system call to
perform the copy operation.)
On the other hand, the fact that IPC us
ing shared memory is not mediated by
the kernel means that, typically, some me
thod of synchronization is required so
that processes dont simultaneously access
the shared memory (e.g., two processes
performing simultaneous updates, or on
System V Semaphores
995
47-5.
tion of event flags using System V semaph
ores. This implementation will require two
arguments for each of the fu
nctions above: a semaphore identifier and a semaphore
number. (Consideration of the
operation will lead you to realize
that the values chosen for
clear
994
Chapter 47
to equal 0. The last two of these oper
ations may cause the caller to block.
A semaphore implementation is not required to initialize the members of a
System V Semaphores
993
A related Linux-specific operation,
SEM_INFO
992
Chapter 47
At system startup, the semaphore limits ar
System V Semaphores
991
int /* Release semaphore - increment it by 1 */
releaseSem(int semId, int semNum)
struct sembuf sops;
sops.sem_num = semNum;
sops.sem_op = 1;
sops.sem_flg = bsUseSemUndo ? SEM_UNDO : 0;
990
Chapter 47
Listing 47-10 shows the implementation of
the binary semaphore functions. Each
function in this implementation takes tw
o arguments, which identify a semaphore
System V Semaphores
989
: Free a currently reserved semaphor
e, so that it can be reserved by
another process.
In academic computer science, these two operations often go by the names
and
, the first letters of the Dutch terms
for these operations. This nomenclature
was coined by the late Dutch computer scientist Edsger Dijkstra, who pro-
duced much of the early theoretica
l work on semaphores. The terms
(decrement the semaphore) and
(increment the semaphore) are also used.
POSIX terms the two operations
post
A third operation is also sometimes defined:
Reserve conditionally
: Make a nonblocking attempt
to reserve this semaphore
for exclusive use. If the semaphore is already reserved, then immediately
988
Chapter 47
Limitations of
SEM_UNDO
We conclude by noting that the
SEM_UNDO
flag is less useful th
an it first appears, for
two reasons. One is that because modify
ing a semaphore typically corresponds to
acquiring or releasing some shared resource, the use of
SEM_UNDO
on its own may be
insufficient to allow a multiprocess applicatio
n to recover in the event that a process
unexpectedly terminates. Unless process te
System V Semaphores
987
The kernel doesnt need to keep a reco
rd of all operations performed using
SEM_UNDO
. It suffices to record the
e adjustments performed
SEM_UNDO
in a per-semaphore, per-process integer total called the
(sema-
phore adjustment) value. When the process
terminates, all that is necessary is to
subtract this total from the semaphores current value.
Since Linux 2.6, processes threadsthreads
clone()
share
semadj
values if
the
CLONE_SYSVSEM
flag is employed. Such sharin
g is required for a conforming
implementation of POSIX threads.
The NPTL threading implementation
CLONE_SYSVSEM
for the implementation of
pthread_create()
986
Chapter 47
47.7Handling of Multiple Bl
If multiple processes are blocked trying to
decrease the value of a semaphore by the
same amount, then it is in
2.Process B makes a request to subtract 1 from semaphore 0 (
3.Process C adds 1 to semaphore 0.
At this point, process B unblocks and comp
letes its request, even though it placed
its request later than process A. Again, it is possible to devise scenarios in which
process A is starved while other processes adjust and block on the values of the
47.8Semaphore Undo Values
Suppose that, having adjusted the value of
a semaphore (e.g., decreased the sema-
phore value so that it is now 0), a process then terminates, either deliberately or
accidentally. By default, the semaphores va
lue is left unchanged. This may consti-
tute a problem for other processes using th
e semaphore, since they may be blocked
waiting on that semaphorethat is, waitin
g for the now-terminated process to undo
the change it made.
To avoid such problems, we can employ the
SEM_UNDO
flag when changing the
value of a semaphore via
. When this flag is specified, the kernel records the
effect of the semaphore operation, and then undoes the operation if the process
terminates. The undo happens regardless
of whether the proc
ess terminates nor-
mally or abnormally.
System V Semaphores
985
When a semaphore set is created, the
field of the associated
semid_ds
data
structure is initialized to 0. A calendar
time value of 0 corresponds to the Epoch
(Section 10.1), and
displays this as 1 AM, 1 January 1970, since the local
timezone is Central Europe, one hour ahead of UTC.
Examining the output further, we can see that, for semaphore 0, the
value is 1 because operation 1 is waiting to decrease the semaphore value, and
is 1 because operation 3 is waiting for this semaphore to equal 0. For sema-
phore 1, the
value of 2 reflects the fact that operation 1 and operation 2 are
waiting to decrease the semaphore value.
Next, we try a nonblocking operation on the semaphore set. This operation
waits for semaphore 0 to equal 0. Since
this operation cant be immediately per-
fails with the error
EAGAIN
./svsem_op 32769 0=0n
3673, 16:03:13: about to semop() [0=0n]
ERROR [EAGAIN/EWOULDBLOCK Resource temporarily unavailable] semop (PID=3673)
Now we add 1 to semaphore 1. This causes two of the earlier blocked operations (1
and 3) to unblock:
./svsem_op 32769 1+1
3674, 16:03:29: about to semop() [1+1]
984
Chapter 47
Using the program in Listing 47-8, along with various others shown in this chapter,
we can study the operation of System V se
maphores, as demonstrated in the follow-
ing shell session. We begin by using a program that creates a semaphore set con-
taining two semaphores, which we initialize to 1 and 0:
./svsem_create -p 2
32769
ID of semaphore set
System V Semaphores
983
if (*sign == '-') /* Reverse sign of operation */
sops[numOps].sem_op = - sops[numOps].sem_op;
else if (*sign == '=') /* Should be '=0' */
if (sops[numOps].sem_op != 0)
cmdLineErr("Expected \"=0\" in \"%s\"\n", arg);
sops[numOps].sem_flg = 0;
for (;; flags++) {
if (*flags == 'n')
sops[numOps].sem_flg |= IPC_NOWAIT;
else if (*flags == 'u')
sops[numOps].sem_flg |= SEM_UNDO;
else
break;
}
if (*flags != ',' && *flags != '\0')
cmdLineErr("Bad trailing character (%c) in \"%s\"\n", *flags, arg);
comma = strchr(remaining, ',');
if (comma == NULL)
break; /* No comma -�- no more ops */
else
remaining = comma + 1;
}
982
Chapter 47
Listing 47-8:
Performing System V semaphore operations with
svsem/svsem_op.c
#include sys/types.&#xsys/;typ;s.7;&#xh000;h
#include sys/&#xsys/;sem;&#x.h00;sem.h
#include ctyp typ;~.h;e.h
#include "curr_time.h" /* Declaration of currT*/
#include "tlpi_hdr.h"
#define MAX_SEMOPS 1000 /* Maximum operations that we permit for
a single semop
static void
usageError(const char *progName)
fprintf(stderr, "Usage: %s semid op[,op...] ...\n\n", progName);
fprintf(stderr, "'op' is either: &#xsem#;sem#{+|-}.1v; lue;value[n][u]\n");
fprintf(stderr, " or: &#xsem#;sem#=0[n]\n");
fprintf(stderr, " \"n\" means include IPC_NOWAIT in 'op'\n");
fprintf(stderr, " \"u\" means include SEM_UNDO in 'op'\n\n");
fprintf(stderr, "The operations in each argument are "
"performed in a single semop() call\n\n");
fprintf(stderr, "e.g.: %s 12345 0+1,1-2un\n", progName);
fprintf(stderr, " %s 12345 0=0n 1+1,2-1u 1=0\n", progName);
exit(EXIT_FAILURE);
System V Semaphores
981
Listing 47-7:
Using
to perform operations on multiple System V semaphores
struct sembuf sops[3];
sops[0].sem_num = 0; /* Subtract 1 from semaphore 0 */
sops[0].sem_op = -1;
sops[0].sem_flg = 0;
sops[1].sem_num = 1; /* Add 2 to semaphore 1 */
sops[1].sem_op = 2;
sops[1].sem_flg = 0;
sops[2].sem_num = 2; /* Wait for semaphore 2 to equal 0 */
sops[2].sem_op = 0;
sops[2].sem_flg = IPC_NOWAIT; /* But don't block if operation
can't be performed immediately */
if (semop(semid, sops, 3) == -1) {
if (errno == EAGAIN) /* Semaphore 2 would have blocked */
printf("Operation would have blocked\n");
else
errExit("semop"); /* Some other error */
}
Example program
The program in Listing 47-8 provides
a command-line interface to the
semop()
system
call. The first argument to this program is
from semaphore
: test semaphore
to see if it equals 0.
At the end of each operation,
we can optionally include an
, or both. The letter
means include
IPC_NOWAIT
in the
semaphore 0, and subtracts 2 from semaph
ore 1. For the operation on semaphore 0,
sem_flg
is 0; for the operation on semaphore 1,
is
IPC_NOWAIT
980
Chapter 47
While it is usual to operate on a single semaphore at a time, it is possible to
make a
call that performs operations on
System V Semaphores
979
The
sops
argument is a pointer to an array that contains the operations to be per-
gives the size of this array (which must contain at least one element).
The operations are performed atomically and in array order. The elements of the
array are structures of the following form:
unsigned short sem_num;/* Semaphore number */
short sem_op;/* Operation to be performed */
short sem_flg;IPC_NOWAIT and SEM_UNDOIPC_NOWAIT and SEM_UNDO
field identifies the semaphore wi
978
Chapter 47
This rather complex solution to the race
problem is not required in all applica-
tions. We dont need it if one process is guaranteed to be able to create and initialize
the semaphore before any other processes
attempt to use it. This would be the
case, for example, if a parent creates and initializes the semaphore before creating
child processes with which it shares the semaphore. In such cases, it is sufficient for
the first process to follow its
First
Process B
Process A
expires
First
Executes
semop
initializes semaphore
Executes
semop
Executing
Waiting
for CPU
Key
#include sys/types.&#xsys/;typ;s.7;&#xh000;h /* For portability */
#include sys/&#xsys/;sem;&#x.h00;sem.h
int
semop
, struct sembuf *
, unsigned int
Returns 0 on success, or 1 on error
System V Semaphores
977
Listing 47-6:
Initializing a System V semaphore
from
svsem/svsem_good_init.c
976
Chapter 47
Listing 47-5:
Incorrectly initializing a System V semaphore
from
svsem/svsem_bad_init.c
System V Semaphores
975
if (argc 3 || strcmp(argv[1], "--help") == 0)
usageErr("%s semid val...\n", argv[0]);
semid = getInt(argv[1], 0, "semid");
974
Chapter 47
if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s semid\n", argv[0]);
semid = getInt(argv[1], 0, "semid");
arg.buf = &ds;
if (semctl(semid, 0, IPC_STAT, arg) == -1)
errExit("semctl");
printf("Semaphore changed: %s", ctime(&ds.sem_ctime));
printf("Last sem %s", ctime(&ds.sem_otime));
/* Display per-semaphore information */
arg.array = calloc(ds.sem_nsems, sizeof(arg.array[0]));
if (arg.array == NULL)
errExit("calloc");
System V Semaphores
973
The fields of the
structure are implicitly updated by various semaphore
system calls, and certain subfields of the
field can be explicitly updated
972
Chapter 47
Changing the value of a semaphore with
System V Semaphores
971
Generic control operations
The following operations are the same ones
that can be applied to other types of
System V IPC objects. In each case, the
argument is ignored. Further
field of the
semid_ds
data structure
970
Chapter 47
argument is the identifier of the
System V Semaphores
969
#include sys/types.&#xsys/;typ;s.7;&#xh000;h /* For portability */
#include sys/&#xsys/;sem;&#x.h00;sem.h
int
semctl
(int
, int
, int
, ... /* union semun
arg
968
Chapter 47
Listing 47-1:
Creating and operating on System V semaphores
svsem/svsem_demo.c
#include sys/types.&#xsys/;typ;s.7;&#xh000;h
#include sys/&#xsys/;sem;&#x.h00;sem.h
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include "curr_time.h" /* Declaration of currTime() */
#include "semun.h" /* Definition of semun union */
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int semid;
if (argc 2 ||  2 ;&#x|| 7; rgc;&#x 000;argc 3 || strcmp(argv[1], "--help") == 0)
usageErr("%s init-value\n"
" or: %s semid operation\n", argv[0], argv[0]);
if (argc == 2) { /* Create and initialize semaphore */
union semun arg;
sop.sem_flg = 0; /* No special options for operation */
System V Semaphores
967
semaphores in a set is specified
966
Chapter 47
the use of a semaphore to sy
nchronize the actions of two
processes that alternately
move the semaphore value between 0 and 1.
Figure 47-1:
Using a semaphore to synchronize two processes
In terms of controlling the actions of a process, a semaphore has no meaning in
Process AProcess B
Time
blocks
blocks
resumes
resumes
Create semaphore
Initialize semaphore to 0
Add 1 to semaphore
Subtract 1 from semaphore
Subtract 1 from semaphore
Add 1 to semaphore
SYSTEM V SEMAPHORES
This chapter describes System V se
maphores. Unlike the IPC mechanisms
described in previous chapters, System V semaphores are not used to transfer data
between processes. Instead, they allow pr
ocesses to synchronize their actions. One
common use of a semaphore is to synchroniz
e access to a block of shared memory,
in order to prevent one process from a
ccessing the shared memory at the same
time as another proc
ess is updating it.
A semaphore is a kernel-maintained intege
r whose value is restricted to being
greater than or equal to 0. Various operations (i.e., system calls) can be performed
on a semaphore, including the following:
waiting for the semaphore value to be equal to 0.
The last two of these operations may cause
the calling process to block. When lower-
ing a semaphore value, the kernel blocks
any attempt to decrease the value below 0.
Similarly, waiting for a semaphore to equal 0 blocks the calling process if the sema-
phore value is not currently 0. In both cases, the calling process remains blocked
until some other process alters the semaph
ore to a value that allows the operation
to proceed, at which point the kernel wa
kes the blocked process. Figure 47-1 shows
964
Chapter 46
46-5.
The client shown in Listing 46-9 (
svmsg_file_client.c
) doesnt handle various
possibilities for failure in the server. In particular, if the server message queue fills
up (perhaps because the server terminated and the queue was filled by other
clients), then the
msgsnd()
call will block indefinitely. Simi
larly, if the server fails to
send a response to the client, then the
msgrcv()
call will block indefinitely. Add code
System V Message Queues
963
Various factors led us to conclude that
other IPC mechanisms are usually pref-
erable to System V message queues. One major difficulty is that message queues
are not referred to using file descriptors.
This means that we cant employ various
with message queues; in particular, it is complex to simulta-
neously monitor both message
queues and file descriptors
to see if I/O is possible.
Furthermore, the fact that message queues are connectionless (i.e., not reference
counted) makes it difficult for an a
pplication to know when a queue may be
962
Chapter 46
a priority-queue strategy so that higher-p
riority messages (i.e., those with lower
message type values) are read first.
However, System V message queues have a number of disadvantages:
Message queues are referred to by identifiers, rather than the file descriptors
used by most other UNIX I/O mechanis
System V Message Queues
961
960
Chapter 46
Listing 46-9:
Client for file server using System V message queues
svmsg/svmsg_file_client.c
#include "svmsg_file.h"
static int clientId;
static void
removeQueue(void)
if (msgctl(clientId, IPC_RMID, NULL) == -1)
errExit("msgctl");
int
main(int argc, char *argv[])
struct requestMsg req;
struct responseMsg resp;
int serverId, numMsgs;
ssize_t msgLen, totBytes;
if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s pathname\n", argv[0]);
if (strlen(argv[�1]) sizeof(req.pathname) - 1)
cmdLineErr("pathname too long (max: %ld bytes)\n",
(long) sizeof(req.pathname) - 1);
System V Message Queues
959
/* Read requests, handle each in a separate child process */
for ;;;;
msgLen = msgrcv(serverId, &req, REQ_MSG_SIZE, 0, 0);
if (msgLen == -1) {
if (errno == EINTR) /* Interrupted by SIGCHLD handler? */
continue; /* ... then restart msgrcv() */
errMsg("msgrcv"); /* Some other error */
break; /* ... so terminate loop */
}
pid = fork(); /* Create child process */
if (pid == -1) {
errMsg("fork");
break;
}
if (pid == 0) { /* Child handles request */
serveRequest(&req);
_exit(EXIT_SUCCESS);
}
/* Parent loops to receive next client request */
}
/* If msgrr forkls, remove server MQ and exit */
if (msgctl(serverId, IPC_RMID, NULL) == -1)
errExit("msgctl");
exit(EXIT_SUCCESS);
svmsg/svmsg_file_server.c
Listing 46-9 is the client program fo
r the application. Note the following:
The client creates a message queue with the
IPC_PRIVATE
key
and uses
to establish an exit handler
958
Chapter 46
static void /* Executed in child process: serve a single client */
serveRequest(const struct requestMsg *req)
int fd;
ssize_t numRead;
struct responseMsg resp;
fd = open(�req-pathname, O_RDONLY);
if (fd == -1) { /* Open failed: send error text */
resp.mtype = RESP_MT_FAILURE;
snprintf(resp.data, sizeof(resp.data), "%s", "Couldn't open");
msgsnd�(req-clientId, &resp, strlen(resp.data) + 1, 0);
exit(EXIT_FAILURE); /* and terminate */
}
/* Transmit file contents in messages with type RESP_MT_DATA. We don't
diagnose read msgsnrrors since we can't notify client. */
resp.mtype = RESP_MT_DATA;
while ((numRead = read(fd, resp.data, RESP_MSG_SIZE�)) 0)
if (msgsnd(r�eq-clientId, &resp, numRead, 0) == -1)
break;
/* Send a message of type RESP_MT_END to signify end-of-file */
resp.mtype = RESP_MT_END;
msgsnd(req�-clientId, &resp, 0, 0); /* Zero-length mtext */
int
main(int argc, char *argv[])
struct requestMsg req;
pid_t pid;
ssize_t msgLen;
int serverId;
struct sigaction sa;
/* Create server message queue */
serverId = msgget(SERVER_KEY, IPC_CREAT | IPC_EXCL |
S_IRUSR | S_IWUSR | S_IWGRP);
if (serverId == -1)
System V Message Queues
957
Server program
Listing 46-8 is the server program for the application. Note the following points
The server is designed to handle requ
ests concurrently. A concurrent server
design is preferable to the iterative design employed in Listing 44-7 (page 912),
since we want to avoid the possibility that
a client request for a large file would
cause all other client requests to wait.
Each client request is handled by crea
ting a child process that serves the
. In the meantime, the main server process waits upon further
client requests. Note the followin
g points about the server child:
Since the child produced via
fork
inherits a copy of the parents stack, it
thus obtains a copy of the request mess
age read by the main server process.
The server child terminates after ha
ndling its associated client request
In order to avoid the creation of zombie processes (Section 26.2), the server
establishes a handler for
SIGCHLD
and calls
within this handler
msgrcv()
call in the parent server process may block, and consequently be
interrupted by the
SIGCHLD
handler. To handle this possibility, a loop is used to
restart the call if it fails with the
error
The server child executes the
serv
function
, which sends three mes-
sage types back to the client. A request with an
of
RESP_MT_FAILURE
indicates
that the server could not open the requested file
RESP_MT_DATA
is used for a
series of messages containing file data
RESP_MT_END
(with a zero-length
field) is used to indicate that tr
956
Chapter 46
SERVER_KEY
), and defines the formats of the mess
System V Message Queues
955
problems listed above when using a single message queue. Note the following
points regarding this approach:
Each client must create its own
message queue (typically using the
IPC_PRIVATE
key) and inform the server of the queue
s identifier, usually by transmitting the
identifier as part of the clie
nts message(s) to the server.
There is a system-wide limit (
MSGMNI
) on the number of message queues, and the
default value for this limit is quite low on some systems. If we expect to have a
large number of simultaneous clients, we may need to raise this limit.
The server should allow for the possibility that the clients message queue no
longer exists (perhaps because the client prematurely deleted it).
We say more about using one message qu
eue per client in the next section.
46.8A File-Server Applicati
In this section, we describe a client-server application that uses one message queue
per client. The application is a simple file
server. The client sends a request message
to the servers message queue asking for the contents of a named file. The server
ssible to the server. A more sophisticated
server would require some type of authentication from the client before serving the
Figure 46-3:
Client-server IPC using one message queue per client
Common header file
Listing 46-7 is the header file included
by both the server and the client. This
header defines the well-known key to be used for the servers message queue
Server MQ
Client sends request to
Server MQ (
ID of client queue)
Client reads
rss
f
Server reads
request
Server child sends
rss
Server creates child
to handle request
Server
Server child
954
Chapter 46
Which approach we choose depends on th
e requirements of our application. We
next consider some of the factors that may influence our choice.
Using a single message queue for server and clients
Using a single message queue may be suitable when the messages exchanged
between servers and clients are small. However, note the following points:
Since multiple processes may attempt to
read messages at the same time, we
must use the message type (
mtype
) field to allow each process to select only
those messages intended for it. One way to
accomplish this is to use the clients
process ID as the message ty
pe for messages sent from the server to the client.
The client can send its process ID as pa
rt of its message(s) to the server. Fur-
thermore, messages to the server must al
so be distinguished by a unique message
type. For this purpose, we can use th
e number 1, which, being the process ID
of the permanently running
init
process, can never be the process ID of a client
process. (An alternative woul
d be to use the servers process ID as the message
type; however, it is diffic
ult for the clients to obtain this information.) This
numbering scheme is shown in Figure 46-2.
Message queues have a limited capacity. This has the potential to cause a couple
of problems. One of these is that mult
iple simultaneous clients could fill the
message queue, resulting in a deadlock situation, where no new client requests
can be submitted and the server is blocked from writing any responses. The
other problem is that a poorly behaved
or intentionally malicious client may
fail to read responses from the server. This can lead to the queue becoming
clogged with unread messages, preventing any communication between clients
and server. (Using two queuesone for
messages from clients to the server,
and the other for messages from the serv
er to the clientswould solve the first
of these problems,
but not the second.)
Figure 46-2:
Using a single message queue for client-server IPC
Using one message queue per client
Using one message queue per client (as well as one for the server) is preferable
where large messages need to be exchanged, or where there is potential for the
message queue
Client sends request
Server sends
response (
= PID of client)
Server reads
request (select
Client reads
response (select
= own PID)
Server
System V Message Queues
953
Listing 46-6:
Displaying all System V message queues on the system
svmsg/svmsg_ls.c
#define _GNU_SOURCE
#include sys/types.&#xsys/;typ;s.7;&#xh000;h
#include sys/&#xsys/;msg;&#x.h00;msg.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int maxind, ind, msqid;
struct msqid_ds ds;
struct msginfo msginfo;
/* Obtain size of kernel 'entries' array */
maxind = msgctl(0, MSG_INFO, (struct msqid_ds *) &msginfo);
if (maxind == -1)
errExit("msgctl-MSG_INFO");
printf("maxind: %d\n\n", maxind);
printf("index id key messages\n");
952
Chapter 46
) operations. (The
program employs these operations.) These oper-
, and
: The
operation serves two purposes. First, it
definitions of these three constants
from the corresponding System V IPC header files.
To list all message queues on the system, we can do the following:
1.Use a
MSG_INFO
operation to find out the maximum index (
maxind
) of the
array for message queues.
2.Perform a loop for all values from 0 up to and including
maxind
, employing a
MSG_STAT
operation for each value. During th
is loop, we ignore the errors that
may occur if an item of the
entries
array is empty (
EINVAL
) or if we dont have
permissions on the object to which it refers (
EACCES
Listing 46-6 provides an implementation of the above steps for message queues.
The following shell session log demo
nstrates the use of this program:
./svmsg_ls
maxind: 4
index ID key messages
2 98306 0x00000000 0
4 163844 0x000004d2 2
ipcs -q

Check above against output of ipcs
------ Message Queues --------
key msqid owner perms used-bytes messages
0x00000000 98306 mtk 600 0 0
0x000004d2 163844 mtk 600 12 2
System V Message Queues
951
the above limits, it does limit the number
of messages on an individual queue to the value specified by the queues
msg_qbytes
950
Chapter 46
if (argc != 3 || strcmp(argv[1], "--help") == 0)
usageErr("%s msqid max-bytes\n", argv[0]);
System V Message Queues
949
948
Chapter 46
if (�argc 1 && strcmp(argv[1], "--help") == 0)
usageErr("%s [msqid...]\n", argv[0]);
for (j = 1; j argc; j++)
System V Message Queues
947
46.3Message Queue Control Operations
msgctl()
system call performs control operations on the message queue identi-
fied by
msqid
argument specifies the operation to
be performed on the queue. It can be
one of the following:
IPC_RMID
Immediately remove the message qu
eue object and its associated
data structure. All messages remaining in the queue are lost, and any
blocked reader or writer processe
s are immediately awakened, with
msgsnd()
msgrcv()
failing with the error
EIDRM
. The third argument to
msgctl()
is
ignored for this operation.
IPC_STAT
Place a copy of the
data structure associated with this message
queue in the buffer pointed to by
buf
. We describe the
structure
in Section 46.4.
946
Chapter 46
#ifdef MSG_EXCEPT
fprintf(stderr, " -x Use MSG_EXCEPT flag\n");
#endif
exit(EXIT_FAILURE);
int
main(int argc, char *argv[])
int msqid, flags, type;
ssize_t msgLen;
size_t maxBytes;
struct mbuf msg; /* Message buffer for msgrcv() */
System V Message Queues
945
The following shell session demonstrates
the use of the programs in Listing 46-1,
Listing 46-2, and Listing 46-3. We begin by creating a message queue using the
IPC_PRIVATE
key, and then write three messages
with different types to the queue:
./svmsg_create -p
32769
ID of message queue
./svmsg_send 32769 20 "I hear and I forget."
./svmsg_send 32769 10 "I see and I remember."
./svmsg_send 32769 30 "I do and I understand."
We then use the program in Listing 46-3 to
read messages with
a type less than or
equal to 20 from the queue:
./svmsg_receive -t -20 32769
Received: type=10; length=22; body=I see and I remember.
./svmsg_receive -t -20 32769
944
Chapter 46
with the error
. (The error
EAGAIN
would be more consistent, as occurs
on a nonblocking
msgsnd()
or a nonblocking read from a FIFO. However,
failing with
is historical behavior, and required by SUSv3.)
MSG_EXCEPT
This flag has an effect only if
is greater than 0, in which case it forces
the complement of the usual operation;
that is, the first message from the
queue whose
mtype
is
equal to
is removed from the queue and
Message type
2
3
4
5
Message body
System V Message Queues
943
46.2.2Receiving Messages
msgrcv()
system call reads (and removes) a message from a message queue, and
copies its contents into the buffer pointed to by
The maximum space available in the
mtext
field of the
buffer is specified by
the argument
. If the body of the message
to be removed from the queue
exceeds
bytes, then no message is
removed from the queue, and
msgrcv()
fails with the error
E2BIG
. (This default behavior can be changed using the
MSG_NOERROR
flag described shortly.)
Messages need not be read in the orde
r in which they were sent. Instead, we
can select messages according to the value in the
mtype
field. This selection is con-
msgtyp
argument, as follows:
If
equals 0, the first message from
the queue is removed and returned to
the calling process.
If
msgtyp
is greater than 0, the first message in the queue whose
mtype
equals
942
Chapter 46
static void /* Print (optional) message, then usage description */
usageError(const char *progName, const char *msg)
if (msg != NULL)
fprintf(stderr, "%s", msg);
fprintf(stderr, "Usage: %s [-n] msqid msg-type [msg-text]\n", progName);
fprintf(stderr, " -n Use IPC_NOWAIT flag\n");
exit(EXIT_FAILURE);
int
main(int argc, char *argv[])
int msqid, flags, msgLen;
struct mbuf msg; /* Message buffer for msgsn
int opt; /* Option character from ge */
/* Parse command-line options and arguments */
flags = 0;
while ((opt = getopt(argc, argv, "n")) != -1) {
if (opt == 'n')
flags |= IPC_NOWAIT;
else
usageError(argv[0], NULL);
}
if (argc optind + 2 || arg op;&#xtin7; + ; 7|;&#x| ar;&#xg7c ;c optind + 3)
usageError(argv[0], "Wrong number of arguments\n");
msqid = getInt(argv[optind], 0, "msqid");
System V Message Queues
941
To send a message with
940
Chapter 46
case 'x':
flags |= IPC_EXCL;
break;
default:
usageError(argv[0], "Bad option\n");
}
}
if (numKeyFlags != 1)
usageError(argv[0], "Exactly one of the options -f, -k, "
"or -p must be supplied\n");
perms = (optind == argc) ? (S_IRUSR | S_IWUSR) :
System V Message Queues
939
static void /* Print usage info, then exit */
usageError(const char *progName, const char *msg)
if (msg != NULL)
fprintf(stderr, "%s", msg);
fprintf(stderr, "Usage: %s [-cx] {-f pathname | -k key | -p} "
"[octal-perms]\n", progName);
fprintf(stderr, " -c Use IPC_CREAT flag\n");
fprintf(stderr, " -x Use IPC_EXCL flag\n");
fprintf(stderr, " -f pathname Generate key using ftok\);
fprintf(stderr, " -k key Use 'key' as key\n");
fprintf(stderr, " -p Use IPC_PRIVATE key\n");
exit(EXIT_FAILURE);
int
main(int argc, char *argv[])
int numKeyFlags; /* Counts -f, -k, and -p options */
int flags, msqid, opt;
unsigned int perms;
long lkey;
key_t key;
/* Parse command-line options and arguments */
numKeyFlags = 0;
flags = 0;
while ((opt = getopt(argc, argv, "cf:k:px")) != -1) {
switchoptopt {
case 'c':
flags |= IPC_CREAT;
break;
case 'f': /* -f pathname */
key = ftok(optarg, 1);
if (key == -1)
errExit("ftok");
numKeyFlags++;
break;
case 'k': /* -k key (octal, decimal or hexadecimal) */
if (sscanf(optarg, "%li", &lkey) != 1)
cmdLineErr("-k option requires a numeric argument\n");
key = lkey;
numKeyFlags++;
break;
case 'p':
key = IPC_PRIVATE;
numKeyFlags++;
break;
938
Chapter 46
Consequently, there are various existing applications that employ message queues,
and this fact forms one of the primary motivations for describing them.
46.1Creating or Opening a Message Queue
SYSTEM V MESSAGE QUEUES
This chapter describes System V message queues. Message queues allow processes
to exchange data in the form of messag
es. Although message queues are similar to
pipes and FIFOs in some respects, th
ey also differ in important ways:
eue is the identifier returned by a call
936
Chapter 45
On Linux, the
ipcs l
command can be used to list the limits on each of the IPC
mechanisms. Programs can
employ the Linux-specific
IPC_INFO
operation to
about all of the System V
IPC objects on the system.
Introduction to System V IPC
935
45.7Obtaining a List of All IPC Objects
Linux provides two nonstandard methods of obtaining a list of all IPC objects on
the system:
/proc/sysvipc
directory that list all IPC objects; and
the use of Linux-specific
We describe the files in
/proc/sysvipc
directory here, and defer discussion of the
calls until Section 46.6, where we provide an example program that lists all
SystemV message queues on the system.
Some other UNIX implementations ha
s; for example, Solaris provides the
, and
system calls for this purpose.
Three read-only files in the
/proc/sysvipc
directory provide the same information as
can be obtained via
/proc/sysvipc/msg
lists all messages queues and their attributes.
/proc/sysvipc/sem
of all IPC objects of a given ty
pe is to parse the output of
ipcs(1)
45.8IPC Limits
Since System V IPC objects consume system resources, the kernel places various
limits on each class of IPC object in order to prevent resources from being
934
Chapter 45
45.6The
and
ipcrm
Commands
and
commands are the System
V IPC analogs of the
and
file
, we can obtain information about IPC objects on the system.
displays all objects, as in the following example:
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x6d0731db 262147 mtk 600 8192 2
------ Semaphore Arrays --------
key semid owner perms nsems
0x6107c0b8 0 cecilia 660 6
0x6107c0b6 32769 britta 660 1
------ Message Queues --------
key msqid owner perms used-bytes messages
0x71075958 229376 cecilia 620 12 2
On Linux,
ip11
displays information only about
IPC objects for which we have read
Introduction to System V IPC
933
3.The identifier for the IPC object is
calculated using the following formula:
identifier = index + xxx_perm.__seq * SEQ_MULTIPLIER
In the formula used to calculate the IPC identifier,
is the index of this object
instance within the
array, and
SEQ_MULTIPLIER
is a constant defined with the
value 32,768 (
IPCMNI
in the kernel source file
include/linux/ipc.h
). For example,
in Figure 45-1, the identifier ge
nerated for the semaphore with the
value
0x4b079002
would be (2 + 5 * 32,768) = 163,842.
Note the following points about
932
Chapter 45
the kernel maintains an associated
structure that records various global
information about all instances of that
IPC mechanism. This information includes
a dynamically sized array of pointers,
entries
, to the associated data structure for
each object instance (
structures in the case of semaphores). The current
size of the
array is recorded in the
field, with the
max_id
field holding the
index of the highest currently in-use element.
Figure 45-1:
Kernel data structures used to repr
esent System V IPC (semaphore) objects
When an IPC
= 10
entries
sem_perm.__key
sem_perm.__seq
sem_perm.__key
sem_perm.__seq
= 5
structure
associated dat
structures

= 3

= 2
= 128
Introduction to System V IPC
931
930
Chapter 45
Suppose a client engages in an extended dialogue with a server, with multiple
IPC operations being performed by ea
ch process (e.g., multiple messages
exchanged, a sequence of semaphore oper
ations, or multiple updates to shared
memory). What happens if the server process crashes or is deliberately halted and
then restarted? At this point, it would
make no sense to blin
dly reuse the existing
IPC object created by the previous server process, since the new server process has
no knowledge of the historical information
associated with the current state of the
IPC object. (For example, there may be
a secondary request within a message
queue that was sent by a client in response to an earlier message from the old
server process.)
In such a scenario, the only option for
the server may be to
abandon all existing
Introduction to System V IPC
929
An attempt by the second user to obtain
an identifier for this message queue using
the following call would fail,
since the user is not permi
tted write access to the mes-
sage queue:
ta structure for an IPC object (the
IPC_STAT
operation) requires
read permission.
To remove an IPC object (the
IPC_RMID
operation) or change its associated
data structure (the
928
Chapter 45
in the
entifier of an existing IPC object,
an initial permission check is made to ascertain whether the permissions specified
in the
flags
argument are compatible with those
on the existing object. If not, then
Introduction to System V IPC
927
ta structure using the appropriate
system call,
by specifying an operation type of
IPC_STAT
. Conversely, some parts of the data
structure can be modified using the
926
Chapter 45
This key value is generated from the supplied
and
value using an
implementation-defined algorithm. SUSv3 makes the following requirements:
Only the least significant 8 bits of
are employed by the algorithm.
The application must ensure that the
refers to an existing file to
which
can be applied (otherwise,
ftok()
calling
glibc
ftok
algorithm is similar to that employed on other UNIX implemen-
tations, and suffers a similar limitation: th
ere is a (very small) possibility that two
different files could yield the same key value.
This can occur because there is a chance
that the least significant bits of an i-node
number could be the same for two files on
different file systems, coupled with the po
ssibility that two different disk devices
(on a system with multiple disk controller
s) could have the sa
me minor device num-
ber. However, in practice, the possibility
of colliding key values for different appli-
cations is small enough that the use of
ftok()
for key generation is a viable
ftok
is the following:
key_t key;
int id;
key = ftok("/mydir/myfile", 'x');
if (key == -1)
errExit("ftok");
Introduction to System V IPC
925
45.2IPC Keys
System V IPC keys are integer values
represented using the data type
. The IPC
ct. (Internally, the kernel maintains data
structures mapping keys to identifiers
for each IPC mechanism, as described in
Section 45.5.)
So, how do we provide a unique keyone
that guarantees that we wont acci-
dentally obtain the identifier of an existing IPC object used by some other applica-
tion? There are three possibilities:
Randomly choose some integer key value, which is typically placed in a header
file included by all programs using th
e IPC object. The difficulty with this
approach is that we may accidentally ch
oose a value used by another application.
Specify the
IPC_PRIVATE
constant as the
value to the
924
Chapter 45
If no IPC object corresponding to
the given key currently exists, and
IPC_CREAT
(analogous to the
open
O_CREAT
flag) was specified as part of the
flags
argument,
Introduction to System V IPC
923
Creating and opening
a System V IPC object
Each System V IPC mechanism has an associated
identifier for that object. We consider
how to choose a key for an application in Section 45.2.
922
Chapter 45
INTRODUCTION TO SYSTEM V IPC
System V IPC is the label used to refer to
three different mechanisms for interpro-
cess communication:
Message queues
can be used to pass messages between processes. Message
queues are somewhat like pipes, but di
ffer in two important respects. First,
message boundaries are preserved, so
that readers and writers communicate in
units of messages, rather than via an
undelimited byte stream. Second, each
message includes an integer
field, and it is possible to select messages by
type, rather than reading them in th
e order in which they were written.
Semaphores
permit multiple processes to sync
hronize their actions. A semaphore
is a kernel-maintained integer value that is
visible to all processes that have the
necessary permissions. A process indicates
to its peers that it is performing some
action by making an appropriate modifi
cation to the value of the semaphore.
Shared memory
enables multiple processes to
share the same region (called a
) of memory (i.e., the same page
frames are mapped into the virtual
memory of multiple processes). Since a
ccess to user-space memory is a fast
operation, shared memory is one of th
e quickest methods of IPC: once one
process has updated the shared memory,
the change is immediately visible to
other processes sharing the same segment.
Although these three IPC mechanisms are
quite diverse in function, there are good
reasons for discussing them
920
Chapter 44
array.) Obtaining the corre
ct process ID from this structure will allow
to
select the child upon which to wait. This
structure will also assist with the SUSv3
requirement that any still-open file
streams created by earlier calls to
must
be closed in the new child process.
44-3.
The server in Listing 44-7 (
fifo_seqnum_server.c
) always starts assigning sequence
numbers from 0 each time it is started.
Modify the program to use a backup file
that is updated each time a sequ
ence number is assigned. (The
O_SYNC
flag,
described in Section 4.3.1, may be useful
.) At startup, the program should check
for the existence of this file, and if it is pr
esent, use the value it contains to initialize
the sequence number. If the backup file
cant be found on startup, the program
should create a new file and start assign
ing sequence numbers beginning at 0. (An
alternative to this technique would be
to use memory-mapped
files, described in
Chapter 49.)
44-4.
Add code to the server in Listing 44-7 (
fifo_seqnum_server.c
) so that if the program
receives the
SIGTERM
signals, it removes the server FIFO and terminates.
44-5.
The server in Listing 44-7 (
fifo_seqnum_server.c
) performs a second
O_WRONLY
open of
the FIFO so that it never
sees end-of-file when reading from the reading descriptor
) of the FIFO. Instead of doing this, an
alternative approach could be tried:
whenever the server sees end-of-file on the reading descriptor, it closes the
descriptor, and then once more opens th
e FIFO for reading. (This open would
block until the next client opened the FI
FO for writing.) What is wrong with this
approach?
44-6.
The server in Listing 44-7 (
fifo_seqnum_server.c
) assumes that the client process is
well behaved. If a misbehaving client create
d a client FIFO and sent a request to the
server, but did not open its FIFO, then the
servers attempt to open the client FIFO
would block, and other clients requests
would be indefinitely delayed. (If done
maliciously, this
would constitute a
.) Devise a scheme to deal
with this problem. Extend
the server (and possibly the client in Listing 44-8)
44-7.
Write programs to verify the operation
of nonblocking opens and nonblocking I/O
on FIFOs (see Section 44.9).
Pipes and FIFOs
919
When using pipes, we must be careful
to close unused descriptors in order to
918
Chapter 44
The impact of the
O_NONBLOCK
flag when writing to a pipe or FIFO is made complex
by interactions with the
PIPE_BUF
limit. The
writ
behavior is summarized in
Table 44-3.
O_NONBLOCK
flag causes a
wr
on a pipe or FIFO to fail (with the error
EAGAIN
in any case where data cant be transferre
d immediately. This means that if we are
writing up to
PIPE_BUF
bytes, then the
writ
will fail if there is not sufficient space in
the pipe or FIFO, because the kernel ca
Pipes and FIFOs
917
Figure 44-8:
open()
for example,
one of the three standard descriptors
that are automatically opened for each
new program run by the shell or
Process X
1. Open FIFO A for reading
2. Open FIFO B for writin
blocks
Process Y
1. Open FIFO B for readin
2. Open FIFO A for writin
blocks
916
Chapter 44
If the other end of the FIFO is already open, then the
O_NONBLOCK
flag has no effect
on the
open()
callit successfully opens the FI
FO immediately, as usual. The
O_NONBLOCK
flag changes things only if the othe
Pipes and FIFOs
915
if (�argc 1 && strcmp(argv[1], "--help") == 0)
usageErr("%s [seq-len...]\n", argv[0]);
/* Create our FIFO (before sending request, to avoid a race) */
umas00 /* So we get the permissions we want */
snprintf(clientFifo, CLIENT_FIFO_NAME_LEN, CLIENT_FIFO_TEMPLATE,
914
Chapter 44
Listing 44-8 is the code for the client. The client performs the following steps:
Create a FIFO to be used for rece
iving a response from the server
. This is
done before sending the request, in order
to ensure that the FIFO exists by the
time the server attempts to open
it and send a response message.
Construct a message for the server cont
aining the clients process ID and a
number (taken from an optional co
mmand-line argument) specifying the
length of the sequence that the client
wishes the server
to assign to it
. (If no
command-line argument is supplied,
the default sequence length is 1.)
Open the server FIFO
and send the message to the server
Open the client FIFO
, and read and print the servers response
Pipes and FIFOs
913
/* Create well-known FIFO, and open it for reading */
912
Chapter 44
struct response { /* Response (serv�er -- client) */
int seqNum; /* Start of sequence */
pipes/fifo_seqnum.h
Server program
Listing 44-7 is the code for the server. The server performs the following steps:
Create the servers well-known FIFO
and open the FIFO for reading
. The
server must be run before any clients, so
that the server FIFO exists by the time
a client attempts to open it. The servers
blocks until the first client
opens the other end of the server FIFO for writing.
Open the servers FIFO once more
, this time for writing. This will never
block, since the FIFO has already been
opened for reading. This second open
is a convenience to ensure that the server doesnt see end-of-file if all clients
close the write end of the FIFO.
Ignore the
SIGPIPE
signal
, so that if the server attempts to write to a client
FIFO that doesnt have a reader,
then, rather than being sent a
SIGPIPE
signal
(which kills a process by default), it receives an
EPIPE
error from the
writ
system call.
Enter a loop that reads and responds
to each incoming client request
. To
send the response, the server constr
ucts the name of the client FIFO
and
then opens that FIFO
If the server encounters an error in op
ening the client FIFO, it abandons that
clients request
This is an example of an
iterative server
, in which the server reads and handles each
client request before going on to handle the next client. An iterative server design
is suitable when each client request ca
n be quickly processed and responded to, so
that other client requests are not de
layed. An alternative design is a
concurrent server
which the main server process employs a se
parate child process (or thread) to handle
each client request. We discuss se
rver design further in Chapter 60.
Listing 44-7:
An iterative server using FIFOs
pipes/fifo_seqnum_server.c
#include sign&#xsign;zl.;&#xh000;al.h
#include "fifo_seqnum.h"
int
main(int argc, char *argv[])
int serverFd, dummyFd, clientFd;
char clientFifo[CLIENT_FIFO_NAME_LEN];
struct request req;
struct response resp;
int seqNum = 0; /* This is our "service" */
Pipes and FIFOs
911
In the three techniques described in th
e main text, a single channeFIFOFIFO
used for all messages from all clie
nts. An alternative is to use a
single connection
for each message
. The sender opens the communicat
ion channel, sends its message,
and then closes the channel. The read
ing process knows that the message is
complete when it encounters end-of-file. If multiple writers hold a FIFO open,
then this approach
is not feasible, because the reader wont see end-of-file when
one of the writers closes the FIFO. This
approach is, however, feasible when
using stream sockets, where a server process creates a unique communication
channel for each incoming client connection.
Figure 44-7:
Separating messages in a byte stream
In our example application, we use the third of the techniques described above,
with each client sending messages of a fixed size to the server. This message is
defined by the
structure defined in Listing 44
-6. Each request to the server
includes the clients process ID, which en
ables the server to construct the name of
the FIFO used by the client to receive a
response. The request also contains a field
) specifying how many sequence numbers
should be allocated to this client.
The response message sent from server to client consists of a single field,
which is the starting value of the range of
sequence numbers allocated to this client.
Listing 44-6:
Header file for
fifo_seqnum_server.c
and
fifo_seqnum_client.c
pipes/fifo_seqnum.h
#include sys/types.&#xsys/;typ;s.7;&#xh000;h
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include fcntünt;l.h;l.h
#include "tlpi_hdr.h"
#define SERVER_FIFO "/tmp/seqnum_sv"
/* Well-known name for server's FIFO */
#define CLIENT_FIFO_TEMPLATE "/tmp/seqnum_cl.%ld"
/* Template for building client FIFO name */
#define CLIENT_FIFO_NAME_LEN (sizeof(CLIENT_FIFO_TEMPLATE) + 20)
/* Space required for client FIFO pathname
(+20 as a generous allowance for the PID) */
struct request { /* Request (clien�t -- server) */
pid_t pid; /* PID of client */
int seqLen; /* Length of desired sequence */
datadatadata
delimiter character
data
bytes
datadata
data
bytes
1) delimiter character
2) header with length field
3) fixed-length messages
datadata
bytes
bytes
910
Chapter 44
Figure 44-6:
Using FIFOs in a single-server,
multiple-client application
Request
(PID + length)
Request
(PID + length)
Response
Response
(PID=6514)
Server FIFOServer
(PID=6523)
Pipes and FIFOs
909
44.8A Client-Server Application Using FIFOs
In this section, we present a simple client-server application that employs FIFOs for
IPC. The server provides the (trivial) serv
ice of assigning unique
sequential numbers
to each client that requests them. In the
course of discussing this application, we
introduce a few concepts and te
Application overview
In the example application, all clients se
nd their requests to the server using a
single server FIFO. The header file (L
isting 44-6) defines the well-known name
/tmp/seqnum_sv
) that the server uses for its FIFO. Th
is name is fixed, so that all cli-
ents know how to contact the server. (In
this example application, we create the
FIFOs in the
directory, since this allows us
to conveniently run the programs
without change on most systems. However, as noted in Section 38.7, creating files in
publicly writable directories such as
can lead to various security vulnerabilities
and should be avoided in real-world applications.)
In client-server applications, well repeatedly encounter the concept of a
known address
or name used by a server to ma
ke its service visible to clients.
Using a well-known address is one solution to the problem of how clients can
know where to contact a server. Another possible solution is to provide some
kind of name server with which server
s can register the names of their ser-
vices. Each client then contacts the name
server to obtain
the location of the
service it desires. This solu
tion allows the location of
servers to be flexible, at
the cost of some extra programming effort. Of course, clients and servers
then need to know where to contact the
name server; typically, it resides at a
well-known address.
It is not, however, possible to use a sing
le FIFO to send responses to all clients,
since multiple clients would race to read
from the FIFO, and possibly read each
others response messages rather than their own. Therefore, each client creates a
unique FIFO that the server
uses for delivering the respon
se for that client, and the
server needs to know how to find each clie
nts FIFO. One possible way to do this is
for the client to generate its FIFO pathna
me, and then pass the pathname as part of
its request message. Alternatively, the cl
ient and server can agree on a convention
for constructing a client FIFO
pathname, and, as part of its request, the client can
pass the server the informat
ion required to construct th
e pathname specific to this
client. This latter solution is used in our
example. Each clients FIFO name is built
from a template (
) consisting of a pathname containing the clients
process ID. The inclusion of the process ID provides an easy way of generating a
name unique to this client.
Figure 44-6 shows how this application
uses FIFOs for commu
908
Chapter 44
can be used for reading and
writing on the FIFO. Doing th
is rather subverts the I/O
model for FIFOs, and SUSv3 explicitly notes that opening a FIFO with the
O_RDWR
flag is unspecified; therefore, for portability reasons, this technique should be
avoided. In circumstances where we need to prevent blocking when opening a
open()
O_NONBLOCK
flag provides a standardized
method for doing so (refer
to Section 44.9).
Avoiding the use of the
O_RDWR
flag when opening a FIFO can be desirable for a
another reason. After such an
, the calling process will never see end-of-
file when reading from the resulting fi
le descriptor, because there will always
be at least one descriptor open for wr
iting to the FIFOthe same descriptor
from which the process is reading.
Using FIFOs and
to create a dual pipeline
One of the characteristics of
shell pipelines is that they are linear; each process in
the pipeline reads data produced by its pr
edecessor and sends data to its successor.
Using FIFOs, it is possible to create a fork
in a pipeline, so that a duplicate copy of
the output of a process is sent to another pr
ocess in addition to its successor in the
pipeline. In order to do
this, we need to use the
command, which writes two
copies of what it reads fr
om its standard input: one to standard output and the
other to the file named in its command-line argument.
Making the
argument to
a FIFO allows us to ha
ve two processes simulta-
neously reading the duplicate output produced by
. We demonstrate this in the
following shell session, which creates a FIFO named
myfifo
, starts a background
command that opens the FIFO for reading (t
his will block until the FIFO is opened
for writing), and then executes a pi
peline that sends the output of
to
, which
both passes the output further down the pipeline to
and sends it to the
myfifo
option to
causes the output of
to be sorted in increasing
numerical order on the fifth space-delimited field.)
mkfifo myfifo
wc -l myfifo &
ls -l | tee myfifo | sort -k5n
(Resulting output not shown)
Diagrammatically, the above commands crea
te the situation shown in Figure 44-5.
The
program is so named because of its shape. We can consider
tee
as function-
ing similarly to a pipe, but with an additi
onal branch that se
nds duplicate output.
Diagrammatically, this has the shape of a capital letter
(see Figure 44-5). In addi-
tion to the purpose described here,
is also useful for debugging pipelines and
for saving the results produced at some intervening point in a complex pipeline.
Figure 44-5:
Using a FIFO and
to create a dual pipeline
sort
Pipes and FIFOs
907
Once a FIFO has been opened, we use the same I/O system calls as are used
with pipes and other files (i.e.,
read
writ
). Just as with pipes, a FIFO
has a write end and a read end, and data is read from the pipe in the same order as
it is written. This fact gives FIFOs their name:
first in, first out
. FIFOs are also some-
As with pipes, when all descriptors refe
rring to a FIFO have been closed, any
outstanding data is discarded.
We can create a FIFO from the shell using the
mkfifo
command:
-m mode ]
pathname
is the name of the FIFO
to be created, and the
option is used to
specify a permission
mode
in the same way as for the
chmod
command.
When applied to a FIFO (or pipe),
and
906
Chapter 44
Note also the input checking performed in Listing 44-5
. This is done to pre-
vent invalid input causing
to execute an unexpected shell command. Sup-
pose that these checks we
re omitted, and the user entered the following input:
pattern:
; rm *
The program would then pass the following command to
, with disastrous
/bin/ls -d ; rm� * 2 /dev/null
Such checking of input is always required in programs that use
pope
(or
to execute a shell command built from user
input. (An alternative would be for the
application to quote any characters othe
r than those being checked for, so that
those characters dont undergo sp
ecial processing by the shell.)
44.6Pipes and
stdio
stream corresponding to the
write end of the pipe.
If the process calling
is reading from the pipe (i.e.,
mode
is
), things
may not be so straightforward. In this
case, if the child process is using the
stdio
library, thenunless it includes explicit calls to
fflush()
or
Pipes and FIFOs
905
fp = popen(popenCmd, "r");
if (fp == NULL) {
printf("popen() failed\n");
continue;
}
/* Read resulting list of pathnames until EOF */
fileCnt = 0;
904
Chapter 44
Listing 44-5:
Globbing filename patterns with
pipes/popen_glob.c
#include ctyp typ;~.h;e.h
#include limi&#xlimi;ts.;&#xh000;ts.h
#include "print_wait_status.h" /* For printWaitStatu
#include "tlpi_hdr.h"
#define POPEN_FMT "/bin/ls -d %s� 2 /dev/null"
#define PAT_SIZE 50
#define PCMD_BUF_SIZE (sizeof(POPEN_FMT) + PAT_SIZE)
int
main(int argc, char *argv[])
char pat[PAT_SIZE]; /* Pattern for globbing */
char popenCmd[PCMD_BUF_SIZE];
Pipes and FIFOs
903
the write end of the pipe; when writing to the pipe, it is sent a
SIGPIPE
signal, and
how this may occur shortly.
When performing a wait to obtain the st
atus of the child shell, SUSv3 requires
pclose()
system()
, should automatically restart the internal call that it makes
if that call is interrupted by a signal handler.
In general, we can make the same statements for
pope
as were made in Sec-
tion 27.6 for
offers convenience. It
builds the pipe, performs
descriptor duplication, closes unused desc
from passing this pattern to the
command
. (Techniques similar to this were
used on older UNIX implementations to
perform filename generation, also known
globbing
, prior to the existence of the
gl
library function.)
902
Chapter 44
When we run the program in Listing 44-4, we see the following:
./pipe_ls_wc
24
ls | wc -l
Verify the results using shell commands
24
44.5Talking to a She
pope
A common use for pipes is to execute a sh
ell command and either read its output
or send it some input. The
and
pclose()
functions are provided to simplify
function creates a pipe, and then
forks a child process that execs a
shell, which in turn creates a child pr
ocess to execute the string given in
command
mode
a)
mode
is
r
stdout
fo
f
stdin
fo
f
process
process
Pipes and FIFOs
901
/* Duplicate stdout on write end of pipe; close duplicated descriptor */
if (pfd[1] != STDOUT_FILENO) { /* Defensive check */
if (dup2(pfd[1], STDOUT_FILENO) == -1)
errExit("dup2 1");
if (close(pfd[1]) == -1)
errExit("close 2");
}
execlp("ls", "ls", (char *) NULL); /* Writes to pipe */
errExit("execlp ls");
default: /* Parent falls through to create next child */
break;
}
switch (fo){
case -1:
errExit("fork");
case 0: /* Second child: exec 'wc' to read from pipe */
if (close(pfd[1]) == -1) /* Write end is unused */
errExit("close 3");
/* Duplicate stdin on read end of pipe; close duplicated descriptor */
if (pfd[0] != STDIN_FILENO) { /* Defensive check */
if (dup2(pfd[0], STDIN_FILENO) == -1)
errExit("dup2 2");
if (close(pfd[0]) == -1)
errExit("close 4");
}
execlp("wc", "wc", "-l", (char *) NULL); /* Reads from pipe */
errExit("execlp wc");
default: /* Parent falls through */
break;
}
/* Parent closes unused file descriptors for pipe, and waits for children */
if (close(pfd[0]) == -1)
errExit("close 5");
if (close(pfd[1]) == -1)
errExit("close 6");
if (wait(NULL) == -1)
errExit("wait 1");
if (wait(NULL) == -1)
errExit("wait 2");
exit(EXIT_SUCCESS);
pipes/pipe_ls_wc.c
900
Chapter 44
, we now have two file descript
ors referring to the write end
of the pipe: descriptor 1 and
. Since unused pipe file descriptors should be
closed, after the
call, we close the superfluous descriptor:
close(pfd[1]);
The code we have shown so far relies on standard output having been previously
open. Suppose that, prior to the
call, standard input and standard output had
both been closed. In this case,
pipe()
would have allocated these two descriptors to
the pipe, perhaps with
having the value 0 and
having the value 1. Con-
sequently, the preceding
dup2
and
close()
calls would be equivalent to the following:
dup2(1, 1); /* Does nothing */
cl11; /* Closes sole descriptor for write end of pipe */
Therefore, it is good defensive programming practice to bracket these calls with an
statement of the following form:
if (pfd[1] != STDOUT_FILENO) {
dup2(pfd[1], STDOUT_FILENO);
close(pfd[1]);
Example program
The program in Listing 44-4 uses the techni
ques described in this section to bring
Pipes and FIFOs
899
Synchronization using pipes has an advant
age over the earlier example of synchro-
nization using signals: it can be used to
coordinate the action
s of one process with
multiple other (related) processes. The fact
standardstandardcant be
queued makes signals unsuitable in this case. (Conversely, signals have the advantage
that they can be broadcast by one process
to all of the members of a process group.)
Other synchronization topologies are possible (e.g., using multiple pipes). Fur-
thermore, this technique could be extended
so that, instead of closing the pipe,
each child writes a message to the pipe containing its process ID and some status
information. Alternatively, each child might
write a single byte to the pipe. The parent
process could then count and analyze these messages. This approach guards
against the possibility of the child accident
ally terminating, rather than explicitly
closing the pipe.
44.4Using Pipes to Connect Filters
When a pipe is created, the file descriptor
s used for the two ends of the pipe are the
next lowest-numbered descriptors available.
Since, in normal circumstances, descrip-
tors 0, 1, and 2 are already in use for a pr
ocess, some higher-numbered descriptors
will be allocated for the pipe. So how do we bring about the situation shown in
Figure 44-1, where two filters (i.e., programs that read from
and write to
) are connected using a pipe, such that the standard output of one program is
directed into the pipe and the standard input of the other is taken from the pipe? And
in particular, how can
we do this without modifying the code of the filters themselves?
The answer is to use the te
chniques described in Sect
ion 5.5 for duplicating file
descriptors. Traditionally, the following seri
es of calls was used to accomplish the
int pfd[2];
pipe(pfd); /* Allocatessaysay file descriptors 3 and 4 for pipe */
/* Other steps here, e.g., fork() */
close(STDOUT_FILENO); /* Free file descriptor 1 */
dup(pfd[1]); /* Duplication uses lowest free file
descriptor, i.e., fd 1 */
The end result of the above steps is that
the processs standard output is bound to
the write end of the pipe. A corresponding
898
Chapter 44
if (argc 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s sleep-time...\n", argv[0]);
Pipes and FIFOs
897
if (write(pfd[1], argv[1], strlen(argv[1])) != strlen(argv[1]))
fatal("parent - partial/failed write");
if (close(pfd[1]) == -1) /* Child will see EOF */
errExit("close");
wait(NULL); /* Wait for child to finish */
exit(EXIT_SUCCESS);
}
pipes/simple_pipe.c
896
Chapter 44
Heres an example of what we might see when running the program in Listing 44-2:
./simple_pipe 'It was a bright cold day in April, '\
'and the clocks were striking thirteen.'
It was a bright cold day in April, and the clocks were striking thirteen.
Listing 44-2:
Pipes and FIFOs
895
it has read all data from the pipe. Instead, a
read
would block waiting for data,
because the kernel knows that there is still
at least one write descriptor open for the
pipe. That this descriptor is held open by the reading process itself is irrelevant; in
theory, that process could still
write to the pipe, even if it
is blocked trying to read.
For example, the
read
might be interrupted by a signal handler that writes data to
the pipe. (This is a realistic scenar
io, as well see in Section 63.5.2.)
The writing process closes its read descri
ptor for the pipe for a different reason.
When a process tries to write to a pipe for which no process has an open read
descriptor, the kernel sends the
SIGPIPE
signal to the writing process. By default,
this signal kills a process. A process can inst
ead arrange to catch or ignore this signal,
in which case the
writ
on the pipe fails with the error
EPIPE
broken pipebroken pipe
the
SIGPIPE
894
Chapter 44
be sure which process will
be the first to succeedthe
two processes race for data.
Preventing such races would require the
use of some synchronization mechanism.
However, if we require bidirectional co
mmunication, there is a simpler way: just
create two pipes, one for sending data in
each direction between the two processes.
(If employing this technique, then we need
to be wary of deadlocks that may occur
if both processes block while trying to re
ad from empty pipes or while trying to
write to pipes that are already full.)
While it is possible to have multiple processes writing to a pipe, it is typical to
have only a single writer. (We show one example of where it is useful to have multiple
writers to a pipe in Section 44.3.) By contra
st, there are situations where it can be useful
to have multiple writers on a FIFO, and we
see an example of this in Section 44.8.
Starting with kernel 2.6.27, Linux su
pports a new, nonstandard system call,
pipe
. This system call performs the same task as
pipe
, but supports an addi-
tional argument,
, that can be used to modify
the behavior of the system
call. Two flags are supported. The
O_CLOEXEC
flag causes the kernel to enable the
close-on-exec flag (
) for the two new file descri
ptors. This flag is useful
for the same reasons as the
flag described in Section 4.3.1. The
O_NONBLOCK
flag causes the kernel to mark
both underlying open
file descriptions
as nonblocking, so that future I/O operations will be nonblocking. This saves
additional calls to
to achieve the same result.
Pipes and FIFOs
893
Figure 44-3:
int filedes[2];
if (pipe(filedes) == -1) /* Create the pipe */
errExit("pipe");
switch (fo){ /* Create a child process */
case -1:
errExit("fork");
case 0: /* Child */
if (close(filedes[1]) == -1) /* Close unused write end */
errExit("close");
/* Child now reads from pipe */
break;
default: /* Parent */
if (close(filedes[0]) == -1) /* Close unused read end */
errExit("close");
/* Parent now writes to pipe */
break;
}
One reason that it is not usual to have
both the parent and child reading from a
single pipe is that if two
processes try to simultaneously read from a pipe, we cant
parent process
filedes[0]
filedes[1]
child process
filedes[0]
filedes[1]
a) After
b) After closing unused descriptors
parent process
filedes[1]
child process
filedes[0]
892
Chapter 44
change the pipe capa
city to any value in the rang
e from the system page size
up to the value in
/proc/sys/fs/pipe-max-size
. The default value for
pipe-max-
size
is 1,048,576 bytes. A privileged (
CAP_SYS_RESOURCE
) process can override
this limit. When allocating space
for the pipe, the kernel may round
size
up to
some value convenient for the implementation. The
calling process
filedes[0]
filedes[1]
direction of
data flow
Pipes and FIFOs
891
Writes of up to
PIPE_BUF
bytes are guaranteed to be atomic
If multiple processes are writing to a single
pipe, then it is guaranteed that their
data wont be intermingled if they write no more than
PIPE_BUF
bytes at a time.
SUSv3 requires that
PIPE_BUF
be at least
_POSIX_PIPE_BUF
512512. An implementa-
tion should define
PIPE_BUF
(in
limit&#xlimi;&#xt7s.;&#xh000;s.h
) and/or allow the call
fpathconf(fd,
_PC_PIPE_BUF)
890
Chapter 44
Figure 44-1:
Using a pipe to connect two processes
One point to note in Figure 44-1 is that
the two processes are
connected to the pipe
so that the writing process (
) has its standard output (file descriptor 1) joined to
the write end of the pipe, while the reading process (
) has its standard input (file
descriptor 0) joined to the read end of
the pipe. In effect, these two processes are
unaware of the existence of the pipe; they
just read from and write to the standard
file descriptors. The shell must do some wo
stdout
stdin
read end
byte stream;
unidirectional
This chapter describes pipes and FIFOs. Pi
888
Chapter 43
43-2.
Repeat the preceding exercise for System V message queues, POSIX message
Interprocess Commun
ication Overview
887
In some circumstances, different IPC facilities may show notable differences in per-
formance. However, in later chapters, we generally refrain from making performance
comparisons, for the following reasons:
The performance of an IPC facility may not be a significant factor in the overall
performance of an application, and it ma
886
Chapter 43
Persistence
persistence
Interprocess Commun
ication Overview
885
interfaces more complicated to use. Th
e corresponding POSIX IPC facilities were
designed to address these problems. The fo
llowing points are of particular note:
The System V IPC facilities are connectionless. These facilities provide no
notion of a handle (like a file descript
or) referring to an open IPC object. In
refer to the object. The kernel does no
t record the process as having opened
the object (unlike other types of IPC obje
cts). This means that the kernel cant
maintain a reference count of the number
of processes that are currently using
an object. Consequently, it can requir
e additional programming effort for an
application to be able to know wh
en an object can safely be deleted.
The programming interfaces for the System V IPC facilities are inconsistent
with the traditional UNIX I/O model
(they use integer key values and IPC
identifiers instead of pathnames and fi
le descriptors). The programming inter-
faces are also overly complex. This last point applies particularly to System V
semaphores (refer to Sections 47.11 and 53.5).
By contrast, the kernel coun
ts open references for POSI
X IPC objects. This simpli-
fies decisions about when an object can
884
Chapter 43
UNIX domain sockets provide a feature
that allows a file descriptor to be
passed from one process to another. This allows one process to open a file and
make it available to another process that
otherwise might not be able to access
the file. We briefly describe
this feature in Section 61.13.3.
Interprocess Commun
ication Overview
883
Functionality
There are functional differences between the various IPC facilities that can be rele-
882
Chapter 43
used for synchronization, with the sync
hronization operation taking the form of
exchanging messages via the facility.
Since kernel 2.6.22, Linux provides an additional, nonstandard synchroniza-
tion mechanism via the
eventfd()
system call. This system call creates an
eventfd
object that has an associat
ed 8-byte unsigned integer maintained by the kernel.
The system call returns a file descriptor
that refers to the object. Writing an
integer to this file descriptor adds th
at integer to the objects value. A
from the file descriptor blocks if the object
s value is 0. If the object has a non-
zero value, a
Interprocess Commun
ication Overview
881
UNIX systems provide the following synchronization facilities:
: A semaphore is a kernel-maintai
ned integer whose value is never
permitted to fall below 0. A process ca
n decrease or increase the value of a
semaphore. If an attempt is made to
decrease the value of the semaphore
below 0, then the kernel blocks the operation until the semaphores value
increases to a level that permits the oper
ation to be performed. (Alternatively,
the process can request a nonblocking oper
ation; then, instead of blocking, the
880
Chapter 43
: The data exchanged via System V message queues, POSIX message
Interprocess Commun
ication Overview
879
In some cases, facilities that are groupe
r
buffer
Process A
buffer
Process B
buffer
Kernel
878
Chapter 43
Although some of these faci
lities are concerned with sy
nchronization, the general
(IPC) is often used
to describe them all.
Figure 43-1:
A taxonomy of UNIX IPC facilities
As Figure 43-1 illustrates, often several
facilities provide similar IPC functionality.
There are a couple of reasons for this:
Similar facilities evolved on differen
t UNIX variants, and later came to be
ported to other UNIX systems. For exampl
e, FIFOs were developed on System V,
whstreamstreamsockets were developed on BSD.
New facilities have been developed to
address design deficiencies in similar
earlier facilities. For example, the POSIX IPC facilities (message queues, sema-
phores, and shared memory) were designed as an improvement on the older
System V IPC facilities.
data
transfer
message
byte
stream
shared
memory
synchronization
semaphore
file lock
)
floc
pseudoterminal
anonymous
COMMUNICATION OVERVIEW
This chapter presents a brief overview of the facilities that processes and threads
can use to communicate with one another
and to synchronize their actions. The
following chapters provide more
876
Chapter 42
atically linked against the library.) This
technique provides an alternative to the
traditional library ve
rsioning approach of
using major and minor version numbers
in the shared library real name.
Defining initialization and
finalization functions within a shared library allows
us to automatically execute code when
the library is loaded and unloaded.
LD_PRELOAD
environment variable allows us to preload shared libraries.
Using this mechanism, we can selectively override functions and other symbols that
the dynamic linker would normally find in other shared libraries.
We can assign various values to the
LD_DEBUG
environment variable in order to
monitor the operation of the dynamic linker.
Further information
Refer to the sources of further in
formation listed in Section 41.14.
42.8Exercises
42-1.
Write a program to verify that if a library is closed with
, it is not unloaded if
any of its symbols are used by another library.
42-2.
dladdr
call to the program in Listing 42-1 (
dynload.c
Advanced Features of Shared Libraries
875
The following example shows an abridged version of the output provided when we
request tracing of information about library searches:
LD_DEBUG=libs date
10687: find library=librt.so.1 [0]; searching
10687: search cache=/etc/ld.so.cache
10687: trying file=/lib/librt.so.1
10687: find library=libc.so.6 [0]; searching
10687: search cache=/etc/ld.so.cache
10687: trying file=/lib/libc.so.6
10687: find library=libpthread.so.0 [0]; searching
10687: search cache=/etc/ld.so.cache
10687: trying file=/lib/libpthread.so.0
10687: calling init: /lib/libpthread.so.0
10687: calling init: /lib/libc.so.6
10687: calling init: /lib/librt.so.1
10687: initialize program: date
10687: transferring control: date
Tue Dec 28 17:26:56 CEST 2010
10687: calling fini: date [0]
10687: calling fini: /lib/librt.so.1 [0]
10687: calling fini: /lib/libpthread.so.0 [0]
10687: calling fini: /lib/libc.so.6 [0]
The value 10687 displayed at the start of each
line is the process ID of the process
being traced. This is useful if we are mo
nitoring several processes (e.g., parent and
By default,
LD_DEBUG
output is written to standard e
rror, but we can direct it else-
where by assigning
a pathname to the
LD_DEBUG_OUTPUT
environment variable.
If desired, we can assign multiple options to
LD_DEBUG
by separating them with
commas (no spaces should appear). The output of the
option (which traces
symbol resolution by the dynamic li
nker) is particularly voluminous.
LD_DEBUG
is effective both for libraries implicitly loaded by the dynamic linker
and for libraries dyna
mically loaded by
For security reasons,
LD_DEBUG
is (since
glibc
874
Chapter 42
suppose that we have a program that calls functions
and
, defined in our
libdemo
library. When we run this program, we see the following output:
Called mod1-x1 DEMO
Called mod2-x2 DEMO
(In this example, we assume that the shared
library is in one of the standard direc-
nt need to use the
LD_LIBRARY_PATH
environment variable.)
We could selectively override the function
x1()
by creating another shared
library,
libalt.so
, which contains a different definition of
. Preloading this
library when running the program
would result in the following:
LD_PRELOAD=libalt.so ./prog
Called mod1-x1 ALT
Called mod2-x2 DEMO
Here, we see that the version of
defined in
libalt.so
is invoked, but that the call
, for which no definition is provided in
libalt.so
, results in the invocation of
function defined in
libdemo.so
LD_PRELOAD
environment variable controls
preloading on a per-process
basis. Alternatively, the file
Advanced Features of Shared Libraries
873
functions are executed regardless of whethe
r the library is loaded automatically or
loaded explicitly using the
interface (Section 42.1).
Initialization and finalization functions are defined using the
constructor
destructor
attributes. Each function that is
to be executed when the library is
loaded should be
void __attribute__ ((constructor)) some_name_loadvoidvoid
/* Initialization code */
Unload functions are similarly defined:
void __attribute__ ((destructor)) some_name_unload(void)
/* Finalization code */
The function names
some_name_load
and
some_name_unload
can be replaced by
It is also possible to use the
constructor
and
destructor
attributes to create
initialization and finalization functions in a main program.
_ini
and
_fini()
functions
An older technique for shared
library initialization and fina
lization is to create two
functions,
and
_fini()
, as part of the library. The
void _init(void)
function con-
tains code that is to executed when the
library is first loaded
by a process. The
void
_fini(void)
function contains code that is to be executed when the library is unloaded.
If we create
and
functions, then we must specify the
gcc nostartfiles
option when building the shared library, in
order to prevent the linker from including
default versions of thes
e functions. (Using the
and
linker
options, we can choose alternative name
s for these two func
tions if desired.)
and
_fin
is now considered obsolete in favor of the
constructor
and
destructor
attributes, which, among ot
her advantages, allow us to
define multiple initialization
and finalization functions.
42.5Preloading Shared Libraries
For testing purposes, it can sometimes be
useful to selectively override functions
(and other symbols) that would normally
be found by the dyna
mic linker using the
rules described in Section 41.11. To do th
is, we can define the environment vari-
LD_PRELOAD
as a string consisting of space-
separated or colon-separated names
of shared libraries that should be loaded
before any other shared libraries. Since
these libraries are loaded first, any functi
ons they define will automatically be used
if required by the executable, thus ov
erriding any other functions of the same
name that the dynamic linker would otherwise have searched for. For example,
872
Chapter 42
Version tag dependencies indicate the re
lationships between successive library ver-
sions. Semantically, the only effect of version tag dependencies on Linux is that a
version node inherits
global
and
local
specifications from the version node upon
which it depends.
Dependencies can be chained, so that
we could have another version node
VER_3
, which depended on
VER_2
, and so on.
The version tag names have no meanings in themselves. Their relationship
Further information about symbol versioning can be found using the com-
mand
and at
http://people.redhat.com/drepper/symbol-versioning
42.4Initialization and
Finalization Functions
It is possible to define one or more fu
nctions that are executed automatically when
a shared library is loaded and unloaded. This
allows us to perform initialization and
finalization actions when work
ing with shared libraries. Initialization and finalization
Advanced Features of Shared Libraries
871
When we run this program, we see the expected result:
LD_LIBRARY_PATH=. ./p1
v1 xyz
Now, suppose that we want to modify the definition of
within our library,
while still ensuring that program
continues to use the ol
d version of this func-
tion. To do this, we must define two versions of
within our library:
cat sv_lib_v2.c
#include stdi&#xstdi;o.h;o.h
__asm__(".symver xyz_old,[email protected]_1");
__asm__(".symver xyz_new,[email protected]@VER_2");
void xyz_old(void) { printf("v1 xyz\n"); }
void xyz_new(void) { printf("v2 xyz\n"); }
void pqrvoidvoid { printf("v2 pqr\n"); }
are provided by the functions
xyz_new()
. The
xyz_ol
function corresponds to our original definition of
, which is the one
that should continue to be used by program
function provides the
definition of
to be used by programs linking against the new version of the library.
.symver
assembler directives are the glue
that ties these two functions
to different version tags in the modified version script (shown in a moment) that
we use to create the new version of the sh
ared library. The first of these directives
says that
is the implementation of
to be used for applications linked
against version tag
VER_1
(i.e., program
in our example), and that
xyz_new()
is
the implementation of
to be used by applications
linked against version tag
VER_2
The use of
rather than
in the second
.symver
directive indicates that this is
the default definition of
to which applications should bind when statically
linked against this shared library. Exactly one of the
.symver
directives for a symbol
should be marked using
The corresponding version script for
our modified library is as follows:
cat sv_v2.map
VER_1 {
global: xyz;
local: *; # Hide all other symbols
VER_2 {
global: pqr;
This version script provides a new version tag,
VER_2
, which depends on the tag
VER_1
. This dependency is indica
ted by the following line:
870
Chapter 42
readelf
once more shows that
is no longer externally visible:
readelf --syms --use-dynamic vis.so | grep vis_
25 0: 00000730 73 FUNC GLOBAL DEFAULT 11 vis_f2
29 16: 000006f0 59 FUNC GLOBAL DEFAULT 11 vis_f1
42.3.2Symbol Versioning
Symbol versioning allows a si
ngle shared library to provid
e multiple versions of the
same function. Each program uses the version of the function that was current
when the program was (statically) linked ag
ainst the shared library. As a result, we
can make an incompatible change to a sh
ared library without needing to increase
the librarys major version number. Carried to an extreme, symbol versioning can
replace the traditional shared library ma
jor and minor versio
ning scheme. Symbol
versioning is used in this manner in
glibc
2.1 and later, so that all versions of
glibc
from 2.0 onward are supported within
a single major library version (
libc.so.6
We demonstrate the use of symbol versio
ning with a simple
example. We begin
by creating the first ve
rsion of a shared library using a version script:
cat sv_lib_v1.c
#include stdi&#xstdi;o.h;o.h
void xyzvoidvoid { printf("v1 xyz\n"); }
cat sv_v1.map
VER_1 {
global: xyz;
local: *; # Hide all other symbols
gcc -g -c -fPIC -Wall sv_lib_v1.c
gcc -g -shared -o libsv.so sv_lib_v1.o -Wl,--version-script,sv_v1.map
Within a version script, the hash character (
) starts a comment.
(To keep the example simple, we avoid th
e use of explicit library sonames and
library major version numbers.)
At this stage, our version script,
sv_v1.map
, serves only to cont
rol the visibility of
the shared librarys symbols;
xyz()
are none in this small example) are
hidden. Next, we create a program,
makes use of this library:
cat sv_prog.c
#include stdl&#xstdl;ib.;&#xh000;ib.h
int
main(int argc, char *argv[])
void xyz(void);
xyz();
exit(EXIT_SUCCESS);
gcc -g -o p1 sv_prog.c libsv.so
Advanced Features of Shared Libraries
869
the three source files
, and
, which respectively define the
functions
vis_comm()
vis_f1()
, and
vis_comm()
function is called by
and
vis_f2()
, but is not intended for direct us
e by applications linked against
the library. Suppose we build the
shared library in the usual way:
gcc -g -c -fPIC -Wall vis_comm.c vis_f1.c vis_f2.c
gcc -g -shared -o vis.so vis_comm.o vis_f1.o vis_f2.o
If we use the following
readelf
command to list the dynamic symbols exported by
see the following:
readelf --syms --use-dynamic vis.so | grep vis_
30 12: 00000790 59 FUNC GLOBAL DEFAULT 10 vis_f1
25 13: 000007d0 73 FUNC GLOBAL DEFAULT 10 vis_f2
27 16: 00000770 20 FUNC GLOBAL DEFAULT 10 vis_comm
This shared library exported three symbols:
vis_comm()
vis_f1()
ever, we would like to ensu
re that only the symbols
vis_f1()
and
vis_
are exported
by the library. We can achieve this re
sult using the following version script:
cat vis.map
VER_1 {
global:
vis_f1;
vis_f2;
local:
*;
The identifier
VER_1
is an example of a
version tag
. As well see in the discussion of
symbol versioning in Sect
ion 42.3.2, a version scri
pt may contain multiple
, each grouped within braces (
) and prefixed with a unique version tag. If we
are using a version script only for the purp
ose of controlling symbol visibility, then
the version tag is redundant; ne
vertheless, olde
r versions of
required it. Modern
versions of
allow the version tag to be omitted;
in this case, the version node is
said to have an anonymous version tag,
and no other version nodes may be present
in the script.
Within the vers
ion node, the
global
keyword begins a semicolon-separated list
of symbols that are made visible outside the library. The
keyword begins a list of
symbols that are to be
hidden from the outside
world. The asterisk (
) here illustrates
the fact that we can use wildcard patterns
in these symbol specifications. The wild-
card characters are the same as those us
ed for shell filename matchingfor exam-
and
. (See the
glob(7)
manual page for further details.) In this example, using
an asterisk for the
local
specification says that ever
ything that wasnt explicitly
declared
global
is hidden. If we did not say this, then
vis_
would still be visible,
since the default is to make C global sy
mbols visible outside the shared library.
We can then build our shared library using the version script as follows:
gcc -g -c -fPIC -Wall vis_comm.c vis_f1.c vis_f2.c
gcc -g -shared -o vis.so vis_comm.o vis_f1.o vis_f2.o \
-Wl,--version-script,vis.map
868
Chapter 42
As well as making a symbol priv
ate to a source-code module, the
static
key-
word also has a converse effect. If a symbol is marked as
static
, then all refer-
ences to the symbol in the same source fi
le will be bound to that definition of
the symbol. Consequently, these references wont be subject to run-time inter-
position by definitions from other shar
ed libraries (in the manner described in
Section 41.12). This effect of the
static
keyword is similar to the
linker option described in Section
41.12, with the difference that the
static
keyword affects a single symbol
within a single source file.
The GNU C complier,
, provides a compiler-specific attribute declaration
that performs a similar task to the
static
void
__attribute__ ((visibility("hidden")))
func(void) {
/* Code */
Whereas the
static
keyword limits the visibility
of a symbol to a single source
hidden
attribute makes the symbol av
As with the
static
keyword, the
hidden
attribute also has the converse effect of
preventing symbol interposition at run time.
Version scripts Section 42.3Section 42.3
to precisely control symbol visibility
and to select the versio
n of a symbol to which a reference is bound.
When dynamically loading a shared library Section 42.1.1Section 42.1.1
RTLD_GLOBAL
flag can be used to specify that
the symbols defined by the library
should be made available for binding by
linker option (Section 42.1.6)
can be used to make the global
symbols of the main program available to dynamically loaded libraries.
Advanced Features of Shared Libraries
867
42.1.6Accessing Symbols in the Main Program
Suppose that we use
to dynamically load a shared library, use
to
obtain the address of a function
x()
from that library, and then call
. If
in turn
calls a function
y()
, then
y()
would normally be sought in one of the shared libraries
loaded by the program.
Sometimes, it is desirable instead to have
invoke an implementation of
y()
in
the main program. (This is similar to a ca
llback mechanism.) In
order to do this, we
must make the (global-scope) symbols in
the main program available to the dynamic
linker, by linking the program using the
exportdynamic
linker option:
gcc -Wl,--export-dynamic main.c
(plus further options and arguments)
Equivalently, we can write the following:
gcc -export-dynamic main.c
Using either of these options allows a dy
namically loaded library to access global
symbols in the main program.
gcc rdynamic
option and the
gcc Wl,E
option are further synonyms for
Wl,exportdynamic
42.2Controlling S
ymbol Visibility
A well-designed shared library should ma
ke visible only those symbols (functions
and variables) that form part of its specif
ied application binary interface (ABI). The
reasons for this are as follows:
If the shared library designer accident
ally exports unspecified interfaces, then
authors of applications that use the librar
y may choose to employ these interfaces.
This creates a compatibility problem for
future upgrades of the shared library.
The library developer expects to be able
to change or remove any interfaces
other than those in the documented ABI, while the library user expects to con-
tinue using the same interfaces (with th
e same semantics) that they currently
employ.
During run-time symbol resolution, any
symbols that are exported by a shared
library might interpose definitions that are provided in other shared libraries
(Section 41.12).
Exporting unnecessary symbols increases the size of the dynamic symbol table
that must be loaded at run time.
All of these problems can be minimized
866
Chapter 42
42.1.4Closing a Shared Library:
function closes a library.
dlclose()
function decrements the systems
counter of open references to the
library referred to by
handle
. If this reference count falls to 0, and no symbols in
the library are required by other libraries,
then the library is unloaded. This proce-
dure is also (recursively) performed for
the libraries in this librarys dependency
tree. An implicit
of all libraries is performed on process termination.
2.2.3 onward, a function with
in a shared library can use
atexit()
(or
on_e
) to establish a function that is called automatically when the library is
42.1.5Obtaining Information About Loaded Symbols:
dl
Given an address in
(typically, one obtained by an earlier call to
#define _GNU_SOURCE
#include dlfc lfc;n.h;n.h
int
dladdr
(const void *
addr
, Dl_info *
Advanced Features of Shared Libraries
865
LD_LIBRARY_PATH=. ./dynload libdemo.so.1 x1
Called mod1-x1
In the first of the above commands,
notes that the library path includes a
864
Chapter 42
Instead of the
*(void **)
syntax shown above, one might consider using the follow-
ing seemingly equivalent code when assigning the return value of
(void *) funcp = dlsym(handle, symbol);
However, for this code,
warns that ANSI C forbids the use of cast
expressions as lvalues. The
void **void **
syntax doesnt incur this warning because we
are assigning to an address
pointed to
by the assignments lvalue.
On many UNIX implementations, we can use casts such as the following to
eliminate warnings from the C compiler:
funcp = (int (*) (int)) dlsym(handle, symbol);
However, the specification of
in SUSv3
Technical Corrigendum Number 1
notes that the C99 standard
nevertheless requires compil
ers to generate a warning
for such a conversion, and proposes the
syntax shown above.
SUSv3 TC1 noted that because of the need for the
*(void **)
syntax, a future
version of the standard may define separate
dlsy
-like APIs for handling data
and function pointers. However, SUSv4
contains no changes with respect to
this point.
Using library pseudohandles with
Advanced Features of Shared Libraries
863
If
is found,
dlfcn l7.;ۼn;.6.;&#xh000;.h
the value of the variable by dereferenc-
ing the pointer:
int *ip;
ip = (int *) dlsym(symbol, "myvar");
if (ip != NULL)
printf("Value is %d\n", *ip);
If
is the name of a function,
tion using the usual C syntax for
dereferencing function pointers:
res = (*funcp)(somearg);
#include dlfc lfc;n.h;n.h
void *
dlsym
(void *
handle
862
Chapter 42
RTLD_NOLOAD
(since
glibc
2.2)
Dont load the library. This
serves two purposes. First, we can use this flag to
check if a particular library is current
ly loaded as part of the processs
address space. If it is,
dlopen
of the error.
dlsy
function searches for the named
(a function or variable) in the
library referred to by
handle
and in the libraries in that librarys dependency tree.
#include dlfc lfc;n.h;n.h
const char *
dlerror
(void);
Advanced Features of Shared Libraries
861
flags
argument is a bit mask that must include exactly one of the constants
RTLD_LAZY
RTLD_NOW
, with the following meanings:
RTLD_LAZY
Undefined function symbols in the library should be resolved only as the
code is executed. If a piece of code re
quiring a particular symbol is not exe-
cuted, that symbol is never resolved.
Lazy resolution is performed only for
function references; references to vari
ables are always resolved immediately.
Specifying the
flag provides behavior that corresponds to the normal
operation of the dynamic linker when
loading the shared libraries identified
in an executables dynamic dependency list.
RTLD_NOW
All undefined symbols in the library should be immediately resolved
before
860
Chapter 42
API enables a program to open a shared library at run time, search
for a function by name in that library, and then call the function. A shared library
loaded at run time in this way is commonly referred to as a
dynamically loaded
library
, and is created in the same way as any other shared library.
API consists of the following fu
nctions (all of which are speci-
fied in SUSv3):
The
dlopen
function opens a shared library,
OF SHARED LIBRARIES
The previous chapter covered the fundamenta
ls of shared libraries. This chapter
describes a number of advanced features of
shared libraries, in
cluding the following:
dynamically loading shared libraries;
controlling the visibility of symbols defined by a shared library;
using linker scripts to
create versioned symbols;
using initialization and finalization
functions to automatically execute code
when a library is loaded and unloaded;
shared library preloading; and
LD_DEBUG
to monitor the operation of the dynamic linker.
42.1Dynamically Loaded Libraries
When an executable starts, the dynamic linker loads all of the shared libraries in
the programs dynamic dependency list. So
Fundamentals of Shared Libraries
857
required at run time. When the file is ex
ecuted, the dynamic linker uses this infor-
mation to load the required shared librar
ies. At run time, all programs using the
same shared library share a single copy
of that library in memory. Since shared
libraries are not copied into executable files, and a single memory-resident copy of
the shared library is empl
oyed by all programs at run time, shared libraries
reduce the amount of disk space and memory required by the system.
The shared library soname provides a le
vel of indirection in resolving shared
library references at run time. If a shared library has a soname, then this name, rather
than the librarys real name, is recorded in
the resulting executable produced by the
static linker. A versioning scheme, whereby a
shared library is given a real name of the
, while the soname has the form
major-id
allows for the creation of programs that
automatically employ the latest minor ver-
sion of the shared library (without requir
ing the programs to be relinked), while
also allowing for the creation of new, in
compatible major versions of the library.
In order to find a shared library at run time, the dynamic linker follows a standard
, a tool that shields
the programmer from the
856
Chapter 41
linker option specifies that references to global symbols within a
shared library should be preferentially bo
und to definitions (if they exist) within
that library. (Note that, regardless of this option, calling
from the main pro-
gram would always invoke the version of
defined in the main program.)
41.13Using a Static Library In
stead of a Shared Library
Although it is almost always preferable to
use shared libraries, there are occasional
situations where static libraries may be appr
opriate. In particular, the fact that a
statically linked application contains all of
the code that it requires at run time can
be advantageous. For example, static linking
is useful if the user cant, or doesnt
wish to, install a shared library on the syst
em where the program is to be used, or if
the program is to be run in an environment (perhaps a
jail, for example)
where shared libraries are unavailable.
In addition, even a compatible shared
library upgrade may unintentionally introduc
e a bug that breaks an application. By
linking an application statically, we can ensure that it is immune to changes in the
shared libraries on a system and that it has
all of the code it requires to run (at the
expense of a larger program size, and
consequent increased disk and memory
requirements).
By default, where the linker has a choice of a shared and a static library of the
same name (e.g., we link using
Lsomedir ldemo
libdemo.so
and
libdemo.a
exist), the shared version of the library is
used. To force usage of the static version
of the library, we may do one of the following:
Specify the pathname of the static library (including the
extension) on the
command line.
Specify the
option to
gcc
options
and
Wl,Bdynamic
to explicitly toggle the
linkers choice between static and shared libraries. These options can be inter-
mingled with
options on the
gcc
command line. The linker processes the options
in the order in which they are specified.
41.14Summary
An object library is an aggregation of
compiled object modules that can be
employed by programs that are linked ag
ainst the library. Like other UNIX imple-
mentations, Linux provides two types of obje
ct libraries: static libraries, which were
the only type of library available unde
r early UNIX systems, and the more modern
shared libraries.
Because they provide several advantages ov
er static libraries, shared libraries
are the predominant type of library in
use on contemporary UNIX systems. The
advantages of shared libraries spring prim
arily from the fact that when a program
is linked against the library, copies of
the object modules required by the program
are not included in the resulting executab
staticstaticely
includes information in the executable fi
le about the shared libraries that are
Fundamentals of Shared Libraries
855
Figure 41-5:
Resolving a global symbol reference
When we build the shared library and the executable program, and then run the
program, this is what we see:
gcc -g -c -fPIC -Wall -c foo.c
gcc -g -shared -o libfoo.so foo.o
gcc -g -o prog prog.c libfoo.so
LD_LIBRARY_PATH=. ./prog
main-xyz
From the last line of output, we
can see that the definition of
in the main pro-
gram overrides (interposes) the one in the shared library.
Although this may at first appear surpri
sing, there is a good historical reason
why things are done this way. The first sh
ared library implemen
tations were designed
so that the default semantics for symbol
resolution exactly mirrored those of appli-
cations linked against static equivalents of
the same libraries. This means that the
following semantics apply:
A definition of a global symbol in the main program overrides a definition in
a library.
If a global symbol is defined in multiple
libraries, then a reference to that sym-
bol is bound to the first definition foun
d by scanning libraries in the left-to-
right order in which they were listed
on the static link command line.
Although these semantics make the transition
from static to shared libraries relatively
straightforward, they can cause some problems. The most significant problem is
that these semantics conflict
with the model of a shared
library as implementing a
self-contained subsystem. By default, a shared library cant guarantee that a reference
to one of its own global symbols will actually
be bound to the librarys definition of
that symbol. Consequently, the properties
of a shared library
can change when it is
aggregated into a larger unit. This can le
ad to applications breaking in unexpected
ways, and also makes it difficult to perf
orm divide-and-conquer debugging (i.e., try-
ing to reproduce a problem using fewe
r or different shared libraries).
In the above scenario, if we wanted
to ensure that the invocation of
xy
shared library actually called the version of
the function defined within the library,
then we could use the
linker option when building the shared library:
gcc -g -c -fPIC -Wall -c foo.c
gcc -g -shared -Wl,-Bsymbolic -o libfoo.so foo.o
gcc -g -o prog prog.c libfoo.so
LD_LIBRARY_PATH=. ./prog
foo-xyz
prog

854
Chapter 41
Fundamentals of Shared Libraries
853
We can also view the
lists by grepping the output of the
readelf dynamic
(or, equivalently,
readelf d
We can use the
ldd
852
Chapter 41
Using the
rpath
linker option when building a shared library
linker option can also be useful wh
en building a shared library. Suppose
we have one shared library,
libx1.so
, that depends on another,
libx2.so
, as shown in
Figure 41-4. Suppose also th
at these libraries reside in the nonstandard directories
and
, respectively. We now go through the st
eps required to build these libraries
and the program that uses them.
Figure 41-4:
A shared library that depends on another shared library
First, we build
libx2.so
, in the directory
pdir/d2
. (To keep the example simple, we
dispense with library version numbering and explicit sonames.)
cd /home/mtk/pdir/d2
gcc -g -c -fPIC -Wall modx2.c
gcc -g -shared -o libx2.so modx2.o
Next, we build
libx1.so
, in the directory
pdir/d1
libx1.so
depends on
libx2.so
which is not in a standard directory, we specify the latters run-time location with the
rpath
linker option. This could be different fr
om the link-time location of the library
(specified by the
option), although in this case
the two locations are the same.
cd /home/mtk/pdir/d1
gcc -g -c -Wall -fPIC modx1.c
gcc -g -shared -o libx1.so modx1.o -Wl,-rpath,/home/mtk/pdir/d2 \

-L/home/mtk/pdir/d2 -lx2
Finally, we build the main program, in the
pdir
directory. Since the main program
makes use of
libx1.so
, and this library resides in a
nonstandard directory, we again
employ the
rpath
linker option:
cd /home/mtk/pdir
gcc -g -Wall -o prog prog.c -Wl,-rpath,/home/mtk/pdir/d1 \
-L/home/mtk/pdir/d1 -lx1
Note that we did not need to mention
libx2.so
when linking the main program.
Since the linker is capable of analyzing the
rpath
list in
libx1.so
, it can find
libx2.so
and thus is able to satisfy the requirement that all symbols can be resolved at static
We can use the following commands to examine
prog
and
libx1.so
in order to
see the contents of their
lists:
objdump -p prog | grep PATH
RPATH /home/mtk/pdir/d1
libx1.so
will be sought here at run time
objdump -p d1/libx1.so | grep PATH
RPATH /home/mtk/pdir/d2
libx2.so
will be sought here at run time

d1/libx1.so
modx1.cmodx1.c
d2/libx2.so
modx2.cmodx2.c

prog
prog.cprog.c
Fundamentals of Shared Libraries
851
Assuming the linker name was already correct
850
Chapter 41
41.8Compatible Versus Incompatible Libraries
Over time, we may need to make changes
to the code of a shared library. Such
changes result in a new version
of the library that is either
with previous
ssneed to change
only the minor version identifier of the
librarys real name, or
incompatible
, meaning that we must define a new major version
of the library.
A change to a library is compatible
with an existing library version if
of the
following conditions hold true:
The semantics of each public function and variable in the library remain
unchanged. In other words, each function keeps the same argument list, and
on global variables and returned argu-
Fundamentals of Shared Libraries
849
ldconfig -v | grep libdemo
libdemo.so.1� - libdemo.so.1.0.1 (changed)
libdemo.so.2� - libdemo.so.2.0.0 (changed)
Above, we filter the output of
, so that we see just the information relating
libdemo
Next, we list the files named
libdemo
in
/usr/lib
848
Chapter 41
cd /usr/lib
ln -s libdemo.so.1.0.1 libdemo.so.1
ln -s libdemo.so.1 libdemo.so
The last two lines in this shell session
create the soname and linker name sym-
bolic links.
ldconfig
ldconfig88
program addresses two potential
problems with shared libraries:
Shared libraries can reside in a variety of directories. If the dynamic linker
needed to search all of these directories
in order to find a library, then loading
libraries could be very slow.
As new versions of libraries are insta
lled or old versions are removed, the
soname symbolic links ma
y become out of date.
program solves these problems by performing two tasks:
Fundamentals of Shared Libraries
847
gcc -g -shared -Wl,-soname,libdemo.so.1 -o libdemo.so.1.0.1 \
mod1.o mod2.o mod3.o
Next, we create appropriate symbolic links for the soname and linker name:
ln -s libdemo.so.1.0.1 libdemo.so.1
ln -s libdemo.so.1 libdemo.so
We can employ
846
Chapter 41
Typically, the linker name is created in th
e same directory as the file to which it
refers. It can be linked either to the real
name or to the soname of the most recent
major version of the library. Usually, a link
to the soname is preferable, so that changes
to the soname are automatically reflected in the linker name. (In Section 41.7, well see
that the
program automates the task of keeping sonames up to date, and
thus implicitly maintains linker names if
we use the convention just described.)
If we want to link a program against an
older major version of a shared library,
we cant use the linker name. Instead, as part of the link command, we would
need to indicate the requ
ired (major) version by specifying a particular real
name or soname.
The following are some examples of linker names:
libdemo.so � - libdemo.so.2
libreadline.so � - libreadline.so.5
Table 41-1 summarizes information about
the shared library real name, soname,
and linker name, and Figure 41-3 portra
ys the relationship
between these names.
Figure 41-3:
Conventional arrangement
of shared library names
Creating a shared library using standard conventions
maj.min
real name
(regular file)
Object code for
library modules
name.
maj.min
name.
.maj
maj
linker name
Fundamentals of Shared Libraries
845
Real names, sonames, and linker names
Each incompatible version of a shared
library is distinguished by a unique
major
version identifier
, which forms part of its real name
. By convention, the major version
identifier takes the form of a number th
at is sequentially incremented with each
incompatible release of the library. In a
ddition to the major version identifier, the
real name also includes a
, which distinguishes compatible
minor versions within the library major ve
rsion. The real name employs the format
name
.so.
major-id
minor-id
Like the major version identifier, the minor version identifier can be any
string, but, by convention, it is either
a number, or two numbers separated by a
dot, with the first number identifying
the minor version, and the second number
indicating a patch level or revision numb
er within the minor
version. Some exam-
ples of real names of shared libraries are the following:
libdemo.so.1.0.1
libdemo.so.1.0.2
Minor version, compatible with version 1.0.1
libdemo.so.2.0.0
New major version, incompatible with version 1.*
libreadline.so.5.0
The soname of the shared library includes
the same major version identifier as its
corresponding real library name, but excl
udes the minor versio
the soname has the form
.so.
major-id
Usually, the soname is created as a rela
tive symbolic link
in the directory that
contains the real name. The following ar
e some examples of sonames, along with
the real names to which they
might be symbolically linked:
libdemo.so.1 �- libdemo.so.1.0.2
libdemo.so.2 �- libdemo.so.2.0.0
libreadline.so.5 �- libreadline.so.5.0
For a particular major version of a shared library, there may be several library files dis-
tinguished by different minor version iden
tifiers. Normally, the soname corresponding
to each major library version points to the
most recent minor version within the major
version (as shown in the above examples for
844
Chapter 41
command resolves each library refere
nce (employing the same search con-
ventions as the dynamic linker) and disp
lays the results in the following form:
library-name
� =
resolves-to-path
For most ELF executables,
ldd
will list entries for at least
ld-linux.so.2
, the dynamic
libc.so.6
, the standard C library.
The name of the C library is different
on some architectures. For example, this
library is named
libc.so.6.1
and
commands
objdump
command can be used to obtain va
rious informationincluding disas-
sembled binary machine codefrom an exec
utable file, compiled object, or shared
library. It can also be used to display
information from the headers of the various
ELF sections of these files; in this usage, it resembles
readelf
, which displays similar
information, but in a different format
. Sources of further information about
objdump
readelf
are listed at the end of this chapter.
command
The
Fundamentals of Shared Libraries
843
Figure 41-2:
Execution of a program that loads a shared library
41.5Useful Tools for Workin
g with Shared Libraries
In this section, we briefly describe a fe
w tools that are useful for analyzing shared
libraries, executable files, and compiled object (
) files.
command
(list dynamic dependencies) command
displays the shared libraries that
a program (or a shared library) requires to run. Heres an example:
ldd prog
libdemo.so.�1 = /usr/lib/libdemo.so.1 (0x40019000)
libc.so.6 =� /lib/tls/libc.so.6 (0x4017b000)
/lib/ld-linux.so.�2 = /lib/ld-linux.so.2 (0x40000000)
Program header
Shared object dependencies:
libbar.so
prog.o
proglibbar.so
Process created; dynamic
loaded into memory
environ:
prog.o
Dynamic linker
examines shared
found in .
loaded into
virtual memory
process virtual
memory
Current directory
file system
Program header
Shared object dependencies:
libbar.so
to libfoo.so
842
Chapter 41
Figure 41-1 shows the compilation and linking steps involved in producing a shared
library with an embedded so
name, linking a program against that shared library,
and creating the soname symbolic
link needed to run the program.
Figure 41-1:
Creating a shared library and linking a program against it
Figure 41-2 shows the steps that occur when the program created in Figure 41-1 is
loaded into memory in preparation for execution.
To find out which shared libraries a proces
s is currently using, we can list the
contents of the corresp
onding Linux-specific
/proc/
PID
file (Section 48.5).
(other info)
soname=libbar.so
(other info)
$ gcc -g -c \
$ gcc -shared -o libfoo.so \
$ gcc -o prog \
code
code
code
prog.o
proglibbar.so
code
$ ln -s libfoo.so \
code
code
libfoo.so
Program header
Shared object dependencies:
libbar.so
3
4
2
Fundamentals of Shared Libraries
841
If a shared library has a
soname, then, during static
linking, the soname is
embedded in the executable file instead of the real name, and subsequently used
by the dynamic linker when searching fo
r the library at run time. The purpose of
the soname is to provide a level of indire
ction that permits an executable to use, at
run time, a version of the shared library th
at is different from (but compatible with)
the library against which it was linked.
In Section 41.6, well look at the conven
tions used for the shared library real
name and soname. For now, we show a
simplified example to demonstrate the
The first step in using a soname is to spec
ify it when the shared library is created:
gcc -g -c -fPIC -Wall mod1.c mod2.c mod3.c
gcc -g -shared -Wl,-soname,libbar.so -o libfoo.so mod1.o mod2.o mod3.o
option is an instruction to
the linker to mark the shared
library
libfoo.so
with the soname
libbar.so
840
Chapter 41
LD_LIBRARY_PATH
environment variable
One way of informing the dynamic linker
that a shared library resides in a non-
standard directory is to specify that dire
ctory as part of a colon-separated list of
directories in the
LD_LIBRARY_PATH
environment variable. (Semicolons can also be
used to separate the directories, in which
case the list must be quoted to prevent
the shell from interpreti
ng the semicolons.) If
LD_LIBRARY_PATH
is defined, then the
dynamic linker searches for the shared librar
y in the directories it lists before looking
in the standard library directories. (Later
, well see that a production application
should never rely on
LD_LIBRARY_PATH
, but for now, this variable provides us with a
Fundamentals of Shared Libraries
839
41.4.3Using a Shared Library
In order to use a shared library, two steps must occur that are not required for pro-
grams that use static libraries:
Since the executable file no longer contains copies of the object files that it
requires, it must have some mechanism for identifying the shared library that it
needs at run time. This is done by em
bedding the name of the shared library
inside the executable during the link ph
ase. (In ELF parlance, the library depen-
dency is recorded in a
tag in the executable.) The list of all of a programs
shared library dependencies is referred to as its
dynamic dependency list
At run time, there must be some
mechanism for resolving the embedded
library namethat is, for finding the shar
ed library file corresponding to the
name specified in the executable fileand
then loading the library into memory,
if it is not already present.
Embedding the name of the library insi
de the executable
happens automatically
when we link our program with a shared library:
gcc -g -Wall -o prog prog.c libfoo.so
If we now attempt to run our program,
we receive the follo
wing error message:
./prog: error in loading shared libraries: libfoo.so: cannot
open shared object file: No such file or directory
This brings us to the second required step:
dynamic linking
, which is the task of resolv-
ing the embedded library name at run ti
me. This task is performed by the
dynamic
(also called the
dynamic linking loader
or the
run-time linker
). The dynamic linker
is itself a shared library, named
/lib/ld-linux.so.2
, which is employed by every ELF
executable that uses
shared libraries.
The pathname
/lib/ld-linux.so.2
is normally a symbolic link pointing to the
dynamic linker executable file. This file has the name
ld-
is the
version installed on the systemfor example,
ld-2.11.so
pathname of the dynamic linker differ
s on some architectures. For example,
on IA-64, the dynamic linker
symbolic link is named
/lib/ld-linux-ia64.so.2
The dynamic linker examines the list of sh
ared libraries required by a program and
838
Chapter 41
are compiler-dependent. Using a different C compiler on another UNIX imple-
mentation will probably require different options.
Note that it is possible to
compile the source files and create the shared library
in a single command:
gcc -g -fPIC -Wall mod1.c mod2.c mod3.c -shared -o libfoo.so
However, to clearly distinguish the compilation and library building steps, well
write the two as separate commands in
the examples shown in this chapter.
Unlike static libraries, it is not poss
ible to add or remove individual object
modules from a previously built shared li
brary. As with normal executables, the
object files within a shared library no
longer maintain distinct identities.
41.4.2Position-Independent Code
option specifies that the
compiler should generate
position-independent
. This changes the way that the compiler
generates code for operations such as
accessing global, static, and external variab
les; accessing string constants; and tak-
ing the addresses of functions. These changes allow the code to be located at any
virtual address at run time. This is necess
ary for shared librarie
s, since there is no
way of knowing at link time
where the shared library code will be located in memory.
(The run-time memory location of a shared library depends on various factors,
such as the amount of memory already take
n up by the program that is loading the
library and which other shared librar
ies the program has already loaded.)
On Linux/x86-32, it is po
ssible to create a shared library using modules com-
option. However, doing so loses some of the benefits of
shared libraries, since pages of program
text containing position-dependent memory
references are not shared across processes. On some architectures, it is impossible
to build shared libraries without the
fPIC
option.
Fundamentals of Shared Libraries
837
The principal costs of this adde
d functionality are the following:
Shared libraries are more complex than
static libraries, both at the conceptual
level, and at the practical level of creati
ng shared libraries and building the pro-
grams that use them.
Shared libraries must be compiled to use position-independent code
(described in Section 41.4.2), which has a performance overhead on most
architectures because it requ
ires the use of an extra
register ([Hubicka, 2003]).
Symbol relocation
must be performed at run time
. During symbol relocation, ref-
erences to each symbol (a variable or
function) in a shared library need to be
modified to correspond to the actual run-
time location at which the symbol is
placed in virtual memory. Because of th
is relocation process, a program using
a shared library may take a little more time to execute than its statically linked
equivalent.
One further use of shared libraries is as a building block in the
Java Native
(JNI), which allows Java code to directly access features of the under-
lying operating system by calling C func
tions within a shared library. For fur-
ther information, see [Liang, 1999] and [Rochkind, 2004].
41.4Creating and Using Shared LibrariesA First Pass
To begin understanding how shared librar
ies operate, we look at the minimum
sequence of steps required to build and use a shared library. For the moment, well
ignore the convention that is normally used to name shared library files. This con-
vention, described in Section 41.6, allows
programs to automatically load the most
up-to-date version of the libraries they requ
ire, and also allows multiple incompatible
versions (so-called
major versions
) of a library to coexist peacefully.
In this chapter, we concern ourselves only with Executable and Linking Format
(ELF) shared libraries, since ELF is the format employed for executables and
shared libraries in modern versions of Linux, as well as in many other UNIX imple-
mentations.
ELF supersedes the older
COFF
41.4.1Creating a Shared Library
In order to build a shared version of the
static library we created earlier, we per-
form the following steps:
gcc -g -c -fPIC -Wall mod1.c mod2.c mod3.c
gcc -g -shared -o libfoo.so mod1.o mod2.o mod3.o
The first of these commands creates the three object modules that are to be put
into the library. (We explain the
cc fPIC
option in the next section.) The
cc shared
command creates a shared library co
ntaining the three object modules.
By convention, shared libraries have the prefix
lib
(for
object
In our examples, we use the
command, rather than the equivalent
command,
to emphasize that the command-line options
we are using to create shared libraries
836
Chapter 41
Having linked the program, we can run it in the usual way:
Called mod1-x1
Called mod2-x2
41.3Overview of Shared Libraries
When a program is built by linking against a static library (or, for that matter, without
using a library at all), the resulting executa
ble file includes copies of all of the
object files that were linked into the pr
ogram. Thus, when several different execut-
ables use the same object modules, each executable has its own copy of the object
modules. This redundancy of
code has several disadvantages:
Disk space is wasted storing multiple co
pies of the same object modules. Such
wastage can be considerable.
If several different programs using th
e same modules are running at the same
modules in virtual memory,
thus increasing the overall virtual memory demands on the system.
If a change is required (perhaps a security or bug fix) to an object module in a
static library, then all executables using
that module must be relinked in order
to incorporate the change. This disadv
antage is further compounded by the
fact that the system administrator needs
to be aware of which applications were
linked against the library.
Shared libraries were designed to address these shortcomings. The key idea of a
shared library is that a single copy of th
e object modules is shared by all programs
requiring the modules. The object modules ar
e not copied into the linked executable;
instead, a single copy of the library is
loaded into memory at run time, when the
first program requiring modules from the
shared library is started. When other
programs using the same shared library are
use the copy of the
library that is already loaded into memo
ry. The use of shared libraries means that
executable programs require less space on di
sk and (when running) in virtual memory.
Although the code of a shared library is
shared among multiple processes, its
variables are not. Each process that uses
the library has its own copies of the
are defined within the library.
Shared libraries provide the following further advantages:
Because overall program size is smaller, in some cases, programs can be loaded
into memory and started more quickly.
This point holds true only for large
shared libraries that are already in us
e by another program. The first program
to load a shared library will actually take
longer to start, since the shared library
must be found and loaded into memory.
Since object modules are not copied into
the executable files, but instead main-
tained centrally in the shared library,
it is possible (subject to limitations
described in Section 41.8) to make
changes to the object modules without
requiring programs to be relinked in
order to see the changes. Such changes
can be carried out even while running pr
ograms are using an existing version
of the shared library.
Fundamentals of Shared Libraries
835
options
argument consists of a series of letters, one of which is the
, while the others are
modifiers
that influence the way the operation is carried
out. Some commonly used oper
ation codes are the following:
(replace): Insert an object file into
the archive, replacing any previous object
file of the same name. This is the st
834
Chapter 41
just once, and then link them into differ
ent executables as required. Although this
technique saves us compilation time, it st
ill suffers from the disadvantage that we
must name all of the object files during
the link phase. Furthermore, our directo-
with a large number of object files.
To get around these problems, we can gr
oup a set of object files into a single
unit, known as an
object library
. Object libraries
are of two types:
and
Shared libraries are the more modern type
of object library, and provide several
advantages over static libraries,
as we describe in Section 41.3.
An aside: including debugger info
rmation when compiling a program
In the
command shown above, we used the
option to include debugging infor-
mation in the compiled program. In general, it is a good idea to always create pro-
n earlier times, debugging information was
FUNDAMENTALS OF
SHARED LIBRARIES
Shared libraries are a technique for placing
library functions into a single unit that
can be shared by multiple processes at run time. This technique can save both disk
space and RAM. This chapter covers the fu
ndamentals of shared libraries. The next
chapter covers a number of advanc
ed features of shared libraries.
41.1Object Libraries
One way of building a program is simply to compile each of its source files to pro-
duce corresponding object fi
les, and then link all of
832
Chapter 40
40.8Summary
Login accounting records the
users currently logged in, as well as all past logins.
This information is maintained in three files: the
utmp
file, which maintains a record
of all currently logged-in users; the
file, which is an audit trail of all logins and
logouts; and the
file, which records the time of last login for each user. Various
commands, such as
and
, use the information in these files.
Login Accounting
831
lastlog
file is indexed by user ID, it is not possible to distinguish log-
ins under different usernames that have
the same user ID. (In Section 8.1, we
noted that it is possible, though unusual,
to have multiple login names with the
same user ID.)
Listing 40-4:
Displaying information from the
lastlog
loginacct/view_lastlog.c
#include time&#xtime;.h0;.h
#include last&#xlast;log;&#x.h00;log.h
#include path&#xpath;s.h;s.h /* Definition of _PATH_LASTLOG */
#include fcntünt;l.h;l.h
#include "ugid_functions.h" /* Declaration of userIdFromName() */
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
struct lastlog llog;
int fd, j;
uid_t uid;
if (�argc 1 && strcmp(argv[1], "--help") == 0)
usageErr("%s [username...]\n", argv[0]);
fd = open(_PATH_LASTLOG, O_RDONLY);
if (fd == -1)
errExit("open");
for (j = 1; j argc; j++) {
uid = userIdFromName(argv[j]);
if (uid == -1) {
printf("No such user: %s\n", argv[j]);
continue;
}
830
Chapter 40
printf("Creating logout entries in utmp and wtmp\n");
Login Accounting
829
int
main(int argc, char *argv[])
struct utmpx ut;
char *devName;
if (argc 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s username [sleep-time]\n", argv[0]);
/* Initialize login record for utmp and wtmp files */
828
Chapter 40
Next we use our program to examine the contents of the
wtmp
file:
./dump_utmpx /var/log/wtmp
user type PID line id host date/time
cecilia USER_PR 249 tty1 1 Fri Feb 1 21:39:07 2008
mtk USER_PR 1471 pts/7 /7 Fri Feb 1 22:08:06 2008
last mtk
mtk pts/7 Fri Feb 1 22:08 still logged in
Above, we used the
command to show that the output of
derives from
wtmp
(For brevity, we have ed
ited the output of the
dump_utmpx
and
commands in this
shell session log to remove lines of output that are irrelevant to our discussion.)
Next, we use the
command to resume the
program in the fore-
ground. It subsequently wr
ites logout records to the
wtmp
files.
./utmpx_login mtk
Creating logout entries in utmp and wtmp
We then once more examine the contents of the
utmp
file. We see that the
utmp
record was overwritten:
./dump_utmpx /var/run/utmp
user type PID line id host date/time
cecilia USER_PR 249 tty1 1 Fri Feb 1 21:39:07 2008
DEAD_PR 1471 pts/7 /7 Fri Feb 1 22:09:09 2008
cecilia tty1 Feb 1 21:39
The final line of output shows that
who
ignored the
DEAD_PROCESS
record.
When we examine the
wtmp
file, we see that the
wtmp
record was superseded:
./dump_utmpx /var/log/wtmp
user type PID line id host date/time
cecilia USER_PR 249 tty1 1 Fri Feb 1 21:39:07 2008
mtk USER_PR 1471 pts/7 /7 Fri Feb 1 22:08:06 2008
DEAD_PR 1471 pts/7 /7 Fri Feb 1 22:09:09 2008
last mtk
mtk pts/7 Fri Feb 1 22:08 - 22:09 (00:01)
The final line of output above demonstrates that
last
matches the login and logout
records in
wtmp
to show the starting and ending
Login Accounting
827
When updating the
wtmp
file, we simply open the file and append a record to it.
Because this is a standard operation,
encapsulates it in the
function.
updwtmpx()
function appends the
utmpx
record pointed to by
to the file spec-
ified in
wtmpx_file
SUSv3 doesnt specify
updwtmpx
, and it appears on only a few other UNIX
implementations. Other implementa
tions provide related functions
login(3)
logout(3)
, and
logwtmp(3)
which are also in
and described in the manual pages.
If such functions are not present, we
need to write our own equivalents. (The
implementation of these functions is not complex.)
Example program
Listing 40-3 uses the functions descri
bed in this section to update the
utmp
and
wtmp
files. This program performs the required updates to
utmp
and
in order to log
in the user named on the command line, an
d then, after sleeping a few seconds, log
them out again. Normally, such actions wo
uld be associated with the creation and
termination of a login session
for a user. This program uses
826
Chapter 40
sample runs of the program in Listing 40-2.) A record containing exactly the
same information is appended to the
file.
The terminal name acts (via the
and
fields) as a unique key for
records in the
utmp
file.
On logout, the record pr
eviously written to the
file should be erased. This
is done by creating a record with
ut_type
Login Accounting
825
824
Chapter 40
the other output produced by the program.
wtmp
file (using
ge
), these records can be matched via the
field.
./dump_utmpx /var/log/wtmp
user type PID line id host date/time
lynley USER_PR 10482 tty3 3 Sat Oct 23 10:19:43 2010
DEAD_PR 10482 tty3 3 2.4.20-4G Sat Oct 23 10:32:54 2010
Listing 40-2:
Displaying the contents of a
utmpx

loginacct/dump_utmpx.c
#define _GNU_SOURCE
#include time&#xtime;.h0;.h
#include utmp&#xutmp;x.h;x.h
#include path&#xpath;s.h;s.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
struct utmpx *ut;
if (�argc 1 && strcmp(argv[1], "--help") == 0)
usageErr("%s [utmp-pathname]\n", argv[0]);
if (�argc 1) /* Use alternate file if supplied */
if (utmpxname(argv[1]) == -1)
errExit("utmpxname");
Login Accounting
823
implementation doesnt perform this ty
pe of caching, but we should never-
theless employ this technique
for the sake of portability.
822
Chapter 40
Login Accounting
821
creates a child for each terminal line and
virtual console, and
each child execs the
820
Chapter 40
ut_type
field is an integer defining the ty
pe of record being written to the
utmpx&#xutmp;&#xx7.h;.h
BOOT_TIME
(2)
This record contains the time of system boot in the
field. The usual
author of
RUN_LVL
and
BOOT_TIME
records is
file and the
wtmp
file.
NEW_TIME
(3)
This record contains the new time af
ter a system clock change, recorded in
field.
OLD_TIME
(4)
This record contains the old time be
fore a system clock change, recorded
in the
field. Records of type
OLD_TIME
and
NEW_TIME
are written to the
utmp
and
wtmp
files by the NTP (or a similar)
daemon when it makes changes
to the system clock.
INIT_PROCESS
55
This is a record for a process spawned by
init
Login Accounting
819
The SUSv3 specifi
cation of the
structure doesnt include the
ut_exit
ut_session
, or
ut_addr_v6
fields. The
and
ut_exit
fields are
present on most other implementations;
ut_session
is present on a few other
implementations; and
is Linux-specific. SUSv3 specifies the
ut_line
fields, but leaves their lengths unspecified.
The
int32_t
data type used to define the
ut_addr_v6
field of the
structure is a 32-bit integer.
Listing 40-1:
Definition of the
#define _GNU_SOURCE /* Without _GNU_SOURCE the two field
struct exit_status { names below are prepended by "__" */
short e_termination; /* Process termination status (signal) */
short e_exit; /* Process exit status */
#define __UT_LINESIZE 32
#define __UT_NAMESIZE 32
#define __UT_HOSTSIZE 256
struct utmpx {
short ut_type; /* Type of record */
pid_t ut_pid; /* PID of login process */
char ut_line[__UT_LINESIZE]; /* Terminal device name */
char ut_id[4]; /* Suffix from terminal name, or
ID field from init55 */
char ut_user[__UT_NAMESIZE]; /* Username */
char ut_host[__UT_HOSTSIZE]; /* Hostname for remote login, or kernel
version for run-level messages */
struct exit_status ut_exit; /* Exit status of process marked
as DEAD_PROCESS (not filled
in by init(8) on Linux) */
long ut_session; /* Session ID */
struct timeval ut_tv; /* Time when entry was made */
int32_t ut_addr_v6[4]; /* IP address of remote host (IPv4
address uses just ut_addr_v6[0],
Each of the string fields in the
structure is null-terminated unless it com-
818
Chapter 40
On Linux, the
utmp
file resides at
/var/run/utmp
, and the
file resides at
/var/
log/wtmp
. In general, applications dont need
to know about these pathnames, since
they are compiled into
glibc
. Programs that do need to refer to the locations of
these files should use the
_PATH_UTMP
and
_PATH_WTMP
pathname constants, defined in
paths.&#xp-7.;th;&#x-7.1;&#xs.-7;&#x.1h0;h
(and
utmpx.h&#x-7.1;&#xut-7;&#x.1mp;&#xx-7.;.h-;.10;
), rather than explicitly coding pathnames into the program.
SUSv3 doesnt standardize any symbolic names for the pathnames of the
utmp
wtmp
files. The names
_PATH_UTMP
and
_PATH_WTMP
are used on Linux and the
BSDs. Many other UNIX implementati
ons instead define the constants
UTMP_FILE
and
WTMP_FILE
for these pathnames. Linux also defines these names
in
utmp&#xu7.6;&#xtmp7;&#x.6.h;.h
, but doesnt define them in
utmpx.&#xu7.6;&#xtmp7;&#x.6x.;.6h;h
paths..6p; t7.;hs.;.6h;h
40.2The
utmpx
and
files have been present in the UNIX system since early times, but
underwent steady evolution and divergence
across various UNIX implementations,
especially BSD versus System V. System V Release 4 greatly extended the API, in
the parallelparallel
structure and associated
utmpx
and
wtmpx
Login accounting is concerned with record
ing which users are currently logged in to
the system, and recording past logins and
logouts. This chapter looks at the login
816
Chapter 39
Using capabilities within a program on a system without file capabilities
Even on a system that doesnt support file capabilities, we can nevertheless employ
capabilities to improve the security of
a program. We do this as follows:
1.Run the program in a process with an effe
Capabilities
815
group, or all processes on the system except
and the caller itself. The final case
excludes
because it is fundamental to the oper
ation of the system. It also excludes
the caller because the caller may be attempti
ng to remove capabilities from every
other process on the system, and we dont wa
nt to remove the capabilities from the
calling process itself.
However, changing the capabilities of other processes is only a theoretical pos-
sibility. On older kernels, and on modern
kernels where support for file capabili-
ties is disabled, the capability bounding
814
Chapter 39
but far-reachingfor each privileged pr
ogram would create an unmanageably
complex administration task. By contra
st, system administrators are familiar
Capabilities
813
SECBIT_KEEP_CAPS
and the
prctl()
812
Chapter 39
Since existing applications arent engineered to make use of the file-capabilities
infrastructure, the kernel must maintain
the traditional handling of processes with
the user ID 0. Nevertheless, we may want an
application to run in a purely capability-
based environment in which
root
Capabilities
811
810
Chapter 39
if (cap_free(empty) == -1)
Capabilities
809
cap_t caps;
cap_value_t capList[1];
808
Chapter 39
3.Use the
cap_se
function to pass the user-space structure back to the kernel
in order to change the processs capabilities.
4.Use the
ca
function to free the structure that was allocated by the
libcap
API in the first step.
At the time of writing, work is in progress on
, a new, improved capa-
Capabilities
807
4.If the file-system user ID is changed fr
om 0 to a nonzero value, then the follow-
ing file-related capabilities are cleare
806
Chapter 39
If a process has the
Capabilities
805
other features were added in kernels 2.6.
25 and 2.6.26 in orde
804
Chapter 39
Next, we become the superuser, which al
lows us to successfully change the
system time:
sudo date -s '2018-02-01 21:39'
root's password:
Capabilities
803
appear that the capabilities implementati
on could provide this feature simply by
preserving the processs permi
tted capabilities across an
. However, this
approach would not handle the following cases:
Performing the
exec
might require certai
n privileges (e.g.,
CAP_DAC_OVERRIDE
that we dont want to preserve across the
Suppose that we explicitly dropped some
permitted capabilities that we didnt
want to preserve across the
exec
, but then the
failed. In this case, the pro-
gram might need some of th
e permitted capabilities th
at it has already (irrevo-
cably) dropped.
For these reasons, a processs permitted
capabilities are not preserved across an
exec()
. Instead, another capability set is introduced: the
inheritable
set
. The inheritable
set provides a mechanism by which a process can preserve some of its capabilities
capability set specifies a group of capabilities that may be
assigned to the processs permi
802
Chapter 39
39.3.3Purpose of the Process Perm
Capabilities
801
CAP_SETPCAP
If file capabilities are not supported, grant and remove capabilities in the
800
Chapter 39
Table 39-1:
Operations permitted by each Linux capability
CapabilityPermits process to
CAP_AUDIT_CONTROL
(Since Linux 2.6.11) Enable and disable ke
rnel audit logging; change filtering
Capabilities
799
: These are the capabilities used by the kernel to perform privilege
checking for the process. As long as it maintains a capability in its permitted
798
Chapter 39
The Linux capability scheme refines the
handling of this problem. Rather than
using a single privilege (i.e., effective user ID of 0) when performing security
checks in the kernel, the superuser privileg
e is divided into distinct units, called
capabilities
. Each privileged operation is associat
ed with a particular capability, and
a process can perform that operation only
if it has the corresponding capability
(regardless of its effective user ID). Put
another way, everywhere in this book that
we talk about a privileged process on Linu
x, what we really mean is a process that
has the relevant capability for pe
rforming a particular operation.
Most of the time, th
e Linux capability scheme is invisible to us. The reason for
this is that when an application that is
unaware of capabilities assumes an effective
user ID of 0, the kernel grants that pr
This chapter describes the Linux capabiliti
es scheme, which divi
des the traditional
all-or-nothing UNIX privilege scheme into
individual capabilities that can be inde-
pendently enabled or disabled. Using capa
bilities allows a program to perform
some privileged operations, while preventing it from performing others.
39.1Rationale for Capabilities
The traditional UNIX privilege scheme divi
des processes into tw
whose effective user ID is 0 (superuser), which bypass all privilege checks, and all
other processes, which are subject to privilege checking according to their user and
group IDs.
The coarse granularity of this scheme is a problem. If we want to allow a process
to perform some operation that is perm
itted only to the superuserfor example,
changing the system timethen we must ru
n that process with an effective user ID
of 0. (If an unprivileged user needs to pe
rform such operations, this is typically
796
Chapter 38
38.13Exercises
38-1.
Log in as a normal, unprivileged user, crea
te an executable file (or copy an existing
file such as
/bin/sleep
Writing Secure Privileged Programs
795
Even where a system call succeeds, it may be necessary to check its result. For
example, where it matters, a privileged program should check that a successful
open()
794
Chapter 38
Dealing with malformed requests is straig
htforwarda server sh
ould be programmed
to rigorously check its inputs and avoi
as described above.
Overload attacks are more difficult to de
al with. Since the server cant control
the behavior of remote clients or the ra
te at which they submit requests, such
attacks are impossible to prevent. (The se
rver may not even be able to determine
the true origin of the attack, since the so
Writing Secure Privileged Programs
793
In order to make stack crashing more
difficultin particular, to make such
attacks much more time-consuming wh
en conducted remotely against net-
work serversfrom kernel 2.6.12 onward, Linux implements
address-space
. This technique randomly varies
the location of the stack over an
8MB range at the top of vi
, the locations of memory
mappings may also be randomized, if the soft
RLIMIT_STACK
limit is not infinite
and the Linux-specific
/proc/sys/vm/legacy_va_layout
file contains the value 0.
More recent x86-32 architectures pr
ovide hardware support for marking
page tables as
(no execute). This feature is used to prevent execution of
program code on the stack, thus making stack crashing more difficult.
There are safe alternatives to many of
the functions mentioned abovefor example,
snprin
that allow the caller to specify the maximum num-
ber of characters that should be copied
. These functions take the specified maxi-
mum into account in order to avoid ove
792
Chapter 38
means that only white-space characters are
Writing Secure Privileged Programs
791
from processs effective group ID (see
Section 15.3.1), a similar statement
790
Chapter 38
38.6Beware of Signals
A user may send arbitrary signals to a se
t-user-ID program that they have started.
Such signals may arrive at any time and
with any frequency. We need to consider
the race conditions that can occu
r if a signal is delivered at
any
point in the execution
of the program. Where appropriate, signals should be caught, blocked, or ignored
to prevent possible security problems. Fu
rthermore, the design of signal handlers
should be as simple as possible, in order to reduce the risk of inadvertently creating
a race condition.
This issue is particularly relevant with
the signals that stop a process (e.g.,
SIGTSTP
SIGSTOP
). The problematic scenario is the following:
Writing Secure Privileged Programs
789
If the process receives a signal that caus
es it to produce a core dump file, then
that file may be read to obtain the information.
Following on from the last point, as a general principle, a secure program should
prevent core dumps, so that a core dump file cant be inspected for sensitive infor-
mation. A program can ensure that a core
dump file is not created by using
788
Chapter 38
If this program subsequently executes the call
Writing Secure Privileged Programs
787
Because of the possibilities listed in the two preceding points, it is highly rec-
ommended practice (see, for example, [Tsafrir et al., 2008]) to not only check
that a credential-changing system call has
succeeded, but also to verify that the
change occurred as expected. For exampl
e, if we are temporarily dropping or
reacquiring a privileged user ID using
se
, then we should follow that call
with a
786
Chapter 38
Instead, we must regain privilege prior to
dropping it permanently, by inserting the
following call between steps 1 and 2 above:
Writing Secure Privileged Programs
785
The first call makes the effective user ID of
the calling process the same as its real
ID. The second call restores the effective us
784
Chapter 38
Privileged programs have access to fea
tures and resources (f
iles, devices, and so
on) that are not available to ordinary us
ers. A program can run with privileges by
two general means:
782
Chapter 37
some other UNIX implementations, it is possible to specify
, with the same
meaning as
. However, this feature is not available to all
implementa-
tions.) Normally, a rule that contains mu
ltiple selectors matches messages corre-
sponding to any of the selectors, but specifying a
of
has the effect of
excluding
all messages belongin
g to the corresponding
. Thus, this rule sends
all messages except those for the
facilities to the file
/var/log/messages
The hyphen (
) preceding the name of this file specifies that a sync to the disk does
not occur on each write to the file (refer
to Section 13.3). This means that writes
are faster, but some data may be lost if
the system crashes soon after the write.
Whenever we change the
syslog.conf
file, we must ask the daemon to reinitial-
ize itself from this file in the usual fashion:
killall -HUP syslogd

Send
SIGHUP
to syslogd
Further features of the
rule syntax allow for much more power-
ful rules than we have shown.
Daemons
781
Any message whose

780
Chapter 37
The remaining arguments to
are a format string and corresponding argu-
ments in the manner of
printf
. One difference from
printf
is that the format string
doesnt need to include a terminating newlin
e character. Also, the format string may
include the 2-character sequence
, which is replaced by the error string correspond-
ing to the current value of
errno
(i.e., the equivalent of
strerror(errno)
The following code demonstrates the use of
openlog()
openlog(argv[0], LOG_PID | LOG_CONS | LOG_NOWAIT, LOG_LOCALO);
syslog(LOG_ERROR, "Bad argument: %s", argv[1]);
syslog(LOG_USER | LOG_INFO, "Exiting");
is specified in the first
call, the default specified by
openlog()
LOG_LOCAL0
) is used. In the second
call, explicitly specifying
LOG_USER
over-
rides the default established by
openlog()
From the shell, we can use the
logger11
command to add entries to the system
log. This command allows specification of the
priority
) and
ident
tag
be associated with the logged messa
Daemons
779
To write a log message, we call
The
priority
778
Chapter 37
LOG_NDELAY
Open the connection to the logging
system (i.e., the underlying UNIX
Daemons
777
37.5.2The
syslog
API consists of three main functions:
openlog()
function establis
776
Chapter 37
facility has two principal components: the
syslogd
daemon and the
library function.
System Log
daemon,
, accepts log messages from two different
Daemons
775
logOpen(LOG_FILE);
readConfigFile(CONFIG_FILE);
Kernel
User process
Process on
remote host
Configuration file
Unix domain
datagram socket
UDP port 514
774
Chapter 37
convention, configuration files are placed in
/etc
or one of its subdirectories, while
log files are often placed in
. Daemon programs commonly provide command-
line options to specify alternative
locations instead of the defaults.
Listing 37-3:
Using
SIGHUP
to reinitialize a daemon
daemons/daemon_SIGHUP.c
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include sign&#xsign;zl.;&#xh000;al.h
#include "become_daemon.h"
#include "tlpi_hdr.h"
static const char *LOG_FILE = "/tmp/ds.log";
static const char *CONFIG_FILE = "/tmp/ds.conf";
/* Definitions of logMessalogOplogCloand
readConfigFare omitted from this listing */
static volatile sig_atomic_t hupReceived = 0;
Daemons
773
772
Chapter 37
during system shutdown. Those daemons th
at are not terminated in this fashion
will receive a
SIGTERM
signal, which the
init
process sends to all of its children during
system shutdown. By default,
SIGTERM
terminates a process. If the daemon needs to
perform any cleanup before terminating, it should do so by establishing a handler
for this signal. This handler must be de
signed to perform such cleanup quickly,
since
follows up the
SIGTERM
signal with a
SIGKILL
signal after 5 seconds. (This
doesnt mean that the daemon can perform 5 seconds worth of CPU work;
init
signals all of the processes on the system at the same time, and they may all be
Since daemons are long-lived, we must be particularly wary of possible memory
leaks (Section 7.1.3) and file descriptor leak
s (where an application fails to close all
of the file descriptors it opens). If such
bugs affect a daemon, the only remedy is to
kill it and restart it after (fixing the bug).
Many daemons need to ensure that just
one instance of the daemon is active at
one time. For example, it makes no
sense to have two copies of the
daemon
both trying to execute scheduled jobs. In
Section 55.6, we look at a technique for
achieving this.
37.4Using
The fact that many daemons should run
continuously presents
a couple of pro-
gramming hurdles:
Typically, a daemon reads operational pa
Daemons
771
if (!(flags & BD_NO_UMASK0))
umask(0); /* Clear file mode creation mask */
if (!(flags & BD_NO_CHDIR))
chdir("/"); /* Change to root directory */
if (!(flags & BD_NO_CLOSE_FILES)) { /* Close all open files */
maxfd = sysconf(_SC_OPEN_MAX);
770
Chapter 37
Listing 37-1:
Header file for
become_daemon.c
daemons/become_daemon.h
#ifndef BECOME_DAEMON_H /* Prevent double inclusion */
#define BECOME_DAEMON_H
/* Bit-mask values for 'flags' argument of becomeDaemon
#define BD_NO_CHDIR 01 /* Don't chdi"/""/" */
#define BD_NO_CLOSE_FILES 02 /* Don't close all open files */
#define BD_NO_REOPEN_STD_FDS 04 /* Don't reopen stdin, stdout, and
stderr to /dev/null */
#define BD_NO_UMASK0 010 /* Don't do a umask00
#define BD_MAX_CLOSE 8192 /* Maximum file descriptors to close if
Daemons
769
4.Clear the process umask (Sec
tion 15.4.6), to ensure th
at, when the daemon creates
files and directories, they ha
ve the requested permissions.
5.Change the processs current working dire
ctory, typically to
the root directory
). This is necessary because a daemon
usually runs until system shutdown; if
the daemons current working directory is
on a file system other than the one
, then that file system cant be
unmounted (Section 14.8.2). Alter-
natively, the daemon can change its working directory to a location where it
does its job or a location defined in it
s configuration file, as long as we know
that the file system containing this di
rectory never needs to be unmounted. For
cron
places itself in
/var/spool/cron
6.Close all open file descriptors that the daemon has inherited from its parent.
(A daemon may need to keep certain in
herited file descriptors open, so this
step is optional, or open to variation.) This is done for a variety of reasons.
Since the daemon has lost its controllin
g terminal and is running in the back-
ground, it makes no sense for the daemon to keep file descriptors 0, 1, and 2
open if these refer to the terminal. Furt
hermore, we cant un
mount any file sys-
tems on which the long-lived daemon
holds files open. And, as usual, we
should close unused open file descriptors because file descriptors are a finite
Some UNIX implementations (e.g., Sola
ris 9 and some of the recent BSD
releases) provide a function named
closefnn
or similaror similaroses all file
descriptors greater than or equal to
. This function isnt available on Linux.
7.After having closed file descriptors 0, 1, and 2, a daemon normally opens
(or similar) to make all those
descriptors refer to this device.
This is done for two reasons:
It ensures that if the daemon calls library functions that perform I/O on
these descriptors, those functi
ons wont unexpectedly fail.
It prevents the possibility that the da
emon later opens a file using descriptor
1 or 2, which is then written toand thus corruptedby a library function
that expects to treat these descriptors
as standard output and standard error.
/dev/null
is a virtual device that
always discards the data written to it. When we
want to eliminate the st
andard output or error of a shell command, we can
redirect it to this file. Reads from
768
Chapter 37
: the HTTP server daemon (Apache), which serves web pages.
controlling terminal only through an explicit
TIOCSCTTY
operation, and so
this second
fo
has no effect with regard to the acquisition of a controlling
terminal, but the superfluous
does no harm.
This chapter examines the characteristic
s of daemon processes and looks at the
steps required to turn a process into a da
emon. We also look at how to log messages
from a daemon using the
facility.
37.1Overview
daemon
is a process with the
following characteristics:
It is long-lived. Often, a daemon is crea
ted at system startup and runs until the
system is shut down.
It runs in the background and has no controlling terminal. The lack of a control-
ling terminal ensures that the kernel ne
ver automatically gene
rates any job-control
or terminal-related signals (such as
SIGINT
SIGTSTP
, and
SIGHUP
) for a daemon.
Daemons are written to carry out specific
tasks, as illustrated by the following
examples:
: a daemon that executes commands at a scheduled time.
: the secure shell daemon, which permit
s logins from remote hosts using a
secure communications protocol.
Process Resources
765
36.4Summary
Processes consume various system resources. The
764
Chapter 36
In older Linux 2.4 kernels (up to and including 2.4.29),
RLIMIT_RSS
did have an
effect on the behavior of the
MADV_WILLNEED
operation Section 50.4Section 50.4
If this operation could not be perfor
med as a result of encountering the
RLIMIT_RSS
limit, the error
EIO
Process Resources
763
There is also a system-wide limit on the total number of files that may be
opened by all processes. This limit can be
762
Chapter 36
ID of the calling process. When a POSIX message queue is created using
mq_o
bytes are deducted against this limi
t according to the following formula:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) +
attr.mq_maxmsg * attr.mq_msgsize;
In this formula,
attr
is the
structure that is passed as the fourth argument to
mq_open()
. The addend that includes
sizeof(struct msg_msg *)
ensures that the user
cant queue an unlimited number of zero-length messages. (The
msg_msg
structure
is a data type used internally by the ke
rnel.) This is necessary because, although
zero-length messages contain no data, th
ey do consume some system memory for
bookkeeping overhead.
RLIMIT_MSGQUEUE
limit affects only the calling process. Other processes
belonging to this user are no
Process Resources
761
RLIMIT_CPU
RLIMIT_CPU
limit specifies the maximum numb
er of seconds of CPU time (in
both system and user mode) that can be
used by the process. SUSv3 requires that
SIGXCPU
signal be sent to the process when
the soft limit is reached, but leaves
other details unspecified.
(The default action for
SIGXCPU
is to terminate a process
with a core dump.) It is possible to establish a handler for
SIGXCPU
that does what-
760
Chapter 36
If all possible resource limit
values can be represented in
, then SUSv3
permits an implementation to define
RLIM_SAVED_CUR
and
RLIM_SAVED_MAX
to be the
same as
RLIM_INFINITY
. This is how these constants are defined on Linux, implying
that all possible resource limi
t values can be represented in
rlim_t
. However, this is
not the case on 32-bit architectures such
as x86-32. On those architectures, in a
large-file compilation envi
Process Resources
759
if (argc 2 ||  2 ;&#x|| 7; rgc;&#x 000;argc 3 || strcmp(argv[1], "--help") == 0)
usageErr("%s soft-limit [hard-limit]\n", argv[0]);
printRlimit("Initial maximum process limits: ", RLIMIT_NPROC);
758
Chapter 36
In this example, the program managed to
create only 4 new processes, because
26 processes were already running for this user.
Listing 36-2:
Displaying process resource limits

procres/print_rlimit.c
#include sys/resour&#xsys/;res;&#xour7;Î.h;ce.h
#include "print_rlimit.h" /* Declares function defined here */
#include "tlpi_hdr.h"
int /* Print 'msg' followed by limits for 'resource' */
printRlimit(const char *msg, int resource)
struct rlimit rlim;
Process Resources
757
Be aware that, in many cases, the
756
Chapter 36
resource
argument identifies the resource li
Process Resources
755
Although
754
Chapter 36
who
argument specifies the process(es) for which resource usage information
struct rusage {
struct timeval ru_utime; /* User CPU time used */
struct timeval ru_stime; /* System CPU time used */
As indicated in the comments in Listing 36-1, on Linux, many of the fields in the
structure are not filled in by
Each process consumes system resource
s such as memory and CPU time. This
chapter looks at resource-related system calls. We begin with the
752
Chapter 35
35-3.
Write a program that places itself under the
SCHED_FIFO
scheduling policy and then
creates a child process. Both processes sh
ould execute a function that causes the
process to consume a maximum of 3 seco
nds of CPU time. (This can be done by
using a loop in which the
system call is repeatedly called to determine the
amount of CPU time so far consumed.)
After each quarter of a second of
consumed CPU time, the function should pr
int a message that displays the process
ID and the amount of CPU time so far
consumed. After each second of consumed
CPU time, the function should call
sched_
to yield the CPU to the other
process. (Alternatively, the processes could raise each others scheduling priority
Process Priorities andScheduling
751
35.5Summary
The default kernel scheduling algorithm em
ploys a round-robin time-sharing policy.
By default, all processes have equal access
to the CPU under this
policy, but we can
set a processs nice value to a number in
the range high priorityhigh priorityto +19 (low
priority) to cause the schedule
r to favor or disfavor that
process. However, even if
we give a process the lowest priority, it
750
Chapter 35
The following code confines the process identified by
to running on any
CPU other than the first CPU of a four-processor system:
Process Priorities andScheduling
749
748
Chapter 35
35.4CPU Affinity
When a process is rescheduled to run on
a multiprocessor system, it doesnt neces-
sarily run on the same CPU on which it la
st executed. The usual reason it may run
on another CPU is that the original CPU is already busy.
When a process changes CPUs, there is a performance impact: in order for a
line of the processs data to be loaded in
to the cache of the new CPU, it must first
be invalidated (i.e., either discarded if it
is unmodified, or fl
if it was modified), if present in the ca
che of the old CPU. (To prevent cache incon-
sistencies, multiprocessor architectures allow data to be kept in only one CPU cache
at a time.) This invalidation costs execution time. Because of this performance impact,
the Linux (2.6) kernel tries to ensure
CPU affinity for a processwherever
possible, the process is resche
duled to run on the same CPU.
cache line
is the cache analog of a page in
a virtual memory management system.
It is the size of the unit used for
Process Priorities andScheduling
747
#include sche&#xsche;}.h;d.h
int
746
Chapter 35
Preventing realtime processes from locking up the system
Since
SCHED_RR
and
SCHED_FIFO
processes preempt any lower-priority processes (e.g.,
the shell under which the program is run),
when developing applications that use
these policies, we need to be aware of th
e possibility that a runaway realtime pro-
cess could lock up the system by hogg
ing the CPU. Programmatically, there are a
few of ways to avoid this possibility:
Establish a suitably low soft CPU time resource limit (
RLIMIT_CPU
, described in
Section 36.3) using
Process Priorities andScheduling
745
Upon successful execution,
744
Chapter 35
make arbitrary changes to the scheduling
policy and priority of any process. How-
ever, an unprivileged process can also ch
ange scheduling poli
cies and priorities,
according to the following rules:
If the process has a nonzero
RLIMIT_RTPRIO
soft limit, then it can make arbitrary
changes to its scheduling policy and prio
rity, subject to the constraint that the
upper limit on the realtime priority that
Process Priorities andScheduling
743
Listing 35-2:
Modifying process scheduling policies and priorities
742
Chapter 35
SUSv3 defines the
argument as a structure to allow an implementation to
include additional implementation-specific
fields, which may be useful if an imple-
mentation provides additional scheduli
ng policies. However, like most UNIX
implementations, Linux provides just the
sched_priority
field, which specifies the
scheduling priority. For the
SCHED_RR
and
SCHED_FIFO
policies, this must be a value in
the range indicated by
Process Priorities andScheduling
741
For both system calls,
specifies the scheduling policy about which we wish to
obtain information. For this argument, we specify either
SCHED_RR
or
SCHED_FIFO
#include sche&#xsche;}.h;d.h
int
740
Chapter 35
35.2.2The
SCHED_FIFO
SCHED_FIFO
first-in, first-outfirst-in, first-out
licy is similar to the
SCHED_RR
policy. The major
difference is that there is no time slice. Once a
SCHED_FIFO
process gains access to the
CPU, it executes until either:
it voluntarily relinquishes the CPU (in
the same manner as described for the
SCHED_FIFO
policy above);
it terminates; or
it is preempted by a higher-priority
process (in the same circumstances as
described for the
SCHED_FIFO
policy above).
In the first case, the process is placed at th
e back of the queue for its priority level. In
the last case, when the higher-priority process has ceased execution (by blocking or
terminating), the preempted process continues
execution (i.e., the preempted process
remains at the head of the queue for its priority level).
35.2.3The
SCHED_BATCH
SCHED_IDLE
The Linux 2.6 kernel series added tw
o nonstandard scheduling policies:
SCHED_BATCH
SCHED_IDLE
. Although these policies are set
via the POSIX realtime scheduling
API, they are not actually realtime policies.
SCHED_BATCH
policy, added in kernel 2.6.
16, is similar to the default
SCHED_OTHER
policy. The difference is that the
SCHED_BATCH
policy causes jobs that fre-
quently wake up to be scheduled less often.
This policy is intended for batch-style
execution of processes.
SCHED_IDLE
policy, added in kernel 2.6.23, is also similar to
but provides functionality equivalent to a ve
ry low nice value (i.e
., lower than +19).
The process nice value has no meaning for this policy. It is intended for running
low-priority jobs that will receive a sign
ificant proportion of the CPU only if no
other job on the system requires the CPU.
35.3Realtime Process Scheduling API
We now look at the various system calls
constituting the realtime process scheduling
API. These system calls allow us to control
process scheduling poli
cies and priorities.
Although realtime scheduling has been a
part of Linux since version 2.0 of the
kernel, several problems persisted for
a long time in the implementation. A
number of features of the implementati
on remained broken in the 2.2 kernel,
and even in early 2.4 kernels. Most of
these problems were rectified by about
35.3.1Realtime Priority Ranges
Process Priorities andScheduling
739
Adding support for hard realtime applic
ations is difficult to achieve without
imposing an overhead on the system that
conflicts with the performance require-
ments of the time-sharing applications th
at form the majority of applications on
typical desktop and server systems. This
is why most UNIX kernelsincluding,
historically, Linuxhave not natively suppo
rted realtime applications. Nevertheless,
starting from around version 2.6.18, variou
s features have been added to the Linux
kernel with the eventual aim of allowing
Linux to natively provide full support for
hard realtime applications
, without imposing the aforementioned overhead for
35.2.1The
SCHED_RR
SCHED_RR
(round-robin) policy, processes of equal priority are executed in
a round-robin time-sharing fashion. A proc
ess receives a fixed-
length time slice
each time it uses the CPU. Once
scheduled, a process employing the
SCHED_RR
policy
maintains control of the CPU until either:
it reaches the end of its time slice;
it voluntarily relinquishes the CPU, either by performing a blocking system call
or by calling the
system call (described in Section 35.3.3);
it terminates; or
it is preempted by a higher-priority process.
For the first two events above,
when a process running under the
SCHED_RR
policy
loses access to the CPU, it is placed at the back of the queue for its priority level. In
the final case, when the higher-priority process has ceased execution, the pre-
empted process continues execution, cons
uming the remainder of its time slice
(i.e., the preempted process remains at the
head of the queue for its priority level).
In both the
SCHED_RR
and the
SCHED_FIFO
policies, the current
ly running process
may be preempted for one of the following reasons:
a higher-priority process that was blocked became unblocked (e.g., an I/O
738
Chapter 35
A realtime application should be able to control the precise order in which its
component processes are scheduled.
SUSv3 specifies a realtime process scheduling API (originally defined in POSIX.1b)
that partly addresses these requirements.
This API provides two realtime scheduling
policies:
SCHED_RR
and
SCHED_FIFO
. Processes operating under either of these policies
always have priority over processes sche
duled using the standard round-robin time-
sharing policy described in Section 35.1, which the realtime scheduling API identi-
fies using the constant
SCHED_OTHER
Each of the realtime poli
cies allows for a range of
priority levels. SUSv3
requires that an implementation provide at
Process Priorities andScheduling
737
if (argc != 4 || strchr("pgu", argv[1][0]) == NULL)
usageErr("%s {p|g|u} who priority\n"
fashion, preempting any process that
may currently be running.
A time-critical application may need to take other steps to avoid unacceptable
delays. For example, to avoid being delaye
d by a page fault, an application can
lock all of its virtual memory into RAM using
or
(described in
A high-priority process should be able to
maintain exclusive access to the CPU
736
Chapter 35
A privileged (
CAP_SYS_NICE
) process can change the pr
iority of any process. An
unprivileged process may change
its own priority (by specifying
as
PRIO_PROCESS
, and
who
as 0) or the priority of another (target) process, if its effective
user ID matches the real or effective user
Process Priorities andScheduling
735
734
Chapter 35
Under the round-robin time-sharing algorithm,
processes cant exercise direct control
over when and for how long they will be able
to use the CPU. By default, each process
in turn receives use of the CPU until its time
slice runs out or it voluntarily gives up the
CPU (for example, by putting itself to sleep or performing a disk read). If all processes
attempt to use the CPU as much as possible (i.e
., no process ever sleeps or blocks on an
I/O operation), then they will receive a roughly equal share of the CPU.
However, one process attribute, the
nice value
, allows a process to indirectly
influence the kernels scheduling algorithm. Each process has a nice value in the
range 20 (high priority) to +19 (low priority
); the default is 0 (r
efer to Figure 35-1).
In traditional UNIX implementa
tions, only privileged processes can assign themselves
or other processesor other processes
ve (high) priority. (Well explain some Linux differ-
ences in Section 35.3.2.) Unprivileged pr
ocesses can only lower their priority, by
assuming a nice value greater than the defa
ult of 0. By doing this, they are being
nice to other processes, and this
fact gives the attribute its name.
The nice value is inherited by a child created via
fo
and preserved across
an
exec
(low priority)
(traditionally) only available
to privileged processes
ANDSCHEDULING
This chapter discusses various system calls and process attributes that determine
when and which processes obtain access to
the CPU(s). We begin by describing the
value, a process characteristic that in
fluences the amount
of CPU time that a
process is allocated by the kernel scheduler. We follow this with a description of the
POSIX realtime scheduling API. This API al
lows us to define the policy and priority
used for scheduling processes, giving us
much tighter contro
l over how processes
are allocated to the CPU. We conclude with
a discussion of the system calls for set-
ting a processs CPU affinity mask, which
732
Chapter 34
34-4.
Modify the program in Listing 34-4 (
disc_SIGHUP.c
) to verify that, if the controlling
process doesnt terminate as
a consequence of receiving
SIGHUP
, then the kernel
to the members of th
e foreground process.
34-5.
Suppose that, in the signal handler of Li
sting 34-6, the code that unblocks the
SIGTSTP
signal was moved to the start of the
handler. What potential race condition
does this create?
34-6.
Write a program to verify that when a
process in an orphaned process group
read
from the controlling terminal, the
read
fails with the error
34-7.
Write a program to verify that if one of the signals
SIGTTIN
SIGTTOU
SIGTSTP
is sent
to a member of an orphaned process group,
then the signal is discarded (i.e., has
no effect) if it would stop the process (i.e., the disposition is
SIG_DFL
), but is
delivered if a handler is installed for the signal.
Process Groups, Sess
ions, and Job Control
731
SIGHUP
signal is delivered to many other
processes. First, if the controlling
process is a shell (as is typically the case),
then, before terminating, the shell sends
SIGHUP
to each of the process groups it
has created. Second, if delivery of
SIGHUP
results
in termination of a controlling proc
ess, then the kernel also sends
SIGHUP
to all of the
members of the foreground process
group of the controlling terminal.
In general, applications do
nt need to be cognizant of job-control signals. One
exception is when a program performs sc
reen-handling operations. Such programs
need to correctly handle the
SIGTSTP
signal, resetting terminal attributes to sane
values before the process is suspended,
and restoring the correct (application-
specific) terminal attributes when the a
pplication is once more resumed following
SIGCONT
signal.
A process group is considered to be orphaned if none of its member processes
has a parent in a different process group in the same session. Orphaned process
groups are significant because there is no
process outside the group that can both
monitor the state of any stopped processes
within the group and is always allowed
to send a
SIGCONT
signal to these stopped processes in order to restart them. This
could result in such stopped processes languishing forever on the system. To avoid
this possibility, when a process group
with stopped member
processes becomes
orphaned, all members of the process group are sent a
SIGHUP
signal, followed by a
SIGCONT
signal, to notify them that they have
become orphaned and ensure that they
Further information
Chapter 9 of [Stevens & Rago
, 2005] covers similar materi
al to this chapter, and
includes a description of the steps that o
ccur during login to
for a login shell. The
manual contains a lengthy description of the functions
relating to job control and
the implementation of job co
ntrol within the shell. The
SUSv3 rationale contains an extensive discussion of sessions, process groups, and
job control.
34.9Exercises
34-1.
Suppose a parent process pe
rforms the following steps:
/* Call to create a number of child processes, each of which
remains in same process group as the parent */
730
Chapter 34
SIGTSTP
SIGTTIN
, and
SIGTTOU
Orphaned process groups also affect
the semantics for delivery of the
SIGTSTP
SIGTTIN
, and
SIGTTOU
signals.
In Section 34.7.1, we saw that
SIGTTIN
is sent to a background process if it tries
re
from the controlling terminal, and
SIGTTOU
is sent to a background process
that tries to
to the controlling terminal if the terminals
TOSTOP
Process Groups, Sess
ions, and Job Control
729
raise(SIGSTOP);
} else { /* Wait for signal */
alar6060 /* So we die if not SIGHUPed */
728
Chapter 34
process group becoming orphaned, the sign
al handler is invoked, and it displays
the childs process ID and the signal number
Listing 34-7:
SIGHUP
and orphaned process groups

pgsjc/orphaned_pgrp_SIGHUP.c
Process Groups, Sess
ions, and Job Control
727
Since the shell did not create the child process,
it is not aware of the childs existence or
that the child is part of the same proces
s group as the deceased parent. Furthermore,
the
init
process checks only for a terminated ch
ild, and then reaps the resulting zombie
process. Consequently, the stopped child m
ight languish forever, since no other pro-
cess knows to send it a
signal in order to caus
e it to resume execution.
Even if a stopped process in an orphaned
process group has a still-living parent
in a different session, that parent is
not guaranteed to be able to send
SIGCONT
to the
stopped child. A process may send
SIGCONT
to any other process in the same session,
but if the child is in a different session, the normal rules for sending signals apply
Section 20.5Section 20.5so the parent may not be ab
le to send a signal to the child if the
child is a privileged process that has changed its credentials.
To prevent scenarios such as
the one described above, SUSv3 specifies that if a pro-
cess group becomes orphaned and has any st
opped members, then all members of the
group are sent a
signal, to inform them that they have become disconnected
from their session, followed by a
SIGCONT
signal, to ensure that they resume execu-
tion. If the orphaned process group doesn
t have any stopped members, no signals
A process group may become orphaned eith
er because the last parent in a dif-
ferent process group in the same session
terminated or because of the termination
of the last process within the group that
had a parent in another group. (The latter
case is the one illustrated in Figure 34-3.) In either case, the treatment of a newly
orphaned process group containing
stopped children is the same.
SIGHUP
and
SIGCONT
to a newly orphaned process group that contains
stopped members is done in order to el
iminate a specific lo
ophole in the job-
control framework. There is nothing to prevent the members of an already-
orphaned process group from later being stopped if another process (with
suitable privileges) sends them a stop
signal. In this case
remain stopped until some process (again
with suitable pr
SIGCONT
signal.
When called by a member of an orphaned process group, the
726
Chapter 34
Suppose that we include this code in a pr
ogram executed from the shell. Figure 34-3
shows the state of processes before and after the parent exits.
After the parent terminates, the child pr
ocess in Figure 34-3 is not only an
orphaned process, it is also part of an
orphaned process group
cess group as orphaned if the parent of every member is either itself a member of
the group or is not a member of the grou
ps session. Put another way, a process
group is not orphaned if at least one of its members has a parent in the same ses-
sion but in a different process group. In
Figure 34-3, the process group containing
the child is orphaned because the child is
in a process group on its own and its par-
init
) is in a different session.
By definition, a session leader is in
an orphaned process group. This follows
f
and child processes
b) Adoption of child by
after
Process
Key
Process group
process group
f
Parent
(process
group leader)
Process Groups, Sess
ions, and Job Control
725
/* Only establish handler for SIGTSTP if it is not being ignored */
if (sigaction(SIGTSTP, NULL, &sa) == -1)
errExit("sigaction");
if (sa.sa_handler != SIG_IGN) {
724
Chapter 34
Note that the
SIGTSTP
handler may interrupt certain blocking system calls (as
described in Section 21.5). This point is
illustrated in the above program output
by the fact that, after the
call is interrupted, the main program prints the
Listing 34-6:
SIGTSTP
pgsjc/handling_SIGTSTP.c
#include sign&#xsign;zl.;&#xh000;al.h
#include "tlpi_hdr.h"
static void /* Handler for SIGTSTP */
tstpHandler(int sig)
Process Groups, Sess
ions, and Job Control
723
can use the wait status value returned by
wait
waitpid()
to determine which signal
caused one of its child to stop. If we raise the
SIGSTOP
signal in the handler for
SIGTSTP
it will (misleadingly) appear to the
parent that the child was stopped by
SIGSTOP
The proper approach in this situation is to have the
handler raise a further
SIGTSTP
signal to stop the process, as follows:
722
Chapter 34
Now all members of the process group are
stopped. The output indicates that pro-
cess group 1228 was the foreground job.
However, after this job was stopped, the
shell became the foreground process group, although we cant tell this from the
output.
We then proceed by rest
arting the job using the
command, which delivers a
SIGCONT
signal to the pr
ocesses in the job:
bg
Resume job in background
[2]+ ./job_mon | ./job_mon | ./job_mon &
Process 1230 (3) received signal 18 (Continued)
Process 1229 (2) received signal 18 (Continued)
Terminal FG process group: 1204
The shell is in the foreground
Process 1228 (1) received signal 18 (Continued)
kill %1 %2
Weve finished: clean up
[1]- Terminated ./job_mon | ./job_mon
[2]+ Terminated ./job_mon | ./job_mon | ./job_mon
34.7.3Handling Job-Control Signals
Because the operation of job control is tran
sparent to most applications, they dont
need to take special action for dealing with job-control signals. One exception is
programs that perform screen handling, such as
and
. Such programs control
Process Groups, Sess
ions, and Job Control
721
The following shell session demonstrates the use of the program in Listing 34-5.
We begin by displaying the process ID of the shell (which is the session leader, and
the leader of a process group of which it is the sole member), and then create a
background job containing two processes:
echo $$
Show PID of the shell
1204
./job_mon | ./job_mon &
Start a job containing 2 processes
[1] 1227
Terminal FG process group: 1204
Command PID PPID PGRP SID
1 1226 1204 1226 1204
2 1227 1204 1226 1204
From the above output, we can see that
the shell remains the foreground process
for the terminal. We can also
see that the new job is in th
e same session as the shell
and that all of the processes are in the same process group. Looking at the process
IDs, we can see that the processes in the
job were created in the same order as the
commands were given on the
command line. (Most shells do things this way, but
some shell implementations create th
e processes in a different order.)
We continue, creating a second backgrou
nd job consisting of three processes:
./job_mon | ./job_mon | ./job_mon &
[2] 1230
Terminal FG process group: 1204
Command PID PPID PGRP SID
1 1228 1204 1228 1204
2 1229 1204 1228 1204
3 1230 1204 1228 1204
We see that the shell is still the foreground process group for the terminal. We also
see that the processes for the new job are in the same session as the shell, but are in
a different process group from the first job. Now we bring the second job into the
foreground and send it a
SIGINT
signal:
./job_mon | ./job_mon | ./job_mon
Type Control-C to generate
SIGINT
(signal 2)
Process 1230 (3) received signal 2 (Interrupt)
Process 1229 (2) received signal 2 (Interrupt)
Terminal FG process group: 1228
Process 1228 (1) received signal 2 (Interrupt)
From the above output, we see that the
SIGINT
signal was delivered to all of the
processes in the foreground process group. We also see that this job is now the
foreground process group for th
e terminal. Next, we send a
SIGTSTP
Type Control-Z to generate
SIGTSTP
(signal 20 on Linux/x86-32).
Process 1230 (3) received signal 20 (Stopped)
Process 1229 (2) received signal 20 (Stopped)
Terminal FG process group: 1228
Process 1228 (1) received signal 20 (Stopped)
[2]+ Stopped ./job_mon | ./job_mon | ./job_mon
720
Chapter 34
Process Groups, Sess
ions, and Job Control
719
The program in Listing 34-5 performs the following steps:
On startup, the program installs a single handler for
The handler carries out the following steps:
Display the foreground process group for the terminal
. To avoid multiple
identical lines of output, this is do
ne only by the process group leader.
Display the ID of the process, the pr
ocesss position in the pipeline, and
the signal received
The handler must do some extra work if it catches
SIGTSTP
, since, when
caught, this signal doesnt stop a proc
ess. So, to actually stop the process,
the handler raises the
SIGSTOP
signal
, which always stops a process. (We
SIGTSTP
in Section 34.7.3.)
If the program is the initial process in
the pipeline, it prints headings for the
output produced by
all of the processes
718
Chapter 34
order to be able to carry
out these actions, the termin
al driver must also record
the session ID (controlling process) and foreground process group ID associ-
ated with a terminal (Figure 34-1).
The shell must support job control (most modern shells do so). This support is
provided in the form of the commands
Process Groups, Sess
ions, and Job Control
717
We can then see the output of the job
by bringing it into the foreground:
date
Tue Dec 28 16:20:51 CEST 2010
The various states of a job under job cont
rol, as well as the shell commands and ter-
minal characters (and the accompanying si
Control-Z
kill STOP
2. Terminal read
3. Terminal write
(+TOSTOP
Control-C
Control-\
Stopped in
background
Terminated
background
foreground
716
Chapter 34
We can stop a background job by sending it a
SIGSTOP
signal:
kill -STOP %1
[1]+ Stopped grep -r SIGHUP /usr/src/linux� x
[1]+ Stopped grep -r SIGHUP /usr/src/linux� x
[2]- Running sleep 60 &
bg %1
Restart job in background
[1]+ grep -r SIGHUP /usr/src/lin�ux x &
The Korn and C shells provide the command
as a shorthand for
kill stop
Process Groups, Sess
ions, and Job Control
715
Each job that is placed in the background
is assigned a unique job number by the
shell. This job number is shown in square
brackets after the job is started in the
background, and also when the job is ma
nipulated or monitored by various job-
number following the job number is the process ID of the
process created to execute the command, or,
in the case of a pipeline, the process
ID of the last process in the pipeline.
In the commands described in the following
paragraphs, jobs can be referred to using the notation
%num
is the
number assigned to th
is job by the shell.
In many cases, the
%num
argument can be omitted, in which case the
job is used by default. The current job is the last job that was stopped in the
foreground (using the
suspend
character described below), or, if there is no
such job, then the last job that was st
arted in the background. (There are some
714
Chapter 34
SIGHUP
when the terminal window is closed
. After closing the terminal window, we
find the following lines in the file
sig.log
PID of parent process is: 12733
Foreground process group ID is: 12733
PID=12755 PGID=12755
First child is in a different process group
PID=12756 PGID=12733
Remaining children are in same PG as parent
PID=12757 PGID=12733
PID=12733 PGID=12733
This is the parent process
PID 12756: caught signal 1 (Hangup)
PID 12757: caught signal 1 (Hangup)
Closing the terminal window caused
SIGHUP
to be sent to th
e controlling process
the parentthe parentated as a result.
We see that the two children that were in
the same process group as the parent (i
.e., the foreground process group for the
terminal) also both received
SIGHUP
. However, the child that was in a separate (back-
ground) process group did not receive this signal.
34.7Job Control
Job control is a feature that first appear
ed around 1980 in the C shell on BSD. Job
control permits a shell user to simultaneo
usly execute multiple commands (jobs),
one in the foreground and the others in the background. Jobs can be stopped and
Process Groups, Sess
ions, and Job Control
713
int
main(int argc, char *argv[])
pid_t parentPid, childPid;
int j;
struct sigaction sa;
if (argc 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s {d|s}... [ � sig.log 2�&1 ]\n", argv[0]);
712
Chapter 34
When we examine
, we find the following output, indicating that when the
, it did not send a signal to the pr
ocess group that it did not create:
cat diffgroup.log
PID=5614; PPID=5613; PGID=5614; SID=5533
PID=5613; PPID=5533; PGID=5613; SID=5533
Parent
5613: caught SIGHUP
Parent was signaled, but not child
SIGHUP
and Termination of the Controlling Process
If the
SIGHUP
signal that is sent to the controlli
ng process as the re
sult of a terminal
disconnect causes the controllin
g process to terminate, then
SIGHUP
is sent to all of
the members of the terminals foreground process group (refer to Section 25.2).
This behavior is a consequence of the
termination of the
controlling process,
rather than a behavior associ
ated specifically with the
SIGHUP
signal. If the controlling
process terminates for any reason, then
the foreground process group is signaled
with
On Linux, the
SIGHUP
signal is followed by a
SIGCONT
signal to ensure that the
process group is resumed if it had earlier been stopped by a signal. However,
SUSv3 doesnt specify this behavior, and most other UNIX implementations
dont send a
SIGCONT
in this circumstance.
We can use the program in Listing 34-4 to
demonstrate that termination of the con-
trolling proc
ess causes a
SIGHUP
signal to be sent to all members of the terminals
foreground process group. This program cr
eates one child process for each of its
command-line arguments
. If the corresponding command-line argument is the
Process Groups, Sess
ions, and Job Control
711
int
main(int argc, char *argv[])
pid_t childPid;
struct sigaction sa;
710
Chapter 34
The
SIGHUP
signal also finds other uses. In
Section 34.7.4, well see that
SIGHUP
is
generated when a process group becomes orphaned. In addition, manually
SIGHUP
is conventionally used as a wa
y of triggering a daemon process
to reinitialize itself or reread its co
nfiguration file. (By definition, a daemon
process doesnt have a controlling term
inal, and so cant otherwise receive
SIGHUP
from the kernel.) We describe the use of
SIGHUP
with daemon processes
in Section 37.4.
34.6.1Handling of
SIGHUP
by the Shell
In a login session, the shell is normally
the controlling proc
ess for the terminal.
Most shells are programmed so that, when ru
n interactively, they establish a handler
SIGHUP
. This handler terminates the
shell, but beforehand sends a
SIGHUP
signal
to each of the process groups (both foreground and background) created by the
shell. (The
SIGHUP
signal may be followed by a
SIGCONT
signal, depending on the shell
Process Groups, Sess
ions, and Job Control
709
708
Chapter 34
Process Groups, Sess
ions, and Job Control
707
If a process has a controlling terminal, opening the special file
/dev/tty
obtains
a file descriptor for that terminal. This
is useful if standard input and output are
redirected, and a program wants to ensure
that it is communicating with the con-
trolling terminal. For example, the
706
Chapter 34
As can be seen from the output, the proces
s successfully places itself in a new pro-
cess group within a new session. Since this session has no controlling terminal, the
open()
call fails. (In the penultimate line of program output above, we see a shell
prompt mixed with the program output, be
cause the shell notices that the parent
process has exited after the
fork()
call, and so prints its next prompt before the child
Process Groups, Sess
ions, and Job Control
705
On a few UNIX implementae.g., HP-UX 11e.g., HP-UX 11
704
Chapter 34
Things are slightly more complex than sh
own in Listing 34-1, since, when creating
the processes for a pipeline, the parent she
ll records the process ID of the first pro-
cess in the pipeline and uses
this as the process group ID (
pipelinePgid
) for all of the
processes in the group.
Process Groups, Sess
ions, and Job Control
703
Using
pid_t childPid;
pid_t pipelinePgid; /* PGID to which processes in a pipeline
are to be assigned */
/* Other code */
childPid = fork();
switch (childPid) {
case -1: /* forkled */
/* Handle error */
case 0: /* Child */
if (setpgid(0, pipelinePgid) == -1)
/* Handle error */
/* Child carries on to exec the required program */
default: /* Parent (shell) */
if (setpgid(childPid, pipelinePgid) == -1 && errno != EACCES)
/* Handle error */
/* Parent carries on to do other things */
}
702
Chapter 34
Process Groups, Sess
ions, and Job Control
701
Figure 34-1 shows the process group and session relationships between the various
processes resulting from the exec
ution of the following commands:
echo $$

Display the PID of the shell
400
find /� 2 /dev/null | wc -l &
Creates 2 processes in background group
[1] 659
sort longlist | uniq -c
Creates 2 processes in foreground group
At this point, the shell (
bash
are all running.
Figure 34-1:
bash
find
wc
sort
uniq
Process
group 660
leader
background
process groups
foreground
process group
controlling process
process group leaders
Controlling terminal
Foreground PGID = 660
Controlling SID = 400
Process
group 658
Process
group 400
#include unis&#xunis;td.;&#xh000;td.h
pid_t
getpgrp
(void);
700
Chapter 34
another process group. The process group le
ader need not be the last member of a
process group.
is a collection of process groups. A processs session membership is
PROCESS GROUPS, SESSIONS,
AND JOB CONTROL
Process groups and sessions
form a two-level hierarchical
relationship between pro-
cesses: a process group is a collection of related processes, and a session is a collec-
tion of related process grou
ps. The meaning of the term
in each case will
become clear in the course of this chapter.
Process groups and sessions are abstract
ions defined to support shell job con-
trol, which allows interactive users to ru
n commands in the foreground or in the
background. The term
is often used synonymously with the term
process group
This chapter describes process groups, sessions, and job control.
34.1Overview
process group
is a set of one or more processes sharing the same
identifier
(PGID). A process group ID is a number of the same type (
) as a pro-
cess ID. A process group has a
process group leader
, which is the process that creates
the group and whose process ID becomes
the process group ID of the group. A
new process inherits its pa
rents process group ID.
A process group has a
696
Chapter 33
The range of kernel version numbers that can be specified in
LD_ASSUME_KERNEL
is subject to some limits. In several
common distributions that supply both
NPTL and LinuxThreads, specifying the ve
rsion number as 2.2.5 is sufficient
to ensure the use of LinuxThreads. For
a fuller description of the use of this
environment variable, see
http://people.redhat.com/drepper/assumekernel.html
33.6Advanced Features
Some advanced features
of the Pthreads API include the following:
Realtime scheduling
694
Chapter 33
In kernels prior to 2.6.12, in
terval timers created using
confstr(3)
692
Chapter 33
Other problems with LinuxThreads
In addition to the above deviations from
SUSv3, the LinuxThreads implementation
has the following problems:
If the manager thread is killed, then
the remaining threads must be manually
cleaned up.
A core dump of a multithreaded program may not include all of the threads of
the process (or even the one that triggered the core dump).
TIOCNOTTY
operation can remove the processs associa-
tion with a controlling terminal only when called from the main thread.
33.5.2NPTL
NPTL was designed to address most of
the shortcomings of LinuxThreads. In
particular:
NPTL provides much closer conformance to the SUSv3 specification for
Applications that employ large number
s of threads scale much better under
NPTL than under LinuxThreads.
NPTL allows an application to create large numbers of threads. The NPTL
implementers were able
to run test programs th
at created 100,000 threads.
With LinuxThreads, the practical limit on
the number of threads is a few thou-
sand. (Admittedly, very few applications
need such large numbers of threads.)
Work on implementing NPTL began in 2002 and progressed over the next year or
so. In parallel, various changes were ma
de within the Linux kernel to accommo-
date the requirements of NPTL. The changes that appeared in the Linux 2.6 kernel
to support NPTL included the following:
refinements to the implementation of
thread groups Section 28.2.1Section 28.2.1
the addition of futexes as a synchron
ization mechanism (futexes are a generic
mechanism that was designed not just for NPTL);
the addition of new system calls (
690
Chapter 33
In addition to the thread
s created by the application, LinuxThreads creates an
additional manager thread that hand
les thread creation and termination.
The implementation uses signals for its
internal operation. With kernels that
support realtime signals (Linux 2.2 and
later), the first thre
e realtime signals
are used. With older kernels,
SIGUSR1
and
SIGUSR2
are used. Applications cant
use these signals. (The use of signals results in high latency for various thread
synchronization operations.)
LinuxThreads deviations from specified behavior
LinuxThreads doesnt conform to the SUSv3 specification for Pthreads on a num-
ber of points. (The LinuxThreads implem
entation was constrained by the kernel
features available at the time that it was de
veloped; it was as conformant as practica-
ble within those constraints.) The follow
ing list summarizes the nonconformances:
Calls to
or similaror similar
this is not so; only the thread
that created the child process can
wait
, then, as required by SUSv3, all other threads are terminated.
However, if the
is done from any thread other than the main thread, then
the resulting process will have the same process ID as the calling threadthat
is, a process ID that is different from
the main threads process ID. According
to SUSv3, the process ID should be th
e same as that of the main thread.
Threads dont share credentials (user
and group IDs). When a multithreaded
688
Chapter 33
handled entirely within the process by a user-space threading library. The kernel
knows nothing about the existence of
multiple threads within the process.
M:1 implementations have a few advant
ages. The greatest advantage is that
many threading operationsfor example, cr
eating and terminating a thread, context
switching between threads, and mutex and condition variable operationsare fast,
since a switch to kernel mode is not required. Furthermore, since kernel support
for the threading library is not required,
an M:1 implementation can be relatively
easily ported from one system to another.
However, M:1 implementations suffer from some serious disadvantages:
When a thread makes a system call such as
read
, control passes from the user-
space threading library to the ke
rnel. This means that if the
call blocks,
then all threads in the process are blocked.
The kernel cant schedule th
e threads of a process. Since the kernel is unaware
of the existence of multiple threads with
in the process, it
cant schedule the
separate threads to differe
nt processors on multiprocessor hardware. Nor is it
possible to meaningfully as
sign a thread in one process a higher priority than a
thread in another process, since the
scheduling of the threads is handled
entirely within the process.
One-to-one (1:1) implementa
tions (kernel-level threads)
In a 1:1 threading implementation, each
thread maps onto a separate KSE. The
kernel handles each threads scheduling
separately. Thread synchronization opera-
tions are implemented using system calls into the kernel.
1:1 implementations eliminate the disadvantages suffered by M:1 implementa-
tions. A blocking system call does not caus
e all of the threads in a process to block,
and the kernel can schedule the threads of
a process onto different CPUs on multi-
However, operations such
as thread creation, context switching, and synchro-
nization are slower on a 1:1 implementati
ons, since a switch into kernel mode is
required. Furthermore, the overhead requir
ed to maintain a separate KSE for each
of the threads in an application that cont
ains a large number of threads may place a
significant load on the kernel schedule
r, degrading overall system performance.
Despite these disadvantages, a 1:1 implem
entation is usually preferred over an
M:1 implementation. Both of the Linux
threading implementationsLinuxThreads
and NPTLemploy the 1:1 model.
During the development of NPTL, signifi
cant effort went into rewriting the
kernel scheduler and devising a threading implementation that would allow
the efficient execution of
multithreaded processes containing many thousands
of threads. Subsequent testing showed that this goal was achieved.
M:NM:Nntations (two-level model)
M:N implementations aim to combine the
advantages of the
1:1 and M:1 models,
while eliminating their disadvantages.
In the M:N model, each process can have
multiple associated KSEs, and several
threads may map to each KSE. This design
permits the kernel to distribute the
threads of an application across multiple
CPUs, while eliminating the possible scal-
ing problems associated with applications
that employ large numbers of threads.
686
Chapter 33
The operation of
is the same as
sigw
, except that:
#include sign&#xsign;zl.;&#xh000;al.h
int
sigwait
684
Chapter 33
More precisely, SUSv3 specifies that ther
e is a separate alternate signal stack
for each kernel scheduling entity (KSE). On a system with a 1:1 threading
implementation, as on Linux, there is
one KSE per thread (see Section 33.4).
33.2.2Manipulating the Thread Signal Mask
When a new thread is created, it inherits
a copy of the signal mask of the thread
that created it. A thread can use
pthread_sigmas
#include sign&#xsign;zl.;&#xh000;al.h
int
pthread_kill
(pthread_t
Returns 0 on success, or a positive error number on error
682
Chapter 33
great depth (perhaps because of recursion). Alternatively, an application may want
to reduce the size of per-thread stacks
to allow for a greater number of threads
within a process. For example, on x86-32
, where the user-accessible virtual address
space is 3 GB, the default stack size of
2 MB means that we can create a maximum
of around 1500 threads. (The precise
maximum depends on how much virtual
memory is consumed by the text and data
segments, shared libraries, and so on.)
The minimum stack that can be employed on a particular architecture can be
680
Chapter 32
32.6Asynchronous Cancelability
When a thread is made asynchronous
ly cancelable (cancelability type
PTHREAD_CANCEL_ASYNCHRONOUS
), it may be canceled at any time (i.e., at any machine-
language instruction); delivery of a cancella
tion is not held off until the thread next
reaches a cancellation point.
The problem with asynchronous cancellati
on is that, although cleanup handlers
are still invoked, the handlers have no wa
Threads: Thread Cancellation
679
s = pthread_create(&thr, NULL, threadFunc, NULL);
if (s != 0)
errExitEN(s, "pthread_create");
slee22 /* Give thread a chance to get started */
if (argc == 1) { /* Cancel thread */
printf("main: about to cancel thread\n");
s = pthread_cancethrthr;
if (s != 0)
errExitEN(s, "pthread_cancel");
} else { /* Signal condition variable */
printf("main: about to signal condition variable\n");
glob = 1;
s = pthread_cond_signal(&cond);
if (s != 0)
errExitEN(s, "pthread_cond_signal");
}
s = pthread_join(thr, &res);
if (s != 0)
errExitEN(s, "pthread_join");
if (res == PTHREAD_CANCELED)
printf("main: thread was canceled\n");
else
printf("main: thread terminated normally\n");
exit(EXIT_SUCCESS);
threads/thread_cleanup.c
If we invoke the program in Listing 32-2 without any command-line arguments,
main()
calls
pthread_ca
, the cleanup handler is invoked automatically, and
we see the following:
/thread_cleanup
thread: allocated memory at 0x804b050
main: about to cancel thread
cleanup: freeing block at 0x804b050
cleanup: unlocking mutex
main: thread was canceled
If we invoke the program with a command-line argument, then
main()
sets
glob
to 1 and
signals the condition variable, the cleanup handler is invoked by
pthread_cleanup_pop()
and we see the following:
./thread_cleanup s
thread: allocated memory at 0x804b050
main: about to signal condition variable
678
Chapter 32
Listing 32-2:
Using cleanup handlers
threads/thread_cleanup.c
#include pthr&#xpthr;纭&#x.h00;ead.h
#include "tlpi_hdr.h"
static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
static pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
static int glob = 0; /* Predicate variable */
static void /* Free memory pointed to by 'arg' and unlock mutex */
cleanupHandlervoid *argvoid *arg
int s;
printf("cleanup: freeing block at %p\n", arg);
freeargarg;
printf("cleanup: unlocking mutex\n");
s = pthread_mutex_unlock(&mtx);
if (s != 0)
errExitEN(s, "pthread_mutex_unlock");
static void *
threadFunc(void *arg)
int s;
void *buf = NULL; /* Buffer allocated by thread */
buf = malloc(0x10000); /* Not a cancellation point */
printf("thread: allocated memory at %p\n", buf);
s = pthread_mutex_lock(&mtx); /* Not a cancellation point */
if (s != 0)
errExitEN(s, "pthread_mutex_lock");
pthread_cleanup_push(cleanupHandler, buf);
while (glob == 0) {
s = pthread_cond_wait(&cond, &mtx); /* A cancellation point */
if (s != 0)
errExitEN(s, "pthread_cond_wait");
}
Threads: Thread Cancellation
677
call to
has an accompanying call to
pthread_cleanup_pop()
This function removes the topmost function
from the stack of cleanup handlers. If
argument is nonzero, the handler is
also executed. This is convenient if
we want to perform the cleanup action
even if the thread was not canceled.
Although we have described
pthread_cleanup_push
and
pthread_cl
as functions, SUSv3 permits them to be
implemented as macros that expand to
statement sequences that include an opening (
) and closing (
) brace, respectively.
Not all UNIX implementations do things th
is way, but Linux and many others do.
This means that each use of
must be paired with exactly one
in the same lexical block. (On implementa-
tions that do things this way,
676
Chapter 32
A thread that is executing code that does
not otherwise include cancellation points
can periodically call
to ensure that it responds in a timely fash-
ion to a cancellation request sent by another thread.
32.5Cleanup Handlers
If a thread with a pending cancellation we
re simply terminated when it reached a
cancellation point, then shared variables and Pthreads objects (e.g., mutexes)
might be left in an inconsistent state, perhaps causing the remaining threads in the
#include pthr&#xpthr;纭&#x.h00;ead.h
void
pthread_cleanup_push
(void (*
)(void*), void *
arg
void
pthread_cleanup_pop
(int
execute
Threads: Thread Cancellation
675
printf("New thread started\n"); /* May be a cancellation point */
for (j = 1; ; j++) {
printf("Loop %d\n", j); /* May be a cancellation point */
sleep(1); /* A cancellation point */
}
/* NOTREACHED */
674
Chapter 32
various functions that retrieve information from system files such as the
utmp
file.
A portable program must correctly handle
the possibility that a thread may be can-
celed when calling these functions.
SUSv3 specifies that aside from the two lists of functions that must and may be
cancellation points, none of the other func
tions in the standard may act as cancella-
tion points (i.e., a portable
program doesnt need to hand
le the possibility that call-
ing these other functions could precipitate thread cancellation).
SUSv4 adds
open
to the list of functions that must be cancellation points,
(it moves to the list of functions that
may
be cancellation
(which is dropped from the standard).
An implementation is free to mark ad
ditional functions th
at are not specified
in the standard as cancellation points. Any function that might block (perhaps
because it might access a file) is a likely
candidate to be a cancellation point.
Within
, many nonstandard functions are marked as cancellation points
for this reason.
Upon receiving a cancellation request, a
thread whose cancelability is enabled and
deferred terminates when it next reaches a
cancellation point. If the thread was not
detached, then some other thread in the proce
ss must join with it, in order to prevent it
from becoming a zombie thread. When a
canceled thread is joined, the value
Threads: Thread Cancellation
673
The threads previous cancelability type is
returned in the location pointed to by
As with the
672
Chapter 32
Having made the cancellation request,
THREADS: THREAD CANCELLATION
Typically, multiple threads execute in parall
el, with each thread performing its task
until it decides to terminate by calling
pthrea
670
Chapter 31
to change its interface. Both
of these techniques allow a
function to allocate persis-
tent, per-thread storage.
Further information
Refer to the sources of further information listed in Section 29.10.
31.6Exercises
31-1.
Implement a function,
one_time_control, initcontrol, init
, that performs the equivalent of
argument should be a pointe
r to a statically allocated
structure containing a Boolean variable and a mutex. The Boolean variable
indicates whethe
r the function
init
has already been called, and the mutex controls
access to that variable. To keep th
e implementation simple, you can ignore
possibilities such as
failing or being canceled when first called from a thread
(i.e., it is not necessary to devise a scheme whereby, if such an event occurs, the
next thread that calls
one_time_in
reattempts the call to
31-2.
Use thread-specific data to wr
ite thread-safe versions of
dirname()
and
basename()
(Section 18.14).
668
Chapter 31
31.3.5Thread-Specific Data Implementation Limits
As implied by our description of how thre
ad-specific data is typically implemented,
an implementation may need to impose li
mits on the number
of thread-specific
data keys that it supports. SUSv3 requires that an implementation support at least 128
_POSIX_THREAD_KEYS_MAX
Thread-local storage requires support from
the kernel (provided in Linux 2.6), the
Pthreads implementation (provided in
NPTL), and the C compiler (provided on
x86-32 with
3.3 and later).
Listing 31-4 shows a thread-safe implementation of
using thread-local
storage. If we compile and link our test
program (Listing 31-2)
with this version of
strerr
to create an executable file,
strerror_test_tls
, then we see the following
results when running the program:
./strerror_test_tls
Main thread has called strerror()
Other thread about to call strer
Other thread: str (0x40376ab0) = Operation not permitted
Main thread: str (0x40175080) = Invalid argument
666
Chapter 31
key that is stored in the global variable
strerrorKey
pthread_key_cr
also records the address of the destructor
that will be used to free the thread-specific
buffers corresponding to this key.
strerr
function then calls
664
Chapter 31
On many UNIX implementations, including Linux, the
strerror
function
provided by the standard C library
thread-safe. However, we use the example of
strerro
anyway, because SUSv3 doesnt require this function to be thread-
safe, and its implementation
provides a simple example of the use of thread-
specific data.
Listing 31-1 shows a simple non-thread-safe implementation of
strerr
. This func-
tion makes use of a pair of
global variables defined by
glibc
is an array of
ng to the error numbers in
errno
(thus, for example,
_sys_errlist[EINVAL]
points to the string
Invalid operation
), and
_sys_nerr
specifies
the number of elements in
_sys_errlist
Listing 31-1:
An implementation of
strerro
that is not thread-safe
threads/strerror.c
pointer
pointer
pointer
pointer
pointer
pointer
pointer
pointer
pointer
for thread A
for thread B
ll correspond to
pthread_keys[1]
for thread C
TSD buffer
for
m
in thread A
TSD buffer
for
m
in thread B
TSD buffer
for
m
in thread C
662
Chapter 31
in use flag
destructor pointer
in use flag
destructor pointer
in use flag
destructor pointer
pthread_keys[0]
pthread_keys[1]
pthread_keys[2]
660
Chapter 31
31.3.1Thread-Specific Data from th
e Library Functions Perspective
In order to understand the us
e of the thread-specific data
API, we need to consider
things from the point of view of a librar
y function that uses
thread-specific data:
The function must allocate a separate bl
ock of storage for each thread that calls
the function. This block needs to be al
located once, the first time the thread
calls the function.
On each subsequent call from the same
thread, the function needs to be able
to obtain the addres
s of the storage block that was allocated the first time this
thread called the function. The function
cant maintain a pointer to the block
in an automatic variable, since automa
tic variables disappear when the func-
TSD buffer
for
m
in thread A
TSD buffer
for
m
in thread B
TSD buffer
for
m
in thread C
658
Chapter 31
For several of the functions that have
nonreentrant interfaces, SUSv3 specifies
reentrant equivalents with names ending with the suffix
. These functions require
the caller to allocate a buffer whose addres
s is then passed to the function and used
656
Chapter 31
for (j = 0; j loops; j++) {
loc = glob;
loc++;
glob = loc;
}
If multiple threads invoke this func
tion concurrently, the final value in
is
unpredictable. This function illustrates th
e typical reason that a function is not
thread-safe: it employs global or static variables that are shared by all threads.
THREADS: THREAD SAFETY AND
PER-THREAD STORAGE
This chapter extends the discussion of th
e POSIX threads API, providing a descrip-
tion of thread-safe functions and one-time
initialization. We also discuss how to use
thread-specific data or thread-local storage
to make an existing function thread-safe
without changing the functions interface.
Threads: Thread Synchronization
653
30-2.
and
652
Chapter 30
SUSv3 specifies that initializing an already initialized condition variable results
in undefined behavior; we should not do this.
When an automatically or dynamically a
llocated condition variable is no longer
required, then it should be destroyed using
pthread_cond
. It is not neces-
sary to call
pthread_co
on a condition variable that was statically initial-
ized using
PTHREAD_COND_INITIALIZER
It is safe to destroy a condition variable only when no threads are waiting on it. If
the condition variable reside
s in a region of dynamically allocated memory, then it
should be destroyed before
freeing that memory regi
on. An automatically allo-
cated condition variable should be dest
Threads: Thread Synchronization
651
while (numUnjoined == 0) {
s = pthread_cond_wait(&threadDied, &threadMutex);
if (s != 0)
errExitEN(s, "pthread_cond_wait");
}
for (idx = 0; idx totThreads; idx++) {
if (thread[idx].state == TS_TERMINATED){
s = pthread_join(thread[idx].tid, NULL);
if (s != 0)
errExitEN(s, "pthread_join");
thread[idx].state = TS_JOINED;
numLive--;
numUnjoined--;
printf("Reaped thread %d (numLive=%d)\n", idx, numLive);
}
}
s = pthread_mutex_unlock(&threadMutex);
if (s != 0)
errExitEN(s, "pthread_mutex_unlock");
}
exit(EXIT_SUCCESS);
threads/thread_multijoin.c
30.2.5Dynamically Allocated Condition Variables
pthread_cond_i
function is used to dynamically initialize a condition vari-
able. The circumstances in which we need to use
pthread_cond_in
are analogous
to those where
pthread_mutex_init
is needed to dynamically initialize a mutex
(Section 30.1.5); that is, we must use
pthread_cond
to initialize automatically
and dynamically allocated condition variables,
and to initialize a statically allocated
condition variable with attrib
utes other than the defaults.
argument identifies the condition variable to be initialized. As with
mutexes, we can specify an
argument that has been previously initialized to
650
Chapter 30
sleep(thread[idx].sleepTime); /* Simulate doing some work */
printf("Thread %d terminating\n", idx);
s = pthread_mutex_lock(&threadMutex);
if (s != 0)
errExitEN(s, "pthread_mutex_lock");
numUnjoined++;
thread[idx].state = TS_TERMINATED;
s = pthread_mutex_unlock(&threadMutex);
if (s != 0)
errExitEN(s, "pthread_mutex_unlock");
s = pthread_cond_signal(&threadDied);
if (s != 0)
errExitEN(s, "pthread_cond_signal");
Threads: Thread Synchronization
649
The following shell session log demo
nstrates the use of the program in
Listing30-4:
./thread_multijoin 1 1 2 3 3

Create 5 threads
Thread 0 terminating
Thread 1 terminating
Reaped thread 0 (numLive=4)
Reaped thread 1 (numLive=3)
Thread 2 terminating
Reaped thread 2 (numLive=2)
Thread 3 terminating
Thread 4 terminating
Reaped thread 3 (numLive=1)
Reaped thread 4 (numLive=0)
Finally, note that although the threads in
the example program are created as join-
able and are immediately re
aped on termination using
, we dont
need to use this approach in order to fi
nd out about thread termination. We could
648
Chapter 30
are no guarantees about the state of the predicate; therefore, we should immedi-
ately recheck the predicate and resume sleeping if it is not in the desired state.
We cant make any assumptions about th
Threads: Thread Synchronization
647
646
Chapter 30
The
pthread_cond_wai
function is designed to perform these steps because, normally,
we access a shared variab
le in the following manner:
s = pthread_mutex_lock(&mtx);
if (s != 0)
errExitEN(s, "pthread_mutex_lock");
while (/* Check that shared variable is not in state we want */)
pthread_cond_wait(&cond, &mtx);
/* Now shared variable is in desired state; do some work */
s = pthread_mutex_unlock(&mtx);
if (s != 0)
errExitEN(s, "pthread_mutex_unlock");
(We explain why the
pthread_cond_wait()
call is placed within a
while
loop rather
statement in the next section.)
In the above code, both accesses to the sh
ared variable must be mutex-protected
for the reasons that we explained earlier.
In other words, there
is a natural associa-
tion of a mutex with a condition variable:
1.The thread locks the mutex in preparat
ion for checking the state of the shared
2.The state of the shared variable is checked.
3.If the shared variable is
not in the desired state, then the thread must unlock
the mutex (so that other threads can acce
ss the shared variable) before it goes
to sleep on the condition variable.
4.When the thread is reawakened becaus
e the condition variable has been sig-
naled, the mutex must once more be lo
cked, since, typically, the thread then
immediately accesses the shared variable.
pthread_cond_wait()
function automatically performs the mutex unlocking and
locking required in the last two of thes
e steps. In the third step, releasing the
mutex and blocking on the condition vari
able are performed atomically. In other
words, it is not possible for some othe
r thread to acquire the mutex and signal
the condition variable before the thread calling
pthread_cond_wait()
has blocked
on the condition variable.
Threads: Thread Synchronization
645
abstime
argument is a
timespec
structure (Section 23.4
.2) specifying an abso-
lute time expressed as seconds and nano
seconds since the Epoch (Section 10.1). If
the time interval specified by
abstime
expires without the co
ndition variable being
signaled, then
pthread_cond_timedwa
returns the error
644
Chapter 30
The difference between
and
pthread_cond_broadcast()
lies in
what happens if multiple threads are blocked in
. With
pthr
, we are simply guaranteed that at least one of the blocked
threads is woken up; with
pthread_cond_bro
, all blocked threads are woken up.
pthread_cond_broadcast()
always yields correct results (since all threads
should be programmed to handle redu
ndant and spurious wake-ups), but
pthr
can be more efficient. However,
pthread_cond_signa
should
be used only if just one of the waiting th
reads needs to be woken up to handle the
change in state of the shared variable, an
d it doesnt matter which one of the wait-
ing threads is woken up. This scenario typically applies when all of the waiting
threads are designed to perform the exac
tly same task. Given these assumptions,
pthr
can be more efficient than
pthread_cond_broadcast()
avoids the following possibility:
1.All waiting threads are awoken.
2.One thread is scheduled fi
rst. This thread checks the state of the shared vari-
able(s) (under protection of the associat
ed mutex) and sees that there is work
to be done. The thread performs the re
quired work, changes the state of the
shared variablessdi
cate that the work has been done, and unlocks the
associated mutex.
3.Each of the remaining threads in turn lo
cks the mutex and tests the state of the
shared variable. However, because of
the change made by
the first thread,
these threads see that there is no work
to be done, and so unlock the mutex
and go back to sleep (i.e., call
pthread_cond_wait()
once more).
By contrast,
pthread_cond_broadcast()
handles the case where the waiting threads are
designed to perform differen
t tasks (in which case they
probably have different
predicates associated with
the condition variable).
A condition variable holds no state info
rmation. It is simply a mechanism for
communicating information about the applications state. If no thread is waiting
on the condition variable at the time that
it is signaled, then the signal is lost. A
thread that later waits on the condition va
riable will unblock only when the variable
is signaled once more.
pthread_cond_timed
function is the same as
pthread_cond_wai
except that the
argument specifies an upper limit on the time that the
thread will sleep while waiting for the condition variable to be signaled.
#include pthr&#xpthr;纭&#x.h00;ead.h
int
pthread_cond_signal
(pthread_cond_t *
int
pthread_cond_broadcast
(pthread_cond_t *
int
pthread_cond_wait
(pthread_cond_t *
, pthread_mutex_t *
Threads: Thread Synchronization
643
while (ava�il 0) { /* Consume all available units */
642
Chapter 30
30.2Signaling Changes of State: Condition Variables
A mutex prevents multiple threads from
accessing a shared variable at the same
time. A condition variable allows one
thread to inform other threads about
changes in the state of a sh
ared variable (or other shar
ed resource) and allows the
other threads blockblockfor such notification.
A simple example that doesnt use condit
ion variables serves to demonstrate why
they are useful. Suppose that we have a nu
mber of threads that produce some result
units that are consumed by the main thread, and that we use a mutex-protected
variable,
, to represent the number of produced units awaiting consumption:
static pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
static int avail = 0;
The code segments shown in this
section can be found in the file
threads/
prod_no_condvar.c
in the source code distribution for this book.
In the producer threads, we would
have code such as the following:
/* Code to produce a unit omitted */
s = pthread_mutex_lock(&mtx);
if (s != 0)
errExitEN(s, "pthread_mutex_lock");
Threads: Thread Synchronization
641
Precisely what happens in each
of these cases depends on the
of the mutex.
SUSv3 defines the fo
llowing mutex types:
PTHREAD_MUTEX_NORMAL
pthread_mutex_t mtx;
pthread_mutexattr_t mtxAttr;
int s, type;
s = pthread_mutexattr_init(&mtxAttr);
if (s != 0)
errExitEN(s, "pthread_mutexattr_init");
640
Chapter 30
SUSv3 specifies that initializing an already initialized mutex results in unde-
fined behavior; we should not do this.
Among the cases where we must use
rather than a static
initializer are the following:
The mutex was dynamically allocated on
the heap. For example, suppose that
we create a dynamically allocated linked
list of structures, and each structure in
the list includes a
pthread_mutex_t
field that holds a mutex that is used to protect
access to that structure.
The mutex is an automatic variable allocated on the stack.
We want to initialize a statically allo
cated mutex with attributes other than the
defaults.
When an automatically or dynamically allocated mutex is no longer required, it
should be destroyed using
pthread_mutex_dest
. (It is not necessary to call
pthread_mutex_dest
on a mutex that was stat
ically initialized using
PTHREAD_MUTEX_INITIALIZER
It is safe to destroy a mutex only when
it is unlocked, and no thread will subse-
quently try to lock it. If the mutex reside
s in a region of dynamically allocated mem-
ory, then it should be destroyed before freeing that memory region. An
automatically allocated mutex should be de
Threads: Thread Synchronization
639
30.1.4Mutex Deadlocks
pthread_mutex_locmutex1mutex1
pthread_mutex_locmutex2mutex2
blocks
pthread_mutex_locmutex2mutex2
pthread_mutex_locmutex1mutex1
blocks
#include pthr&#xpthr;纭&#x.h00;ead.h
int
pthread_mutex_init
(pthread_mutex_t *
mutex
, const pthread_mutexattr_t *
Returns 0 on success, or a positive error number on error
638
Chapter 30
pthread_mutex_tryloc
and
pthread_mutex_timedlock
functions are much
less frequently used than
pthread_mutex_lock
. In most well-designed applications, a
thread should hold a mutex for only a shor
t time, so that other threads are not pre-
vented from executing in parallel. This
guarantees that other threads that are
blocked on the mutex will soon be granted
a lock on the mutex. A thread that uses
pthrea
to periodically poll the mutex to
see if it can be locked risks
being starved of access to the mutex wh
ile other queued threads are successively
granted access to the mutex via
pthread_mutex_
30.1.3Performance of Mutexes
What is the cost of using a mutex? We ha
ve shown two different versions of a pro-
gram that increments a shared variable
: one without mutexes
Listing 30-1Listing 30-1and
one with mutexes (Lis
ting 30-2). When we run these two programs on an x86-32
system running Linux 2.6.31 (with NPTL),
we find that the version without
mutexes requires a total of 0.35 seconds to
execute 10 million loops in each thread
(and produces the wrong resu
lt), while the version with mutexes requires 3.1 seconds.
At first, this seems expens
ive. But, consider the main loop executed by the ver-
sion that does not employ a mutex (L
isting 30-1). In that version, the
thread
for
loop that increments a loop
variable against another variable, perf
orms two assignments and another incre-
ment operation, and then branches back
to the top of the loop. The version that
uses a mutex (Listing 30-2) performs th
e same steps, and locks and unlocks the
mutex each time around the loop. In other words, the cost of locking and unlocking
a mutex is somewhat less than ten times the
cost of the operations that we listed for
the first program. This is relatively cheap.
Furthermore, in the typical case, a thread
would spend much more time doing othe
r work, and perform relatively fewer
mutex lock and unlock operations, so that
the performance impact of using a mutex
is not significant in most applications.
To put this further in perspective, runn
ing some simple test programs on the
same system showed that 20 million loops locking and unlocking a file region using
(Section 55.3) require 44 seconds, and 20 million loops incrementing and
decrementing a System V se
Chapter 47Chapter 47requir
e 28 seconds. The problem
with file locks and semaphores
is that they always require a system call for the lock
and unlock operations, and each system call
has a small, but appreciable, cost (Sec-
tion 3.1). By contrast, mu
texes are implemented using atomic machine-language
operations (performed on memory locations
visible to all threads) and require system
calls only in case of lock contention.
On Linux, mutexes are implemented using
futexes
(an acronym derived from
fast user space mutexes
), and lock contentions are dealt with using the
futex()
system
call. We dont describe futexes in this
book (they are not intended for direct
Threads: Thread Synchronization
637
loc = glob;
loc++;
glob = loc;
s = pthread_mutex_unlock(&mtx);
if (s != 0)
errExitEN(s, "pthread_mutex_unlock");
}
636
Chapter 30
To lock a mutex, we specify the mutex in a call to
. If the mutex
is currently unlocked, this call locks th
Threads: Thread Synchronization
635
If multiple threads try to ex
ecute this block of code (a
critical section
), the fact that
only one thread can hold the mutex (the
others remain blocked) means that only
one thread at a time can enter the block, as illustrated in Figure 30-2.
Figure 30-2:
Using a mutex to protect a critical section
Finally, note that mutex locking is adviso
ry, rather than mandatory. By this, we
mean that a thread is free to ignore the
use of a mutex and simply access the corre-
sponding shared variable(s). In order to sa
fely handle shared va
riables, all threads
must cooperate in their use of a mutex, abiding by the locking rules it enforces.
30.1.1Statically Allocated Mutexes
A mutex can either be allocated as a static
variable or be created dynamically at run
time (for example, in a bl
ock of memory allocated via
). Dynamic mutex cre-
ation is somewhat more complex, and we delay discussion of it until Section 30.1.5.
A mutex is a variable of the type
pthread_mutex_t
. Before it can be used, a
mutex must always be initialized. For a stat
ically allocated mutex,
we can do this by
assigning it the value
PTHREAD_MUTEX_INITIALIZER
, as in the following example:
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
According to SUSv3, applying the operat
ions that we describe in the remain-
der of this section to a
of a mutex yields results that are undefined. Mutex
operations should always be performed
only on the original mutex that has
been statically initialized using
PTHREAD_MUTEX_INITIALIZER
or dynamically initial-
ized using
pthread_
(described in Section 30.1.5).
30.1.2Locking and Unlocking a Mutex
After initialization, a mutex is unlocked.
To lock and unlock
a mutex, we use the
pthread_mutex_
functions.
lock mutex
access shared resource
unlock mutex
lock mutex
access shared resource
unlock mutex
blocks
unblocks, lock granted
Thread AThread B
634
Chapter 30
4.Thread 1 receives another time slice
and resumes execution where it left off.
Having previously (step 1) copied the value of
glob
(2000) into its
loc
, it now
increments
loc
and assigns the result20012001
. At this point, the effect of
the increment operations performed by thread 2 is lost.
If we run the program in Listing 30-1 mu
ltiple times with the same command-line
argument, we see that the printed value of
fluctuates wildly:
./thread_incr 10000000
glob = 10880429
./thread_incr 10000000
glob = 13493953
This nondeterministic behavior is a consequence of the vagaries of the kernels
CPU scheduling decisions. In complex pr
Threads: Thread Synchronization
633
Figure 30-1:
Two threads incrementing a global variable without synchronization
When we run the program in Listing 30-1 specifying that each thread should incre-
ment the variable 1000 times, all seems well:
./thread_incr 1000
glob = 2000
However, what has probably happened here
Repeatedly:
expires
Thread 1
Repeatedly:
Current
glob
time slice
Executing
Waiting
for CPU
Key
632
Chapter 30
loc
, incrementing
, and copying
loc
back to
. (Since
is an automatic variable
allocated on the per-thread stack, each th
read has its own copy of this variable.)
The number of iterations of the loop is determined by the command-line argument
supplied to the program, or by a defaul
t value, if no argument is supplied.
Listing 30-1:
Incorrectly incrementing a global variable from two threads

threads/thread_incr.c
#include pthr&#xpthr;纭&#x.h00;ead.h
#include "tlpi_hdr.h"
static int glob = 0;
static void * /* Loop 'arg' times incrementing 'glob' */
threadFunc(void *arg)
int loops = *((int *) arg);
int loc, j;
for (j = 0; j loops; j++) {
loc = glob;
loc++;
glob = loc;
}
THREADS: THREAD
In this chapter, we describe two tools
that threads can use to synchronize their
actions: mutexes and condit
ion variables. Mutexes allo
w threads to synchronize
their use of a shared resource, so that, fo
r example, one thread doesnt try to access
a shared variable at the same time as anot
her thread is modifying it. Condition vari-
ables perform a complementary task: they a
llow threads to inform each other that a
shared variable (or other shared resource) has changed state.
30.1Protecting Accesses to
Shared Variables: Mutexes
One of the principal advantages of thread
s is that they can share information via
global variables. However, this easy sharing comes at a cost: we must take care that
multiple threads do not attempt to modify
the same variable at the same time, or
that one thread doesnt try to read the va
lue of a variable while another thread is
modifying it. The term
critical section
is used to refer to
a section of code that
accesses a shared resource an
d whose execution should be
atomic
; that is, its execu-
tion should not be interrupted by another
thread that simultan
eously accesses the
same shared resource.
Listing 30-1 provides a simple example of the kind of problems that can occur
when shared resources are not accessed atomically. This program creates two
threads, each of which execut
es the same function. The fu
nction executes a loop that
repeatedly increments a global variable,
, by copying
into the local variable
630
Chapter 29
number of other attributes, including proces
s ID, open file descriptors, signal dis-
positions, current working dire
ctory, and resource limits.
The key difference between threads and processes is the easier sharing of
information that threads provide, and this is the main reason that some application
Threads: Introduction
629
29.9Threads Versus Processes
In this section, we briefly consider some of the factors that might influence our
628
Chapter 29
Once a thread has been detached, it is no longer possible to use
pt
to
Threads: Introduction
627
int
main(int argc, char *argv[])
pthread_t t1;
void *res;
int s;
s = pthread_create(&t1, NULL, threadFunc, "Hello world\n");
if (s != 0)
errExitEN(s, "pthread_create");
printf("Message from m\n");
s = pthread_join(t1, &res);
if (s != 0)
errExitEN(s, "pthread_join");
626
Chapter 29
If a thread is not detached (see Section 29.7), then we must join with it using
pthrea
. If we fail to do this, then, when
the thread terminates, it produces
the thread equivalent of a zombie process
(Section 26.2). Aside from wasting system
resources, if enough thread zombies accumu
late, we wont be able to create addi-
tional threads.
The task that
performs for threads is similar to that performed
by
wa
for processes. However, there are some notable differences:
Threads are peers. Any thread in a process can use
to join with
any other thread in the process. For ex
ample, if thread A creates thread B,
which creates thread C, then it is possible
for thread A to join with thread C, or
vice versa. This differs from the hierarchical relationship between processes.
When a parent process creates a child using
fo
, it is the only process that
on that child. There is no such
try to join with a
thread ID that had
already been joined. In other words, a join with any thread operation is
incompatible with mo
dular program design.
Example program
The program in Listing 29-1 creates an
other thread and then joins with it.
Listing 29-1:
A simple program using Pthreads
threads/simple_thread.c
#include pthr&#xpthr;纭&#x.h00;ead.h
#include "tlpi_hdr.h"
static void *
threadFunc(void *arg)
char *s = (char *) arg;
printf("%s", s);
Threads: Introduction
625
pthread_
function is needed because the
data type must be
treated as opaque data. On Linux,
happens to be defined as an
long
, but on other implementations, it could be a pointer or a structure.
In NPTL,
pthread_t
is actually a pointer that has been cast to
unsigned long
SUSv3 doesnt require
pthread_t
to be implemented as a scalar type; it could be a struc-
ture. Therefore, we cant portably use code such as the following to display a
thread ID (though it does work on many implementations, including Linux, and is
624
Chapter 29
is equivalent to performing a
pthrea
ications for the following reasons:
Various Pthreads functions use thread ID
s to identify the thread on which they
are to act. Examples of such functions include
Threads: Introduction
623
wrongly appear that the thread was can
celed. In an application that employs
622
Chapter 29
Compiling Pthreads programs
On Linux, programs that use the Pthreads API must be compiled with the
cc pthread
option. The effects of this
option include the following:
_REENTRANT
preprocessor macro is defined. Th
is causes the declarations of a
few reentrant functions to be exposed.
The program is linked with the
libpthread
library (the equivalent of
lpthread
The precise options for compiling a mu
ltithreaded program vary across imple-
mentations (and compilers). Some othe
r implementations (e.g., Tru64) also
; Solaris and HP-UX use
29.3Thread Creation
When a program is started, the resulting pr
ocess consists of a single thread, called
or
main
thread. In this section, we l
ook at how to create additional
pthread_cr
function creates a new thread.
The new thread commences execution by
calling the function identified by
with the argument
arg
(i.e.,
). The thread that calls
pthread_create
continues
execution with the next statement that follo
ws the call. (This behavior is the same as
glibc
wrapper function for the
system call described in Section 28.2.)
arg
argument is declared as
void *
, meaning that we can pass a pointer to
any type of object to the
function. Typically,
arg
points to a global or heap vari-
able, but it can also be specified as
. If we need to pass multiple arguments to
arg
can be specified as a pointer to a structure containing the arguments
as separate fields. With judiciou
s casting, we can even specify
arg
Strictly speaking, the C standards do
nt define the results of casting
int
to
void *
and vice versa. However, most C compilers permit these operations, and they
produce the desired result; that is,
int j intvoid *void *void *
The return value of
is likewise of type
void *
, and it can be employed in the
same way as the
arg
argument. Well see how this value is used when we describe
function below.
Caution is required when using a cast in
teger as the return value of a threads
start function. The reason for this is that
PTHREAD_CANCELED
, the value returned
when a thread is canceled (see Chapter
32), is usually some implementation-
defined integer value cast to
void *
. If a threads start
Threads: Introduction
621
SUSv3 doesnt specify how these data types should be represented, and portable
programs should treat them as opaque
data. By this, we mean that a program
should avoid any reliance on knowledge of
the structure or contents of a variable of
one of these types. In particular, we ca
nt compare variables of these types using
operator.
In the traditional UNIX API,
is a global integer variable. However, this
doesnt suffice for threaded
programs. If a thread made a function call that
620
Chapter 29
Among the attributes that are distinct
for each thread
are the following:
thread ID Section 29.5Section 29.5
signal mask;
thread-specific data (Section 31.3);
variable;
floating-point environment (see
fenv(3)
realtime scheduling policy and pr
iority (Sections 35.2 and 35.3);
CPU affinity (Linux
-specific, describe
d in Section 35.4);
capabilities (Linux-specific,
described in Chapter 39); and
stack (local variables and functi
on call linkage information).
As can be seen from Figure 29-1, all of
the per-thread stacks reside within the
same virtual address space. This means that
, given a suitable pointer, it is possible
for threads to share data on each others stacks. This is occasionally useful, but
it requires careful programming to ha
ndle the dependency that results from
the fact that a local variable remains va
Threads: Introduction
619
makes it possible to serve multiple clie
nts simultaneously. While this approach
works well for many scenar
ios, it does have the following limitations in some
It is difficult to share information be
tween processes. Since the parent and
child dont share memory (other than th
e read-only text segment), we must use
some form of interprocess communicati
on in order to exchange information
between processes.
Process creation with
fork
is relatively expensive. Even with the copy-on-write
technique described in Section 24.2.2,
the need to duplicate various process
attributes such as page tables and fi
le descriptor tables means that a
fork
call is
still time-consuming.
Threads address both of these problems:
618
Chapter 29
We have simplified things somewhat in
Figure 29-1. In particular, the location
of the per-thread stacks may be interm
ingled with shared libraries and shared
memory regions, depending on the order in which threads are created, shared
libraries loaded, and shared memory regions attached. Furthermore, the loca-
tion of the per-thread stacks can vary
depending on the Linux distribution.
The threads in a process can execute concurrently. On a multiprocessor system,
multiple threads can execute parallel. If
one thread is blocked on I/O, other
threads are still eligible to
execute. (Although it sometime
s useful to create a sepa-
rate thread purely for the purpose of pe
rforming I/O, it is often preferable to
employ one of the
alternative I/O models that
we describe in Chapter 63.)
Figure 29-1:
Four threads exLinux/x86-32Linux/x86-32
Threads offer advantages over processes in
certain applications. Consider the tradi-
tional UNIX approach to achieving concurrency by creating multiple processes. An
Virtual memory address
(hexadecimal)
argv, environ
Uninitialized datbssbss
Initialized data
Text (program code)
Stack for main thread
Stack for thread 1
Stack for thread 2
Stack for thread 3
Shared libraries,
shared memory
main thread executing here
thread 1 executing here
thread 3 executing here
thread 2 executing here
increasing virtual addesses
THREADS: INTRODUCTION
In this and the next few chapters, we
describe POSIX threads, often known as
. We wont attempt to cover the entire Pt
hreads API, since it is rather large.
Various sources of further information abou
t threads are listed at the end of this
chapter.
These chapters mainly describe the
standard behavior specified for the
Pthreads API. In Section 33.5, we discus
s those points where the two main Linux
threading implementationsLinuxThreads
and Native POSIX Threads Library
NPTLNPTLstandard.
In this chapter, we provide an overview
of the operation of threads, and then
look at how threads are crea
ted and how they terminate.
We conclude with a dis-
cussion of some factors that may influence the choice of a multithreaded approach
versus a multiprocess approach
when designing an application.
29.1Overview
Like processes, threads are a mechanism
that permits an application to perform
multiple tasks concurrently. A single proces
s can contain multiple threads, as illus-
trated in Figure 29-1. All of these threads are independently executing the same
program, and they all share the same global memory, including the initialized data,
uninitialized data, and heap segments. (A tr
aditional UNIX process is simply a special
616
Chapter 28
28.5Summary
When process accounting is enabled, the ke
rnel writes an accounting record to a
file for each process that terminates on
the system. This record contains statistics
on the resources used by the process.
Like
fork
, the Linux-specific
cl
system call creates a new process, but
allows finer control over which attributes
are shared between the parent and child.
This system call is used primarily for implementing threading libraries.
We compared the speed of process creation using
fo
Although
is faster than
fork
614
Chapter 28
Timers
Interval timersYesNo
(continued)
612
Chapter 28
fo
or
is followed by an
. This is illustrated by the final pair of
data rows in Table 28-3, where each child performs an
exec
, rather than imme-
diately exiting. The program execed was the
command (
/bin/true
because it produces no output). In this
case, we see that th
e relative differences
between
fork()
are much lower.
In fact, the data shown in Table 28-3
doesnt reveal the full cost of an
exec()
because the child execs the same program in
each loop of the test. As a result,
the cost of disk I/O to re
ad the program into memory is essentially eliminated,
because the program will be read into the kernel buffer cache on the first
exec
, and then remain there. If each l
oop of the test execed a different pro-
gram (e.g., a differently named copy of the same program), then we would
observe a greater cost for an
exec()
28.4Effect of
ex
and
on Process Attributes
A process has numerous attributes, some
of which we have already described in
that we explore in later chapters. Regarding these
What happens to these attribute
s when a process performs an
Which attributes are inherited by a child when a
fork
is performed?
Table 28-4 summarizes the answers to these questions. The
exec
column indicates
which attributes are preserved during an
. The
fo
column indicates which
attributes are inherited (or in some
cases, shared) by a child after
fork
the attributes indicated as be
ing Linux-specific, all listed
attributes appear in standard
UNIX implementations, and their handling during
and
fork
conforms to the
requirements of SUSv3.
Table 28-4:
and
fork()
on process attributes
Process attribute

Interfaces affecting attribute; additional notes
Process address space
Text segmentNoSharedChild process
shares text segment with parent.
Stack segmentNoYesFunction entry/exit;
alloca
longjm
siglon
Data and heap segmentsNoYes
Environment variablesSee
notes
putenv
610
Chapter 28
__WALL
(since Linux 2.4)
Wait for all children, regardless of type (
or
nonclone
__WNOTHREAD
(since Linux 2.4)
By default, the wait calls wait not on
ly for children of the calling process,
but also for children of any other processes in the same thread group as
the caller. Specifying the
__WNOTHREAD
flag limits the wait to children of the
calling process.
These flags cant be used with
28.3Speed of Process Creation
Table 28-3 shows some speed comparisons fo
ogram that executed a loop that repeat-
edly created a child process and then waited for it to terminate. The table com-
608
Chapter 28
Making the childs parent the same as the callers:
CLONE_PARENT
By default, when we create a new process with
cl
, the parent of that process (as
mount namespace. Thereafter, changes to the namespace by one process are not
visible in the other process. (In earlier 2.4.
kernels, as well as in older kernels, we
can consider all processes on the system
as sharing a single system-wide mount
Per-process mount namespaces can be us
ed to create environments that are
similar to
chroot()
jails, but which are more secure
and flexible; for example, a jailed
process can be provided with a mount point th
at is not visible to other processes on the
system. Mount namespaces are also useful in
606
Chapter 28
of the new thread. Note
that it isnt suf-
ficient to obtain the thread ID of the new thread via the return value of
cl
, like so:
tid = clone(...);
The problem is that this code can lead
to various race conditions, because the
assignment occurs only after
notification of the termination of a
thread. Such notification is required by the
function, which is the
POSIX threads mechanism by which one thread can wait for the termination of
another thread.
When a thread is created using
pthread_create
, NPTL makes a
call in
which
ptid
and
point to the same location. (This is why
its own thread ID (this is the same valu
TGID=2001
TGID=2001
TGID=2001
Process with PID 2001
Thread AThread BThread C
TGID=2001
Thread
roup leader (TID matches TGID)
604
Chapter 28
Sharing file systemr
CLONE_FS
If the
CLONE_FS
flag is specified, then the parent and the child share file system
related informationumask, root director
y, and current working directory. This
means that calls to
in either process will affect the other
process. If the
CLONE_FS
602
Chapter 28
/* If argc� 1, child shares file descriptor table with parent */
flags = (�argc 1) ? CLONE_FILES : 0;
/* Allocate stack for child */
stack = malloc(STACK_SIZE);
if (stack == NULL)
errExit("malloc");
stackTop = stack + STACK_SIZE; /* Assume stack grows downward */
/* Ignore CHILD_SIG, in case it is a signal whose default is to
terminate the process; but don't ignore SIGCHLD (which is ignored
by default), since that would prevent the creation of a zombie. */
if (CHILD_SIG != 0 && CHILD_SIG != SIGCHLD)
if (signal(CHILD_SIG, SIG_IGN) == SIG_ERR)
errExit("signal");
/* Create child; child commences execution in child */
if (clone(childFunc, stackTop, flags | CHILD_SIG, (void *) &fd) == -1)
errExit("clone");
/* Parent falls through to here. Wait for child; __WCLONE is
needed for child notifying with signal other than SIGCHLD. */
if (waitpid(-1, NULL, (CHILD_SIG != SIGCHLD) ? __WCLONE : 0) == -1)
errExit("waitpid");
printf("child has terminated\n");
/* Did clof file descriptor in child affect parent? */
s = write(fd, "x", 1);
if (s == -1 && errno == EBADF)
printf("file descriptor %d has been closed\n", fd);
else if (s == -1)
printf("writ file descriptor %d failed "
"unexpectedly (%s)\n", fd, strerror(errno));
else
printf("writ file descriptor %d succeeded\n", fd);
exit(EXIT_SUCCESS);
procexec/t_clone.c
When we run the program in Listing 28
-3 without a command-line argument, we
see the following:
./t_clone
CLONE_FILES
child has terminated
wron file descriptor 3 succeeded
Childs close() did not affect parent
600
Chapter 28
With
fork()
and
vfor
, we have no way to select the termination signal; it is
SIGCHLD
The remaining bytes of the
flags
argument hold a bit mask that controls the opera-
tion of
. We summarize these bit-mask values in Table 28-2, and describe
them in more detail in Section 28.2.1.
The remaining arguments to
. These arguments relate to
the implementation of threads, in particul
ar the use of thread IDs and thread-local
storage. We cover the use of these arguments when describing the
flags
bit-mask
values in Section 28.2.1. (In Linux 2.4 an
d earlier, these three arguments are not
provided by
clone()
. They were specifically added
in Linux 2.6 to support the NPTL
POSIX threads implementation.)
Example program
Listing 28-3 shows a simple example of the use of
cl
to create a child process.
The main program does the following:
Open a file descriptor (for
/dev/null
) that will be closed by the child
598
Chapter 28
When using the Version 3 option, the only difference in the operation of pro-
cess accounting is in the fo
rmat of records written to
the accounting file. The new
format is defined as follows:
struct acct_v3 {
char ac_flag; /* Accounting flags */
char ac_version; /* Accounting versi33 */
u_int16_t ac_tty; /* Controlling terminal for process */
u_int32_t ac_exitcode; /* Process termination status */
u_int32_t ac_uid; /* 32-bit user ID of process */
u_int32_t ac_gid; /* 32-bit group ID of process */
u_int32_t ac_pid; /* Process ID */
u_int32_t ac_ppid; /* Parent process ID */
u_int32_t ac_btime; /* Start time (time_t) */
loc = localtime(&t);
if (loc == NULL) {
printf("???Unknown time??? ");
} else {
strftime(timeBuf, TIME_BUF_SIZE, "%Y-%m-%d %T ", loc);
printf("%s ", timeBuf);
}
printf("%5.2f %7.2f ", (double) (comptToLL(ac.ac_utime) +
comptToLL(ac.ac_stime)) / sysconf(_SC_CLK_TCK),
596
Chapter 28
In the output, we see one line for each pr
ocess that was created in the shell session.
and
echo
commands are shell built-in commands, so they dont result in
the creation of new processes. Note that the entry for
appeared in the account-
ing file after the
entry because the
command terminated after the
Most of the output is self-explanatory. The
column shows single letters
indicating which of the
bits is set in each record (see Table 28-1). Section 26.1.3
describes how to interpre
t the termination status values shown in the
Listing 28-2:
process accounting file

procexec/acct_view.c
#include fcntünt;l.h;l.h
#include time&#xtime;.h0;.h
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include sys/acct.h&#xsys/;竌&#xt.h7;
#include limi&#xlimi;ts.;&#xh000;ts.h
#include "ugid_functions.h" /* Declaration of userNameFromId() */
#include "tlpi_hdr.h"
#define TIME_BUF_SIZE 100
static long long /* Convert comp_t value into long long */
comptToLL(comp_t ct)
const int EXP_SIZE = 3; /* 3-bit, base-8 exponent */
const int MANTISSA_SIZE = 13; /* Followed by 13-bit mantissa */
const int MANTISSA_MASK = (1 MANTISSA_SIZE) - 1;
long long mantissa, exp;
mantissa = ct & MANTISSA_MASK;
exp = (ct �� MANTISSA_SIZE) & ((1 EXP_SIZE) - 1);
The next two commands run programs that
we presented in previous chapters
(Listing 27-1, on page 566, and Listing 24
-1, on page 517). The first command runs
a program that execs the file
/bin/echo
; this results in an acco
command name
. The second command creates a child process that doesnt
./t_execve /bin/echo
hello world goodbye
./t_fork
PID=18350 (child) idata=333 istack=666
PID=18349 (parent) idata=111 istack=222
Finally, we use the program in Listing 28-2 to
view the contents of the accounting file:
./acct_view pacct
command flags term. user start time CPU elapsed
status time time
acct_on -S-- 0 root 2010-07-23 17:19:05 0.00 0.00
bash ---- 0 root 2010-07-23 17:18:55 0.02 21.10
su -S-- 0 root 2010-07-23 17:18:51 0.01 24.94
cat --XC 0x83 mtk 2010-07-23 17:19:55 0.00 1.72
sleep ---- 0 mtk 2010-07-23 17:19:42 0.00 15.01
grep ---- 0x200 mtk 2010-07-23 17:20:12 0.00 0.00
echo ---- 0 mtk 2010-07-23 17:21:15 0.01 0.01
t_fork F--- 0 mtk 2010-07-23 17:21:36 0.00 0.00
t_fork ---- 0 mtk 2010-07-23 17:21:36 0.00 3.01
594
Chapter 28
comp_t
type is a kind of floating-point
number. Values of this type are
sometimes called
compressed clock ticks
. The floating-point value consists of a 3-bit,
base-8 exponent, followed by a 13-bit ma
ntissa; the exponent can represent a
factor in the range 8
=1 to 8
2,097,1522,097,152of 125 and an
exponent of 1 represent the value 10
00. Listing 28-2 defines a function
) to convert this type to
long long
. We need to use the type
long long
because the 32 bits used to represent an
unsigned long
on x86-32 are insufficient
to hold the largest value th
at can be represented in
, which is (2
1) * 8
The three time fields
defined with the type
represent time in system
clock ticks. Therefore, we must divi
de these times by
If the system crashes, no accounting record is written for any processes that
are still executing.
Since writing records to the
accounting file can rapidly consume disk space, Linux
provides the
/proc/sys/kernel/acct
virtual file for controlling the operation of pro-
cess accounting. This file co
ntains three numbers, defini
ng (in order) the parameters
high-water
, and
. Typical defaults for th
comp_t ac_utime; /* User CPU time (clock ticks) */
comp_t ac_stime; /* System CPU time (clock ticks) */
592
Chapter 28
In kernels before 2.6.10, a separate process accounting record was written
for each thread created using the NPTL threading implementation. Since
kernel 2.6.10, a single accounting record
is written for the entire process when
the last thread terminates. Under the
older LinuxThreads threading implemen-
tation, a single process ac
counting record is always
written for each thread.
Historically, the primary use of process
accounting was to charge users for con-
sumption of system resources on multiuser UNIX systems. However, process
accounting can also be usef
ul for obtaining information about a process that was
not otherwise monitored and re
ported on by its parent.
Although available on most UNIX implem
entations, process
accounting is not
specified in SUSv3. The format of the acco
unting records, as well as the location of
the accounting file, vary so
mewhat across implementations.
PROCESS CREATION AND
590
Chapter 27
27-5.
When we run the following program, we fi
nd it produces no output. Why is this?
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
printf("Hello world");
execlp("sleep", "sleep", "0", (char *) NULL);
27-6.
Suppose that a parent process
has established a handler for
SIGCHLD
and also
blocked this signal. Subsequently, one of its children exits, and the parent then
does a
wa
to collect the childs status. What
happens when the parent unblocks
SIGCHLD
? Write a program to verify your answer
. What is the relevance of the result
for a program calling the
function?
589
27.9Exercises
27-1.
The final command in the following shell
session uses the program in Listing 27-3
to exec the program
xyz
. What happens?
echo $PATH
/usr/local/bin:/usr/bin:/bin:./dir1:./dir2
ls -l dir1
total 8
-rw-r--r-- 1 mtk users 7860 Jun 13 11:55 xyz
ls -l dir2
total 28
-rwxr-xr-x 1 mtk users 27452 Jun 13 11:55 xyz
./t_execlp xyz
27-2.
Use
execve
to implement
. You will need to use the
API to handle
the variable-length argument list supplied to
exec
. You will also need to use
functions in the
package to allocate space for the argument and environment
vectors. Finally, note that an easy way of ch
588
Chapter 27
status of the child in
this case. (Ignoring
SIGCHLD
causes the status of a child process
to be immediately discarded,
as described in Section 26.3.3.)
On some UNIX implementations,
system
handles the case that it is called with the
disposition of
SIGCHLD
_CS_PATH
configuration variable.
This value is a
PATH
-style list of directories containi
ng the standard system utilities.
We can assign this list to
, and then use
to exec the standa
rd shell as follows:
char path[PATH_MAX];
if (confstr(_CS_PATH, path, PATH_MAX) == 0)
_exi127127;
587
sigemptyset(&blockMask); /* Block SIGCHLD */
586
Chapter 27
585
An improved
system
implementation
Listing 27-9 shows an implementation of
conforming to the rules described
above. Note the follo
wing points about this implementation:
As noted earlier, if
command
is a
pointer, then
syst
should return non-
zero if a shell is available or 0 if no sh
ell is available. The only way to reliably
584
Chapter 27
child. However, both th
e calling program and the
process would, by default,
be killed by these signals.
How should the calling process and th
e executed command respond to these
signals? SUSv3 specifies the following:
SIGINT
and
SIGQUIT
should be ignored in the ca
lling process while the command
is being executed.
In the child,
SIGINT
SIGQUIT
should be treated as they would be if the calling
process did a
fork
and
; that is, the disposition of handled signals is reset
to the default, and the disposition
of other signals remains unchanged.
Figure 27-2:
Arrangement of processes during execution of
system(sleep 20)
Dealing with signals in the manner specified by SUSv3 is the most reasonable
approach, for the following reasons:
It would not make sense to have both
processes responding to these signals,
since this could lead to confusing beha
viors for the user of the application.
Similarly, it would not make sense to ign
ore these signals in the process executing
the command while treating them accordin
g to their default di
spositions in the
calling process. This would allow the user
to do things such as killing the calling
process while the executed
command was left running. It is also inconsistent
with the fact that the ca
lling process has actually given up control (i.e., is
blocked in a
waitp
call) while the command passed to
system
The command executed by
may be an interactive application, and it
makes sense to have this application respond to terminal-generated signals.
SUSv3 requires the treatment of
SIGINT
and
SIGQUIT
described above, but notes that
this could have an undesirable effect
in a program that invisibly uses
to per-
form some task. While the comm
and is being executed, typing
Control-C
or
Control-\
will kill only the child of
, while the application (une
xpectedly, to the user)
continues to run. A program that uses
in this way should check the termina-
f
f
Foreground process group
Child shell created by
sys
Child process created by
shell (executes command
given to
sys
sys
calling process
583
case 0: /* Child */
execl("/bin/sh", "sh", "-c", command, (char *) NULL);
_exit(127); /* Failed exec */
default: /* Parent */
if (waitpid(childPid, &status, 0) == -1)
return -1;
else
return status;
}
procexec/simple_system.c
Treating signals correctly inside
sy
What adds complexity to
the implementation of
is the correct treatment
with signals.
The first signal to consider is
SIGCHLD
. Suppose that the program calling
is also directly creating children
SIGCHLD
that per-
forms its own
wa
. In this situation, when a
SIGCHLD
signal is generated by the ter-
mination of the child created by
, it is possible that the signal handler of the
main program will be invokedand collect the childs statusbefore
has a
chance to call
wa
. (This is an example of a race condition.) This has two unde-
sirable consequences:
The calling program would be deceived in
to thinking that one of the children
that it created has terminated.
function would be unable to obta
in the termination status of the
child that it created.
Therefore,
must block delivery of
SIGCHLD
while it is executing.
The other signals to consider are
those generated by the terminal
Control-C
(usually
Control-\
SIGINT
and
SIGQUIT
tively. Consider what is happening when we execute the following call:
system("sleep 20");
At this point, three processes are running:
the process executin
g the calling program,
a shell, and
As an efficiency measure, when the string given to the
option is a simple
command (as opposed to a pipeline or a sequence), some shells (including
) directly exec the command, rather th
an forking a child shell. For shells
that perform such an optimization, Figure
27-2 is not strictly accurate, since
there will be only two processes (the calling process and
sleep
). Nevertheless,
the arguments in this section about how
should handle signals still apply.
All of the processes shown in Figure 27-2
form part of the foreground process
582
Chapter 27
only to the words produced by shell expansions. In addition, modern shells reset
IFS
(to a string consisting of
the three characters space, tab, and newline) on shell
startup to ensure that scripts behave
consistently if they inherit a strange
value.
As a further security measure,
bash
reverts to the real user (group) ID when
581
Listing 27-7:
Executing shell commands with
system
procexec/t_system.c
#include sys/wait.h&#xsys/;wai;&#xt.h7;
#include "print_wait_status.h"
#include "tlpi_hdr.h"
#define MAX_CMD_LEN 200
int
main(int argc, char *argv[])
char str[MAX_CMD_LEN]; /* Command to be executed by system
580
Chapter 27
be available if the program called
before calling
. If
command
is
579
specification of signals, which doesnt sp
ecify signal blocking; therefore, C pro-
grams written on non-UNIX systems wont kn
ow to unblock signals.) For this reason,
SUSv3 recommends that signals should not be blocked or ignored across an
of an arbitrary program. Here, arbitrary means a program that we did not write.
It is acceptable to block or ignore signals when execing a program we have written
or one with known behavior
with respect to signals.
27.6Executing a
Shell Command:
The
syst
function allows the calling progra
m to execute an arbitrary shell com-
mand. In this section, we describe the operation of
, and in the next section
we show how
can be implemented using
fork
wa
In Section 44.5, we look at the
popen()
pclose
functions, which can also be
used to execute a shell command, but allow the calling program to either read
the output of the command or to send input to the command.
function creates a child proces
s that invokes a shell to execute
command
. Here is an example of a call to
system("ls | wc");
The principal advantages of
are simplicity and convenience:
578
Chapter 27
Listing 27-6:
Setting the close-on-exec flag for a file descriptor

procexec/closeonexec.c
#include fcntünt;l.h;l.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int flags;
if (�argc 1) {
577
If the
call fails for some reason, we may want to keep the file descriptors
open. If they are already closed, it may be difficult, or impossible, to reopen
them so that they refer to the same files.
For these reasons, the kernel provides a cl
ose-on-exec flag for each file descriptor.
576
Chapter 27
within the shell. Some frequently used
commandssuch as
pwd
echo
, and
are sufficiently simple that it is a worth-
while efficiency to implement them
inside the shell. Other commands are
implemented within the shell so that they
have side effects on the shell itself
that is, they change information stored by
the shell, or modify attributes of or
affect the execution of the sh
ell process. For example, the
command must
change the working directory of the sh
ell itself, and so cant be executed
within a separate process. Other exampl
es of commands that are built in for
their side effects include
575
exec
call results in the following argument list being used:
/usr/bin/awk -f longest_line.awk input.txt
This successfully invokes
using the script
longest_line.awk
to process the file
input.txt
execlp
Normally, the absence of a
line at the start of a script causes the
exec
functions
to fail. However,
execlp
and
do things somewhat differently. Recall that
these are the functions that use the
PATH
environment variable to obtain a list of
directories in which to search for a file to
be executed. If either of these functions
finds a file that has execute permission
turned on, but is not a binary executable
and does not start with a
line, then they exec the sh
574
Chapter 27
573
The limit placed on
the length of the
line varies across UNIX implementa-
tions. For example, the limit is 64 characters in OpenBSD 3.1 and 1024 characters
on Tru64 5.1. On some historical implem
entations (e.g., SunOS 4), this limit
was as low as 32 characters.

572
Chapter 27
#!
571
security reasons, it is sometimes preferable
to ensure that a program is execed with
a known environment list. We consider
this point further in Section 38.8.
Listing 27-5 demonstrates that the ne
w program inherits its environment from
the caller during an
exec
call. This program first uses
pute
to make a change
to the environment that it inherits
from the shell
as a result of
fork
. Then the
printenv
program is execed to display the values of the
USER
and
SHELL
environment
variables. When we run this
program, we see the following:
echo $USER $SHELL
Display some of the shells environment variables
blv /bin/bash
./t_execl
Initial value of USER: blv
Copy of environment was inherited from the shell
britta
These two lines are displayed by execed printenv
/bin/bash
Listing 27-5:
Passing the callers environment to the new program using
procexec/t_execl.c
#include stdl&#xstdl;ib.;&#xh000;ib.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
570
Chapter 27
Listing 27-3:
Using
exec
to search for a filename in
PATH
procexec/t_execlp.c
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s pathname\n", argv[0]);
execlp(argv[1], argv[1], "hello world", (char *) NULL);
errExit("execlp"); /* If we get here, something went wrong */
procexec/t_execlp.c
27.2.2Specifying Program Arguments as a List
When we know the number of arguments for an
at the time we write a program,
we can use
, or
exec
to specify the arguments as a list within the
function call. This can be convenient, sinc
e it requires less code than assembling
the arguments in an
argv
vector. The program in Listing 27-4 achieves the same
result as the program in Listing 27-1 but using
instead of
exec
Listing 27-4:
Using
exec
to specify program arguments as a list
procexec/t_execle.c
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
569
If the
PATH
variable is not defined, then
and
assume a default path
.:/usr/bin:/bin
As a security measure,
the superuser account (
root
568
Chapter 27
exec
and
exec
functions allow the programmer to explicitly specify
the environment for the new program using
pointers to character strings. The name
567
27.2The
Library Functions
The library functions described in this
section provide alternative APIs for per-
forming an
. All of these functions are layered on top of
, and they dif-
fer from one another and from
only in the way in which the program name,
argument list, and environment of the new program are specified.
The final letters in the names of these fu
nctions provide a clue to the differences
between them. These differences are summari
566
Chapter 27
Listing 27-1:
Using
exec
to execute a new program
procexec/t_execve.c
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
char *argVec[10]; /* Larger than required */
565
ENOENT
The file referred to by
doesnt exist.
ENOEXEC
The file referred to by
is marked as being executable, but it is not
in a recognizable executable format. Possi
bly, it is a script that doesnt begin
with a line (starting
with the characters
564
Chapter 27
argument contains the pathname of the new program to be loaded
into the processs memory. This pathname ca
n be absolute (indicated by an initial
or relative to the cu
rrent working directory
of the calling process.
argv
argument specifies the command-lin
e arguments to be passed to the
new program. This array corresponds to
, and has the same form as, the second
argv
) argument to a C
ma
function; it is a
-terminated list of pointers to
character strings. The value supplied for
argv[0]
corresponds to the command
name. Typically, this value is the same as
the basename (i.e., the final component)
of
, specifies the environment list for the new program.
envp
argument corresponds to the
environ
array of the new program; it is a
NULL
terminated list of pointers to
character strings of the form
name=value
(Section 6.7).
/proc/
PID
/exe
file is a symbolic link containing the absolute
pathname of the executable file be
ing run by the corresponding process.
After an
, the process ID of the process
remains the same, because the same
process continues to exist. A few other
process attributes also remain unchanged,
as described in Section 28.4.
PROGRAM EXECUTION
This chapter follows from our discussion of process creation and termination in
the previous chapters. We now lo
ok at how a process can use the
system call
to replace the program that it is running
by a completely new program. We then
show how to implement the
function, which allows its caller to execute an
arbitrary shell command.
27.1Executing a New Program:
execve
system call loads a new program in
to a processs memory. During this
operation, the old program is discarded, an
d the processs stack, data, and heap are
replaced by those of the new program. After executing various C library run-time
startup code and program initialization co
de (e.g., C++ static constructors or C
functions declared with the
constructor
attribute described in Section 42.4), the
new program commences execution at its
ma
The most frequent use of
is in the child produced by a
fork
, although
it is also occasionally used in applications without a preceding
fo
Various library functions, a
ll with names beginning with
, are layered on
top of the
system call. Each of these functi
ons provides a different interface
to the same functionality. The loading of
a new program by any of these calls is com-
monly referred to as an
operation, or simply by the notation
exec()
. We begin
with a description of
and then describe the library functions.
562
Chapter 26
26.5Exercises
26-1.
Write a program to verify that when a childs parent terminates, a call to
Monitoring Child Processes
561
The System V
SIGCLD
On Linux, the name
SIGCLD
is provided as a synonym for the
SIGCHLD
signal. The
reason for the existence of both names is historical. The
SIGCHLD
signal originated
on BSD, and this name was adopted by PO
SIX, which largely standardized on the
BSD signal model. System V provided the corresponding
SIGCLD
signal, with slightly
different semantics.
The key difference between BSD
and System V
lies in what happens
when the disposition of
stopped children. This status indicates
560
Chapter 26
SIG_IGN
semantics for
SIGCHLD
have a long history, deriving from System V.
SUSv3 specifies the behavior described here,
but these semantics were left unspecified
in the original POSIX.1 standard. Thus, on some older UNIX implementations,
ignoring
SIGCHLD
Monitoring Child Processes
559
/* Parent comes here: wait for SIGCHLD until all children are dead */
sigemptyset(&emptyMask);
while (numLiveChildren� 0) {
if (sigsuspend(&emptyMask) == -1 && errno != EINTR)
errExit("sigsuspend");
sigCnt++;
}
printf("%s All %d children have terminated; SIGCHLD was caught "
"%d times\n", currTime("%T"), argc - 1, sigCnt);
exit(EXIT_SUCCESS);
procexec/multi_SIGCHLD.c
26.3.2Delivery of
SIGCHLD
Just as
waitpid()
can be used to monitor stopped chil
dren, so is it possible for a parent
process to receive the
signal when one of its childr
en is stopped by a signal. This
behavior is controlled by the
SA_NOCLDSTOP
flag when using
sigaction()
to establish a
SIGCHLD
signal. If this flag is omitted, a
SIGCHLD
signal is delivered to
the parent when one of its children
stops; if the flag is present,
SIGCHLD
is not deliv-
ered for stopped children. (The implementation of
given in Section 22.7
SA_NOCLDSTOP
SIGCHLD
is ignored by default, the
SA_NOCLDSTOP
flag has a meaning only if
we are establishing a handler for
SIGCHLD
. Furthermore,
SIGCHLD
is the only signal
for which the
SA_NOCLDSTOP
flag has an effect.
SUSv3 also allows for a
parent to be sent a
SIGCHLD
signal if one of its stopped chil-
dren is resumed by being sent a
SIGCONT
signal. (This corresponds to the
WCONTINUED
.) This feature is implemented in Linux since kernel 2.6.9.
26.3.3Ignoring Dead Child Processes
558
Chapter 26
slee55 /* Artificially lengthen execution of handler */
printf("%s handler: returning\n", currTime("%T"));
errno = savedErrno;
int
main(int argc, char *argv[])
int j, sigCnt;
Monitoring Child Processes
557
16:45:19 Child 2 (PID=17768) exiting
These children terminate during
16:45:21 Child 3 (PID=17769) exiting
first invocation of handler
16:45:23 handler: returning
16:45:23 handler: Caught SIGCHLD
Second invocation of handler
16:45:23 handler: Reaped child 17768 - child exited, status=0
16:45:23 handler: Reaped child 17769 - child exited, status=0
16:45:28 handler: returning
16:45:28 All 3 children have terminated; SIGCHLD was caught 2 times
Note the use of
sigprocmask()
signal before any children are created
in Listing 26-5
. This is done to ensure
correct operation of the
loop
in the parent. If we failed to block
SIGCHLD
in this way, and a child terminated
556
Chapter 26
SIGCHLD
handler is executing for an
already terminated child,
then, although
SIGCHLD
is generated twice, it is queued only once to the parent. As a
result, if the parents
SIGCHLD
handler called
only once each time it was
invoked, the handler might fail to reap some zombie children.
The solution is to loop inside the
SIGCHLD
handler, repeatedly calling
wa
WNOHANG
flag until there are no more dead
children to be reaped. Often, the
body of a
SIGCHLD
handler simply consists of th
e following code, which reaps any
checking their status:
while (waitpid(-1, NULL, WNOHANG�) 0)
continue;
The above loop continues until
waitpid()
returns either 0, indicating no more zombie
children, or 1, indicating an error (probably
, meaning that there are no
Design issues for
SIGCHLD
handlers
Suppose that, at the time we establish a handler for
SIGCHLD
, there is already a termi-
nated child for this process. Does the kernel then immediately generate a
SIGCHLD
signal for the parent? SUSv3 leaves this point unspecified. Some System Vderived
implementations do generate a
SIGCHLD
in these circumstances; other implementa-
tions, including Linux, do not. A portab
le application can make this difference
invisible by establishing the
SIGCHLD
handler before creating any children. (This is
usually the natural way of doing things, of course.)
A further point to consider is the issu
e of reentrancy. In Section 21.1.2, we
noted that using a system call (e.g.,
wa
) from within a signal handler may
change the value of the global variable
errno
. Such a change could interfere with
Monitoring Child Processes
555
default: /* Parent */
sleep(3); /* Give child a chance to start and exit */
snprintf(cmd, CMD_SIZE, "ps | grep %s", basename(argv[0]));
cmd[CMD_SIZE - 1] = '\0'; /* Ensure string is null-terminated */
systemcmdcmd /* View zombie child */
/* Now send the "sure kill" signal to the zombie */
if (kill(childPid, SIGKILL) == -1)
errMsg("kill");
sleep(3); /* Give child a chance to react to signal */
printf("After sending SIGKILL to zombie (PID=%ld):\n", (long) childPid);
systemcmdcmd /* View zombie child again */
exit(EXIT_SUCCESS);
}
procexec/make_zombie.c
26.3The
Signal
The termination of a child process is an
event that occurs asynchronously. A parent
cant predict when one of its child will terminate. (Even if the parent sends a
SIGKILL
signal to the child, the exact time of te
rmination is still dependent on when the
child is next scheduled for
use of a CPU.) We have al
ready seen that the parent
should use
or similaror similar order to prevent
the accumulation of zombie chil-
dren, and have looked at two wa
ys in which this can be done:
The parent can call
without specifying the
WNOHANG
flag, in
which case the call will block if a child has not already terminated.
The parent can periodically perform a no
a polla pollfor dead chil-
dren via a call to
specifying the
WNOHANG
flag.
Both of these approaches can be inconv
enient. On the one hand, we may not want
the parent to be blocked waiting for a child to terminate. On the other hand, making
repeated nonblocking
wa
calls wastes CPU time and adds complexity to an
554
Chapter 26
wait
calls in order to ensure that dead chil
dren are always removed from the system,
rather than becoming long-lived zo
mbies. The parent may perform such
calls
either synchronously, or
asynchronously, in response to delivery of the
SIGCHLD
signal,
as described in Section 26.3.1.
Listing 26-4 demonstrates the creation
of a zombie and that a zombie cant be
SIGKILL
. When we run this program,
we see the following output:
./make_zombie
Parent PID=1013
Child (PID=1014) exiting
1013 pts/4 00:00:00 make_zombie
Output from 11
1014 pts/4 00:00:00 make_zombie defuncퟯ&#xunc7;&#xt000;t
After sending SIGKILL to make_zombie (PID=1014):
1013 pts/4 00:00:00 make_zombie
Output from 11
1014 pts/4 00:00:00 make_zombie defuncퟯ&#xunc7;&#xt000;t
In the above output, we see that
ps11
displays the string
to indicate a process
in the zombie state.
The program in Listing 26-4 uses the
system
function to execute the shell com-
mand given in its character-st
ring argument. We describe
Monitoring Child Processes
553
UNIX implementations. Neither is standa
rdized in SUSv3. (SUSv2 did specify
wait
We usually avoid the use of
wa
in this book. Typically, we dont
552
Chapter 26
value in
is 0 or nonzero. Unfortunately, this behavior is not required by
SUSv3, and some UNIX implementations leave the
structure unchanged in
this case. (A future corrigendum to SUSv
4 is likely to add a requirement that
are zeroed in this case.) The only portable way to distinguish these two
cases is to zero out the
structure before calling
wa
, as in the following
siginfo_t info;
...
Monitoring Child Processes
551
The following additional flags may be ORed in
options
WNOHANG
This flag has the same meaning as for
. If none of the children
matching the specification in
550
Chapter 26
same signal once more, which this time will terminate the process. The signal
handler would contain code
such as the following:
void
handler(int sig)
/* Perform cleanup steps */
signal(sig, SIG_DFL); /* Disestablish handler */
raissigsig; /* Raise signal again */
26.1.5The
System Call
waitpi
waitid
Monitoring Child Processes
549
if (�argc 1 && strcmp(argv[1], "--help") == 0)
usageErr("%s [exit-status]\n", argv[0]);
switch (fo){
case -1: errExit("fork");
case 0: /* Child: either exits immediately with given
status or loops waiting for signals */
548
Chapter 26
kill -STOP 15871
$ waitpiturned: PID=15871; status=0x137f (19,127)
child stopped by signal 19 (Stopped (signal))
kill -CONT 15871
$ waitpiturned: PID=15871; status=0xffff (255,255)
child continued
The last two lines of output will appear only
on Linux 2.6.10 and later, since earlier
kernels dont support the
WCONTINUED
option. (This shell session is made
slightly hard to read by the fact that
output from the program executing in the
background is in some cases intermingled
with the prompt produced by the shell.)
We continue the shell session by sending a
SIGABRT
signal to terminate the child:
kill -ABRT 15871
$ waitpiturned: PID=15871; status=0x0006 (0,6)
child killed by signal 6 (Aborted)
Press Enter, in order to see shell notifica
tion that background job has terminated
[1]+ Done ./child_status
ls -l core
ls: core: No such file or directory
ulimit -c

RLIMIT_CORE
Although the default action of
SIGABRT
is to produce a core dump file and terminate
the process, no core file was produced. Th
is is because core dumps were disabledthe
RLIMIT_CORE
soft resource limit (Section 36.3),
which specifies the maximum size of a
core file, was set to 0, as shown by the
command above.
We repeat the same experiment, but th
is time enabling core dumps before
sending
SIGABRT
to the child:
ulimit -c unlimited
Allow core dumps
./child_status &
[1] 15902
$ Child started with PID = 15903
kill -ABRT 15903
Send
SIGABRT
to child
$ waitpiturned: PID=15903; status=0x0086 (0,134)
child killed by signal 6 (Aborted) (core dumped)
Press Enter, in order to see shell notifica
tion that background job has terminated
[1]+ Done ./child_status
ls -l core
Monitoring Child Processes
547
if (msg != NULL)
printf("%s", msg);
if (WIFEXITED(status)) {
printf("child exited, status=%d\n", WEXITSTATUS(status));
} else if (WIFSIGNALED(status)) {
printf("child killed by signal %d (%s)",
WTERMSIG(status), strsignal(WTERMSIG(status)));
#ifdef WCOREDUMP /* Not in SUSv3, may be absent on some systems */
if (WCOREDUMP(status))
printf(" (core dumped)");
#endif
printf"\n""\n";
} else if (WIFSTOPPED(status)) {
printf("child stopped by signal %d %s%s\",
WSTOPSIG(status), strsignal(WSTOPSIG(status)));
#ifdef WIFCONTINUED /* SUSv3 has this, but older Linux versions and
some other UNIX implementations don't */
} else if (WIFCONTINUED(status)) {
printf("child continued\n");
#endif
} else { /* Should never happen */
printf("what happened to this child? (status=%x)\n",
(unsigned int) status);
}

procexec/print_wait_status.c
prin
function is used in Listing 26-3. This program creates a child
process that either loop
s continuously calling
(during which time signals
can be sent to the child) or, if an in
teger command-line argument was supplied,
exits immediately using this integer as th
e exit status. In the meantime, the parent
monitors the child via
waitpid
546
Chapter 26
sys/wai&#xs7ys;&#x/wai;t.h;t.h
Monitoring Child Processes
545
In its rationale for
waitpi
, SUSv3 notes that the name
WUNTRACED
is a historical
artifact of this flags origin in BSD, where a process could be stopped in one of
two ways: as a consequence of being traced by the
ptrace
system call, or by
being stopped by a signal (i.e., not being traced). When a child is being traced
, then delivery of
any
signal (other than
SIGKILL
) causes the child to
be stopped, and a
SIGCHLD
signal is consequently sent to the parent. This behavior
occurs even if the child is
ignoring the signal. However, if the child is blocking
the signal, then it is not stopped (unless the signal is
SIGSTOP
, which cant be
blocked).
26.1.3The Wait Status Value
Normal termination
Killed by signal
Stopped by signal
15870
ter!= 0!= 0
exit st0-2550-255
stop signal
core dumped flag
Continued by signal
544
Chapter 26
26.1.2The
wait
system call has a number of limitations, which
waitpid()
was designed
to address:
If a parent process has
created multiple children
, it is not possible to
for the
dication of this fact.
wa
, we can find out only about children that have terminated. It is not
possible to be notified when a chil
d is stopped by a signal (such as
SIGSTOP
or
SIGTTIN
) or when a stopped child is resumed by delivery of a
SIGCONT
signal.
The return value and
arguments of
are the same as for
. (See
Section 26.1.3 for an explanat
Monitoring Child Processes
543
Listing 26-1:
Creating and waiting for multiple children

procexec/multi_wait.c
#include sys/wait.h&#xsys/;wai;&#xt.h7;
#include time&#xtime;.h0;.h
#include "curr_time.h" /* Declaration of currTim
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int numDead; /* Number of children so far waited for */
pid_t childPid; /* PID of waited for child */
int j;
if (argc 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s sleep-time...\n", argv[0]);
542
Chapter 26
system call does the following:
MONITORING CHILD PROCESSES
In many application designs, a parent proc
ess needs to know when one of its child
processes changes statewhen the child term
inates or is stopped by a signal. This
chapter describes two techniques used to monitor child processes: the
system
call (and its variants) and the use of the
SIGCHLD
signal.
26.1Waiting on a Child Process
In many applications where a parent creates child processes, it is useful for the
parent to be able to monitor the children
to find out when and how they terminate.
This facility is provided by
wa
and a number of related system calls.
26.1.1The
system call waits for one of the children of the calling process to termi-
nate and returns the termination status of
that child in the buffer pointed to by
status
Process Termination
539
Normal termination is accomplished by calling
or, more usually,
exit
which is layered on top of
and
ex
take an integer argument
define the termination sta
tus of the process. By con-
vention, a status of 0 is used to indicate
successful termination, and a nonzero status
indicates unsuccessful termination.
As part of both normal and abnormal
process termination, the kernel per-
forms various cleanup steps. Terminating a process normally by calling
addi-
tionally causes exit handlers registered using
atex
and
to be called (in
reverse order of registration), and causes
buffers to be flushed.
Further information
Refer to the sources of further information listed in Section 24.6.
25.6Exercise
25-1.
If a child process makes the call
exit(1)
538
Chapter 25
To understand why the message written with
printf()
appears twice, recall that the
stdio
buffers are maintained in a processs user
-space memory (refer to Section 13.2).
Therefore, these buffers are
duplicated in the child by
fork
. When standard out-
put is directed to a terminal, it is line-buf
fered by default, with
the result that the
newline-terminated string written by
prin
appears immediately. However, when
standard output is directed to
a file, it is block-buffered by default. Thus, in our
example, the string written by
prin
is still in the parents
buffer at the time
of the
fork()
, and this string is duplicated in the child. When the parent and the
, they both flush th
eir copies of the
buffers, resulting in
duplicate output.
We can prevent this duplicated output
from occurring in one of the follow-
buffering issue, we can use
fflush()
to flush the
buffer prior to a
fork()
call. Alternatively, we could use
Process Termination
537
if (on_exit(onexitFunc, (void *) 10) != 0)
fatal("on_exit 1");
if (atexit(atexitFunc1) != 0)
fatal("atexit 1");
if (atexit(atexitFunc2) != 0)
fatal("atexit 2");
if (on_exit(onexitFunc, (void *) 20) != 0)
fatal("on_exit 2");
exit22
procexec/exit_handlers.c
536
Chapter 25
When called,
is passed two arguments: the
argument supplied to
exit
and a copy of the
arg
argument supplied to
at the time the function was
registered. Although defined as a pointer type,
is open to programmer-defined
Process Termination
535
some systems, this causes all of the exit handlers to once more be invoked,
which can result in an infinite recursion (until a stack overflow kills the pro-
cess). Portable applicatio
ns should avoid calling
exit()
inside an exit handler.
SUSv3 requires that an implementation a
llow a process to be able to register at
least 32 exit handlers. Using the call
sysconf(_SC_ATEXIT_MAX)
534
Chapter 25
An exit handler is a programmer-supplied function that is registered at some
point during the life of the process an
d is then automatically called during
normal
process termination via
. Exit handlers are not called if a program calls
directly or if the process is te
rminated abnormally by a signal.
To some extent, the fact that exit handle
rs are not called when a process is ter-
minated by a signal limits their utility. Th
e best we can do is
to establish handlers
for the signals that might be sent to th
Process Termination
533
The C99 standard requires that falling off the end of the main program should
be equivalent to calling
. This is the behavior we obtain on Linux if we
compile a program using
gcc std=c99
532
Chapter 25
unsuccessfully. There are no fixed rules ab
out how nonzero status values are to be
PROCESS TERMINATION
This chapter describes what happens wh
en a process terminates. We begin by
describing the use of
exit
and
to terminate a process. We then discuss the use
of exit handlers to automatically pe
rform cleanups when a process calls
conclude by considering some interactions between
fork
buffers, and
25.1Terminating a Process:
exit
A process may terminate in two general ways. One of these is
abnormal
termination,
caused by the delivery of a signal whose
default action is to terminate the process
(with or without a core dump), as described
in Section 20.1. Altern
atively, a process can
terminate
system call.
argument given to
_e
defines the
of the process,
which is available to the parent of this process when it calls
. Although
defined as an
, only the bottom 8 bits of
are actually made available to the
parent. By convention, a termination status
530
Chapter 24
Further information
[Bach, 1986] and [Goodheart & Cox, 1994] provide details of the implementation
exit
Process Creation
529
printf("[%s %ld] Child started - doing some work\n",
528
Chapter 24
assume that the state of the signal mask in
the child is irrelevant; if necessary, we
can unblock
SIGUSR1
in the child after the
fork
The following shell session log shows what happens when we run the program
in Listing 24-6:
./fork_sig_sync
[17:59:02 5173] Child started - doing some work
[17:59:02 5172] Parent about to wait for signal
[17:59:04 5173] Child about to signal parent
[17:59:04 5172] Parent got signal
Listing 24-6:
Using signals to synchronize process actions
procexec/fork_sig_sync.c
#include sign&#xsign;zl.;&#xh000;al.h
#include "curr_time.h" /* Declaration of currTime() */
#include "tlpi_hdr.h"
#define SYNC_SIG SIGUSR1 /* Synchronization signal */
Process Creation
527
To see the argument for the children first after
behavior, consider
what happens with copy-on-write
semantics when the child of a
performs
an immediate
exec()
. In this case, as the parent carries on after the
to
modify data and stack pages, the kernel duplicates the to-be-modified pages for
the child. Since the child performs an
exec()
as soon as it is scheduled to run,
526
Chapter 24
Listing 24-5:
Parent and child race to write a message after
fo

procexec/fork_whos_on_first.c
#include sys/wait.h&#xsys/;wai;&#xt.h7;
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int numChildren, j;
pid_t childPid;
if (�argc 1 && strcmp(argv[1], "--help") == 0)
usageErr("%s [num-children]\n", argv[0]);
from 2.2.19. Although this change was later dropped
from the 2.4 kernel series, it was subseq
uently adopted in Linux 2.6. Thus, pro-
grams that assume the 2.2.19 behavior
would be broken by the 2.6 kernel.
Some more recent experime
nts reversed the kernel
sessment of
Process Creation
525
Where it is used,
should generally be immediately followed by a call to
exec
. If the
call fails, the child process should terminate using
_exit()
child of a
should not terminate by calling
, since that would cause the
stdio
buffers to be flushed and closed. We
go into more detail on this point
in Section 25.4.)
Other uses of
vfork()
in particular, those relying on its unusual semantics for
memory sharing and process schedulingare likely to render a program nonportable,
especially to implementations where
is implemented simply as a call to
fork()
24.4Race Conditions After
After a
fo
, it is indeterminate which processthe parent or the childnext has
access to the CPU. (On a multiprocessor
system, they may both simultaneously
get access to a CPU.) Applications that implicitly or explicitly rely on a particular
sequence of execution in order to achieve correct results are open to failure due to
race conditions
, which we described in Section 5.1. Such bugs can be hard to find, as
their occurrence depends on scheduling de
cisions that the kernel makes according
to system load.
We can use the program in Listing 24-5
524
Chapter 24
Listing 24-4 shows the use of
vf
, demonstrating both of the semantic features
that distinguish it from
fork
: the child shares the parents memory, and the parent
is suspended until the child terminates or calls
. When we run this program,
we see the following output:
./t_vfork
Child executing
Even though child slept, parent was not scheduled
Parent executing
istack=666
From the last line of output, we can see
that the change made
by the child to the
was performed on the parents variable.
Listing 24-4:
Using
procexec/t_vfork.c
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int istack = 222;
switch (vf) {
case -1:
errExit("vfork");
case 0: /* Child executes first, in parent's memory space */
sleep(3); /* Even if we sleep for a while,
parent still is not scheduled */
write(STDOUT_FILENO, "Child executing\n", 16);
istack *= 3; /* This change will be seen by parent */
_exit(EXIT_SUCCESS);
default: /* Parent is blocked until child exits */
write(STDOUT_FILENO, "Parent executing\n", 17);
printf("istack=%d\n", istack);
exit(EXIT_SUCCESS);
}
procexec/t_vfork.c
Except where speed is absolutely critic
al, new programs should avoid the use of
in favor of
fo
. This is because, when
fork
is implemented using copy-on-
write semantics (as is done on most mode
rn UNIX implementati
ons), it approaches
the speed of
, and we avoid the eccentric
behaviors associated with
described above. (We show some speed comparisons between
fo
and
in
Section 28.3.)
SUSv3 marks
as obsolete, and SUSv4 goes further, removing the specifi-
cation of
. SUSv3 leaves many deta
ils of the operation of
unspecified,
allowing the possibility that it is implemented as a call to
fork
. When implemented
in this manner, the BSD semantics for
vfor
are not preserved. Some UNIX systems
do indeed implement
as a call to
fo
, and Linux also did this in kernel 2.0
Process Creation
523
Modern UNIX implementations employ
ing copy-on-write for implementing
fo
are
much more efficient than older
fo
implementations, thus largely eliminating the
need for
. Nevertheless, Linux (like many other UNIX implementations) pro-
vides a
system call with BSD semantics for programs that require the fastest
possible fork. However, beca
use the unusual semantics of
can lead to some
subtle program bugs, its use should normally be avoided, except in the rare cases
where it provides worthwhile performance gains.
Like
fo
is used by the calling proces
s to create a new child process.
However,
is expressly designed to be used in programs where the child per-
forms an immediate
call.
Two features distinguish the
system call from
and make it more efficient:
No duplication of virtual memory pages
or page tables is done for the child
process. Instead, the child shares the pa
rents memory until it either performs
a successful
exec
or calls
to terminate.
Execution of the parent process is su
spended until the child has performed an
exec
These points have some important implicat
ions. Since the child
is using the parents
memory, any changes made by
the child to the data, heap, or stack segments will be
visible to the parent once it resumes. Fu
rthermore, if the child performs a function
process attributes.
The semantics of
mean that after the call, th
e child is guaranteed to be
scheduled for the CPU before
the parent. In Section 24.2,
we noted that this is not
a guarantee made by
fork
or the child may be sched-
#include unis&#xunis;td.;&#xh000;td.h
pid_t
vfork
(void);
522
Chapter 24
Process Creation
521
the processs data, heap, and stack segments. Most modern UNIX implementa-
tions, including Linux, use two techni
ques to avoid such wasteful copying:
The kernel marks the text segment of each
process as read-only, so that a pro-
cess cant modify its own code. This me
ans that the parent and child can share
the same text segment. The
fo
system call creates a text segment for the
Frame
Parent
page table
PT entry 211
page table
PT entry 211
Frame
Physical page
Parent
page table
PT entry 211
page table
PT entry 211
Frame
Physical page
Before modificationAfter modification
Unused
page
frames
520
Chapter 24
Figure 24-2:
Duplication of file descriptors during
fork()
, and closing of unused descriptors
24.2.2Memory Semantics of
fork()
Conceptually, we can consider
fo
as creating copies of the parents text, data,
heap, and stack segments. (Indeed, in
some early UNIX implementations, such
duplication was literally performed: a new
process image was created by copying the
parents memory to swap space, and maki
ng that swapped-out image the child pro-
cess while the parent kept its own memory
.) However, actually performing a simple
copy of the parents virtual memory pa
ges into the new child process would be
wasteful for a number of reasonsone being that a
fork
is often followed by an
immediate
exec
, which replaces the processs text
with a new program and reinitializes
descriptor
descriptor
Parent file descriptors
close-on-exec flag
Open file table
Parent file descriptorsOpen file table
Child file descriptors
Child file descriptors
Parent file descriptorsOpen file table
a) Descriptors and open
file table entries
before
f
) Descriptors after

f
c) After closing unused
descriptors in parent
in child
descriptor
descriptor
descriptor
descriptor
descriptor
descriptor
descriptor
descriptor
OFT entry
OFT entry
OFT entry
OFT entry
OFT entry
OFT entry
Process Creation
519
switch (fo){
case -1:
errExit("fork");
case 0: /* Child: change file offset and status flags */
518
Chapter 24
file offset (as modified by
read
) and the open file status flags
Process Creation
517
516
Chapter 24
The key point to understanding
fo
is to realize that af
Process Creation
515
Figure 24-1:
Overview of the use of
fork()
24.2Creating a New Process:
In many applications, creati
ng multiple processes can be a useful way of dividing
up a task. For example, a network server process may listen for incoming client
requests and create a new child process to
handle each request; meanwhile, the
server process continues to listen for fu
rther client connections. Dividing tasks up
in this way often makes application design simpler. It also permits greater concur-
rency (i.e., more tasks or reques
ts can be handled simultaneously).
fork
system call creates a new process, the
, which is an almost exact
duplicate of the calling process, the
Memory of
parent copied t
o child
Execution of parent
Parent process
running program A
Child process
running program A
Parent may perform
other actions here
Execution of
program B
Child may perform
further actions here
Kernel restarts parent and
optionally delivers
f
exit(status)
wait(&status)
Child status
passed to par
514
Chapter 24
The
library function is layered on top of the
_exit()
system call. In Chapter 25,
wait
is likewise optional. The parent can simply ignore its child
and continue executing.
However, well see la
ter that the use of
is usually
desirable, and is often employ
ed within a handler for the
SIGCHLD
signal, which the
kernel generates for a parent process wh
en one of its children terminates. (By
default,
SIGCHLD
is ignored, which is why we label it as being optionally delivered in
In this and the next three chapters, we lo
ok at how a process is created and termi-
nates, and how a process can execute a ne
w program. This chap
ter covers process
creation. However, before diving into that
subject, we present a short overview of
the main system calls covered in these four chapters.
24.1Overview of
fo
execve
The principal topics of this and the ne
xt few chapters are the system calls
fo
wa
. Each of these system calls has variants, which well also
look at. For now, we provide an overview of these four system calls and how they
es of the parents stack, data, heap,
and text segments Section 6.3Section 6.3
fork
derives from the fact that we can
envisage the parent process as dividi
ng to yield two copies of itself.
exit(status)
library function terminates a process, making all resources
(memory, open file descriptors, and so on) used by the process available for
subsequent reallocation by the kernel. The
argument is an integer that
512
Chapter 23
Timers and Sleeping
511
fd = timerfd_create(CLOCK_REALTIME, 0);
if (fd == -1)
errExit("timerfd_create");
510
Chapter 23
the maximum number of expirations of the
timer that the program should wait for
before terminating; the default for this argument is 1.
The program creates a timer using
timerfd_create()
, and arms it using
Timers and Sleeping
509
new_value.it_value
508
Chapter 23
to be able to simultaneous
ly monitor one or more time
Timers and Sleeping
507
if (argc 2)
usageErr("%s secs[/nsecs][:int-secs[/int-nsecs]]...\n", argv[0]);
tidlist = calloc(argc - 1, sizeof(timer_t));
if (tidlist == NULL)
errExit("malloc");
sev.sigev_notify = SIGEV_THREAD; /* Notify via thread */
sev.sigev_notify_function = threadFunc; /* Thread start function */
sev.sigev_notify_attributes = NULL;
/* Could be pointer to pthread_attr_t structure */
/* Create and start one timer for each command-line argument */
for (j = 0; j argc - 1; j++) {
itimerspecFromStr(argv[j + 1], &ts);
sev.sigev_value.sival_ptr = &tidlist[j];
/* Passed as argument to threadFu/
if (timer_create(CLOCK_REALTIME, &sev, &tidlist[j]) == -1)
errExit("timer_create");
printf("Timer ID: %ld (%s)\n", (long) tidlist[j], argv[j + 1]);
506
Chapter 23
Listing 23-7:
using a thread function
timers/ptmr_sigev_thread.c
#include sign&#xsign;zl.;&#xh000;al.h
#include time&#xtime;.h0;.h
#include pthr&#xpthr;纭&#x.h00;ead.h
#include "curr_time.h" /* Declaration of currTim
#include "tlpi_hdr.h"
#include "itimerspec_from_str.h" /* Declares itimerspecFro */
static pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
static int expireCnt = 0; /* Number of expirations of all timers */
static void /* Thread notification function */
threadFunc(union sigval sv)
timer_t *tidptr;
int s;
tidptr = sv.sival_ptr;
printf("[%s] Thread notify\n", currTime"%T""%T");
printf(" timer ID=%ld\n", (long) *tidptr);
Timers and Sleeping
505
address of the timer ID (
) to this field
so that the notification func-
tion can obtain the ID of the ti
mer that caused its invocation.
Having created and armed all of the ti
mers, the main program enters a loop
that waits for timer expirations
. Each time through the loop, the program
pthread_cond_wait()
to wait for a condition variable (
cond
) to be signaled by
the thread that is handli
ng a timer notification.
thread
function is invoked on each timer expiration
ing a message, it increments the value of the global variable
for the possibility of timer ov
504
Chapter 23
that can be queued. Therefore, the POSIX.1b committee decided on a different
approach: if we choose to
receive timer notification
via a signal, then multiple
instances of the signal are never queued, ev
en if we use a realtime signal. Instead,
after receiving the signal (either via a signal handler or by using
), we
Use the value in the
field of the
tification of timer expiration via
the invocation of a function in a separate
thread. Understanding this flag requires
knowledge of POSIX threads th
at we present later, in Chapters 29 and 30. Readers
unfamiliar with POSIX threads may want to
read those chapters before examining
the example program that we present in this section.
Listing 23-7 demons
trates the use of
SIGEV_THREAD
. This program takes the same
command-line arguments as the program
in Listing 23-5. The program performs
the following steps:
For each command-line argument, the program creates
and arms
a POSIX
SIGEV_THREAD
notification mechanism
Each time this timer expires, the function specified by
sev.sigev_notify_function
will be invoked in a separate thread. When this function is invoked, it receives
the value specified in
sev.sigev_value.sival_ptr
as an argument. We assign the
#define _POSIX_C_SOURCE 199309
#include time&#xtime;.h0;.h
int
Timers and Sleeping
503
if (cptr == NULL) {
�tsp-it_interval.tv_sec = 0;
�tsp-it_interval.tv_nsec = 0;
} else {
sptr = strchr(cptr + 1, '/');
if (sptr != NULL)
*sptr = '\0';
�tsp-it_interval.tv_sec = atoi(cptr + 1);
�tsp-it_interval.tv_nsec = (sptr != NULL) ? atoi(sptr + 1) : 0;
}

timers/itimerspec_from_str.c
We demonstrate the use of the program in
Listing 23-5 in the following shell ses-
sion, creating a single timer with an initia
l timer expiry of 2 seconds and an interval
of 5 seconds.
./ptmr_sigev_signal 2:5
Timer ID: 134524952 2:52:5
[15:54:56] Got signal 64
SIGRTMAX
is signal 64 on this system
*sival_ptr = 134524952
sival_ptr points to the variable tid
502
Chapter 23
Each of the command-line arguments of th
e program in Listing 23-5 specifies the
initial value and interval for a timer. The
syntax of these arguments is described in
the programs usage message and demonstrated in the shell session below. This
program performs the following steps:
Establish a handler for the signal th
at is used for timer notifications
For each command-line argument, create
and arm
a POSIX timer that
uses the
SIGEV_SIGNAL
notification mechanism. The
iti
function
that we use to convert
the command-line arguments to
structures
is shown in Listing 23-6.
On each timer expiration, the signal specified in
will be delivered
to the process. The handler for this signal
displays the value that was supplied in
sev.sigev_value.sival_ptr
(i.e., the timer ID,
) and the overrun value for
the timer
Having created and armed the timers, wait
for timer expirations by executing a
loop that repeatedly calls
Listing 23-6 shows the function that conv
erts each of the command-line arguments
for the program in Listing 23-5 into a corresponding
itimerspec
structure. The for-
Timers and Sleeping
501
printf("[%s] Got signal %d\n", currTime"%T""%T", sig);
printf(" *sival_ptr = %ld\n", (long) *tidptr);
500
Chapter 23
further information about the signal. (To ta
ke advantage of this feature in a signal
handler, we specify the
SA_SIGINFO
flag when establishing the handler.) The follow-
ing fields are set in the
: This field contains the signal generated by this timer.
the signal. (Alternatively,
may be assigned the address of a
structure that contains the
given to
timer_c
Linux also supplies the following nonstandard field in the
si_overrun
: This field contains the overrun
count for this timer (described in
Section 23.6.6).
Linux also supplies another nonstandard field:
. This field contains
an identifier that is used internally by th
e system to identify the timer (it is not
Timers and Sleeping
499
On each expiration of the timer, th
siginfo_t
structure (Section 21.4) that provides
#define _POSIX_C_SOURCE 199309
#include time&#xtime;.h0;.h
int
#define _POSIX_C_SOURCE 199309
#include time&#xtime;.h0;.h
int
timer_delete
(timer_t
timerid
Returns 0 on success, or 1 on error
498
Chapter 23
23.6.2Arming and Disarming a Timer:
Timers and Sleeping
497
SIGEV_SIGNAL
When the timer expires, generate the signal specified in the
field
for the process. If
sigev_signo
is a realtime signal, then the
sigev_value
field spec-
ifies data (an integer or a pointer) to
accompany the signal (Section 22.8.1).
496
Chapter 23
clockid
can specify any of the values shown in Table 23-1, or the
value
Timers and Sleeping
495
23.6POSIX Interval Timers
494
Chapter 23
and
remain
arguments serve similar purposes to the analogous argu-
By default (i.e., if
is 0), the sleep interval specified in
is relative
nanosleep()
). However, if we specify
TIMER_ABSTIME
in
(see the example in
Listing 23-4), then
request
specifies an absolute time as
measured by the clock iden-
tified by
clockid
. This feature is essential in applications that need to sleep accu-
rately until a specific time. If we instead
struct timespec request;
Timers and Sleeping
493
23.5.3Obtaining the Clock ID of
a Specific Process or Thread
The functions described in this section a
llow us to obtain the ID of a clock that
measures the CPU time consumed by a pa
rticular process or thread. We can use
#define _XOPEN_SOURCE 600
#include time&#xtime;.h0;.h
int
clock_nanosleep
(clockid_t
clockid
flags
const struct timespec *
, struct timespec *
remain
492
Chapter 23
CLOCK_REALTIME
clock is a system-wide clock that measures wall-clock time. By
contrast with the
CLOCK_MONOTONIC
dont cause any access to the hardware clock (which can be expensive for some
hardware clock sources), and the resoluti
Timers and Sleeping
491
request = remain; /* Next sleep is with remaining time */
}
490
Chapter 23
Listing 23-3:
Using
timers/t_nanosleep.c
#define _POSIX_C_SOURCE 199309
#include sys/time.h&#xsys/;tim;.h7;
#include time&#xtime;.h0;.h
#include sign&#xsign;zl.;&#xh000;al.h
#include "tlpi_hdr.h"
static void
sigintHandler(int sig)
Timers and Sleeping
489
this program expects seconds
and nanosecond values for
loops repeatedly, executing
until the total sleep interval is passed. If
is interrupted by the handler for
SIGINT
(generated by typing
Control-C
488
Chapter 23
If the sleep completes,
slee
#define _POSIX_C_SOURCE 199309
#include time&#xtime;.h0;.h
int
nanosleep
(const struct timespec *
request
, struct timespec *
remain
Timers and Sleeping
487
sa.sa_handler = handler;
if (sigaction(SIGALRM, &sa, NULL) == -1)
errExit("sigaction");
486
Chapter 23
Timers and Sleeping
485
An example of the use of
alarm()
is shown in Section 23.3.
In some later example programs in this book, we use
to start a timer
without establishing a corresponding
SIGALRM
handler, as a technique for
ensuring that a process is killed if it is not otherwise terminated.
484
Chapter 23
Timers and Sleeping
483
482
Chapter 23
Each time the timer expires, the
Timers and Sleeping
481
. If we call
480
Chapter 23
TIMERS AND SLEEPING
A timer allows a process to sc
hedule a notification for it
self to occur at some time
in the future. Sleeping allows a proces
s (or thread) to suspend execution for a
period of time. This chapter
478
Chapter 22
Although signals can be viewed as a method of IPC, many factors make them
generally unsuitable for this purpose, in
cluding their asynchronous nature, the fact
that they are not queued, and their low ba
ndwidth. More usually, signals are used
Signals: Advanced Features
477
well as the process ID and real user ID of the sending process.
sigsuspend
system call allows a program to atomically modify the process
signal mask and suspend execution unti
l a signal arrives, The atomicity of
is essential to avoid race conditio
ns when unblocking a signal and then
l that signal arrives.
We can use
sigwaitin
and
sigtimedwa
to synchronously wait for a signal.
This saves us the work of designing an
d writing a signal handler, which may be
unnecessary if our only aim is to wait for the delivery of a signal.
Like
and
, the Linux-specific
system call can
be used to synchronously wait for a signal
. The distinctive feature of this interface
is that signals can be read via a file descri
ptor. This file descriptor can also be mon-
itored using
476
Chapter 22
disposition to
sigpause()
function is similar to
, but removes
just one signal from the process signal
mask before suspending the process until
the arrival of a signal.
The BSD signal API
The POSIX signal API drew heavily on the 4.2BSD API, so the BSD functions are
mainly direct analogs of those in POSIX.
As with the functions in the System V
signal API described above, we present
the prototypes of the functions in the BSD
signal API, and briefly explain the oper-
ation of each function.
Signals: Advanced Features
475
22.13System V and BSDSystem V and BSD
Our discussion of signals has focused on the POSIX signal API. We now briefly
look at the historical APIs provided by System V and BSD. Although all new appli-
cations should use the POSIX API, we
474
Chapter 22
file descriptor can be monitored
along with other descriptors using
sele
(described in Chapter 63). Among other uses, this feature pro-
vides an alternative to the self-pipe trick described in Section 63.5.2. If signals are
pending, then these techniques indicate
the file descriptor as being readable.
When we no longer require a
signalfd
file descriptor, we should close it, in
order to release the asso
ciated kernel resources.
Listing 22-7 (on page 473) demonstrates the use of
. This program cre-
ates a mask of the signal numbers specif
ied in its command-line arguments, blocks
those signals, and then creates a
file descriptor to read those signals. It then
loops, reading signals from the file descri
ptor and displaying some of the informa-
Signals: Advanced Features
473
472
Chapter 22
mask
Signals: Advanced Features
471
printf(" si_pid=%ld, si_uid=%ld\n",
(long) si.si_pid, (long) si.si_uid);
}
signals/t_sigwaitinfo.c
system call is a variation on
. The only difference is
allows us to specify a time limit for waiting.
argument specifies the maximum time that
sigtimedwa
should wait
for a signal. It is a pointer to a structure of the following type:
struct timespec {
time_t tv_sec; /* Seconds ('time_t' is an integer type) */
long tv_nsec; /* Nanoseconds */
The fields of the
structure are filled in to specify the maximum number of
seconds and nanoseconds that
sigt
should wait. Specifying both fields of
the structure as 0 causes an immediate timeo
utthat is, a poll to check if any of the
specified set of signals is pending. If th
e call times out without a signal being deliv-
ered,
sigt
fails with the error
EAGAIN
If the
argument is specified as
, then
sigtimedwait
is exactly equiva-
lent to
. SUSv3 leaves the meaning of a
NULL
unspecified, and
some UNIX implementations instead interp
#include sys/signal&#xsys/;sig;&#xnal7;ý.h;fd.h
int
signalfd
(int
470
Chapter 22
In the output for the accepted
SIGUSR1
signal, we see that the
field has the
value 100. This is the value to which the field was initialized by the preceding signal
that was sent using
. We noted earlier that the
field contains valid
information only for signals sent using
sigq
Listing 22-6:
Synchronously waiting for a signal with
signals/t_sigwaitinfo.c
#define _GNU_SOURCE
#include stri&#xstri;ng.;&#xh000;ng.h
#include sign&#xsign;zl.;&#xh000;al.h
#include time&#xtime;.h0;.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int sig;
siginfo_t si;
Signals: Advanced Features
469
An example of the use of
sigw
is shown in Listing 22-6. This program
first blocks all signals, then delays for the number of seconds specified in its
optional command-line argument. This allo
ws signals to be sent to the program
before
sigwaitin
. The program then loops continuously using
to
accept incoming signals, until
SIGINT
or
SIGTERM
is received.
The following shell session log demo
nstrates the use of the program in
Listing22-6. We run the program in the background, specifying that it should
delay 60 seconds before calling
, and then send it two signals:
./t_sigwaitinfo 60 &
./t_sigwaitinfo: PID is 3837
./t_sigwaitinfo: signals blocked
./t_sigwaitinfo: about to delay 60 seconds
[1] 3837
./t_sigqueue 3837 43 100
Send signal 43
./t_sigqueue: PID is 3839, UID is 1000
./t_sigqueue 3837 42 200
Send signal 42
./t_sigqueue: PID is 3840, UID is 1000
Eventually, the program complete
s its sleep interval, and the
sigw
loop
accepts the queued signals. (We see a shell prompt mixed with the next line of the
programs output because the
t_sigwaitinfo
program is writing output from the
background.) As with realtime
signals caught with a handler, we see that signals are
delivered lowest number first, and that the
structure passed to the signal
handler allows us to obtain the process
ID and user ID of the sending process:
$ ./t_sigwaitinfo: finished delay
got signal: 42
si_signo=42, si_code=-1 (SI_QUEUE), si_value=200
si_pid=3840, si_uid=1000
got signal: 43
si_signo=43, si_code=-1 (SI_QUEUE), si_value=100
si_pid=3839, si_uid=1000
We continue, using the shell
command to send a sign
al to the pr
ocess. This
time, we see that the
468
Chapter 22
The main program continues its loop:
=== LOOP 2
Starting critical section, signal mask is:
2 (Interrupt)
3 (Quit)
Type Control-\ to generate
SIGQUIT
Before sigsusp- pending signals:
3 (Quit)
Caught signal 3 (Quit)
sigsuspendd, signals are unblocked
=== Exited loop
Signals: Advanced Features
467
if (sigaction(SIGINT, &sa, NULL) == -1)
errExit("sigaction");
if (sigaction(SIGQUIT, &sa, NULL) == -1)
errExit("sigaction");
for (loopNum = 1; !gotSigquit; loopNum++) {
printf("=== LOOP %d\n", loopNum);
/* Simulate a critical section by delaying a few seconds */
printSigMask(stdout, "Starting critical section, signal mask is:\n");
for (startTime = time(NULL); time(NULL) startTime + 4; )
continue; /* Run for a few seconds elapsed time */
printPendingSigs(stdout,
"Before sigsuspe- pending signals:\n");
if (sigsuspend(&origMask) == -1 && errno != EINTR)
errExit("sigsuspend");
}
466
Chapter 22
Loop until
is set
. Each loop iteration perf
orms the following steps:
Display the current value of
the signal mask using our
printSigMask
Simulate a critical section by executing a CPU busy loop for a few seconds.
Display the mask of pending signals using our
printPendi
function
(Listing 20-4).
Uses
to unblock
and
SIGQUIT
and wait for a signal (if one
is not already pending).
to restore the process signal mask to its original state
then display the signal mask using
prin
Listing 22-5:
Using

signals/t_sigsuspend.c
Signals: Advanced Features
465
and the main program resumes, the
paus
call will block until a
instance
of
SIGINT
is delivered. This defeats the purp
ose of the code, which was to unblock
SIGINT
and then wait for its
Even if the likelihood of
SIGINT
464
Chapter 22
22.9Waiting for a Si
Before we explain what
does, we first descri
be a situation where we
need to use it. Consider the following
scenario that is so
There is a problem with the code in Listing 22-4. Suppose that the
SIGINT
signal is
delivered after executi
on of the second
, but before the
call.
(The signal might actually have been gene
rated at any time during the execution of
the critical section, and then be delivered on
ly when it is unbloc
ked.) Delivery of the
SIGINT
signal will cause the handler to be
Signals: Advanced Features
463
int
main(int argc, char *argv[])
struct sigaction sa;
int sig;
462
Chapter 22
We continue by using the shell
command to send a signal to the
catch_rtsigs
pro-
gram. As before, we see that the
structure received by the handler includes
the process ID and user ID of the se
nding process, but in this case, the
value
is
SI_USER
Press Enter to see next shell prompt
echo $$
Display PID of shell
12780
kill -40 12842
Uses kill(2) to send a signal
$ caught signal 40
si_signo=40, si_code=0 (SI_USER), si_value=0
si_pid=12780, si_uid=1000
PID is that of the shell
Press Enter to see next shell prompt
kill 12842
Kill catch_rtsigs by sending
SIGTERM
Caught 6 signals
Press Enter to see notification from shell about terminated background job
[1]+ Done ./catch_rtsigs 60
Listing 22-3:

signals/catch_rtsigs.c
#define _GNU_SOURCE
#include stri&#xstri;ng.;&#xh000;ng.h
#include sign&#xsign;zl.;&#xh000;al.h
#include "tlpi_hdr.h"
static volatile int handlerSleepTime;
static volatile int sigCnt = 0; /* Number of signals received */
static volatile int allDone = 0;
static void /* Handler for signals established using SA_SIGINFO */
siginfoHandler(int sig, siginfo_t *si, void *ucontext)
/* UNSAFE: This handler uses non-async-signal-safe functions
(printf)ee Section 21.1.2) */
/* SIGINT or SIGTERM can be used to terminate program */
if (sig == SIGINT || sig == SIGTERM) {
allDone = 1;
Signals: Advanced Features
461
If the first argument is supplied, the ma
in program blocks all signals, and then
sleeps for the number of seconds specified by this argument. During this time, we
can queue multiple realtime signals to
the process and observe what happens when
the signals are unblocked. The second ar
gument specifies the number of seconds
that the signal handler should sleep befo
460
Chapter 22
A call to
may fail if the limit on the number of queued signals has been
reached. In this case,
Signals: Advanced Features
459
Listing 22-2:
Using
sigq
to send realtime signals

signals/t_sigqueue.c
#define _POSIX_C_SOURCE 199309
#include sign&#xsign;zl.;&#xh000;al.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int sig, numSigs, j, sigData;
union sigval sv;
if (argc 4 || strcmp(argv[1], "--help") == 0)
usageErr("%s pid sig-num data [num-sigs]\n", argv[0]);
/* Display our PID and UID, so that they can be compared with the
corresponding fields of the siginfo_t argument supplied to the
handler in the receiving process */
printf("%s: PID is %ld, UID is %ld\n", argv[0],
458
Chapter 22
this case). However, SUSv3 doesnt re
quire implementations to guarantee this
behavior, so we cant portably rely on it.
22.8.1Sending Realtime Signals
sigq
system call sends the realtime signal specified by
to the process
specified by
The same permissions are required to send a signal using
sigq
with
(see Section 20.5). A null signal (i.e.,
signal 0) can be sent, with the same
meaning as for
, we cant use
to send a signal to an
entire process group by specifying a negative value in
#define _POSIX_C_SOURCE 199309
#include sign&#xsign;zl.;&#xh000;al.h
int
sigqueue
(pid_t
, int
, const union sigval
Returns 0 on success, or 1 on error
Signals: Advanced Features
457
When sending a realtime signal, it is possible to specify data (an integer or
pointer value) that accompanies the sign
al. The signal handler in the receiving
456
Chapter 22
by default. In older versions of the library,
the earlier unreliable (System V-compatible)
semantics are provided.
The Linux kernel contains an implementation of
as a system call. This
implementation provides the older, unreliable semantics. However,
bypasses this system
call by providing a
library function that calls
If we want to obtain unreliable sign
al semantics with mo
dern versions of
, we
can explicitly replace our calls to
with calls to the (nonstandard)
sysv_s
function.
function takes the same arguments as
If the
_BSD_SOURCE
feature test macro is not defined when compiling a program,
glibc
implicitly redefines all calls to
to be calls to
has unreliable semantics. By default,
_BSD_SOURCE
defined, but it is disabled
(unless also explicitly defined) if
other feature test macros such as
_SVID_SOURCE
or
_XOPEN_SOURCE
are defined when compiling a program.
sigaction()
is the preferred API for establishing a signal handler
Because of the System V versus BSD (and old versus recent
) portability issues
described above, it is good practice always to use
sigactio
, rather than
establish signal handlers. We follow this
practice throughout th
e remainder of this
book. (An alternative is to write our own version of
, probably similar to List-
ing 22-1, specifying exactly
the flags that we require,
and employ that version with
our applications.) Note, however, that
it is portable (and
Signals: Advanced Features
455
Delivery of further occurrences of a si
gnal was not blocked during execution
of a signal handler. (This corresponds to the
SA_NODEFER
flag described in Sec-
tion 20.13.) This meant that if the sign
al was delivered again while the handler
was still executing, then the handler woul
d be recursively invoked. Given a suf-
ficiently rapid stream of signals, the re
sulting recursive invocations of the han-
dler could overflow the stack.
As well as being unreliable, early UNIX
implementations did not provide auto-
matic restarting of system calls (i
.e., the behavior described for the
SA_RESTART
flag in
Section 21.5).
The 4.2BSD reliable signals implementa
tion rectified these limitations, and
several other UNIX implementations follo
wed suit. However, the older semantics
live on today in the System V implementation of
, and even contemporary
standards such as SUSv3 and C99 leave these aspects of
deliberately
Tying the above information together, we implement
as shown in List-
ing 22-1. By default, this implementation
provides the modern signal semantics. If
compiled with
DOLD_SIGNAL
, then it provides the earlier unreliable signal
semantics and doesnt enable automa
tic restarting of system calls.
Listing 22-1:
An implementation of
signals/signal.c
#include sign&#xsign;zl.;&#xh000;al.h
typedef void (*sighandler_t)(int);
sighandler_t
signal(int sig, sighandler_t handler)
struct sigaction newDisp, prevDisp;
newDisp.sa_handler = handler;
sigemptyset(&newDisp.sa_mask);
#ifdef OLD_SIGNAL
454
Chapter 22
Order of delivery of multiple unblocked signals
If a process has multiple pending signals that are unblocked using
sigprocmask()
then all of these signals are immediately delivered to the process.
As currently implemented, the Linux kern
el delivers the signals in ascending
order. For example, if pending
SIGINT
(signal 2) and
SIGQUIT
(signal 3) signals were
both simultaneously unblocked, then the
SIGINT
signal would be delivered before
SIGQUIT
, regardless of the order in which the two signals were generated.
We cant, however, rely on standardstandard
signals being delivered in any particular
order, since SUSv3 says that the delivery or
der of multiple signal
s is implementation-
defined. (This statement applies only to stan
dard signals. As well see in Section 22.8,
the standards governing realtime signals
do provide guarantees about the order in
time signals are delivered.)
Main program
Signals: Advanced Features
453
The model we have implicitly considered so far is
asynchronous
signal genera-
tion, in which the signal is sent either by
another process or gen
erated by the kernel
for an event that occurs independently of
the execution of th
e process (e.g., the
user types the
character or a child of this process terminates). For asyn-
chronously generated signals, the earlie
r statement that a process cant predict
when the signal will be delivered holds true.
However, in some cases, a signal is generated while the process itself is execut-
ing. We have already seen two examples of this:
The hardware-generated signals (
SIGBUS
SIGFPE
SIGILL
SIGSEGV
, and
SIGEMT
described in Section 22.4 are generated as a consequence of executing a specific
machine-language instruction that results in a hardware exception.
A process can use
ki
, or
to send a signal to itself.
In these cases, the genera
tion of the signal is
the signal is delivered
immediately (unless it is blocked, but see Se
ction 22.4 for a discussi
on of what happens
when blocking hardware-generated signals)
. In other words, the earlier statement
about the unpredictability of
the delivery of a signal doesnt apply. For synchro-
nously generated signals, delivery
is predictable and reproducible.
Note that synchronicity is an attribute of
of the signal itself. All signals may be generated synchronously (e.g., when a pro-
cess sends itself a signal using
) or asynchronously (e.g., when the signal is sent
by another process using
22.6Timing and Order
of Signal Delivery
As the first topic of this section, we cons
ider exactly when a pending signal is deliv-
ered. We then consider what happens if multiple pending blocked signals are
simultaneously unblocked.
When is a signal delivered?
As noted in Section 22.5, synchronously
generated signals are delivered immedi-
ately. For example, a hardware exception
triggers an immediate signal, and when a
process sends itself a signal using
ra
, the signal is delivered before the
ra
452
Chapter 22
22.4Hardware-Generated Signals
SIGBUS
SIGFPE
SIGILL
SIGSEGV
can be generated as a consequence of a hardware
exception or, less usually, by being sent by
. In the case of a hardware excep-
Signals: Advanced Features
451
nored terminal-generated signals
If, at the time it was execed, a program
finds that the disposition of a terminal-
450
Chapter 22
22.2Special Cases for Delivery
For certain signals, special rules apply
regarding delivery, disposition, and han-
dling, as described in this section.
SIGKILL
SIGSTOP
It is not possible to chan
SIGKILL
, which always terminates a
process, and
SIGSTOP
, which always stops a process. Both
and
Signals: Advanced Features
449
The file system on which the current
working directory resides is mounted
read-only, is full, or has run out of i-no
des. Alternatively, the user has reached
their quota limit on the file system.
448
Chapter 22
22.1Core Dump Files
Certain signals cause a process to create
a core dump and terminate (Table 20-1,
page 396). A core dump is a file containi
ng a memory image of the process at the
time it terminated. (The term
derives from an old memory technology.) This
memory image can be loaded into a debugg
er in order to examine the state of a
programs code and data at the
moment when the signal arrived.
One way of causing a program to produce a core dump is to type the
acter (usually
Control-\
SIGQUIT
signal to be generated:
ulimit -c unlimited
Explained in main text
Type Control-\
Quit (core dumped)
ls -l core
Shows core dump file for11
-rw------- 1 mtk users 57344 Nov 30 13:39 core
In this example, the message
Quit (core dumped)
is printed by the shell, which
ss. Similar functionality is available on
Linux by attaching to a running process using
and then using the
gcore
command.
Circumstances in which core dump files are not produced
A core dump is not produced in the following circumstances:
The process doesnt have pe
rmission to write the core
dump file. This could
happen because the process doesnt have
write permission for the directory in
which the core dump file is to be create
d, or because a file with the same name
already exists and either is not writable or
is not a regular file (e.g., it is a direc-
tory or a symbolic link).
A regular file with the same name alre
ady exists, and is writable, but there is
hardhardlink to the file.
The directory in which the core dump file is to be created doesnt exist.
The process resource limit on
the size of a core dump fi
of the code of a program that they
would otherwise be unable to read.
SIGNALS: ADVANCED FEATURES
446
Chapter 21
We can use
si
to define an alternate signal stack for a process. This is
an area of memory that is used instead of the standard process stack when invoking
a signal handler. An alternate signal st
ack is useful in cases where the standard
stack has been exhausted by growing too
large (at which point the kernel sends a
SIGSEGV
signal to
the process).
sigaction()
SA_SIGINFO
flag allows us to establish a signal handler that
receives additional information about a signal. This information is supplied via a
structure whose address is passed as
an argument to the signal handler.
When a signal handler interrupts a blocked system call, the system call fails
with the error
EINTR
. We can take advantage of this
behavior to, for example, set a
timer on a blocking system call. Interrupted
system calls can be manually restarted
if desired. Alternatively, establis
hing the signal handler with the
SA_RESTART
flag causes many (but not all) system
calls to be automatically restarted.
Further information
See the sources listed in Section 20.15.
21.7Exercise
21-1.
Implement
abort()
Signals: Signal Handlers
445
If
flag
11a handler for the signal
sig
will interrupt blocking system
calls. If
is false (0), then blocking system
calls will be restarted after execution
of a handler for
siginterrupt
function works by using
sigactio
444
Chapter 21
time of signal delivery, the input and
output system calls will be interrupted,
Signals: Signal Handlers
443
If we frequently write code such as the above, it can be useful to define a macro
such as the following:
#define NO_EINTR(stmt) while ((stmt) == -1 && errno == EINTR);
Using this macro, we can rewrite the earlier
read
call as follows:
NO_EINTR(cnt = read(fd, buf, BUF_SIZE));
if (cnt == -1) /* read() failed with other than EINTR */
errExit("read");
The GNU C library provides a (nonstan
dard) macro with the same purpose as
NO_EINTR()
macro in
unistd.&#xu7.6;&#xnis7;&#x.6td;&#x.7.6;&#xh000;h
. The macro is called
442
Chapter 21
The final argument passed to a handler established with the
SA_SIGINFO
flag,
ucontext
, is a pointer to a structure of type
ucontext_t
(defined in
ucontext&#xuc7o;&#xntex;&#xt7.h;.h
(SUSv3 uses a
pointer for this argument because it doesnt specify any of the
Signals: Signal Handlers
441
Table 21-2:
(continued)
440
Chapter 21
The following two fields are set only for the delivery of a
SIGIO
signal (Section 63.3):
This field contains the band event value associated with the I/O event.
glibc
up until 2.3.2,
si_band
was typed as
This field contains the number of the file descriptor associated with the I/O
event. This field is not specified in SUSv3, but it is present on many other
Signals: Signal Handlers
439
Upon entry to a signal ha
ndler, the fields of the
siginfo_t
438
Chapter 21
structure uses a union to combine the
and
fields. (Most other UNIX implementations
similarly use a union for this purpose.)
Using a union is possible because only one of these fields is required during a par-
. (However, this can lead to strange bugs if we naively
Signals: Signal Handlers
437
sigstack.ss_size = SIGSTKSZ;
sigstack.ss_flags = 0;
if (sigaltstack(&sigstack, NULL) == -1)
errExit("sigaltstack");
printf("Alternate stack is at %10p-%p\n",
sigstack.ss_sp, (char *) sbrk(0) - 1);
sa.sa_handler = sigsegvHandler; /* Establish handler for SIGSEGV */
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_ONSTACK; /* Handler uses alternate stack */
if (sigaction(SIGSEGV, &sa, NULL) == -1)
errExit("sigaction");
overflowSt11;
signals/t_sigaltstack.c
21.4The
about a signal when it is delivered. In
order to obtain this information, we must declare the handler as follows:
void handler(int sig, siginfo_t *siginfo, void *ucontext);
The first argument,
, is the signal number, as for a standard signal handler. The
second argument,
, is a structure used to provide the additional information
about the signal. We describe this
structure below. The last argument,
ucontext
, is
also described below.
Since the above signal hand
ler has a different prototype from a standard signal
handler, C typing rules mean that we cant use the
field of the
sigaction
structure to specify the address of the hand
ler. Instead, we must use an alternative
field:
. In other words, the definition of the
structure is some-
what more complex than was shown in
Section 20.13. In full, the structure is
struct sigaction {
union {
void (*sa_handler)intint
void (*sa_sigaction)(int, siginfo_t *, void *);
} __sigaction_handler;
436
Chapter 21
Call 2145 - top of stack near 0x4024cfac
Caught signal 11 (Segmentation fault)
Top of handler stack near 0x804c860
In this shell session, the
command is used to remove any
RLIMIT_STACK
Signals: Signal Handlers
435
specifying
for the
argument. Otherwise, each of these arguments
points to a structure of the following type:
typedef struct {
void *ss_sp; /* Starting address of alternate stack */
int ss_flags; /* Flags: SS_ONSTACK, SS_DISABLE */
size_t ss_size; /* Size of alternate stack */
} stack_t;
fields specify the size and locati
on of the alternate signal stack.
When actually using the alternate signal st
ack, the kernel automatically takes care
of aligning the value given in
to an address boundary that is suitable for the
Typically, the alternate signal stack is ei
ther statically allocated or dynamically
allocated on the heap. SUSv3 specifies the constant
SIGSTKSZ
to be used as a typical
value when sizing the alternate stack, and
MINSIGSTKSZ
as the minimum size required
to invoke a signal handler. On Linux/x8
6-32, these constants are defined with the
values 8192 and 2048, respectively.
The kernel doesnt resize an alternate si
gnal stack. If the stack overflows the
space we have allocated for it, then chao
s results (e.g., overwriting of variables
beyond the limits of the st
ack). This is not usually a problembecause we normally
use an alternate signal stack to handle the special case of the standard stack over-
flowing, typically only one or a few frames
are allocated on the stack. The job of the
SIGSEGV
handler is either to perform some
cleanup and terminate the process or to
unwind the standard stack using a nonlocal goto.
ss_flags
field contains one of the following values:
SS_ONSTACK
434
Chapter 21
abort()
always terminates the process. In
most implementations, termination is
guaranteed as follows: if the process
still hasnt terminated after raising
SIGABRT
once (i.e., a handler catches the sign
Signals: Signal Handlers
433
int
main(int argc, char *argv[])
struct sigaction sa;
printSigMask(stdout, "Signal mask at startup:\n");
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sa.sa_handler = handler;
if (sigaction(SIGINT, &sa, NULL) == -1)
errExit("sigaction");
432
Chapter 21
Note that using
was the simplest way of writ
ing the program in Listing 21-2
in a standards-conformant fashion. In pa
rticular, we could not have replaced the
#ifdef
with the following run-time check:
if (useSiglongjmp)
Signals: Signal Handlers
431
From the program output, we can see that, after a
lo
from the signal
430
Chapter 21
Signals: Signal Handlers
429
428
Chapter 21
Real-world applications should avoid
calling non-async-signal-safe functions
from signal handlers. To make this clear, each signal handler in the example pro-
grams that uses one of these functions is
marked with a comment indicating that
the usage is unsafe:
printf("Some message\n"); /* UNSAFE */
21.1.3Global Variables and the
sig_atomic_t
Data Type
Notwithstanding reentrancy issues, it ca
n be useful to share global variables
Signals: Signal Handlers
427
SUSv3 notes that all functions not listed in
Table 21-1 are considered to be unsafe
with respect to signals, but
points out that a function is unsafe only when invoca-
tion of a signal handler interrupts the exec
ution of an unsafe function, and the han-
dler itself also calls an unsafe function. In other words, when writing signal
handlers, we have two choices:
Ensure that the code of the signal hand
ler itself is reentrant and that it calls
only async-signal-safe functions.
Block delivery of signals while executing
code in the main program that calls
unsafe functions or works with global da
ta structures also updated by the sig-
nal handler.
The problem with the second approach is th
at, in a complex program, it can be dif-
ficult to ensure that a signal handler wi
is calling an unsafe function. For this re
ason, the above rules are often simplified to
the statement that we must not call unsafe
functions from within a signal handler.
If we set up the same handler function to deal with several different signals or
SA_NODEFER
flag to
, then a handler may interrupt itself. As a
consequence, the handler may be nonreentr
ant if it updates global (or static)
variables, even if they are not used by the main program.
Use of
errno
inside signal handlers
Because they may update
, use of the functions listed in Table 21-1 can never-
theless render a signal handler nonreentrant, since they may overwrite the
errno
426
Chapter 21
The following functions are added:
execv()
fchmodat
fchownat
fsta
link
mkdirat()
mkfifoat
mkno
mknoda
openat()
readlinkat()
renameat
unli
Table 21-1:
Functions required to be async-signal-s
afe by POSIX.1-1990, SUSv2, and SUSv3
_Exit()
(v3)
_exit()
abov3v3
Signals: Signal Handlers
425
int
main(int argc, char *argv[])
char *cr1;
int callNum, mismatch;
struct sigaction sa;
if (argc != 3)
usageErr("%s str1 str2\n", argv[0]);
str2 = argv[2]; /* Make argv[2] available to handler */
cr1 = strdup(crypt(argv[1], "xx")); /* Copy statically allocated string
to another buffer */
if (cr1 == NULL)
errExit("strdup");
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sa.sa_handler = handler;
if (sigaction(SIGINT, &sa, NULL) == -1)
errExit("sigaction");
/* Repeatedly call crysing argv[1]. If interrupted by a
signal handler, then the static storage returned by cr
will be overwritten by the results of encrypting argv[2], and
424
Chapter 21
Example program
Listing 21-1 demonstrates the nonreentrant nature of the
cryp
function (Section 8.5).
As command-line arguments, this progra
m accepts two strings. The program per-
forms the following steps:
1.Call
crypt()
to encrypt the string in the first command-line argument, and copy
this string to a separate buffer using
2.Establish a handler for
SIGINT
(generated by typing
Control-C
). The handler calls
to encrypt the string supplied in
the second command-line argument.
3.Enter an infinite
for
loop that uses
cryp
to encrypt the string in the first
Signals: Signal Handlers
423
relevant for programs that employ signal
handlers. Because a signal handler may
asynchronously interrupt the execution of
a program at any point in time, the main
program and the signal handler in effect
form two independent (although not con-
current) threads of execution
within the same process.
A function is said to be
if it can safely be
simultaneously executed by
multiple threads of execution in the same
process. In this context, safe means
that the function achieves its expected resu
lt, regardless of the state of execution of
The SUSv3 definition of a reentrant func
tion is one whose effect, when called
by two or more threads, is guaranteed to
be as if the threads each executed the
function one after the other in an un
defined order, even if the actual execu-
tion is interleaved.
A function may be
if it updates global or static data structures. (A func-
tion that employs only local variables is gu
aranteed to be reentrant.) If two invoca-
tions of (i.e., two threads executing) the
function simultaneously attempt to update
the same global variable or data structure,
then these updates are likely to interfere
with each other and produce incorrect resu
lts. For example, suppose that one thread
of execution is in the middle of updating
a linked list data structure to add a new
list item when another thread also attemp
ts to update the same linked list. Since
adding a new item to the list requires u
pdating multiple pointers, if another thread
interrupts these steps and updates th
e same pointers, chaos will result.
Such possibilities are in fact rife within
the standard C library. For example, we
already noted in Section 7.1.3 that
malloc()
and
maintain a linked list of freed
memory blocks available for realloca
tion from the heap. If a call to
in the
main program is interrupted by a signal handler that also calls
, then this
linked list can be corrupted. For this reason, the
malloc()
family of functions, and
other library functions that use them, are nonreentrant.
422
Chapter 21
21.1Designing Signal Handlers
In general, it is preferable to write si
mple signal handlers. One important reason
for this is to reduce the risk of creati
ng race conditions. Tw
o common designs for
signal handlers are the following:
SIGNALS: SIGNAL HANDLERS
This chapter continues the de
scription of signals begun in the previous chapter. It
focuses on signal handlers, and extends the discussion started in Section 20.4.
Among the topics we consider are the following:
how to design a signal handler, which necessitates a discussion of reentrancy
and async-signal-safe functions;
Signals: Fundamental Concepts
419
system call provides more control and flexibility than
when setting the disposition of a signal. Firs
418
Chapter 20
20.14Waiting for a Signal:
suspends execution of the process until the call is interrupted by a
signal handler (or until an unhandled signal terminates the process).
When a signal is handled,
Signals: Fundamental Concepts
417
416
Chapter 20
20.13Changing Signal Dispositions:
system call is an alternative to
Signals: Fundamental Concepts
415
if (sig == SIGINT)
gotSigint = 1;
else
sigCnt[sig]++;
int
main(int argc, char *argv[])
int n, numSecs;
414
Chapter 20
./sig_sender 5368 1000000 10 2
Send
SIGUSR1
signals, plus a
SIGINT
./sig_sender: sending signal 10 to process 5368 1000000 times
./sig_sender: exiting
./sig_receiver: pending signals are:
2 (Interrupt)
10 (User defined signal 1)
./sig_receiver: signal 10 caught 1 time
[1]+ Done ./sig_receiver 15
The command-line arguments to the sending program specified the
SIGUSR1
and
SIGINT
signals, which are signals 10 and 2, respectively, on Linux/x86.
From the output above, we can see that
even though one million signals were
sent, only one was delivered to the receiver.
Even if a process doesnt block signals, it may receive fewer signals than are
sent to it. This can happen if the signals are sent so fast that they arrive before the
receiving process has a chance to be sc
heduled for execution by the kernel, with
the result that the multiple signals are recorded just once in the processs pending
signal set. If we execute the program in Listing 20-7 with no command-line argu-
ments (so that it doesnt block signals and sleep), we see the following:
./sig_receiver &
[1] 5393
./sig_receiver: PID is 5393
./sig_sender 5393 1000000 10 2
./sig_sender: sending signal 10 to process 5393 1000000 times
./sig_sender: exiting
./sig_receiver: signal 10 caught 52 times
[1]+ Done ./sig_receiver
Of the million signals sent, just 52 were caught by the receiving process. (The pre-
cise number of signals caught will vary depending on the vagaries of decisions
made by the kernel scheduling
algorithm.) The reason for th
is is that each time the
sending program is scheduled to run, it sends multiple signals to the receiver. How-
ever, only one of these signals is marked as pending and then delivered when the
receiver has a chance to run.
Listing 20-7:
Catching and counting signals

signals/sig_receiver.c
#define _GNU_SOURCE
#include sign&#xsign;zl.;&#xh000;al.h
Signals: Fundamental Concepts
413
412
Chapter 20
Signals: Fundamental Concepts
411
To temporarily prevent delivery of a sign
al, we can use the series of calls shown
in Listing 20-5 to block the signal, and th
SUSv3 specifies that if any pending signals are unblocked by a call to
sigprocmask()
then at least one of those signals will be
410
Chapter 20
20.10The Signal Mask (Blo
cking Signal Delivery)
For each process, the kernel maintains a
a set of signals whose delivery
to the process is currently blocked. If a signal that is blocked is sent to a process,
delivery of that signal is delayed until it is unblocked by being removed from the
process signal mask. (In Section 33.2.1, well see that the signal mask is actually a
per-thread attribute, and that each thread
in a multithreaded process can indepen-
its signal mask using the
pthread_sigm
A signal may be added to the signal mask in the following ways:
When a signal handler is invoked, the signal that caused its invocation can be
automatically added to the signal mask.
Signals: Fundamental Concepts
409
functions are also not async-signal-safe (i.e., beware of
indiscriminately calling them from signal handlers). */
void /* Print list of signals within a signal set */
408
Chapter 20
The GNU C library implements three
nonstandard functions that perform
tasks that are complementary to the standa
rd signal set functions just described.
These functions perform the following tasks:
Signals: Fundamental Concepts
407
Multiple signals are represented using a data structure called a
406
Chapter 20
20.8Displaying Signal Descriptions
Each signal has an associated printable de
scription. These descriptions are listed in
. For example, we can refer to
sys_siglist[SIGPIPE]
Signals: Fundamental Concepts
405
Listing 20-3:
Using the
system call
signals/t_kill.c
#include sign&#xsign;zl.;&#xh000;al.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int s, sig;
if (argc != 3 || strcmp(argv[1], "--help") == 0)
usageErr("%s sig-num pid\n", argv[0]);
404
Chapter 20
IPC channels such as pipes and FIFOs
Signals: Fundamental Concepts
403
SIGCONT
signal is treated specially. An
unprivileged process may send this
signal to any other process in the same session, regardless of user ID checks.
This rule allows job-control shells to
restart stopped jobs (process groups),
even if the processes of the job have ch
anged their user IDs (i.e., they are privi-
leged processes that have used the system calls described in Section 9.7 to
change their credentials).
Figure 20-2:
Permissions required for an unpriv
ileged process to send a signal
If a process doesnt have
permissions to send a signal to the requested
, then
termination status, as described in Section 26.2).
Various other techniques can also be us
Receiving process
real user ID
Sending process
nal to receiver
402
Chapter 20
pid
argument identifies one or more processes to which the signal specified by
Signals: Fundamental Concepts
401
various example programs, well nevertheless call
from a signal handler as a
simple means of seeing wh
en the handler is called.
Listing 20-2:
Establishing the same handler for two different signals
signals/intquit.c
#include sign&#xsign;zl.;&#xh000;al.h
#include "tlpi_hdr.h"
static void
sigHandler(int sig)
static int count = 0;
/* UNSAFE: This handler uses non-async-signal-safe functions
(printfsee Section 21.1.2) */
if (sig == SIGINT) {
count++;
printf("Caught SIGINT (%d)\n", count);
400
Chapter 20
Listing 20-1 (on page 399) shows a simple
example of a signal handler function and
a main program that establishes it as the handler for the
SIGINT
signal. (The termi-
nal driver generates this signal when we type the terminal
interrupt
character, usually
Signals: Fundamental Concepts
399
Invocation of a signal handler may interrupt the main program flow at any
time; the kernel calls the handler on th
e processs behalf, and when the handler
start of program
instruction
instruction
Main program
is executed
Program
resumes at
point of interruption
Delivery
Kernel calls signal
of process
flow of execution
398
Chapter 20
temporarily establish a handler for a signal
Signals: Fundamental Concepts
397
SIGIO
is ignored by default on several UNIX implementations (particularly BSD
derivatives).
Although not specified by any standards,
SIGEMT
appears on most UNIX imple-
mentations. However, this signal typically results in termination with a core
dump on other implementations.
In SUSv1, the de
fault action for
SIGURG
was specified as process termination,
and this is the default in
some older UNIX implementations. SUSv2 adopted
the current specification (ignore).
20.3Changing Signal
UNIX systems provide two ways of ch
anging the disposition of a signal:
and
system call, which is described in this section, was the origi-
396
Chapter 20
Note the following points regarding the de
fault behavior shown for certain signals
in Table 20-1:
On Linux 2.2, the default action for the signals
SIGXCPU
SIGXFSZ
SIGSYS
SIGBUS
is to terminate the process without producing a core dump. From kernel 2.4
onward, Linux conforms to the requiremen
ts of SUSv3, with these signals caus-
ing termination with a core dump. On several other UNIX implementations,
SIGXCPU
SIGXFSZ
are treated in the same way as on Linux 2.2.
SIGPWR
is typically ignored by default on
those other UNIX implementations
where it appears.
Table 20-1:
Linux signals
NameSignal numberDescriptionSUSv3Default
SIGABRT
6Abort process
SIGALRM
14Real-time timer expired
SIGBUS
7 (SAMP=10)Memory access error
SIGCHLD
17 (SA=20, MP=18)Child terminated or stopped
SIGCONT
18 (SA=19, M=25, P=26)Continue if stopped
SIGEMT
undef (SAMP=7)Hardware faultterm
SIGFPE
Signals: Fundamental Concepts
395
SIGUSR2
See the description of
SIGUSR1
SIGVTALRM
The kernel generates this signal upon
394
Chapter 20
SIGTRAP
This signal is used to implement debugger breakpoints and system call
tracing, as performed by
(Appendix A). See the
ptrace22
manual
page for further information.
SIGTSTP
This is the job-control
signal, sent to stop the foreground process
group when the
user types the
character (usually
Control-Z
) on the
keyboard. Chapter 34 describes process groups (jobs) and job control in
Signals: Fundamental Concepts
393
SIGQUIT
When the user types the
character (usually
Control-\
) on the keyboard,
this signal is sent to the foreground
process group. By default, this signal
terminates a process and causes it to produce a core dump, which can then
be used for debugging. Using
SIGQUIT
in this manner is useful with a pro-
gram that is stuck in an infinite lo
op or is otherwise not responding. By
Control-\
and then loading the resulting core dump with the
gdb
debugger and using the
backtrace
command to obtain a stack trace, we can
find out which part of the program
code was executing. ([Matloff, 2008]
describes the use of
SIGSEGV
This very popular signal is generate
d when a program makes an invalid mem-
ory reference. A memory reference ma
y be invalid because the referenced
page doesnt exist (e.g., it lies in an unmapped area somewhere between
the heap and the stack), the process trie
d to update a location in read-only
memory (e.g., the program text se
gment or a region of mapped memory
marked read-only), or the process tried to access a part of kernel memory
while running in user mode (Section
2.1). In C, these events often result
from dereferencing a pointer containing a bad address (e.g., an uninitial-
ized pointer) or passing an invalid argument in a function call. The name
of this signal derives from the term
SIGSTKFLT
Documented in
as stack fault on coprocessor, this signal is
defined, but is unused on Linux.
SIGSTOP
This is the
sure stop
signal. It cant be blocked,
ignored, or caught by a handler;
thus, it always stops a process.
SIGSYS
This signal is generated if a process makes a bad system call. This means that
the process executed an instruction that was interpreted as a system call trap,
but the associated system call number
was not valid (refer to Section 3.1).
SIGTERM
This is the standard signal used for
terminating a process and is the default
signal sent by the
and
392
Chapter 20
SIGINT
When the user types the terminal
character (usually
Control-C
), the
terminal driver sends this signal to
the foreground process group. The
default action for this signal
is to terminate the process.
SIGIO
fcntl()
system call, it is possible to arrange for this signal to be
generated when an I/O event (e.g., input becoming available) occurs on
certain types of open file descriptors,
Signals: Fundamental Concepts
391
SIGCLD
This is a synonym for
SIGCHLD
SIGCONT
When sent to a stopped process, this signal causes the process to resume
(i.e., to be rescheduled to run at some later time). When received by a pro-
cess that is not currently stopped, this
signal is ignored by default. A process
may catch this signal, so that it carri
es out some action when it resumes.
mation about the foreground process group.
390
Chapter 20
Signals appeared in very early UNIX im
plementations, but have gone through
some significant changes since their inception. In early implementations, signals
could be lost (i.e., not delivered to the targ
Signals: Fundamental Concepts
389
Upon delivery of a signal, a process carries out one of the following default
actions, depending on the signal:
The signal is
ignored
; that is, it is discarded by the kernel and has no effect on
the process. (The process never even knows that it occurred.)
The process is
terminated
killedkilled This is so
388
Chapter 20
20.1Concepts and Overview
is a notification to a process that
an event has occurred. Signals are some-
times described as
software interrupts
. Signals are analogous to hardware interrupts
in that they interrupt the normal flow of execution of a program; in most cases, it is
not possible to predict exactl
y when a signal will arrive.
One process can (if it has suitable permi
ssions) send a signal to another process.
In this use, signals can be employed as a synchronization technique, or even as a
primitive form of interprocess communication (IPC). It is also possible for a pro-
cess to send a signal to itself. However,
the usual source of many signals sent to a
process is the kernel. Among the types of ev
ents that cause the kernel to generate a
signal for a process are the following:
A hardware exception occurred, meaning
This chapter and next two chapters discuss signals. Although the fundamental con-
cepts are simple, our discussion is quite leng
386
Chapter 19
19.6An Older System for Mo
Linux provides another mechanism for mo
nitoring file even
ts. This mechanism,
known as
dnotify
, has been available since kernel
Monitoring File Events
385
These two events share the same
value, allowing the application to link them.
When we create a subdirectory under
one of the monitored directories, the
mask in the resulting event includes the
IN_ISDIR
bit, indicating that the subject of
the event is a directory:
mkdir dir2/ddd
Read 32 bytes from inotify fd
wd = 1; mask = IN_CREATE IN_ISDIR
name = ddd
At this point, it is worth repeating that
monitoring is not
recursive. If the
application wanted to monitor events in
the newly created subdirectory, then it
would need to issue a further
inotify_add_wa
call specifying the pathname of
Finally, we remove one of
the monitored directories:
Read 32 bytes from inotify fd
wd = 1; mask = IN_DELETE_SELF
wd = 1; mask = IN_IGNORED
The last event,
IN_IGNORED
, was generated to inform the application that the kernel
has removed this watch item from the watch list.
19.5Queue Limits and
Queuing
events requires kernel memory.
For this reason, the kernel places
various limits on the operation of the
inotify
mechanism. The superuser can config-
ure these limits via three files in the directory
/proc/sys/fs/inotify
max_queued_events
inotify_in
is called, this value is used
384
Chapter 19
The program in Listing 19-1 performs the following steps:
to create an
file descriptor
inotify_ad
to add a watch item for each of the files named in the
command-line argument of the program
. Each watch item watches for all
possible events.
Execute an infinite loop that:
Reads a buffer of events from the
file descriptor
Calls the
displayInotifyEvent
function to display the contents of each of the
structures within that buffer
The following shell session demonstrates the use of the program in Listing 19-1.
We start an instance of the program that runs in the background monitoring two
./demo_inotify dir1 dir2 &
[1] 5386
Watching dir1 using wd 1
Watching dir2 using wd 2
Then we execute commands that generate ev
ents in the two dire
ctories. We begin
by creating a file using
cat � dir1/aaa
Read 64 bytes from inotify fd
wd = 1; mask = IN_CREATE
name = aaa
wd = 1; mask = IN_OPEN
name = aaa
The above output produced by the background program shows that
read
fetched
a buffer containing two events
. We continue by typing so
me input for the file and
then the terminal
end-of-file
Hello world
Read 32 bytes from inotify fd
wd = 1; mask = IN_MODIFY
name = aaa
Type Control-D
Read 32 bytes from inotify fd
wd = 1; mask = IN_CLOSE_WRITE
name = aaa
We then rename the file into the other
monitored directory. This results in two
events, one for the directory from which the file moves (watch descriptor 1), and
the other for the destination di
rectory (watch descriptor 2):
mv dir1/aaa dir2/bbb
Read 64 bytes from inotify fd
wd = 1; cookie = 548; mask = IN_MOVED_FROM
name = aaa
wd = 2; cookie = 548; mask = IN_MOVED_TO
name = bbb
Monitoring File Events
383
#define BUF_LEN (10 * (sizeof(struct inotify_event) + NAME_MAX + 1))
int
main(int argc, char *argv[])
int inotifyFd, wd, j;
char buf[BUF_LEN];
ssize_t numRead;
char *p;
struct inotify_event *event;
if (argc 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s pathname... \n", argv[0]);
inotifyFd = inotify_init /* Create inotify instance */
if (inotifyFd == -1)
errExit("inotify_init");
for (j = 1; j argc; j++) {
wd = inotify_add_watch(inotifyFd, argv[j], IN_ALL_EVENTS);
if (wd == -1)
errExit("inotify_add_watch");
printf("Watching %s using wd %d\n", argv[j], wd);
}
for ;;;; /* Read events forever */
numRead = read(inotifyFd, buf, BUF_LEN);
if (numRead == 0)
382
Chapter 19
When appending a new event to the end of the event queue, the kernel will
coalesce that event with the event at the
tail of the queue (so that the new event is
not in fact queued), if the two
events have the same values for
mask
cookie
. This is done because many applicatio
ns dont need to
know about repeated
instances of the same event, and dropping
the excess events reduces the amount of
kernelkernelory required for the event qu
eue. However, this means we cant use
inotify
Monitoring File Events
381
IN_UNMOUNT
event informs the application that the file system containing the
monitored object has been unmounted.
After this event, a further event con-
taining the
IN_IGNORED
bit will be delivered.
We describe the
IN_Q_OVERFLOW
in Section 19.5, which di
scusses limits on queued
inotify
events.
cookie
field is used to tie related events
380
Chapter 19
Figure 19-2:
An input buffer containing three
inotify_event
structures
field tells us the watch descriptor fo
r which this event occurred. This field
contains one of the values re
turned by a previous call to
. The
field is useful when an application is monitoring multiple files or directories via the
same
file descriptor. It provides the link
terminating null byte
padding null bytes
bytes
Monitoring File Events
379
The meanings of most of the bits in Table 19-1 are evident from their names. The
378
Chapter 19
system call removes the watch item specified by
from
instance referred to by the file descriptor
Table 19-1:
inotify
Bit valueInOutDescription
IN_ACCESS
File was accessed (
re
IN_ATTRIB
File metadata changed
IN_CLOSE_WRITE
File opened for writing was closed
IN_CLOSE_NOWRITE
File opened read-only was closed
IN_CREATE
File/directory created inside watched directory
Monitoring File Events
377
Starting with kernel 2.6.27, Linux su
pports a new, nonstandard system call,
inotify
. This system call performs the same task as
inotif
, but provides
an additional argument,
flags
, that can be used to modify the behavior of the
system call. Two flags are supported. The
IN_CLOEXEC
flag causes the kernel to
enable the close-on-exec flag (
FD_CLOEXEC
) for the new file descriptor. This flag is
useful for the same reasons as the
flag described in Section 4.3.1.
The
IN_NONBLOCK
flag causes the kernel to enable the
O_NONBLOCK
flag on the
underlying open file description, so that future reads will be nonblocking. This
to achieve the same result.
system call either adds a new watch item to or modifies an
existing watch item in the watch list for the
instance referred to by the file
descriptor
. (Refer to Figure 19-1.)
Figure 19-1:
instance and associated
kernel data structures
argument identifies the file for wh
ich a watch item is to be created or
modified. The caller must have read permis
sion for this file. (The file permission
check is performed once, at the time of the
in
call. As long as the
watch item continues to exist, the caller wi
ll continue to receiv
e file notifications
even if the file permissions are later changed so that the caller no longer has read
permission on the file.)
mask
argument is a bit mask that specif
ies the events to be monitored for
. We say more about the bit valu
es that can be specified in
mask
shortly.
If
has not previously been added to the watch list for
inotif
path
watch descriptor 1
watch descriptor 2
watch descriptor 3
path
path
watch
items
376
Chapter 19
19.1Overview
The key steps in the use of the
inotify
API are as follows:
1.The application uses
in
to create an
. This system call
returns a file descriptor that is used to refer to the
inotify
instance in later
2.The application informs the kernel abo
ut which files are of interest by using
inotif
to add items to the watch list of the
inotify
instance created
in the previous step. Each watch item consists of a pathname and an associated
bit mask. The bit mask specifies the set of
events to be monitored for the path-
name. As its function result,
cations, the application performs
opera-
tions on the
file descriptor. Each successful
returns one or more
inotify_event
structures, each containing
information about an event that
occurred on one of the pathna
mes being watched via this
inotify
instance.
4.When the application has finished monitoring, it closes the
inotify
file descriptor.
This automatically removes all wa
inotify
instance.
mechanism can be used to monitor
files or directories. When monitor-
ing a directory, the application will be informed about events for the directory
itself and for files inside the directory.
monitoring mechanism is not recu
rsive. If an application wants to
monitor events within an entire
directory subtree, it must issue
in
calls for each directory in the tree.
An
file descriptor can be monitored using
, and, since
Linux 2.6.25, signal-driven I/O. If events
are available to be read, then these inter-
faces indicate the
file descriptor as being readable. See Chapter 63 for fur-
MONITORING FILE EVENTS
Some applications need to be able to moni
Directories and Links
373
18.16Exercises
18-1.
In Section 4.3.2, we noted that it is not possible to open a file for writing if it is
currently being executed (
open()
372
Chapter 18
dirn
and
basename
may modify the string pointed to by
Therefore, if we wish to preserve a pathname string, we must pass copies of it to
dirn
and
basename
, as shown in Listing 18-5 (p
age 371). This program uses
(which calls
) to make copies
of the strings to be passed to
dirname()
basename()
to deallocate the duplicate strings.
Finally, note that both
and
basename
Directories and Links
371
If
is a
pointer or an empty string, then both
dirname()
and
basename
370
Chapter 18
if (lstat(argv[1], &statbuf) == -1)
errExit("lstat");
if (!S_ISLNK(statbuf.st_mode))
fatal("%s is not a symbolic link", argv[1]);
numBytes = readlink(argv[1], buf, BUF_SIZE - 1);
if (numBytes == -1)
errExit("readlink");
buf[numBytes] = '\0'; /* Add terminating null byte */
printf("readlink: %s -�- %s\n", argv[1], buf);
if (realpath(argv[1], buf) == NULL)
errExit("realpath");
printf("realpath: %s -�- %s\n", argv[1], buf);
exit(EXIT_SUCCESS);
dirs_links/view_symlink.c
18.14Parsing Pathname Strings:
dirname()
and
basename
and
basename
functions break a pathname string into directory and
filename parts. (These functions perform a similar task to the
and
basename11
For example, given the pathname
/home/britta/prog.c
dirn
Directories and Links
369
18.13Resolving a Pathname:
realpath()
library function dereferences all symbolic links in
(a
null-terminated string) and re
solves all references to
and
/..
to produce a null-
terminated string containing the
corresponding absolute pathname.
The resulting string is placed in the buffer pointed to by
should be a character array of at least
PATH_MAX
bytes. On success,
realpa
also
368
Chapter 18
system call was not conceived as
a completely secure jail mecha-
nism. To begin with, there are various ways in which a privileged program can sub-
sequently use a further
call to break out of the jail. For example, a
privileged (
CAP_MKNOD
) program can use
mknod()
to create a memory device file (sim-
ilar to
/dev/mem
) giving access to the contents of
RAM, and, from that point, any-
thing is possible. In general, it is
Directories and Links
367
18.12Changing the Root Directory of a Process:
Every process has a
root directory
, which is the point from which absolute pathnames
(i.e., those beginning with
under their new root director
y, so they cant roam around the entire file system.
(This relies on the fact that the root directory is its own parent; that is,
is a link
, so that changing directory to
and then attempting a
cd ..
leaves the user in the
same directory.)
Some UNIX implementations (but not Li
nux) allow multiple hard links to a
directory, so that it is possible to create
a hard link within a subdirectory to its
parent (or a further removed ancestor).
On implementations permitting this,
the presence of a hard link that reaches outside the jail directory tree compro-
mises the jail. Symbolic link
s to directories outside the
jail dont pose a problem
366
Chapter 18
openat
system call is similar to the traditional
open()
system call, but adds an
argument,
dirfd
, that is used as follows:
If
specifies a relative pathname, then
Directories and Links
365
The equivalent using
chdir()
is as follows:
char buf[PATH_MAX];
getcwd(buf, PATH_MAX); /* Remember where we are */
chdir(somepath); /* Go somewhere else */
364
Chapter 18
If the
argument is
and
is 0, then the
wrapper function for
working directory. The BSD-derived
open()
We can use
fchdir()
to change the processs current
working directory to another
Directories and Links
363
FTW_SKIP_SIBLINGS
Dont process any further entries in th
e current directory; resume process-
ing in the parent directory.
FTW_SKIP_SUBTREE
If
is a directory (i.e.,
is
FTW_D
), then dont call
func()
for
entries under that directory. Processi
ng resumes with the next sibling of
FTW_STOP
Dont process any further entries in th
e directory tree, as with the tradi-
tion of
362
Chapter 18
The program in Listing 18-3 displays an indented hierarchy of the filenames in a
directory tree, one file per line, as well
as the file type and i-node number. Com-
mand-line options can be used
Directories and Links
361
switch (sb�uf-st_mode & S_IFMT) { /* Print file type */
case S_IFREG: printf("-"); break;
case S_IFDIR: printf("d"); break;
case S_IFCHR: printf("c"); break;
case S_IFBLK: printf("b"); break;
case S_IFLNK: printf("l"); break;
case S_IFIFO: printf("p"); break;
case S_IFSOCK: printf("s"); break;
default: printf("?"); break; /* Should never happen (on Linux) */
}
printf(" %s ",
(type == FTW_D) ? "D " : (type == FTW_DNR) ? "DNR" :
(type == FTW_DP) ? "DP " : (type == FTW_F) ? "F " :
(type == FTW_SL) ? "SL " : (type == FTW_SLN) ? "SLN" :
(type == FTW_NS) ? "NS " : " ");
if (type != FTW_NS)
printf("%7ld ", (long) s�buf-st_ino);
else
printf(" ");
printf(" %*s", 4 * ftw�b-level, ""); /* Indent suitably */
printf("%s\n", &pathname[ft�wb-base]); /* Print basename */
360
Chapter 18
FTW_SL
Directories and Links
359
argument to
is created by ORing (
) zero or more of the follow-
ing constants, which modify th
e operation of the function:
FTW_CHDIR
Do a
chdi
into each directory before processing its contents. This is use-
func
is designed to do some work in the directory in which the file
specified by its
argument resides.
FTW_DEPTH
Do a postorder traversal of the
directory tree. This means that
nftw
calls
on all of the files (and subdirecto
ries) within a dire
ctory before execut-
ing
on the directory itself. (The name of this flag is somewhat mislead-
ing
nftw
always does a depth-first, rather
than a breadth-first, traversal of
the directory tree. All that this flag do
es is convert the traversal from preorder
FTW_MOUNT
Dont cross over into another file syst
em. Thus, if one of the subdirectories
of the tree is a mount po
int, it is not traversed.
FTW_PHYS
By default,
dereferences symbolic links. This flag tells it not to do so.
Instead, a symbolic link is passed to
with a
value of
FTW_SL
described below.
For each file,
passes four arguments when calling
. The first of these
arguments,
, is the pathname of the file.
This pathname may be absolute,
was specified as an absolute pathname, or relative to the current working
directory of the calling process
at the time of the call to
ntfw
, if
was
expressed as a relative path
name. The second argument,
, is a pointer to a
structure (Section 15.1) containing info
rmation about this file. The third argu-
, provides further information abou
t the file, and has one of the fol-
lowing symbolic values:
FTW_D
This is a directory.
FTW_DNR
This is a directory that cant be read (and so
doesnt traverse any of
its descendants).
FTW_DP
We are doing a postorder traversal (
FTW_DEPTH
) of a directory, and the cur-
rent item is a directory whose files
and subdirectories have already been
FTW_F
This is a file of any type other than a directory or symbolic link.
FTW_NS
on this file failed, probably because of permission restric-
tions. The value in
is undefined.
358
Chapter 18
The
n (i.e., calling some programmer-defined
function) for each file in the subtree.
The
nftw
function is an enhancement of the older
function, which per-
forms a similar task. New applications should use
nf
new ftw
provides more functionality, and pred
ictable handling of symbolic links
(SUSv3 permits
ftw()
either to follow or not foll
ow symbolic links). SUSv3 spec-
ifies both
ftw()
, but the latter function is marked obsolete in SUSv4.
The GNU C library also provides the BSD-derived
API (
fts_open
fts_read()
fts_children()
Directories and Links
357
else
for (argv++; *argv; argv++)
listFiles(*argv);
exit(EXIT_SUCCESS);
dirs_links/list_files.c
readdir_r()
function
readdir_r()
function is a variation on
read
. The key semantic difference
between
readdir_
and
rea
is that the former is reentrant, while the latter is
not. This is because
readdir_
356
Chapter 18
Listing 18-2:
Scanning a directory
dirs_links/list_files.c
#include dire ire;nt.;&#xh000;nt.h
#include "tlpi_hdr.h"
static void /* List all files in directory 'dirPath' */
listFiles(const char *dirpath)
DIR *dirp;
struct dirent *dp;
Boolean isCurrent; /* True if 'dirpath' is "." */
isCurrent = strcmp(dirpath, ".") == 0;
dirp = opendir(dirpath);
if (dirp == NULL) {
errMsg("opendir failed on '%s'", dirpath);
Directories and Links
355
and
, which are also specified in SUSv3,
allow random access within a directory stream. Refer to the manual pages for further
information about these functions.
Directory streams and file descriptors
A directory stream has an asso
ciated file descriptor. The
354
Chapter 18
Further information about
the file referred to by
d_name
can be obtained by calling
on the pathname constructed using the
argument that was specified to
opendi
Directories and Links
353
fdopendir()
function is like
opendir()
, except that the directory for which a
stream is to be created is specified via the open file descriptor
fdopendir()
function is provided so that applications can avoid the kinds of race
conditions described in Section 18.11.
After a successful call to
fdopendir()
, this file descriptor is under the control of
the system, and the program should not access
it in any way other than by using the
functions described in the
remainder of this section.
fdopendir()
function is specified in SUSv4 (but not in SUSv3).
rea
function reads successive entries from a directory stream.
Each call to
rea
reads the next directory from
the directory stream referred to
by
ext4
#include dire ire;nt.;&#xh000;nt.h
DIR *
fdopendir
352
Chapter 18
18.7Removing a File or Directory:
remove
library function removes a file or an empty directory.
If
is a file,
remove()
calls
unlink
is a directory,
remove()
calls
rm
Like
and
rm
remo
doesnt dereference symbolic links. If
is a symbolic link,
remo
removes the link itself, rather than the file to
which it refers.
If we want to remove a file in preparat
ion for creating a new file with the same
name, then using
remove
#include dire ire;nt.;&#xh000;nt.h
DIR *
opendir
(const char *
Directories and Links
351
an octal number. The value given in
mode
is ANDed against the process umask
350
Chapter 18
argument is an integer used to tell
the number of bytes avail-
buffer
If no errors occur, then
read
#include sys/stat.h&#xsys/;sta;&#xt.h7;
int
mkdir
(const char *
, mode_t
mode
Returns 0 on success, or 1 on error
Directories and Links
349
If
oldpath
refers to a file other than a directory, then
cant specify the
pathname of a directory (the error is
EISDIR
). To rename a file to a location
inside a directory (i.e., move
the file to another directory),
newpath
must
include the new filename. The following ca
ll both moves a file into a different
directory and changes its name:
rename("sub1/x", "sub2/y");
Specifying the name of a directory in
oldpath
allows us to rename that direc-
tory. In this case,
either must not exist or must be the name of an
empty directory. If
newpath
is an existing file or an
existing, nonempty directory,
then an error results (respectively,
ENOTDIR
ENOTEMPTY
If
is a directory, then
newpath
cant contain a director
y prefix that is the
same as
oldpath
. For example, we could not rename
/home/mtk
to
/home/mtk/bin
(the error is
EINVAL
The files referred to by
oldpath
and
must be on the same file system.
This is required because a directory is a li
st of hard links that refer to i-nodes in
the same file system as the directory. As stated earlier,
rename()
is merely
manipulating the contents of directory list
s. Attempting to rename a file into a
different file system fails with the error
EXDEV
. (To achieve the desired result, we
must instead copy the contents of the file from one file system to another and
348
Chapter 18
The program then closes the file descriptor
, at which the point the file is removed,
and uses
once more to show that the amount
of disk space in use has decreased.
The following shell session demonstrates
the use of the program in Listing 18-1:
./t_unlink /tmp/tfile 1000000
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda10 5245020 3204044 2040976 62% /
********** Closed file descriptor
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda10 5245020 2201128 3043892 42% /
In Listing 18-1, we use the
function to execute a shell command. We
system
in detail in Section 27.6.
18.4Changing the Name of a File:
rename()
system call can be used both to re
name a file and to move it into
another directory on the same file system.
argument is an existing pathname
, which is renamed to the pathname
given in
newpath
rena
call just manipulates directory entr
ies; it doesnt move file data.
Renaming a file doesnt affect other hard li
nks to the file, nor does it affect any pro-
cesses that hold open descriptors for the
file, since these descriptors refer to open
file descriptions, which (after the
open
call) have no conne
ction with filenames.
The following rules apply to the use of
rename()
If
newpath
already exists, it is overwritten.
If
newpath
and
oldpath
refer to the same file, then no changes are made (and
the call succeeds). This is rather counte
from the previous
point, we normally expect that if two filenames
and
exist, then the call
rename(x, y)
would remove the name
. This is not the case if
and
are links
to the same file.
The rationale for this rule, which co
mes from the original BSD implementa-
tion, was probably to simpl
ify the checks that the kern
el must perform in order
to guarantee that
calls such as
rename(x, ./x)
rename("x", "somedir/../x")
dont remove the file.
rena
system call doesnt dereference symbolic links in either of its
arguments. If
oldpath
is a symbolic link, then the symbolic link is renamed. If
is a symbolic link, then it is treated as a normal pathname to which
is to be renamed
(i.e., the existing
newpath
symbolic link is removed).
#include stdi&#xstdi;o.h;o.h
int
rename
(const char *
oldpath
, const char *
newpath
Returns 0 on success, or 1 on error
Directories and Links
347
Listing 18-1:
Removing a link with
unlink

dirs_links/t_unlink.c
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include fcntünt;l.h;l.h
#include "tlpi_hdr.h"
#define CMD_SIZE 200
#define BUF_SIZE 1024
int
main(int argc, char *argv[])
int fd, j, numBlocks;
char shellCmd[CMD_SIZE]; /* Command to be passed to system
char buf[BUF_SIZE]; /* Random bytes to write to file */
if (argc 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s temp-file [num-1kB-blocks] \n", argv[0]);
346
Chapter 18
implementations behave in the manner sp
ecified by SUSv3. One notable exception
is Solaris, which provides the same beha
vior as Linux by default, but provides
SUSv3-conformant behavior if appropriate compiler options are used. The upshot
of this inconsistency across implementati
ons is that portable applications should
avoid specifying a symbolic link for the
oldpath
SUSv4 recognizes the inconsistency ac
ross existing implementations and spec-
Directories and Links
345
Given the pathname of
an existing file in
oldpath
link
system call creates a new
link, using the pathname specified in
newpath
. If
already exists, it is not
overwritten; instead, an error (
EEXIST
) results.
link
system call doesnt derefe
rence symbolic links. If
is
a symbolic link, then
newpath
is created as a new hard li
nk to the same symbolic link
is also a symbolic link to the same file to which
oldpath
refers.) This behavior doesnt conform to
SUSv3, which says th
at all functions that
perform pathname resolution should de
reference symbolic links unless other-
wise specified (and there is no exception specified for
). Most other UNIX
Table 18-1:
344
Chapter 18
Some UNIX file systems perform an optimization not mentioned in the main
text nor shown in Figure 18-2. When the
total length of the string forming the
symbolic links contents is small enough
to fit in the part of the i-node that
would normally be used for data pointe
rs, the link string is instead stored
there. This saves allocating a disk bloc
k and also speeds access to the symbolic
Directories and Links
343
Figure 18-2:
Representation of hard and symbolic links
Since a symbolic link refers
to a filename, rather than an i-node number, it can be
used to link to a file in a different file
system. Symbolic links also do not suffer the
other limitation of hard links: we can create symbolic links to directories. Tools
such as
and
this61
directory
/home/erena/
this
data for soft link
File data
......
that61
......
regular file
file data
UID=kiranGID=users
perm=
Data block pointers
UID=erenaGID=users
type=file
perm=
Data block pointers
link count = 2size=518
directory
......
......
other309
......
directory
Two hard
links to the
same file
342
Chapter 18
at least not portably and unambiguouslysinc
e a file descriptor refers to an i-node,
and multiple filenames (or even, as describe
d in Section 18.3, none at all) may refer
On Linux, we can see which files a
process currently has open by using
Section 18.8Section 18.8
ntents of the Linux-specific
/proc/
PID
/fd
directory, which contains sy
mbolic links for each of the file descriptors cur-
rently opened by the process. The
and
tools, which have been
ported to many UNIX systems, can
also be useful in this regard.
Hard links have two limitations, both of
which can be circumvented by the use of
symbolic links:
Because hard linkshard linksre
fer to files using just an i-node num-
ber, and i-node numbers are unique only
within a file system, a hard link must
reside on the same file system
A hard link cant be made to a directory.
This prevents the creation of circular
links, which would confus
e many system programs.
Early UNIX implementations permitted the superuser to create hard links to
directories. This was necessary because
these implementations did not provide
mkdir()
system call. Instead, a directory was created using
, and then
links for the
and
entries were created ([Vahal
ia, 1996]). Although this fea-
Directories and Links
341
If we review the list of information stored in a file i-node (Section 14.4), we see that
the i-node doesnt contain a filename; it is
only the mapping within a directory list
that defines the name of a file. This has
a useful consequence: we can create multi-
ple namesin the same or in different directorieseach of which refers to the same
i-node. These multiple names are known as
links
, or sometimes as
hard links
to dis-
tinguish them from symbolic li
nks, which we discuss shortly.
All native Linux and UNIX file system
s support hard links. However, many
non-UNIX file systems (e.g., Microsofts
VFAT) do not. (Microsofts NTFS file
system does support hard links.)
From the shell, we can create new hard
links to an existing file using the
com-
mand, as shown in the following shell session log:
echo -n 'It is good to collect thing�s,' abc
ls -li abc
122232 -rw-r--r-- 1 mtk users 29 Jun 15 17:07 abc
ln abc xyz
340
Chapter 18
On most native Linux file systems, filenames can be up to 255 characters long. The
relationship between directories and i-nodes is illustrated in Figure 18-1, which
shows the partial contents of the file system
i-node table and relevant directory files
that are maintained for an example file (
UID=root
GID=root
Data block pointers
UID=root
GID=root
Data block pointers
UID=root
GID=root
Data block pointers
number
2
6422
tmp
directory
group
File data
information)
File data
type=file
perm=
type=directory
perm=
type=directory
perm=
In this chapter, we conclude our discussion of file-related topics by looking at direc-
tories and links. After an overview of th
eir implementation, we describe the system
calls used to create and remove directories
and links. We then look at library func-
tions that allow a program to scan the contents of a single directory and to walk
through (i.e., examine each file in) a directory tree.
Each process has two directory-related a
Access Control Lists
337
printf("%c", (permVal == 1) ? 'r' : '-');
336
Chapter 17
Access Control Lists
335
Listing 17-1:
Display the access or default ACL on a file

acl/acl_view.c
#include acl/libacl¬l/;lib;¬l7;&#x.h00;.h
#include sys/&#xsys/;ެl;&#x.h00;acl.h
#include "ugid_functions.h"
#include "tlpi_hdr.h"
static void
usageError(char *progName)
fprintf(stderr, "Usage: %s [-d] filename\n", progName);
exit(EXIT_FAILURE);
int
main(int argc, char *argv[])
acl_t acl;
acl_type_t type;
acl_entry_t entry;
acl_tag_t tag;
uid_t *uidp;
gid_t *gidp;
334
Chapter 17
Access Control Lists
333
acl_entry_t entry;
status = acl_create_entry(&acl, &entry);
The new entry can then be populated us
ing the functions described previously.
332
Chapter 17
Access Control Lists
331
We now look briefly at the various ACL functions. In most cases, we dont describe
330
Chapter 17
Overview
The functions that constitute the ACL API are listed in the
acl(5)
manual page. At
Access Control Lists
329
ACL entry requires 8 bytes, so that the maximum number of ACL entries for a
file is somewhat less (because of some overhead for the name of the extended
attribute for the ACL) than one-eighth
of the block size. Thus, a 4096-byte
block size allows for a maximum of ar
ound 500 ACL entries. (Kernels before
2.6.11 imposed an arbitrary limitation of 32 entries for ACLs on
, an ACL is limited to 25 entries.
and
, ACLs can contain up to 8191 en
tries. This limit is a conse-
quence of the size limitation64 kB64 kB
imposed by the VFS on the value of an
Section 16.2Section 16.2
At the time of writing,
Btrfs
limits ACLs to around 500 entries. However, since
was still under heavy development, this limit may change.
Although most of the above fi
le systems allow large numbers of entries to be created
in an ACL, this should be avoided for the following reasons:
The maintenance of lengthy ACLs beco
mes a complex and potentially error-
prone system administration task.
The amount of time required to scan
the ACL for the matching entry (or
matching entries in the case of group
ID checks) increases linearly with the
number of ACL entries.
Generally, we can keep the number of ACL
entries on a file down to a reasonable
number by defining suitable groups in th
e system group file (Section 8.3) and using
those groups within the ACL.
17.8The ACL API
The POSIX.1e draft standard defined a large suite of functions and data structures
for manipulating ACLs. Since they are so
numerous, we wont attempt to describe
the details of all of these functions. Instead, we provide an overview of their usage
and conclude with an example program.
Programs that use the ACL API should include
sys/acl.h&#xs-7.;ys/;&#x-7.1;¬-7;&#x.1l.;&#xh-7.;က
. It may also be neces-
sary to include
acl/libac¬l7;&#x/lib;ેl;&#x.h00;l.h
if the program makes use of various Linux extensions
to the POSIX.1e draft standard. (A list of
the Linux extensions is provided in the
manual page.) Programs using this
API must be compiled with the
lacl
option, in order to
link against the
libacl
library.
As already noted, on Linux, ACLs are implemented using extended attributes,
and the ACL API is implemented as a se
t of library functions that manipulate
user-space data structures, and, where necessary, make calls to
328
Chapter 17
If a directory has a default ACL, then:
A new subdirectory created in this dire
ctory inherits the directorys default
ACL as its default ACL. In other words,
default ACLs propagate down through
a directory tree as new subdirectories are created.
A new file or subdirectory created in
this directory inheri
ts the directorys
default ACL as its access ACL. The ACL entries that correspond to the tradi-
tional file permission bits are masked
ANDedANDed
of the
argument in the system call (
mkdir()
, and so on) used to create
the file or subdirectory. By corresponding ACL entries, we mean:
ACL_USER_OBJ
ACL_MASK
or, if
ACL_MASK
is absent, then
ACL_GROUP_OBJ
ACL_OTHER
When a directory has a default ACL, th
e process umask (Section 15.4.6) doesnt
Access Control Lists
327
We then use
ls l
to once more view the traditio
nal permission bits of the file.
We see that the displayed group class permis
e permissions in the
ACL_MASK
entry (
), rather than those in the
ACL_GROUP
entry (
ls -l tfile
-rwx--x--x+ 1 mtk users 0 Dec 3 15:42 tfile
326
Chapter 17
entries with three lines showing the name and ownership of the file. We can pre-
vent these lines from being displayed by specifying the
omitheader
option.
Next, we demonstrate that changes to a
files permissions using the traditional
command are carried through to the ACL.
chmod u=rwx,g=rx,o=x tfile
Access Control Lists
325
To avoid these problems, we might consider making the
ACL_GROUP_OBJ
entry the
issions are granted to the
ACL_MASK
17.5The
getfacl
setfacl
From the shell, we can use the
324
Chapter 17
17.4The
Entry and the ACL Group Class
If an ACL contains
ACL_USER
or
ACL_GROUP
entries, then it must contain an
ACL_MASK
entry. If the ACL doesnt contain any
ACL_USER
or
ACL_GROUP
entries, then the
ACL_MASK
entry is optional.
ACL_MASK
entry acts as an upper limit on
the permissions granted by ACL
entries in the so-called
group class
. The group class is the set of all
ACL_USER
ACL_GROUP
ACL_GROUP_OBJ
entries in the ACL.
The purpose of the
ACL_MASK
entry is to provide consistent behavior when run-
ning ACL-unaware applications. As an ex
ample of why the mask entry is needed,
suppose that the ACL on a file
includes the following entries:
user::rwx # ACL_USER_OBJ
user:paulh:r-x # ACL_USER
group::r-x # ACL_GROUP_OBJ
group:teach:--x # ACL_GROUP
other::--x # ACL_OTHER
Now suppose that a program executes the following
call on this file:
Access Control Lists
323
17.3Long and Short Text Forms for ACLs
When manipulating ACLs using the
tag-qualifier]: permissionsThe tag-type is one of the values shown in th
e first column of Table 17-1. The
may optionally be followed by a
, which identifies a user or group, either
by name or numeric identifier. The
is present only for
ACL_USER
and
ACL_GROUP
entries.
The following are all short text form AC
Ls corresponding to a traditional per-
missions mask of 0650:
u::rw-,g::r-x,o::---
u::rw,g::rx,o::-
user::rw,group::rx,other::-
The following short text form ACL includ
es two named users, a named group, and
a mask entry:
u::rw,u:paulh:rw,u:annabel:rw,g::r,g:teach:rw,m::rwx,o::-
Table 17-1:
322
Chapter 17
2.If the effective user ID of the proce
ss matches the owneuser IDuser ID
then the process is granted the permissions specified in the
ACL_USER_OBJ
entry.
(To be strictly accurate, on Linux, it is
the processs file-system IDs, rather than
its effective IDs, that are used for the checks described in this section, as
described in Section 9.5.)
3.If the effective user ID of the proces
s matches the tag qualifier in one of the
ACL_USER
entries, then the process is gran
ted the permissions specified in that
entry, masked (ANDed) against the value of the
ACL_MASK
entry.
4.If one of the processs group IDs (i.e.,
the effective group ID or any of the sup-
plementary group IDs) matches the fi
le group (this corresponds to the
ACL_GROUP_OBJ
entry) or the tag qualifier of any of the
ACL_GROUP
entries, then
Access Control Lists
321
ACL_GROUP_OBJ
This entry specifies permissions granted
to the file group. Each ACL contains
exactly one
ACL_GROUP_OBJ
entry. This entry corresponds to the traditional
file
permissions, unless the ACL also contains an
ACL_MASK
entry.
ACL_GROUP
This entry specifies the permissions gr
anted to the group identified by the
tag qualifier. An ACL may contain zero or more
ACL_GROUP
entries, but at
most one
ACL_GROUP
entry may be defined for a particular group.
ACL_MASK
This entry specifies the maximum permissions that may be granted by
ACL_USER
ACL_GROUP_OBJ
ACL_GROUP
entries. An ACL contains at most
ACL_MASK
entry. If the ACL contains
ACL_USER
or
ACL_GROUP
entries, then
an
ACL_MASK
entry is mandatory. We say mo
re about this tag type shortly.
ACL_OTHER
This entry specifies the permissions th
at are granted to users that dont
match any other ACL entry. Each ACL contains exactly one
ACL_OTHER
entry. This entry corresponds
to the traditional file
other
permissions.
The tag qualifier is employed only for
ACL_USER
and
ACL_GROUP
entries. It specifies
either a user ID or a group ID.
Minimal and extended ACLs
minimal
ACL is one that is semantically equivalent to the traditional file permis-
320
Chapter 17
springing from the incompleteness of the draft standards), writing portable pro-
grams that use ACLs presents some difficulties.
This chapter provides a de
scription of ACLs and a brief tutorial on their use. It
also describes some of the library functi
Tag type
Tag qualifier
Permissions
Corresponds to
) permissions
Corresponds to
group
permissions
Corresponds to
other
permissions
Group
Section 15.4 described the traditional UNIX
(and Linux) file permissions scheme.
For many applications, this scheme is sufficient. However, some applications need
finer control over the permissions grante
318
Chapter 16
for (j = optind; j argc; j++) {
listLen = listxattr(argv[j], list, XATTR_SIZE);
if (listLen == -1)
errExit("listxattr");
printf("%s:\n", argv[j]);
/* Loop through all EA names, displaying name + value */
for (ns = 0; ns listLen; ns += strlen(&list[ns]) + 1) {
printf(" name=%s; ", &list[ns]);
Extended Attributes
317
Example program
316
Chapter 16
Removing an EA
, and
system calls remove an EA from
The null-terminated string given in
identifies the EA that is to be removed.
An attempt to remove an EA that
doesnt exist fails with the error
ENODATA
failure could also happen if another proc
Extended Attributes
315
name
argument is a null-terminated string
that defines the name of the EA.
The
value
argument is a pointer to a buffer that
defines the new value for the EA. The
argument specifies the
length of this buffer.
By default, these system calls create a new EA if one with the given
doesnt already exist, or replace the value
of an EA if it does already exist. The
argument provides finer control over this behavior. It may be specified as 0 to
obtain the default behavior, or as
one of the following constants:
XATTR_CREATE
EEXIST
) if an EA with the given
name
XATTR_REPLACE
ENODATA
) if an EA with the given
doesnt already exist.
Here an example of the use of
314
Chapter 16
This prevents arbitrary users from a
ttaching EAs to directories such as
, which
are publicly writable (and so would allow
arbitrary users to manipulate EAs on the
directory), but which have th
Extended Attributes
313
user.x="The past is not dead."
user.y="In fact, it's not even past."
312
Chapter 16
EAs have names of the form
namespace.name
component serves to
separate EAs into functionally distinct classes. The
name
component uniquely iden-
tifies an EA within the given
namespace
Four values are supported for
four types of EAs are used as follows:
EAs may be manipulated by unprivileged
processes, subject to file permis-
EXTENDED ATTRIBUTES
This chapter describes extended attribute
EAsEAsallow arbitrary metadata,
in the form of name-value pairs, to be as
sociated with file i-
nodes. EAs were added
to Linux in version 2.6.
16.1Overview
EAs are used to implement access control list
Chapter 17Chapter 17es (Chap-
ter 39). However, the design of EAs is general enough to allow them to be used for
other purposes as well. For
example, EAs could be used to record a file version
number, information about the MIME type
JFS
XFS
Support for EAs is optional for each fi
le system, and is controlled by kernel
configuration options under the
File systems
menu. EAs are supported on
Reiserfs
since Linux 2.6.7.
File Attributes
309
15.7Exercises
15-1.
about the permissions required for
various file-system operations
. Use shell commands or write programs to verify or
answer the following:
a)Removing all owner permissions from
a file denies the file owner access,
even though group and other do have access.
b)On a directory with read permissi
on but not execute permission, the
names of files in the directory can be
listed, but the files themselves cant
be accessed, regardless of
the permissions on them.
c)What permissions are required on the
parent directory and the file itself in
order to create a new file, open a file
for reading, open a file for writing,
process umask while leaving it unchanged?
15-6.
chmod a+rX file
command enables read permission
for all categories of user,
and likewise enables execute permission for all categories of user if
file
is a directory
or execute permission is enabled fo
r any of the user categories for
demonstrated in the following example:
ls -ld dir file prog
dr-------- 2 mtk users 48 May 4 12:28 dir
-r-------- 1 mtk users 19794 May 4 12:22 file
-r-x------ 1 mtk users 19336 May 4 12:21 prog
chmod a+rX dir file prog
ls -ld dir file prog
dr-xr-xr-x 2 mtk users 48 May 4 12:28 dir
-r--r--r-- 1 mtk users 19794 May 4 12:22 file
-r-xr-xr-x 1 mtk users 19336 May 4 12:21 prog
Write a program that uses
st
and
chmo
to perform the equivalent of
chmod a+rX
15-7.
Write a simple version of the
command, which modifies file i-node flags.
See the
chattr11
308
Chapter 15
Within a program, i-node flags can
File Attributes
307
FS_NODUMP_FL
Dont include this file in backups made using
. The effect of this
flag is dependent on the
option described in the
FS_NOTAIL_FL
Disable tail packing. This flag is supported only on the
Reiserfs
file system.
It disables the
Reiserfs
tail-packing feature, which tries to pack small files
(and the final fragment of larger files)
into the same disk block as the file
306
Chapter 15
FL_*
flags and their meanings are as follows:
FS_APPEND_FL
The file can be opened for writing only if the
O_APPEND
flag is specified (thus
forcing all file updates to append to th
e end of the file). This flag could be
used for a log file, for example. Only privileged (
CAP_LINUX_IMMUTABLE
cesses can set this flag.
FS_COMPR_FL
Store the file on disk in a compressed format. This feature is not imple-
mented as a standard part of any of the major native Linux file systems.
(There are packages that implement this feature for
and
.) Given
the low cost of disk storage, the CPU overhead involved in compression
and decompression, and the fact that co
mpressing a file means that it is no
longer a simple matter to randomly
access the files contents (via
file compression is undesira
ble for many applications.
FS_DIRSYNC_FL
(since Linux 2.6)
Make directory updates (e.g.,
open(pathname, O_CREAT)
unlink()
mkdir()
) synchronous. This is analogous to the synchronous file update
mechanism described in Section 13.3.
As with synchronous file updates,
there is a performance impact associated with synchronous directory
File Attributes
305
The first Linux file system to support i-node flags was
, and these flags are some-
times referred to as
ext2
extended file attributes
. Subsequently, support for i-node flags
has been added on other file systems, including
Reiserfs
(since
Linux2.4.19),
XFS
(since Linux 2.4.25 and 2.6), and
(since Linux 2.6.17).
The range of i-node flags supported va
ries somewhat across file systems. In
order to use i-node flags on a
Reiserfs
file system, we must use the
mount o attrs
option when mounting the file system.
From the shell, i-node flags can be set and viewed using the
and
lsattr
com-
mands, as shown in the following example:
lsattr myfile
-------- myfile
chattr +ai myfile

Turn on Append Only and Immutable flags
lsattr myfile
----ia-- myfile
Within a program, i-node flags can
304
Chapter 15
In order to modify selected bits of the file
permissions, we firs
File Attributes
303
if (stat(MYDIR, &sb) == -1)
errExit("stat-%s", MYDIR);
printf("Requested dir. perms: %s\n", filePermStr(DIR_PERMS, 0));
printf("Process umask: %s\n", filePermStr(u, 0));
printf("Actual dir. perms: %s\n", filePermStr(sb.st_mode, 0));
if (unlink(MYFILE) == -1)
errMsg("unlink-%s", MYFILE);
if (rmdir(MYDIR) == -1)
errMsg("rmdir-%s", MYDIR);
exit(EXIT_SUCCESS);

files/t_umask.c
15.4.7Changing File Permissions:
chmo
fchm
fchmod
system calls change the permissions of a file.
chmo
system call changes the permissions of the file named in
this argument is a symbolic link,
changes the permissions of the file to
which it refers, rather than
the permissions of the link itself. (A symbolic link is
always created with read, write, and exec
ute permissions enable
d for all users, and
these permission cant be changed. Thes
e permissions are ignored when derefer-
encing the link.)
fchmod
system call changes the permission
s on the file referred to by the
mode
argument specifies the new permissions of the file, either numerically
octaloctalask formed by ORing (
) the permission bits listed in Table 15-4. In
order to change the permissions on a file, either the process must be privileged
CAP_FOWNER
) or its effective user ID must matc
h the owner (user ID) of the file. (To
be strictly accurate, on Linux, for an unpri
vileged process, it is the processs file-
system user ID, rather than its effective us
er ID, that must match the user ID of the
file, as described in Section 9.5.)
302
Chapter 15
Listing 15-5 illustrates the use of
in conjunction with
open()
and
mkdir()
. When we run this program, we see the following:
./t_umask
Requested file perms: rw-rw----
This is what we asked for
Process umask: ----wx-wx
This is what we are denied
Actual file perms: rw-r-----
So this is what we end up with
Requested dir. perms: rwxrwxrwx
Process umask: ----wx-wx
Actual dir. perms: rwxr--r--
In Listing 15-5, we employ the
and
system calls to create and
remove a directory, and the
un
system call to remove a file. We describe
these system calls in Chapter 18.
Listing 15-5:
Using

files/t_umask.c
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include fcntünt;l.h;l.h
#include "file_perms.h"
#include "tlpi_hdr.h"
#define MYFILE "myfile"
#define MYDIR "mydir"
#define FILE_PERMS (S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP)
#define DIR_PERMS (S_IRWXU | S_IRWXG | S_IRWXO)
File Attributes
301
A files sticky permission bit is set via the
command (
chmod +t file
) or via
chmod()
system call. If the sticky bit for a file is set,
ls l
shows a lowercase or
uppercase letter
in the other-execute permission
300
Chapter 15
The problem is that if the pathname given to
is a symbolic link, and a
malicious user manages to change the link so
that it refers to a different file before
File Attributes
299
other UNIX implementations, a privileged process can execute a file even when no
permission category grants execute perm
ission. When accessing a directory, a
privileged process is always granted execute (search) permission.
We can rephrase our description of a privileged process in terms of two
Linux process capabilities:
CAP_DAC_READ_SEARCH
and
(Section
39.2). A process with the
CAP_DAC_READ_SEARCH
capability always has read permis-
sion for any type of file, and always
has read and execute permissions for a
directory (i.e., can always access files in a
directory and read the list of files in a
directory). A process with the
CAP_DAC_OVERRIDE
capability always has read and
write permissions for any type of file,
and also has execute permission if the
file is a directory or if execute permission
is granted to at least one of the per-
mission categories for the file.
15.4.4Checking File Accessibility:
As noted in Section 15.4.3, the
user and group IDs, as well as supplemen-
298
Chapter 15
The rules applied by the kernel when checking permissions are as follows:
1.If the process is privileged, all access is granted.
2.If the effective user ID of the process is
the same as the user
ID (owner) of the
file, then access is granted according to the
owner
permissions on the file. For
example, read access is granted if the
owner-read permission bit is turned on in
the file permissions mask; otherw
ise, read access is denied.
3.If the effective group ID of the process or any of the process supplementary
group IDs matches the group ID (group
owner) of the file, then access is
granted according to the
group
permissions on the file.
4.Otherwise, access is granted according to the
other
permissions on the file.
In the kernel code, the above tests are ac
tually constructed so that the test to
File Attributes
297
15.4.2Permissions on Directories
Directories have the same permission sche
me as files. Howeve
r, the three permis-
296
Chapter 15
Listing 15-3:
Header file for
file_perms.c
files/file_perms.h
#ifndef FILE_PERMS_H
#define FILE_PERMS_H
#include sys/types.&#xsys/;typ;s.7;&#xh000;h
File Attributes
295
: The file may be executed (i.e., it is a program or a script). In order to
execute a script file (e.g., a
bash
script), both read and execute permissions are
The permissions and ownership of a file can be viewed using the command
ls l
in the following example:
ls -l myscript.sh
rwxr-x---
1 mtk users 1667 Jan 15 09:22 myscript.sh
In the above example, the file permissions are displayed as
rwxr-x---
(the initial
hyphen preceding this string indicates the ty
pe of this file: a re
gular file). To inter-
pret this string, we break these 9 characters
294
Chapter 15
} else { /* Turn group name into GID */
gid = groupIdFromName(argv[2]);
if (gid == -1)
fatal("No group user%s%s", argv[1]);
}
/* Change ownership of all files named in remaining arguments */
errFnd = FALSE;
for (j = 3; j argc; j++) {
if (chown(argv[j], uid, gid) == -1) {
errMsg("chown: %s", argv[j]);
errFnd = TRUE;
}
}
exit(errFnd ? EXIT_FAILURE : EXIT_SUCCESS);

files/t_chown.c
15.4File Permissions
In this section, we describe the permissi
on scheme applied to files and directories.
Although we talk about permissions here ma
inly as they apply to regular files and
directories, the rules that we describe appl
y to all types of file
s, including devices,
File Attributes
293
When changing the owner or group of a file, the set-group-ID permission bit is not
turned off if the group-execute permission bit is already off or if we are changing
the ownership of a directory. In both of
these cases, the set-group-ID bit is being
292
Chapter 15
The distinction between these three system calls is similar to the
stat
family of sys-
tem calls:
changes the ownership of
the file named in the
lchown
does the same, except that if
is a symbolic link, ownership of
the link file is changed, rather th
an the file to which it refers; and
fchown()
changes the ownership of a file referred to by the open file descriptor,
argument specifies the new us
er ID for the file, and the
argument
specifies the new group ID for the file. To ch
ange just one of the IDs, we can specify
1 for the other argument to leave that ID unchanged.
Prior to Linux 2.2,
chown()
did not dereference symbolic links. The semantics
chown()
were changed with Linux 2.2, and the new
system call was
added to provide the behavior of the old
chown()
system call.
Only a privileged (
CAP_CHOWN
) process may use
to change the us
er ID of a file.
An unprivileged process can use
to change the group ID of a file that it
owns (i.e., the processs effective user ID
matches the user ID of the file) to any of
the groups of which they are a member. A privileged process can change the group
ID of a file to any value.
If the owner or group of a file is chan
File Attributes
291
15.3File Ownership
Each file has an associated user ID (UID
290
Chapter 15
If
times
is specified as
, then both file timestamps
are updated to the current
is not
, then the new last access
timestamp is specified in
and the new last modification
timestamp is specified in
. Each of the ele-
ments of the array
is a structure of the following form:
struct timespec {
time_t tv_sec; /* Seconds ('time_t' is an integer type) */
long tv_nsec; /* Nanoseconds */
The fields in this structure specify a time in seconds and nanoseconds since the
Epoch (Section 10.1).
File Attributes
289
, the file is specified via
an open file descriptor,
luti
, the file is specified via a pa
thname, with the difference from
that if the pathname refers to a symb
olic link, then the
link is not derefer-
enced; instead, the timestamps
of the link itself are changed.
The
function is supported since
2.3. The
lutime
function is
supported since
15.2.2Changing File Timestamps with
futime
utimen
system call (supported since kernel 2.6.22) and the
futimen
library
function (supported since
glibc
rent time.
These interfaces are not specified in SUSv3, but are included in SUSv4.
system call updates the timestamps of the file specified by
to the values specified in the array
#include sys/time.h&#xsys/;tim;.h7;
int
futimes
(int
, const struct timeval
);int
lutimes
(const char *
, const struct timeval
#define _XOPEN_SOURCE 700 /* Or define _POSIX_C_SOU�RCE = 200809 */
#include sys/stat.h&#xsys/;sta;&#xt.h7;
int
utimensat
(int
, const char *
const struct timespec
, int
Returns 0 on success, or 1 on error
288
Chapter 15
(To be accurate, on Linux, it is the pr
ocesss file-system user ID, rather than
its effective user ID, that
is checked against the files user ID, as described in
If
is specified as pointer to a
structure, then the last file access and
modification times are updated using th
e corresponding fields of this struc-
of the process must
match the files user
ID (having write permission on the file is not sufficient) or the caller must be
privileged (
CAP_FOWNER
To change just one of the fi
le timestamps, we first use
File Attributes
287
Nanosecond timestamps
With version 2.6, Linux supports nanosecond resolution for the three timestamp
fields of the
stat
structure. Nanosecond resolution
improves the accuracy of programs
that need to make decisions based on the
relative order of fi
le timestamps (e.g.,
ma11
SUSv3 doesnt specify nanose
cond timestamps for the
structure, but SUSv4
adds this specification.
Not all file systems support nanosecond timestamps.
Btrfs
Reiserfs
glibc
API (since version 2.3), the timestamp fields are each defined
as a
structure (we describe this structure when we discuss
later
in this section), which represents a time in seconds and nanoseconds components.
Suitable macro definitions make the second
using the traditional field names (
st_atime
). The nanosecond
components can be accessed using field names such
st_atim.tv_nsec
, for the nano-
second component of the
last file access time.
15.2.1Changing File Timestamps with
The last file access and modification ti
mestamps stored in a file i-node can be
explicitly changed using
or one of a related set of system calls. Programs
such as
and
use these system calls to
286
Chapter 15
In Sections 14.8.1 and 15.5, we describe
moun22
options and per-file flags that pre-
vent updates to the last access time of a file. The
open()
O_NOATIME
flag described in
Section 4.3.1 also serves a similar purpose.
In some applications, this can be useful
for performance reasons, since it reduces
the number of disk operations that are
required when a file is accessed.
Although most UNIX syst
ems dont record the creation time of a file, on
recent BSD systems, this time is recorded in a
field named
Table 15-2:
Effect of various functions on file timestamps
Function
File or
Parent
amcamc
fc
chown()
lchown()
fchown
exec
xxx
Affects parent directory
of second argument
mkdir()
xxxxx
mkfifo
xxxxx
mknod()
xxxxx
mmap
st_mtime
are changed only on updates
MAP_SHARED
msync()
Changed only if file is modified
open()
creat()
xxxxx
When creating new file
open()
creat()
When truncating existing file
pipe
xxx
re
prea
, and
re
may buffer directory entries; timestamps
updated only if directory is read
remo
fremov
lremovexattr()
rename
Affects timestamps in both parent directories;
SUSv3 doesnt specify file
change, but
notes that some implementations do this
rmdir()
sendfile()
Timestamp changed for input file
File Attributes
285
printf("Last file access: %s", ctime(�&sb-st_atime));
printf("Last file modification: %s", ctime(�&sb-st_mtime));
printf("Last status change: %s", ctime(�&sb-st_ctime));
int
main(int argc, char *argv[])
struct stat sb;
Boolean statLink; /* True if "-l" specified (i.e., use lstat) */
int fname; /* Location of filename argument in argv[] */
statLink = (argc� 1) && strcmp(argv[1], "-l") == 0;
/* Simple parsing for "-l" */
fname = statLink ? 2 : 1;
if (fname �= argc || (�argc 1 && strcmp(argv[1], "--help") == 0))
usageErr("%s [-l] file\n"
" -l = use lstnstead of st\", argv[0]);
if (statLink) {
if (lstat(argv[fname], &sb) == -1)
errExit("lstat");
} else {
if (stat(argv[fname], &sb) == -1)
errExit("stat");
}
displayStatInfo(&sb);
exit(EXIT_SUCCESS);
files/t_stat.c
15.2File Timestamps
st_mtime
st_ctime
fields of the
structure contain file timestamps.
These fields record, respectively, the times of
last file access, last file modification,
and last file status change (i.e., last chan
ge to the files i-node
information). Time-
stamps are recorded in seconds since the Epoch (1 January 1970; see Section 10.1).
Most native Linux and UNIX file systems support all of the timestamp fields,
but some non-UNIX file systems may not.
Table 15-2 summarizes which of the timestamp fields (and in some cases, the
analogous fields in the parent directory)
are changed by various system calls and
library functions described in this bo
ok. In the headings of this table,
represent the
, and
fields, respectively. In most cases, the
284
Chapter 15
Listing 15-1:
File Attributes
283
field records the number of disk blocks actually allocated. If the file con-
tains holes (Section 4.7), this
will be smaller than might be expected from the corre-
sponding number of bytes (
st_size
) in the file. (The disk usage command,
du k file
displays the actual space allocated for a file,
in kilobytes; that is, a figure calculated
from the
st_blocks
value for the file
, rather than the
st_size
field is somewhat misleadingly named. It is not the block size of
the underlying file system, but rather th
e optimal block size (in bytes) for I/O on
files on this file system. I/O in blocks smalle
r than this size is less efficient (refer to
282
Chapter 15
The full set of file-type macros (defined in
sys/stsys;&#x/st7; t.h;at.h
) is shown in Table 15-1. All
of the file-type macros in Table 15-1 are specified in SUSv3 and appear on Linux.
Some other UNIX implementations defi
ne additional file types (e.g.,
S_IFDOOR
, for
door files on Solaris). The type
S_IFLNK
S_IFSOCK
and
S_ISSOCK
from
sys/stat.h&#xs7.6;&#xys/7;&#x.6st;§.6;&#xt.h7;&#x.600;
we must either define the
_BSD_SOURCE
feature test macro or define
_XOPEN_SOURCE
with a value greater than or eq
ual to 500. (The rules have varied
somewhat across
versions: in some cases,
_XOPEN_SOURCE
must be defined
with a value of 600 or greater.)
The bottom 12 bits of the
field define the permissions for the file. We
describe the file permission
bits in Section 15.4. For now,
we simply note that the 9
least significant of the permission bits ar
e the read, write, and execute permissions
for each of the categories
owner, group, and other.
File size, blocks allocated, and optimal I/O block size
For regular files, the
st_size
field is the total size of th
e file in bytes. For a symbolic
link, this field contains the
length (in bytes) of the path
name pointed to by the link.
For a shared memory object (Chapter 54),
this field contains th
e size of the object.
field indicates the total number of
blocks allocated to the file, in
512-byte block units. This total includes
space allocated for poin
ter blocks (see Fig-
ure 14-2, on page 258). The choice of the
512-byte unit of measurement is histori-
calthis is the smallest block size on any of the file systems that have been
implemented under UNIX. More modern file
systems use larger logical block sizes.
For example, under
, the value in
is always a multiple of 2, 4, or 8,
File Attributes
281
Device IDs and i-node number
st_dev
field identifies the device on
which the file resides. The
st_ino
field con-
tains the i-node number of the file. The combination of
and
uniquely
identifies a file across all file systems. The
type records the major and minor
IDs of a device (Section 14.1).
If this is the i-node
for a device, then the
field contains the major and
minor IDs of the device.
The major and minor IDs of a
value can be extracted using two macros:
ma
and
m
. The header file required to obta
in the declarations of these two
macros varies across UNIX implementati
ons. On Linux, they are exposed by
sys/type&#x-7.1;&#xs-7.;y-1;.2s;&#x-7.1;&#x/-7.;typ;çs.;&#xh000;s.h
if the
_BSD_SOURCE
macro is defined.
RWX
GroupOther
File typePermissions
RWX
280
Chapter 15
These three system calls differ only in the way that the file is specified:
FILE ATTRIBUTES
In this chapter, we investigate various a
278
Chapter 14
various information about the file, includin
g its type, size, link count, ownership,
permissions, timestamps, and pointe
rs to the files data blocks.
Linux provides a range of journa
ling file systems, including
, and
. A journaling file system record
s metadata updates (and option-
ally on some file systems, data updates)
to a log file before the actual file updates
are performed. This means that in the even
t of a system crash, the log file can be
replayed to quickly restore the file system
to a consistent state. The key benefit of
journaling file systems is that they avoi
d the lengthy file-sys
tem consistency checks
required by conventional UNIX fi
le systems after a system crash.
All file systems on a Linux system are mounted under a single directory tree,
with the directory
at its root. The location at whic
h a file system is mounted in the
directory tree is called its mount point.
A privileged process can mount and
unmount a file system using the
mo
system calls. Information abo
ut a mounted file system can be
File Systems
277
Many native UNIX and Linux file systems support the notion of reserving a
certain portion of the blocks of a file system for the superuser, so that if the file
system fills up, the superuser can still log in to the system and do some work to
resolve the problem. If there are reserved blocks in the file system, then the dif-
ference in values of the
f_bfree
and
f_bavail
fields in the
structure tells us
f_flag
field is a bit mask of the flags used
to mount the file system; that is, it
contains information similar to the
mountflags
argument given to
moun22
However, the constants used
for the bits in this fiel
d have names starting with
ST_
instead of the
MS_
used for
mountflags
. SUSv3 requires only the
ST_RDONLY
and
ST_NOSUID
constants, but the
glibc
implementation supports a full range of con-
stants with names corresponding to the
constants described for the
mo
mountflags
f_fsid
field is used on some UNIX
mation by scanning
/proc/mounts
276
Chapter 14
14.11Obtaining Informatio
statvf
and
library functions obtain information about a mounted file
system.
The only difference between these two functi
ons is in how the file system is identi-
fied. For
, we use
to specify the name of any file in the file system.
fstatv
, we specify an open file descriptor,
, referring to any file in the file
File Systems
275
Various memory-based file systems have been developed for Linux. The most
sophisticated of these to date is the
file system, which first appeared in
Linux 2.4. The
file system differs from other memory-based file systems in
that it is a
memory file system. This means that
uses not only RAM, but
also the swap space, if RAM
is exhausted. (Although the
file system described
here is Linux-specific, most UNIX implementations provide some form of memory-
based file system.)
The
file system is an optional Linux
kernel component that is configured
via the
CONFIG_TMPFS
option.
To create a
file system, we use a command of the following form:
mount -t tmpfs
source
274
Chapter 14
We begin by creating a directory tree (
src1
) mounted under
top
. This tree
includes a submount (
top/sub
Password:
mkdir top
This is our top-level mount point
mkdir src1
Well mount this under
top
touch src1/aaa
mount --bind src1 top
Create a normal bind mount
mkdir top/sub
Create directory for a submount under
top
mkdir src2
Well mount this under
top/sub
touch src2/bbb
mount --bind src2 top/sub
Create a normal bind mount
find top
Verify contents under
top
mount tree
top
top/aaa
top/sub
This is the submount
top/sub/bbb
Now we create another bind mount (
) using
top
as the source. Since this new
mount is nonrecursive, the submount is not replicated.
mount --bind top dir1
Here we use a normal bind mount
find dir1
dir1
dir1/aaa
dir1/sub
The absence of
dir1/sub/bbb
in the output of
find
shows that the submount
top/sub
was not replicated.
Now we create a recu
rsive bind mount (
dir2
) using
as the source.
mount --rbind top dir2
find dir2
dir2
dir2/aaa
dir2/sub
dir2/sub/bbb
The presence of
dir2/sub/bbb
in the output of
shows that the submount
top/sub
was replicated.
14.10A Virtual Memory File System:
All of the file systems we have described so far in this chapter reside on disks. How-
ever, Linux also supports the notion of
that reside in memory. To
applications, these look just
like any other file systemthe same operations (
open()
read
mkdir()
, and so on) can be applied to files and directories in
such file systems. There is, however, one important difference: file operations are
much faster, since no disk access is involved.
File Systems
273
We can create a bind mount from the shell using the
option to
mount(8)
shown in the following examples.
In the first example, we bind mount a
directory at another location and show
that files created in one directory are visible at the other location:
su
Privilege is required to use88
Password:
/testfs
mkdir d1
Create directory to be bound at another location
touch d1/x

Create file in the directory

Create mount point to which
will be bound
mount --bind d1 d2

Create bind mount:
visible via

Verify that we can see contents of

Create second file in directory

Verify that this change is visible via
x y
In the second example, we bind mount a
file at another location and demonstrate
that changes to the file via one mo
unt are visible via the other mount:
�cat f1

Create file to be bound to another location
272
Chapter 14
touch /testfs/newfile
Create a file in this subtree
ls /testfs
View files in this subtree
newfile
umount /testfs
Pop a mount from the stack
mount | grep testfs
/dev/sda12 on /testfs type ext3 rwrw
Now only one mount on
/testfs
ls /testfs
Previous mount is now visible
lost+found myfile
One use of mount stacking is to stack a
new mount on an existing mount point that
is busy. Processes that hold file descriptors open, that are
-jailed, or that
have current working directories within
the old mount point continue to operate
under that mount, but processes making
new accesses to the
mount point use the
new mount. Combined with a
MNT_DETACH
unmount, this can provide a smooth
migration off a file system without needin
g to take the system into single-user
mode. Well see another example of how stac
king mounts is usef
ul when we discuss
file system in Section 14.10.
14.9.3Mount Flags That Are Per-Mount Options
In kernel versions before
2.4, there was a one-to-one correspondence between file
systems and mount points. Because this no
longer holds in Linux 2.4 and later,
some of the
mountflags
values described in Section
File Systems
271
14.9Advanced Mount Features
We now look at a number of more advanced features that can be employed when
mounting file systems. We demonstrate the
use of most of these features using the
mo88
command. The same effects can also
be accomplished from a program via
calls to
moun22
14.9.1Mounting a File System at Multiple Mount Points
In kernel versions before 2.4, a file system could be mounted only on a single
mount point. From kernel 2.4 onward, a
file system can be mounted at multiple
locations within the file system. Because
each of the mount points shows the same
subtree, changes made via one mount po
int ssas
demonstrated by the following shell session:
su
Privilege is required 88
Password:
mkdir /testfs
Create two directories for mount points
mount /dev/sda12 /testfs
Mount file system at one mount point
mount /dev/sda12 /demo
Mount file system at second mount point
mount | grep sda12
previously visible at that mount point.
When the mount at the top of the stac
k is unmounted, the previously hidden
once more, as demonstrated by the following shell session:
su
Privilege is required 88
Password:
mount /dev/sda12 /testfs
Create first mount on
/testfs
touch /testfs/myfile
Make a file in this subtree
mount /dev/sda13 /testfs
Stack a second mount on
/testfs
mount | grep testfs
270
Chapter 14
On Linux 2.2 and earlier, the file system
can be identified in two ways: by the
mount point or by the name of the de
vice containing the file system. Since
kernel 2.4, Linux doesnt allow the latt
er possibility, because a single file sys-
tem can now be mounted at multiple loca
tions, so that specifying a file system
File Systems
269
case 'o':
data = optarg;
break;
case 't':
fstype = optarg;
break;
case 'f':
for (j = 0; j strlen(optarg); j++) {
switch (optarg[j]) {
case 'b': flags |= MS_BIND; break;
case 'd': flags |= MS_DIRSYNC; break;
case 'l': flags |= MS_MANDLOCK; break;
case 'm': flags |= MS_MOVE; break;
case 'A': flags |= MS_NOATIME; break;
case 'V': flags |= MS_NODEV; break;
case 'D': flags |= MS_NODIRATIME; break;
case 'E': flags |= MS_NOEXEC; break;
case 'S': flags |= MS_NOSUID; break;
case 'r': flags |= MS_RDONLY; break;
case 'c': flags |= MS_REC; break;
case 'R': flags |= MS_REMOUNT; break;
case 's': flags |= MS_SYNCHRONOUS; break;
default: usageError(argv[0], NULL);
}
}
break;
default:
usageError(argv[0], NULL);
}
}
if (argc != optind + 2)
usageError(argv[0], "Wrong number of arguments\n");
if (mount(argv[optind], argv[optind + 1], fstype, flags, data) == -1)
errExit("mount");
exit(EXIT_SUCCESS);
filesys/t_mount.c
14.8.2Unmounting a File System:
umount
system call unmounts a mounted file system.
target
argument specifies the mount point of
the file system to be unmounted.
#include sys/mount.&#xsys/;mou;&#xnt.7;&#xh000;h
int
umount
(const char *
268
Chapter 14
Finally, we move the mount point to a new
location within the directory hierarchy:
./t_mount -f m /testfs /demo
cat /proc/mounts | grep sda12
Verify change
/dev/sda12 /demo ext3 ro 0
Listing 14-1:
Using
mo
filesys/t_mount.c
#include sys/mount.&#xsys/;mou;&#xnt.7;&#xh000;h
#include "tlpi_hdr.h"
static void
usageError(const char *progName, const char *msg)
if (msg != NULL)
fprintf(stderr, "%s", msg);
File Systems
267
working directory located within, the file system (this will always be true of
the root file system). Another ex
ample of where we need to use
MS_REMOUNT
is
with
tmpfs
(memory-based) file systems
(Section 14.10), which cant be
unmounted without losing
their contents. Not all
are modifiable;
see the
mount(2)
. The new flags are
MS_PRIVATE
MS_SHARED
MS_SLAVE
MS_UNBINDABLE
. (These flags can be used in conjunction with
MS_REC
to prop-
agate their effects to all
of the submounts under a mo
unt subtree.) Shared sub-
trees are designed for use with certain
advanced file-system
features, such as
per-process mount namespaces (see the description of
CLONE_NEWNS
in
Section28.2.1), and the
Filesystem in Userspace
FUSEFUSE sub-
tree facility permits file-system mounts to be propagated between mount
namespaces in a controlled fashion. Deta
ils on shared subtrees can be found in
the kernel source code file
Documentation/filesystems/sharedsubtree.txt
and
[Viro & Pai, 2006].
Example program
The program in Listing 14-1 provides a command-level interface to the
mo22
system call. In effect, it
is a crude version of the
mount(8)
command. The following
shell session log demonstrates the use of
this program. We begin by creating a
directory to be used as a mount
point and mounting a file system:

Need privilege to mount a file system
Password:
mkdir /testfs
./t_mount -t ext2 -o bsdgroups /dev/sda12 /testfs
cat /proc/mounts | grep sda12
Verify the setup
/dev/sda12 /testfs ext3 rw 0 0
266
Chapter 14
MS_NODEV
Dont allow access to block and characte
r devices on this file system. This is
a security feature designed to prev
ent users from doing things such as
inserting a removable disk containing
device special files that would allow
arbitrary access to the system.
MS_NODIRATIME
Dont update the last access time for di
rectories on this file system. (This
flag provides a subset of the functionality of
MS_NOATIME
, which prevents
updates to the last access
time for all file types.)
MS_NOEXEC
Dont allow programs (or scripts) to be
executed from this file system. This
is useful if the file system contains non-Linux executables.
MS_NOSUID
MS_RELATIME
(since Linux 2.6.20)
Update the last access timestamp for file
s on this file system only if the cur-
File Systems
265
mountflags
argument is a bit mask of flag
s that modify the operation of
mo
Zero or more of the following flags can be specified in
mountflags
MS_BIND
(since Linux 2.4)
Create a bind mount. We describe this
feature in Section 14.9.4. If this flag
is specified, then the
fstype
mountflags
data
arguments are ignored.
MS_DIRSYNC
(since Linux 2.6)
Make directory updates synchronous. Th
is is similar to the effect of the
open
flag (Section 13.3), but applies
only to directory updates. The
MS_SYNCHRONOUS
flag described below provides
ing at a different location, except that there is no point in time when the
subtree is unmounted. The
argument should be a string specified as
264
Chapter 14
14.8.1Mounting a File System:
mo
system call mounts the file system contained on the device specified by
under the directory (the
mount point
) specified by
mountflags
argument is a bit mask constructed by ORing (
) zero or more
File Systems
263
Before looking at these syst
em calls, it is useful to
know about three files that
contain information about the file system
s that are currently mounted or can be
A list of the currently mounted file syst
ems can be read from the Linux-specific
/proc/mounts
virtual file.
/proc/mounts
is an interface to kernel data structures, so
it always contains accurate info
rmation about mounted file systems.
With the arrival of the per-process mount namespace feature mentioned ear-
lier, each process now has a
/proc/
PID
/mounts
file that lists the mount points
constituting its mount namespace, and
is just a symbolic link to
/proc/self/mounts
mo88
and
umount(8)
commands automatically maintain the file
262
Chapter 14
To list the currently mounted file
systems, we can use the command
mount
, with no
arguments, as in the following exam
ple (whose output has been somewhat
abridged):
/dev/sda6 on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sda8 on /home type ext3 (rw,acl,user_xattr)
/dev/sda1 on /windows/C type vfat (rw,noexec,nosuid,nodev)
/dev/sda9 on /home/mtk/test type reiserfs (rw)
Figure 14-4 shows a partial directory and fi
le structure for the system on which the
mount
command was performed. This diagram shows how the mount points
Figure 14-4:
Example directory hierarchy showing file-system mount points
14.8Mounting and Unmounting File Systems
moun
and
system calls allow a privileged (
CAP_SYS_ADMIN
) process to
mount and unmount file systems. Most UN
IX implementations provide versions of
these system calls. However, they are not standardized by SUSv3, and their opera-
tion varies both across UNIX impl
ementations and across file systems.
points
sda6 file system
sda9 file system
sda8 file system
sda1 file system
File Systems
261
was developed at IBM. It was in
tegrated into the 2.4.20 kernel.
http://oss.sgi.com/projects/xfs/
) was originally developed by Silicon Graph-
SGISGIthe early 1990s for Irix, it
(B-tree FS, usually pronounced butter FS;
) is a
new file system designed from the ground up to provide a range of modern
features, including extents, writable
snapshots (which provide functionality
overall hierarchy. The superuser uses a co
mmand of the following form to mount a
file system:
device directory
This command attaches the file system on the named
into the directory hier-
archy at the specified
directory
the file systems
. It is possible to change
the location at which a file system is moun
tedthe file system is unmounted using the
umount
command, and then mounted once more at a different point.
With Linux 2.4.19 and later, things
became more complicated. The kernel
now supports per-process
mount namespaces
. This means that each process
260
Chapter 14
14.6Journaling File Systems
file system is a good example of a
traditional UNIX file system, and suffers
from a classic limitation of such file syst
ems: after a system crash, a file-system con-
sistency check (
fsck
) must be performed on reboot in order to ensure the integrity
of the file system. This is necessary becaus
e, at the time of the system crash, a file
update may have been only partially
File Systems
259
One other benefit conferred by this design is that files can have holes, as
described in Section 4.7. Rather than alloca
te blocks of null bytes for the holes in a
file, the file system can just mark (with th
e value 0) appropriate pointers in the i-node
and in the indirect pointer blocks to indicate
that they dont refer to actual disk blocks.
14.5The VirtVFSVFS
Each of the file systems available on Linu
x differs in the details of its implementa-
tion. Such differences include, for example, the way in which the blocks of a file are
allocated and the manner in which director
ies are organized. If every program that
worked with files needed to understand the
specific details of each file system, the
task of writing programs that worked with
all of the different file systems would be
nearly impossible. The
virtual file system
(VFS, sometimes also referred to as the
virtual file switch
) is a kernel feature that resolves this problem by creating an
abstraction layer for file-system operations
(see Figure 14-3). The ideas behind the
VFS are straightforward:
The VFS defines a generic interface fo
r file-system operations. All programs
that work with files specify their operations in terms of this generic interface.
Each file system provides an implementation for the VFS interface.
Under this scheme, programs need to un
derstand only the VFS interface and can
ignore details of individual file-system implementations.
The VFS interface includes operations corresponding to all of the usual system
calls for working with file systems and directories, such as
open()
read
writ
close()
mount()
mmap()
mkdir()
unlink()
, and
rena
The VFS abstraction layer is closely mode
led on the traditional UNIX file-system
model. Naturally, some file systemsespeci
ally non-UNIX file systemsdont support
all of the VFS operations (e.g., Micros
ofts VFAT doesnt support the notion of
symbolic links, created using
). In such cases, the underlying file system
passes an error code back to the VFS laye
r indicating the lack of support, and the
VFS in turn passes this error
code back to the application.
Figure 14-3:
The virtual file system
Virtual File SystVFSVFS
ext2ReiserfsVFATNFS
258
Chapter 14
Figure 14-2:
Structure of file blocks for a file in an
file system
As an example, the author measured
one system containing somewhat more
than 150,000 files. Just over 30% of th
e files were less than 1000 bytes in size,
and 80% occupied 10,000 bytes or less. A
ssuming a 1024-byte block size, all of
the latter files could be referenced usin
g just the 12 direct pointers, which can
refer to blocks containing a total of
12,288 bytes. Using a 4096-byte block
size, this limit rises to 49,
152 bytes (95% of the files on the system fell under
This design also allows for enormous file
sizes; for a block size of 4096 bytes, the
Key
DB = Data block
IPB = Indirect pointer block
3IPB = Triple IPB
Note: not all blocks are shown
2IPB
3IPB
IPB
2IPB
IPB
2IPB
i-node entry
Other file
information
Pointers to indirectly
addressed file blocks
Direct pointers
to file blocks
File Systems
257
Three timestamps: time of last
access to the file (shown by
ls lu
), time of last
modification of the file (t
he default time shown by
ls l
), and time of last status
change (last change to i-no
de information, shown by
ls lc
). As on other UNIX
implementations, it is notable that mo
st Linux file systems dont record the
creation time of a file.
Number of hard links to the file.
Size of the file in bytes.
Number of blocks actually allocated to the file, measured in units of 512-byte
blocks. There may not be a simple correspondence between this number and
the size of the file in byte
s, since a file can contain
holes (Section 4.7), and thus
require fewer allocated blocks than woul
d be expected according to its nomi-
nal size in bytes.
Pointers to the data
blocks of the file.
I-nodes and data block pointers in
ext2
Like most UNIX file systems, the
file system doesnt store the data blocks of a
file contiguously or even in sequential
order (though it does attempt to store them
close to one another). To locate the file
256
Chapter 14
A file system contains the following parts:
Boot block
: This is always the first block in a file system. The boot block is not
used by the file system; rather, it cont
ains information used to boot the operat-
ing system. Although only one boot bloc
k is needed by the operating system,
all file systems have a boot block (most of which are unused).
: This is a single block, immediately following the boot block, which
File Systems
255
The file-system types currently known by the kernel can be viewed in the Linux-specific
/proc/filesystems
Linux 2.6.14 added the
Filesystem in Userspace
(FUSE) facility
. This mechanism
adds hooks to the kernel that allow a fi
block
super-
block
table
data blocks
partition
partitionpartition
File
system
254
Chapter 14
Each disk is divided into
one or more (nonoverlapping)
partitions
. Each partition is
treated by the kernel as a sepa
ding under the
directory.
File Systems
253
[Kroah-Hartman, 2003] provides an overview of
udev
, and outlines the reasons
it is considered superior to
, the Linux 2.4 solution to the same problems.
Information about the
file system can be found in the Linux 2.6 kernel
source file
Documentation/filesystems/sysfs.txt
and in [Mochel, 2005].
Device IDs
Each device file has a
major ID number
and a
minor ID number
. The major ID identi-
fies the general class of device, and is used
by the kernel to look up the appropriate
driver for this type of device. The minor
ID uniquely identifies a particular device
within a general class. The major and minor
IDs of a device file are displayed by the
ls l
command.
A devices major and minor IDs are record
ed in the i-node for the device file.
(We describe i-nodes in Section 14.4.) Each device driver registers its association
with a specific major device ID, and this association provides the connection
between the device special file and the device driver. The name of the device file
has no relevance when the kernel looks for the device driver.
On Linux 2.4 and earlier, the total number
of devices on the system is limited
by the fact that device major and minor ID
s are each represented using just 8 bits.
The fact that major device IDs are fixe
d and centrally assigned (by the Linux
Assigned Names and Numbers Authority; see
http://www.lanana.org/
erbates this limitation. Linux 2.6 eases this
limitation by using more bits to hold the
major and minor device IDs (res
pectively, 12 and 20 bits).
14.2Disks and Partitions
Regular files and directories ty
pically reside on hard disk
devices. (Files and direc-
tories may also exist on other devices, su
ch as CD-ROMs, flash memory cards, and
virtual disks, but for the present discussion
, we are interested primarily in hard disk
devices.) In the following sections, we lo
ok at how disks are organized and divided
into partitions.
Disk drives
A hard disk drive is a mechanical device consisting of one or more platters that
rotate at high speed (of the order of thou
252
Chapter 14
14.1Device SpDevicesDevices
This chapter frequently mentions disk devices, so we start with a brief overview of
the concept of a device file.
A device special file corresponds to a de
vice on the system.
Within the kernel,
each device type has a corresponding devi
ce driver, which handles all I/O requests
for the device. A
device driver
is a unit of kernel code that implements a set of oper-
ations that (normally) correspond to in
put and output actions on an associated
piece of hardware. The API provided by devi
ce drivers is fixed, and includes opera-
tions corresponding to the system calls
writ
ioct
The fact that each device driver provides
a consistent interface,
hiding the differ-
ences in operation of individual devices, allows for
(Section 4.2).
Some devices are real, such as mice,
disks, and tape drives. Others are
meaning that there is no corresponding ha
rdware; rather, the kernel provides (via
a device driver) an abstract device with an
API that is the same as a real device.
Devices can be divi
ded into two types:
Character devices
handle data on a character-by-character basis. Terminals and
keyboards are examples of character devices.
Block devices
handle data a block at a time. The size of a block depends on the
type of device, but is typically some multiple of 512 bytes. Examples of block
devices include disk
s and tape drives.
Device files appear within the file system
, just like other files, usually under the
/dev
directory. The superuser can create a device file using the
mknod
command, and
the same task can be performed in a privileged (
CAP_MKNOD
) program using the
mknod()
system call.
We dont describe the
(make file-system i-node
FILE SYSTEMS
In Chapters 4, 5, and 13, we looked at file I/O, with a particular focus on regular
(i.e., disk) files. In this and the following
about mounted file systems.
250
Chapter 13
Further information
[Bach, 1986] describes the implementation
and advantages of the buffer cache on
System V. [Goodheart & Cox, 1994] and [V
ahalia, 1996] also describe the rationale
and implementation of the System V buff
er cache. Further relevant information
specific to Linux can be found in [B
). Are the results similar? Are the trends the same when going
from small to large buffer sizes?
13-2.
Time the operation of the
filebuff/write_bytes.c
program (provided in the source
code distribution for this book) for various buffer sizes and file systems.
13-3.
What is the effect of the following statements?
fflush(fp);
fsync(fileno(fp));
13-4.
Explain why the output of the following
File I/O Buffering
249
The
function is the converse of
. Given a file descriptor, it creates a
corresponding stream that uses th
is descriptor for its I/O. The
mode
argument is
the same as for
fopen()
; for example,
for read,
for write, or
for append. If this
argument is not consistent
with the access mode
of the file descriptor
fdop
fdopen()
function is especially useful for
to files other
than regular files. As well see in later ch
248
Chapter 13
File I/O Buffering
247
Failure to observe any of these re
strictions results in the error
EINVAL
. In the above
means the physical block size of the device (typically 512 bytes).
When performing direct I/O, Linux 2.4 is
more restrictive than Linux 2.6: the
#include fcntünt;l.h;l.h
#include mall&#xmall;oc.;&#xh000;oc.h
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int fd;
ssize_t numRead;
size_t length, alignment;
246
Chapter 13
The specification of
is new in SUSv3, and not all UNIX implementa-
tions support this inte
rface. Linux provides
since kernel 2.6.
13.6Bypassing the Buffer Cache: Direct I/O
Starting with kernel 2.4, Linux allows an
application to bypass the buffer cache
when performing disk I/O, thus transferring
data directly from user space to a file
or disk device. This
Alignment restrictions for direct I/O
Because direct I/O (on both disk devices an
d files) involves direct access to the
disk, we must observe a number of
The data buffer being transferred must
be aligned on a memory boundary that
is a multiple of the block size.
The file or device offset at which data transfer commences must be a multiple
of the block size.
The length of the data to be transferre
d must be a multiple of the block size.
File I/O Buffering
245
bytes from
244
Chapter 13
Figure 13-1:
Summary of I/O buffering
13.5Advising the Kern
posix_
system call allows a process to
inform the kernel about its likely
pattern for accessing file data.
The kernel may (but is not obliged
to) use the information provided by
to optimize its use of the buffer
cache, thereby improving I/O per-
formance for the process and for
the system as a whole. Calling
has
no effect on the semantics of a program.
argument is a file descriptor identi
fying the file abo
ut whose access pat-
terns we wish to inform the kernel. The
Kernel memory
User Memory
Make flushing automatic
on each I/O call
Kernel buffer cache
stdio
buffer
ff
open(path, flags |
O_SYNC, mode)
User data
I/O system calls
Kernel-initiated write
To force
buffer flush
stdio
library calls
#define _XOPEN_SOURCE 600
#include fcntünt;l.h;l.h
int
posix_fadvise
(int
, off_t
File I/O Buffering
243
O_DSYNC
O_RSYNC
SUSv3 specifies two further open file st
atus flags related to synchronized I/O:
O_DSYNC
O_RSYNC
O_DSYNC
flag causes writes to be performed according to the requirements
of synchronized I/O data integrity completion (like
fdatasyn
). This contrasts with
O_SYNC
, which causes writes to be performed
according to the requirements of syn-
242
Chapter 13
open
call, every
to the file automatically flushes the file data and
File I/O Buffering
241
fdatasyn
potentially reduces the number of
disk operations from the two
required by
fsyn
to one. For example, if the file
data has changed, but the file size
has not, then calling
fdatasync()
only forces the data to be updated. (We noted
240
Chapter 13
File I/O Buffering
239
Flushing a
Regardless of the current buffering mode,
at any time, we can force the data in a
output stream to be written (i.e
., flushed to a kernel buffer via
) using
fflush()
library function. This function fl
ushes the output buffer for the speci-
fied
If
is
NULL
fflush()
flushes all
buffers.
fflush()
function can also be applied to
an input stream. This causes any
buffered input to be discarded. (The buffer will be refilled when the program next
stdio
buffer is automatically flushed when
the corresponding stream is closed.
In many C library implementations, including
terminal, then an implicit
is performed whenever input is read from
. This has the effect of flushing any prompts written to
that dont include
a terminating newline character (e.g.,
printf(Date: )
). However, this behavior is not
specified in SUSv3 or C99 and is not implem
ented in all C libraries. Portable programs
should use explicit
calls to ensure that su
ch prompts are displayed.
The C99 standard makes two requirements
if a stream is opened for both input
and output. First, an output operation cant be directly followed by an input
operation without an
intervening call to
ffl
or one of the file-positioning
fset
, or
rewind
). Second, an input operation cant be
directly followed by an output operatio
n without an intervening call to one of
the file-positioning functions, unless
the input operation encountered end-
of-file.
13.3Controlling Kernel
It is possible to force flushing of kernel
buffers for output files. Sometimes, this is
necessary if an application (e.g., a databa
se journaling process) must ensure that
output really has been written to the disk
(or at least to the di
sks hardware cache)
Before we describe the system calls used
to control kernel buffering, it is useful
to consider a few relevant definitions from SUSv3.
Synchronized I/O data
integrity and synchronized I/O file integrity
SUSv3 defines the term
238
Chapter 13
_IOLBF
Employ line-buffered I/O. This flag
is the default for streams referring to
terminal devices. For output streams,
data is buffered until a newline char-
acter is output (unless the buffer fills
first). For input streams, data is read
a line at a time.
_IOFBF
Employ fully buffered I/O. Data is read or written (via calls to
read
or
) in units equal to the size of the
buffer. This mode is the default for
streams referring to disk files.
The following code demo
nstrates the use of
File I/O Buffering
237
13.2Buffering in the
Library
Buffering of data into large blocks to redu
ce system calls is exactly what is done by
the C library I/O functions (e.g.,
fprintf()
fscanf
236
Chapter 13
input file into the buffer cache is unav
oidable. However, we already saw that
ext2
file system with a 4096-byte block size, and each row shows the average of 20 runs.
We dont show the test program (
filebuff/write_bytes.c
), but it is available in the
source code distrib
ution for this book.
Table 13-2 shows the costs just for making
writ
system calls and transferring data
from user space to the kernel buffer cache using different
wr
buffer sizes. For
larger buffer sizes, we see significant diffe
rences from the data
shown in Table 13-1.
For example, for a 65,536-byte buffer size, th
e elapsed time in Table 13-1 is 2.06 sec-
onds, while for Table 13-2 it is 0.09 second
s. This is because
no actual disk I/O is
being performed in the latter case. In
other words, the majority of the time
required for the large buffer cases in Table 13-1 is due to the disk reads.
As well see in Section 13.3, when we
force output operations to block until
data is transferred to the disk, the times for
calls rise significantly.
Finally, it is worth noting that the in
formation in Table 13-2 (and later, in
Table 13-3) represents just one form of (naive) benchmark for a file system. Fur-
thermore, the results will probably show some variation across file systems. File systems
can be measured by various other criteria,
such as performance under heavy multiuser
File I/O Buffering
235
Each row shows the average of 20 runs for the given buffer size. In these tests,
as in other tests shown later in this ch
apter, the file syst
em was unmounted and
234
Chapter 13
Correspondingly, for input, the kernel read
s data from the disk and stores it in
a kernel buffer. Calls to
read
FILE I/O BUFFERING
In the interests of speed and efficiency, I/
O system calls (i.e., the kernel) and the I/O
functions of the standard C library (i.e., the
functions) buffer data when oper-
ating on disk files. In this chapter, we describe both types of buffering and consider
how they affect application performance.
We also look at various techniques for
influencing and disabling both types of buffering, and look at a technique called
direct I/O, which is useful for bypassing kernel buffering in certain circumstances.
13.1Kernel Buffering of File
When working with disk files, the
read
and
system calls dont directly ini-
tiate disk access. Instead, they simply co
System and Process Information
231
12.3Summary
file system exposes a range of ke
rnel information to application pro-
grams. Each
/proc/
subdirectory contains files and subdirectories that provide
information about the pr
ocess whose ID matches
. Various other files and
/proc
expose system-wide informat
ion that programs can read
and, in some cases, modify.
system call allows us to discov
er the UNIX implementation and
the type of machine on which an application is running.
Further information
/proc
file system can be found in the
manual
page, in the kernel source file
Documentation/filesystems/proc.txt
, and in various files
in the
Documentation/sysctl
directory.
12.4Exercises
12-1.
Write a program that lists the process ID and command name for all processes
being run by the user named in the prog
rams command-line argument. (You may
find the
userIdFromNa
function from Listing 8-1, on
page 159, useful.) This can
be done by inspecting the
and
lines of all of the
/proc/
/status
files on
the system. Walking
through all of the
PID
directories on the system requires the
read33
, which is described in Section 18.8. Make sure your program
correctly handles the possibility that a
PID
230
Chapter 12
The
gethostnam
system call, which is the converse of
System and Process Information
229
12.2System Identification:
system call returns a range of id
entifying information about the host
system on which an application is running, in the structure pointed to by
utsbuf
argument is a pointer to a
structure, which is defined as follows:
#define _UTSNAME_LENGTH 65
struct utsname {
char sysname[_UTSNAME_LENGTH]; /* Implementation name */
228
Chapter 12
Example program
Listing 12-1 demonstrates
how to read and modify a
/proc
file. This program reads
and displays the contents of
/proc/sys/kernel/pid_max
. If a command-line argument
is supplied, the program update
s the file using that value.
This file (which is new in
Linux 2.6) specifies an upper limit for proc
ess IDs Section 6.2Section 6.2Here is an example
of the use of this program:
su
Privilege is required to update
pid_max
Password:
./procfs_pidmax 10000
Old value: 32768
/proc/sys/kernel/pid_max now contains 10000
Listing 12-1:
/proc/sys/kernel/pid_max

sysinfo/procfs_pidmax.c
#include fcntünt;l.h;l.h
#include "tlpi_hdr.h"
#define MAX_LINE 100
int
main(int argc, char *argv[])
int fd;
char line[MAX_LINE];
ssize_t n;
fd = open("/proc/sys/kernel/pid_max", (�argc 1) ? O_RDWR : O_RDONLY);
if (fd == -1)
errExit("open");
n = read(fd, line, MAX_LINE);
if (n == -1)
errExit("read");
if (�argc 1)
printf("Old value: ");
printf("%.*s", (int) n, line);
if (�argc 1) {
if (write(fd, argv[1], strlen(argv[1])) != strlen(argv[1]))
fatal("w failed");
system("echo /proc/sys/kernel/pid_max now contains "
"`cat /proc/sys/kernel/pid_max`");
}
exit(EXIT_SUCCESS);

sysinfo/procfs_pidmax.c
System and Process Information
227
Other than the files in the
PID
subdirectories, most files under
/proc
are
owned by
root
le can be modified only by
root
Figure 12-1:
Selected files and subdirectories under
Accessing files in
/proc/
/proc/
directories are volatile. Each of th
ese directories comes into existence
when a process with the corresponding pr
ocess ID is created and disappears when
filesystems, kallsyms, loadavg, locks, meminfo,
partitions, stat, swaps, uptime, version, vmstat
sockstat, sockstat6,
fs
file-max
kernel
acct, core_pattern, hostname,
msgmax, msgmnb, msgmni, pid_max,
sem, shmall, shmmax, shmmni
vm
overcommit_memory,
overcommit_ratio
226
Chapter 12
where
TID
is the thread ID of the thread. (This is the same number as would be
System and Process Information
225
CapEff: 00000000fffffeff
Effective capabilities
CapBnd: 00000000ffffffff
224
Chapter 12
In order to provide easier access to kernel information, many modern UNIX
/proc
virtual file system. This file system resides under
/proc
directory and contains various files th
at expose kernel information, allow-
ing processes to conveniently read that information, and change it in some cases,
using normal file I/O system calls. The
/proc
file system is said to be virtual because
the files and subdirectories th
at it contains dont reside on a disk. Instead, the kernel
creates them on the fly as processes access them.
In this section, we present an overview of the
/proc
file system. In later chap-
ters, we describe specific
/proc
files, as they relate to
the topics of each chapter.
Although many UNIX implementations provide a
/proc
file system, SUSv3 doesnt
ion About a Process:
/proc/
For each process on the sy
stem, the kernel provides a corresponding directory
named
/proc/
is the ID of the process. Wi
thin this directory are vari-
ous files and subdirectories containing info
rmation about that process. For example,
we can obtain information about the
process, which always has the process ID 1,
by looking at files under the directory
/proc/1
Among the files in each
/proc/
PID
directory is one named
, which pro-
vides a range of information about the process:
cat /proc/1/status
Name: init
Name of command run by this process
State: S (sleeping)
State of this process
Tgid: 1
SYSTEM AND PROCESS
In this chapter, we look at ways of accessing a variety of system and process infor-
mation. The primary focus of the
chapter is a discussion of the
/proc
file system. We
also describe the
system call, which is used
222
Chapter 11
Further information
Chapter 2 of [Stevens & Rago, 2005] and
Chapter 2 of [Gallm
eister, 1995] cover
similar ground to this chapte
r. [Lewine, 1991] also prov
ides much useful (although
now slightly outdated) background. Some
information about POSIX options with
glibc
System Limits and Options
221
11.6Summary
SUSv3 specifies limits that an implementa
tion may enforce and system options that
an implementation may support.
Often, it is desirable not to hard-cod
e assumptions about system limits and
options into a program, since these may vary across implementations and also on a
single implementation, either at run time or across file systems. Therefore, SUSv3
specifies methods by which an implementati
on can advertise the limits and options
it supports. For most limits, SUSv3 specifies a minimum value that all implementations
must support. Additionally, each implemen
tation can advertise its implementation-
specific limits and options at compile
time (via a constant definition in
lim&#xlim7;&#xits.;&#xh000;its.h
or
unistd.&#xu7ni;&#xstd.;h00;h
) and/or run time (via a call to
fpathcon
). These
techniques may similarly be used to find
out which SUSv3 options an implementa-
tion supports. In some cases, it may not be
220
Chapter 11
Each option constant, if defined,
has one of the following values:
A value of 1 means that
the option is not supported
. In this case, header files,
data types, and function
interfaces associated with
the option need not be
defined by the implementation. We may ne
ed to handle this possibility by con-
ditional compilation using
preprocessor directives.
A value of 0 means that
the option may be supported
. An application must check
System Limits and Options
219
218
Chapter 11
Listing 11-2 shows the use of
fpathconf()
System Limits and Options
217
sysconfPrint("_SC_OPEN_MAX: ", _SC_OPEN_MAX);
sysconfPrint("_SC_NGROUPS_MAX: ", _SC_NGROUPS_MAX);
sysconfPrint("_SC_PAGESIZE: ", _SC_PAGESIZE);
sysconfPrint("_SC_RTSIG_MAX: ", _SC_RTSIG_MAX);
exit(EXIT_SUCCESS);
syslim/t_sysconf.c
SUSv3 requires that
file-related limits at run time.
The only difference between
and
fpathconf()
is the manner in which a file
or directory is specified. For
, specification is by pathname; for
fpathconf()
specification is via a (previously opened) file descriptor.
name
argument is one of the
_PC_*
constants defined in
unistd&#xu7ni;&#xstd7;&#x.h00;.h
, some of
which are listed in Table 11-1. Table 11-2
216
Chapter 11
name
argument is one of the
constants defined in
unistduni;&#xstd7;&#x.h00;.h
, some of
which are listed in Table 11-1. The value of the limit is returned as the function
System Limits and Options
215
with the value 8) specifies the SUSv3-
required minimum corresponding to the
RTSIG_MAX
implementation constant. The third column specifies the constant name
that can be given at run time to
implemented by a particular UNIX implem
entation. The general
form of this com-
mand is as follows:
#include unis&#xunis;td.;&#xh000;td.h
long
sysconf
(int
214
Chapter 11
(Section 9.6). SUSv3 defines th
e corresponding minimum value,
_POSIX_NGROUPS_MAX
with the value 8. At run time, an applicat
System Limits and Options
213
If defined, this limit will always be at le
ast the size of the minimum value described
XXX_M�AX = _POSIX_XXX_MAX
SUSv3 divides the limits that it
specifies into three categories:
runtime invariant
pathname variable values
runtime increasable values
. In the following para-
graphs, we describe these catego
ries and provide some examples.
Runtime invariant values (possibly indeterminate)
A runtime invariant value is a limit whose value, if defined in
limits.h&#xli7m;&#xits.;&#xh700;
, is fixed for
the implementation. However, the value ma
212
Chapter 11
From one file system to another
: For example, traditional System V file systems
allow a filename to be up to 14 bytes,
while traditional BSD file systems and
most native Linux file systems allow filenames of up to 255 bytes.
Since system limits and options affect what
an application may do, a portable appli-
SYSTEM LIMITS AND OPTIONS
210
Chapter 10
time and human-readable character strings.
Describing such conversions took us into
and internationalization.
Using and displaying times and dates is an
important part of many applications,
and well make frequent use of the functions described in this chapter in later parts
of this book. We also say a little more ab
out the measurement of time in Chapter 23.
Further information
Time
209
if (clockTicks == 0) { /* Fetch clock ticks on first call */
clockTicks = sysconf(_SC_CLK_TCK);
if (clockTicks == -1)
errExit("sysconf");
}
clockTime = cloc
if (clockTime == -1)
errExit("clock");
208
Chapter 10
Although the
clock_t
Time
207
clock_t tms_cutime; /* User CPU time of all (waited for) children */
clock_t tms_cstime; /* System CPU time of all (waited for) children */
The first two fields of the
value by
this number to convert to
seconds. (We describe
in Section 11.2.)
On most Linux hardware architectures,
sy_SC_CLK_TCK_SC_CLK_TCK
206
Chapter 10
with greater accuracy and time measuremen
ts can be made with greater precision.
e real time required to run the program:
time ./myprog
real 0m4.84s
user 0m1.030s
sys 0m3.43s
system call retrieves process time in
Time
205
Abrupt changes in the system time
of the sort caused by calls to
204
Chapter 10
(Section 10.2.3), as shown by the results from
when we run the
program in Listing 10-4 in a number of different locales:
LANG=de_DE ./show_time
ctof timlue is: Tue Feb 1 12:23:39 2011
asctime() of local time is: Tue Feb 1 12:23:39 2011
strftimelocal time is: Dienstag, 01 Februar 2011, 12:23:39 CET
LC_TIME
has precedence over
LANG=de_DE LC_TIME=it_IT ./show_time
German and Italian locales
ctof timlue is: Tue Feb 1 12:24:03 2011
asctime() of local time is: Tue Feb 1 12:24:03 2011
strftimelocal time is: marted, 01 febbraio 2011, 12:24:03 CET
And this run demonstrates that
LC_ALL
has precedence over
LC_TIME
LC_ALL=fr_FR LC_TIME=en_US ./show_time
French and US locales
ctof timlue is: Tue Feb 1 12:25:38 2011
asctime() of local time is: Tue Feb 1 12:25:38 2011
Time
203
202
Chapter 10
Under each locale subdirectory is a standa
Time
201
Locale information is maintained in a directory hierarchy under
/usr/share/
locale
(or
/usr/lib/locale
in some distributions). Each
subdirectory under this direc-
tory contains information about a partic
ular locale. These directories are named
using the following convention:
language
200
Chapter 10
adjustment to add to the local time to
convert it to UTC. The final four compo-
nents provide a rule describing when th
e change from standard time to DST
Time
199
Listing 10-4:
Demonstrate the effect of timezones and locales
time/show_time.c
#include time&#xtime;.h0;.h
#include loca&#xloca;le.;&#xh000;le.h
#include "tlpi_hdr.h"
#define BUF_SIZE 200
int
main(int argc, char *argv[])
time_t t;
struct tm *loc;
char buf[BUF_SIZE];
198
Chapter 10
These files reside in the directory
/usr/share/zoneinfo
. Each file in this directory
contains information about the timezone re
gime in a particular country or region.
These files are named according to the timezone they describe, so we may find files
with names such as
(US Eastern Standard Time),
Time
197
Listing 10-3:
196
Chapter 10
The conversion specifications
are similar to those given to
Table 10-1Table 10-1he
major difference is that the specifiers
are more general. For example, both
and
can accept a weekday name in eith
er full or abbreviated form, and
or
can be
used to read a day of the month with or
without a leading 0 in the case of single-
digit days. In addition, case is ignored; for example,
and
are equivalent
month names. The string
is used to match a percent character in the input
string. The
Time
195
char *
currTime(const char *format)
static char buf[BUF_SIZE]; /* Nonreentrant */
time_t t;
size_t s;
struct tm *tm;
t = time(NULL);
tm = localtime(&t);
if (tm == NULL)
194
Chapter 10
Listing 10-2:
Time
193
function provides us with more precise control when converting a
broken-down time into printable form.
Given a broken-down time pointed to by
places a corresponding null-terminated, date-plus-time string in
the buffer pointed to by
192
Chapter 10
#define SECONDS_IN_TROPICAL_YEAR (365.24219 * 24 * 60 * 60)
int
main(int argc, char *argv[])
time_t t;
struct tm *gmp, *locp;
struct tm gm, loc;
struct timeval tv;
t = time(NULL);
printf("Seconds since the Epoch (1 Jan 1970): %ld", (long) t);
printf(" (about %6.3f years)\n", t / SECONDS_IN_TROPICAL_YEAR);
Time
191
190
Chapter 10
number of seconds that the represented time falls east of UTC. The second field,
, is the abbreviated timezone name (e.g.,
for Central Euro-
pean Summer Time). SUSv3 doesnt specify either of these fields, and they appear
on only a few other UNIX implementations (mainly BSD derivatives).
function translates a broken-dow
n time, expressed as local time,
into a
Time
189
A reentrant version of
is provided in the form of
. (We explain
reentrancy in Section 21.1.2.) This fu
nction permits the caller to specify an
additional argument that is
a pointer to a (caller-supplied) buffer that is used
188
Chapter 10
Figure 10-1:

strf
s



Functions affected as marked:
* by
environment variable
calendar timecalendar time
struct tm
(broken-down time)
struct timeval
fixed-format string
user-formatted,
localized string
Kernel
#include time&#xtime;.h0;.h
char *
ctime
(const time_t *
Time
187
If the
argument is supplied, then it returns a
timezone
structure whose fields
186
Chapter 10
10.1Calendar Time
Regardless of geographic lo
cation, UNIX systems repres
ent time internally as a
measure of seconds since the Epoch; that
is, since midnight on the morning of
1January 1970, Universal Coordinated Ti
me (UTC, previously known as Green-
wich Mean Time, or GMT). This is appr
oximately the date when the UNIX system
came into being. Calendar time is stored in variables of type
, an integer type
specified by SUSv3.
On 32-bit Linux systems,
, which is a signed integer, can represent dates
in the range 13 December 1901 20:45:52 to 19 January 2038 03:14:07. (SUSv3
leaves the meaning of negative
values unspecified.) Thus, many current
Within a program, we may be in
terested in two kinds of time:
Real time
: This is the time as measured
either from some standard point
calendar
time) or from some fixed point (typically the start) in the life of a
process (
elapsed
or
wall clock
time). Obtaining the calend
ar time is useful to pro-
grams that, for example, timestamp database records or files. Measuring
elapsed time is useful for a program that
takes periodic actions or makes regu-
lar measurements from some
external input device.
Process time
: This is the amount of CPU time used by a process. Measuring pro-
cess time is useful for checking or op
timizing the performance of a program or
Most computer architectures have a built-in
to measure real and process time. In this ch
apter, we look at system calls for dealing
with both sorts of time, and library
184
Chapter 9
9.9Exercises
9-1.
Assume in each of the following cases th
at the initial set of process user IDs is
real=1000 effective=0 saved=0 file-system=0
. What would be the st
ate of the user IDs
after the following calls?
Process Credentials
183
p = groupNameFromId(egid);
printf("eff=%s (%ld); ", (p == NULL) ? "???" : p, (long) egid);
p = groupNameFromId(sgid);
printf("saved=%s%ld%ld; ", (p == NULL) ? "???" : p, (long) sgid);
p = groupNameFromId(fsgid);
printf("fs=%s (%ld); ", (p == NULL) ? "???" : p, (long) fsgid);
printf("\n");
182
Chapter 9
9.7.5Example: Displaying Process Credentials
The program in Listing 9-1 uses the system calls and library functions described in
Process Credentials
181
Note the following supplementa
ry information to Table 9-1:
glibc
implementations of
180
Chapter 9
argu-
ment by reading the group ID field from the users record in the password file. This
is slightly confusing, since the group ID from the password file is not really a sup-
plementary group, Instead, it defines the in
itial real user ID, effective user ID, and
r, e, s
r, e, s
r, e, s
r, e, s
r, e
r, s
ed processes
r, e, s
Process Credentials
179
On Linux, as on most UNIX implementations,
time techniques can then be used to
dynamically allocate a
grouplist
array for a future
178
Chapter 9
setresui
and
setresgi
provide the most straightforward API for
changing process credentials,
we cant portably employ them in applications; they are
not specified in SUSv3 and are available on
only a few other UNIX implementations.
Process Credentials
177
176
Chapter 9
SUSv3 says that it is unspecified whether an unprivileged process can use
Process Credentials
175
In older versions of the GNU C library (
2.0 and earlier),
174
Chapter 9
changes, rule 1 applies exactly as stated
. In rule 2, since changing the group IDs
doesnt cause a process to lose privileges
Process Credentials
173
Modifying effective IDs
172
Chapter 9
ownership of new files, in te
rms of the effective IDs of a process. Even though the
processs file-system IDs are really used fo
r these purposes on Linux, in practice,
their presence seldom makes an effective difference.
9.6Supplementary Group IDs
Process Credentials
171
170
Chapter 9
ls -l check_password
-rwsr-xr-x 1 root users 18150 Oct 28 10:49 check_password
whoami
This is an unprivileged login
mtk
./check_password
But we can now access the shadow
Username:
avr
Password:
Successfully authenticated: UID=1001
Process Credentials
169
As shown in this example, it is possible fo
168
Chapter 9
Section 8.1Section 8.1process is crea
ted (e.g., when the shell executes a pro-
gram), it inherits these identifiers from its parent.
9.2Effective User ID and Effective Group ID
On most UNIX implementations (Linux is
a little different, as explained in Sec-
tion 9.5), the effective user ID and group
ID, in conjunction with the supplementary
166
Chapter 8
memory. This mini
mizes the possibility of a program crash producing a core dump
file that could be read
There are other possible ways in which the unencrypted password could be
exposed. For example, the password could be
read from the swap file by a priv-
ileged program if the virtual memory pa
ge containing the password is swapped
out. Alternatively, a process with
sufficient privilege could read
/dev/mem
(a vir-
tual device that presents the physical
memory of a computer as a sequential
stream of bytes) in an attempt to discover the password.
The
Users and Groups
165
lnmax = sysconf(_SC_LOGIN_NAME_MAX);
164
Chapter 8
Users and Groups
163
algorithm takes a
(i.e., a password) of up
to 8 characters, and applies
a variation of the Data Encryption
StandardDESDESThe
argu-
ment is a 2-character string whose value is
used to varyvary
technique designed to make it more diff
icult to crack the en
crypted password. The
value from the encrypted password value
already stored in
162
Chapter 8
a password, it will eventually cease to be usable. */
long sp_lstchg; /* Time of last password change
(days since 1 Jan 1970) */
Users and Groups
161
160
Chapter 8
if (name == NULL || *name == '\0') /* On NULL or empty string */
Users and Groups
159
158
Chapter 8
Users and Groups
157
156
Chapter 8
or, if one or more usernames are supplie
d as command-line arguments, then the
group memberships of those users.)
group file
Users and Groups
155
On a stand-alone system, all the passw
ord information resides in the file
154
Chapter 8
In order, these fields are as follows:
: This is the unique name
that the user must enter in order to log in.
Often, this is also called the username.
We can also consider the login name to
be the human-readable (symbolic) iden
tifier corresponding to the numeric
user identifier (described in a moment). Programs such as
ls(1)
display this
name, rather than the numeric user ID
associated with the file, when asked to
show the ownership of a file (as in
ls l
: This field contains a 13-c
haracter encrypted password,
Every user has a unique login name and an associated numeric user identifier
(UID). Users can belong to one or more groups. Each group also has a unique
name and a group identifier GIDGID
The primary purpose of user and group ID
152
Chapter 7
7.4Exercises
7-1.
Modify the program in Listing 7-1 (
free_and_sbrk.c
) to print out the current value of
the program break after each execution of
malloc()
. Run the program specifying a
small allocation block size.
This will demonstrate that
malloc()
doesnt employ
to adjust the program break on each call, but instead periodically allocates larger
chunks of memory from which it passes back small pieces to the caller.
7-2.
(Advanced) Implement
mallo
Memory Allocation
151
Older versions of
, and some other UNIX im
plementations (mainly BSD
derivatives), require the inclusion of
stdlib.&#xst7.;ml7;&#x.6ib;&#x.7.6;&#xh000;h
instead of
alloca§.6;&#xll7.;oca;.6.;&#xh000;.h
to obtain
the declaration of
alloca()
If the stack overflows as a consequence of calling
alloca()
, then program behavior is
150
Chapter 7
posix_
function differs from
memalign
in two respects:
from the stack by increasing the size of the stack frame. This is possible because the
calling function is the one whose stack frame is, by definition, on the top of the
stack. Therefore, there is space abov
e the frame for expansion, which can be
accomplished by simply modifying the value of the stack pointer.
argument specifies the number of by
tes to allocate on the stack. The
Memory Allocation
149
Since
realloc()
may relocate the block of me
SUSv3 doesnt specify
memali
, but instead specifies a similar function,
posix_memalign()
. This function is a recent creation of the standards com-
mittees, and appears on only a few UNIX implementations.
#include mall&#xmall;oc.;&#xh000;oc.h
void *
memalign
(size_t
boundary
, size_t
148
Chapter 7
numitems
argument specifies how many items to allocate, and
specifies
their size. After allocating a block
of memory of the appropriate size,
call
Memory Allocation
147
settings are: 0, meaning ignore errors; 1, meaning print diagnostic errors on
; and 2, meaning call
abor
to terminate the program. Not all memory
allocation and deallocation erro
146
Chapter 7
To avoid these types of errors, we
should observe the following rules:
After we allocate a block of memory, we
should be careful not to touch any
bytes outside the range of that block. Th
is could occur, for
example, as a result
Memory Allocation
145
Looking at the implementation of
, things start to become more interest-
ing. When
places a block of memory onto the free list, how does it know what
size that block is? This is done via a trick. When
malloc()
allocates the block, it allo-
cates extra bytes to hold an
integer containing the size of the block. This integer is
located at the beginning of the block;
the address actually returned to the caller
points to the location just past this
length value, as shown in Figure 7-1.
Figure 7-1:
Length of
blocLL
Memory for use by caller
Length of
blocLL
Remaining bytes of free block
Pointer to
next free
Pointer to
previous free
blockPPblockNN
Block on free list:
Allocated, in-use block:
free list
= pointer value marking end of list
144
Chapter 7
In this case, the (
function is able to recognize that an entire region at the
top end of the heap is free, since, when releasing blocks, it coalesces neighboring
free blocks into a single larger block. (Such coalescing is done to avoid having a
large number of small fragments on the free list, all of which may be too small to
satisfy subsequent
malloc()
The
free()
function calls
to lower the program break only when the
free block at the top end is sufficiently
Memory Allocation
143
for (j = 0; j numAllocs; j++) {
ptr[j] = malloc(blockSize);
if (ptr[j] == NULL)
errExit("malloc");
}
printf("Program break is now: %10p\n", sb00);
printf("Freeing blocks from %d to %d in steps of %d\n",
freeMin, freeMax, freeStep);
for (j = freeMin - 1; j freeMax; j += freeStep)
free(ptr[j]);
printf("After frprogram break is: %10p\n", sb00);
exit(EXIT_SUCCESS);

memalloc/free_and_sbrk.c
Running the program in Listing 7-1 with
the following command line causes the
program to allocate 1000 blocks of memo
ry and then free every second block:
./free_and_sbrk 1000 10240 2
The output shows that after these blocks ha
ve been freed, the program break is left
unchanged from the level it reached wh
en all memory blocks were allocated:
Initial program break: 0x804a6bc
Allocating 1000*10240 bytes
Program break is now: 0x8a13000
Freeing blocks from 1 to 1000 in steps of 2
After frprogram break is: 0x8a13000
The following command line specifies that
all but the last of the allocated blocks
should be freed. Again, the program
break remains at its high-water mark.
./free_and_sbrk 1000 10240 1 1 999
Initial program break: 0x804a6bc
Allocating 1000*10240 bytes
Program break is now: 0x8a13000
Freeing blocks from 1 to 999 in steps of 1
After frprogram break is: 0x8a13000
If, however, we free a complete set of bloc
the program break decreases from its peak value, indicating that
has used
to lower the program brea
k. Here, we free the last 500 blocks of allocated
./free_and_sbrk 1000 10240 1 500 1000
Initial program break: 0x804a6bc
Allocating 1000*10240 bytes
Program break is now: 0x8a13000
Freeing blocks from 500 to 1000 in steps of 1
After frprogram break is: 0x852b000
142
Chapter 7
If the argument given to
is a
pointer, then the call
does nothing. (In other
words, it is not an error to give a
NULL
pointer to
Making any use of
after the call to
for example, passing it to
second timeis an error that can lead to unpredictable results.
Example program
The program in Listing 7-1 can be us
ed to illustrate the effect of
on the pro-
gram break. This program allocates multiple blocks of memory and then frees
some or all of them, depending on it
optionaloptionaland-line arguments.
The first two command-line arguments spec
ify the number and size of blocks to
allocate. The third command-line argument specifies the loop step unit to be used
when freeing memory blocks. If we specify
1 here (which is also the default if this
argument is omitted), then the program free
s every memory block; if 2, then every
second allocated block; and so on. The fourth and fifth command-line arguments
specify the range of blocks th
at we wish to free. If these arguments are omitted, then
all allocated blocks (in steps given by the third command-line argument) are freed.
Listing 7-1:
Demonstrate what happens to the program break when memory is freed

memalloc/free_and_sbrk.c
#include "tlpi_hdr.h"
#define MAX_ALLOCS 1000000
int
main(int argc, char *argv[])
char *ptr[MAX_ALLOCS];
int freeStep, freeMin, freeMax, blockSize, numAllocs, j;
printf("\n");
if (argc 3 || strcmp(argv[1], "--help") == 0)
usageErr("%s num-allocs block-size [step [min [max]]]\n", argv[0]);
Memory Allocation
141
provide a simple interface that allows
memory to be allocated in small
units; and
allow us to arbitrarily deallocate blocks
of memory, which are maintained on a
free list and recycled in f
uture calls to allocate memory.
malloc()
function allocates
bytes from the heap and
140
Chapter 7
After the program break is increased,
the program may access any address in
the newly allocated area, but no physical
MEMORY ALLOCATION
Many system programs need to be able
to allocate extra memory for dynamic data
structures (e.g., linked lists and binary
trees), whose size depends on information
that is available only at run time. This
chapter describes the fu
nctions that are used
to allocate memory on
the heap or the stack.
7.1Allocating Memory on the Heap
A process can allocate memory by increa
sing the size of the heap, a variable-
sizesegment of contiguous vi
rtual memory that begins ju
st after the uninitialized
data segment of a process and grows and
shrinks as memory is allocated and freed
(see Figure 6-1 on page 119)
. The current limit of the he
ap is referred to as the
To allocate memory, C programs normally use the
malloc
family of functions,
which we describe shortly. However, we begin with a description of
br
upon which the
malloc
functions are based.
7.1.1Adjusting the Program Break:
br
sbrk()
Resizing the heap (i.e., allo
cating or deallocating memo
ry) is actually as simple as
telling the kernel to adjust its idea of where the processs program break is. Ini-
tially, the program break lies just past
the end of the uninitialized data segment
(i.e., the same location as
, shown in Figure 6-1).
138
Chapter 6
6.9Summary
Each process has a unique process ID and main
tains a record of its parents process ID.
The virtual memory of a process is logica
lly divided into a number of segments:
text, (initialized and uninitialized) data, stack, and heap.
The stack consists of a series of fram
es, with a new frame being added as a
function is invoked and removed when th
Processes
137
However, when we compile with optimi
136
Chapter 6
This means that optimized variables may en
d up with incorrect values as a conse-
quence of a
longjm
operation. We can see an example of this by examining the
behavior of the program in Listing 6-6.
Listing 6-6:
A demonstration of the interaction of compiler optimization and
Processes
135
as part of a comparison operation (
, and so on), where the other oper-
and is an integer constant expression an
d the resulting expression is the entire
controlling expression of a sele
ction or iteration statement; or
as a free-standing function call that
is not embedded inside some larger
expression.
Note that the C assignment statement doesn
t figure in the list above. A statement
of the following form is not standards-conformant:
134
Chapter 6
Listing 6-5:
Demonstrate the use of
se
and
longjmp()

proc/longjmp.c
Processes
133
same
variable, to perform th
132
Chapter 6
handled by its caller. This is perfectly va
lid, and, in many cases, the desirable
Processes
131
Listing 6-4:
Modifying the process environment

proc/modify_env.c
#define _GNU_SOURCE /* To get various declarations from stdlibstd;&#xlib7;&#x.h00;.h */
#include stdl&#xstdl;ib.;&#xh000;ib.h
#include "tlpi_hdr.h"
extern char **environ;
int
main(int argc, char *argv[])
int j;
char **ep;
clea; /* Erase entire environment */
for (j = 1; j argc; j++)
if (putenv(argv[j]) != 0)
errExit("putenv: %s", argv[j]);
130
Chapter 6
In some circumstances, the use of
Processes
129
128
Chapter 6
SUSv3 permits an implementation of
Processes
127
and there is no variab
le (corresponding to
argc
) that specifies the size of the envi-
ronment list. (For similar reasons, we dont number the elements of the
array in Figure 6-5.)
Listing 6-3:
Displaying the process environment
proc/display_env.c
#include "tlpi_hdr.h"
extern char **environ;
int
main(int argc, char *argv[])
char **ep;
for (ep = environ; *ep != NULL; ep++)
puts(*ep);
exit(EXIT_SUCCESS);
proc/display_env.c
An alternative method of accessing the envi
ronment list is to declare a third argu-
ment to the
main()
int main(int argc, char *argv[], char *envp[])
This argument can then be treated in the same way as
, with the difference
that its scope is local to
ma
. Although this feature is widely implemented on
UNIX systems, its use should be avoided si
nce, in addition to the scope limitation,
it is not specified in SUSv3.
126
Chapter 6
The
command runs a program using a modified copy of the shells envi-
ronment list. The environment list can be modified to both add and remove
definitions from the list copi
ed from the shell. See the
env(1)
manual page for
environ
NULL
Processes
125
6.7Environment List
Each process has an associated array of strings called the
, or simply
environment
. Each of these strings is a definition of the form
name=value
. Thus, the
environment represents a set of name-value
pairs that can be used to hold arbitrary
e list are referred to as
When a new process is created, it inherits a copy of its parents environment.
This is a primitive but frequently used
form of interprocess communicationthe
environment provides a way to transfer information from a parent process to its
child(ren). Since the child gets a copy of its parents environment at the time it is
created, this transfer of information is
one-way and once-only. After the child pro-
cess has been created, either process may change its own environment, and these
changes are not seen by the other process.
A common use of environment variables is in the shell. By placing values in its
own environment, the shell can ensure that these values are passed to the processes
that it creates to execute user commands. For example, the environment variable
SHELL
124
Chapter 6
Since the
argv
list is terminated by a
value, we could alternatively code the
body of the program in Listing 6-2 as
follows, to output just the command-line
arguments one per line:
char **p;
for (p = argv; *p != NULL; p++)
puts*p*p
One limitation of the
argc
argv
mechanism is that these variables are available only
as arguments to
ma
. To portably make the command-line arguments available
in other functions, we must either pass
argv
as an argument to those functions or
errnrrn;o.h;o.h
by defining the
_GNU_SOURCE
As shown in Figure 6-1, the
argv
and
arrays, as well as
the strings they ini-
tially point to, reside in a single contigu
ous area of memory just above the process
stack. (We describe
, which holds the programs environment list, in the
next section.) There is an upper limit on
the total number of bytes that can be
stored in this area. SUSv3 prescribes the use of the
ARG_MAX
constant (defined in
limits.&#xl7im;&#xits.;h00;h
) or the call
sysconf(_SC_ARG_MAX)
Processes
123
two arguments to the function
main()
. The first argument,
many command-line arguments there are. The second argument,
char *argv[]
, is an
array of pointers to the command-line ar
guments, each of which is a null-termi-
nated character string. The fi
rst of these strings, in
argv[0]
, is (conventionally) the
name of the program itself. The list of pointers in
is terminated by a
NULL
pointer (i.e.,
argv[argc]
is
The fact that
argv[0]
contains the name used to invoke the program can be
employed to perform a useful trick. We can create multiple links to (i.e., names for)
the same program, and then
have the program look at
and take different
actions depending on the name used to invoke it. An example of this technique is
, and
zcat11
commands, all of which are links to the
same executable file. (If we employ this technique, we must be careful to handle the
possibility that the user might invoke the
program via a link with a name other than
any of those th
at we expect.)
Figure 6-4 shows an example of the
data structures associated with
argc
and
argv
when executing the program in Listing 6-
2. In this diagram, we show the termi-
nating null bytes at the end of
each string using the C notation
Figure 6-4:
Values of
argc
and
argv
for the command
necho hello world
The program in Listing 6-2 echoes its command-line arguments, one per line of
output, preceded by a string showing which element of
argv
is being displayed.
Listing 6-2:
Echoing command-line arguments
proc/necho.c
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
int j;
for (j = 0; j argc; j++)
printf("argv[%d] = %s\n", j, argv[j]);
exit(EXIT_SUCCESS);
proc/necho.c
argv
argc
122
Chapter 6
In virtual memory terms, the stack
segment increases in size as stack
frames are allocated, but on most implementations, it wont decrease in size
after these frames are deallocated (the
memory is simply reused when new
stack frames are allocated). When we
talk about the stack segment growing
and shrinking, we are considering th
ings from the logical perspective of
frames being added to and removed from the stack.
Frame for
doCalc
Frame for
square()
Direction of
stack growth
Frames for C run-time
startup functions
stack
Frame for
Processes
121
Virtual memory management separates the
virtual address space of a process from
the physical address space of RAM.
This provides many advantages:
Processes are isolated from one another and from the kernel, so that one pro-
cess cant read or modify the memory of another process or the kernel. This is
accomplished by having the page-table
entries for each process point to dis-
120
Chapter 6
Figure 6-2:
Overview of virtual memory
In order to support this organization, the kernel maintains a
page table
for each
process (Figure 6-2). The page table describe
s the location of each page in the pro-
page 1
page 4
page 3
Page table
(page frames)
page 0
page 2
Process virtual
Physical
memory (RAM)
increasing virtual
Processes
119
Figure 6-1:
a process on Linux/x86-32
The upshot of locality of reference is that it is possible to execute a program while
maintaining only part of its address space in RAM.
A virtual memory scheme splits the memo
ry used by each program into small,
fixed-size units called
. Correspondingly, RAM is di
vided into a series of
frames
of the same size. At any one time, only some of the pages of a program need
to be resident in physical memory page
frames; these pages form the so-called
(unallocated memory)
argv, environ
Uninitialized datbssbss
Initialized data
Text (program code)
Stack
(grows downwards)
(grows upwards)
Program
break
Top of
stack
Kernel
(mapped into process
virtual memory, but not
accessible to program)
provides addresses of
kernel symbols in this
region (
kernel 2.4 and earlier)
Virtual memory address
(hexadecimal)
increasing virtual addesses
118
Chapter 6
An
application binary interface
the next byte past, respecti
vely, the end of the program text, the end of the initial-
ized data segment, and the end of the uninitialized data segment. To make use of
these symbols, we must explicitly declare them, as follows:
Processes
117
compiler and an application binary interface in which all arguments are passed on
the stack. In practice, an optimizing comp
iler may allocate frequently used vari-
ables in registers, or optimize a variable
116
Chapter 6
many processes may be running the same program, the text segment is made
sharable so that a single copy of the
program code can be mapped into the vir-
tual address space of all of the processes.
initialized data segment
contains global and static variables that are explic-
itly initialized. The values of these variables are read from the executable file
when the program is loaded into memory.
uninitialized data segment
contains global and static variables that are not
explicitly initialized. Before starting
the program, the system initializes all
memory in this segment to 0. For historical reasons, this is often called the
segment, a name derived from an old
assembler mnemonic for block started
for placing global and stat
ic variables that are ini-
tialized into a separate segment from thos
e that are uninitialized is that, when a
program is stored on disk, it is not necessary to allocate space for the uninitial-
ized data. Instead, the ex
ecutable merely needs to record the location and size
required for the uninitialized data segmen
t, and this space is allocated by the
program loader at run time.
is a dynamically growing and sh
rinking segment containing stack
frames. One stack frame is allocated for each currently called function. A
frame stores the functions local variab
les (so-called automatic variables), argu-
Processes
115
With the exception of a few system processes such as
init
(process ID 1), there
is no fixed relationship between a program
and the process ID of the process that is
created to run that program.
The Linux kernel limits process IDs to
being less than or equal to 32,767. When
a new process is created, it is assigned th
e next sequentially available process ID. Each
time the limit of 32,767 is reached, the ke
114
Chapter 6
Machine-language instructions
: These encode the algorithm of the program.
Program entry-point address
: This identifies the location of the instruction at
which execution of the program should commence.
: The program file contains values used
to initialize variables and also lit-
eral constants used by th
e program (e.g
., strings).
Symbol and relocation tables
: These describe the locations and names of functions
and variables within the program. These
tables are used for a variety of pur-
poses, including debugging and run-time symbol resolution (dynamic linking).
Shared-library and dyna
: The program file includes fields
listing the shared libraries that the pr
ogram needs to use at run time and the
pathname of the dynamic linker that sh
ould be used to load these libraries.
: The program file contains
various other information that
describes how to co
nstruct a process.
One program may be used to construct ma
ny processes, or, put conversely, many
processes may be runni
ng the same program.
We can recast the definition of a process given at the start of this section as
follows: a process is an abstract entity
, defined by the kern
el, to which system
resources are allocated in
order to execute a program.
From the kernels point of view, a proc
ess consists of user
-space memory con-
taining program code and variables used by
that code, and a range of kernel data
structures that maintain information abo
ut the state of the process. The informa-
tion recorded in the kernel data struc
tures includes various identifier numbers
IDsIDs
l memory tables, the table of open file
descriptors, information relating to sign
al delivery and hand
ling, process resource
usages and limits, the current working dire
ctory, and a host of
6.2Process ID and Parent Process ID
Each process has a procPIDPID a posi
tive integer that uniq
uely identifies the
In this chapter, we look at the structure
of a process, paying particular attention to
the layout and contents of a processs virt
ual memory. We also examine some of the
attributes of a process. In later chapters
, we examine further process attributes (for
example, process credentials in Chapter 9,
and process priorities and scheduling in
Chapter 35). In Chapters 24
to 27, we look at how pr
ocesses are created, how they
terminate, and how they can be
made to execute new programs.
6.1Processes and Programs
is an instance of an executing prog
ram. In this section, we elaborate on
110
Chapter 5
108
Chapter 5
The files in the
/dev/fd
directory are rarely used wi
thin programs. Their most com-
mon use is in the shell. Many user-level
commands take filename arguments, and
106
Chapter 5
if (argc != 3 || strcmp(argv[1], "--help") == 0)
If we attempt to access a large file using 32-bit functions (i.e., from a program
104
Chapter 5
We say more about nonblocking I/O
in Section 44.9 and in Chapter 63.
Historically, System Vderived
implementations provided the
O_NDELAY
flag,
with similar semantics to
O_NONBLOCK
. The main difference was that a nonblock-
ing
out making any source code changes.
(or by some other means).
5.9Nonblocking I/O
Specifying the
O_NONBLOCK
flag when opening a file serves two purposes:
If the file cant be opened immediately, then
open()
, we must enable this flag using the
fcntl()
102
Chapter 5
Gather output
system call performs
gather output
. It concatenates gathersgathers
from all of the buffers specified by
and writes them as a sequence of contiguous
bytes to the file referred to
by the file descriptor
. The buffers are gathered in
array order, starting with
the buffer defined by
.Like , writ
completes atomically, with all data being transferred in a
single operation from user memo
ry to the file referred to by
. Thus, when writing
to a regular file, we can be sure that all of the requested data is written contigu-
ously to the file, rather than being inters
persed with writes by other processes (or
threads).
As with
, a partial write is possible. Therefore, we must check the return
writev()
to see if all reques
ted bytes were written.
The primary advantages of
read
and
are convenience and speed. For
example, we could replace a call to
by either:
code that allocates a single large buff
er, copies the data to be written from
other locations in the processs address
space into that buffer, and then calls
to output the buffer; or
calls that output the buffers individually.
The first of these options, while semantically equivalent to using
, leaves us
with the inconvenience (and inefficiency) of allocating buffers and copying data in
The second option is not semantica
lly equivalent to a single call to
writev
writ
calls are not performed atomically
. Furthermore, performing a single
writ
system call is cheaper than performing multiple
write()
calls (refer to the dis-
cussion of system calls in Section 3.1).
Performing scatter-gather
100
Chapter 5
SUSv3 allows an implementation to plac
e a limit on the number of elements in
. An implementation can advertise its limit by defining
IOV_MAX
in
limits.h.6l;&#xim7.;its;.6.;&#xh7.6;
or at run time via the
[0]
iovcnt
len0
buffer0
buffer1
buffer2
Chapter 5
This call makes a duplicate of
oldfd
by using the lowest unused file descriptor
greater than or equal to
. This is useful if we want a guarantee that the new
descriptor (
newfd
) falls in a certain range of values. Calls to
and
can
always be recoded as calls to
cl
and
fcntl()
, although the former calls are more
concise. (Note also that some of the
#include unis&#xunis;td.;&#xh000;td.h
ssize_t
(int
, void *
, size_t
, off_t
Chapter 5
In process A, descriptors 1 and 20 both re
fer to the same open file description
(labeled 23). This si
tuation may arise as a result of a call to
fcntl()
(see Section 5.5).
Descriptor 2 of process A and descriptor
2 of process B refer to a single open
file descriptio7373This scenario could occur after a call to
fork
(i.e., process A is
the parent of process B, or vice versa), or
if one process passed an open descriptor
to another process usin
flags
file
Process A
File descriptor table
file
status
flags
ptr
0
86
5139
23
Open file table
(system-wide)
file
file
locks
(system-wide)
flags
file
Process B
File descriptor table
Chapter 5
fcntl()
to modify open file status flags is
particularly useful in the follow-
ing cases:
The file was not opened by the calling program, so that it had no control over
the flags used in the
open()
call (e.g., the file may be one of the three standard
descriptors that are opened before the program is started).
The file descriptor was obtained from a system call other than
open()
. Examples
of such system calls are
pipe()
Chapter 5
If we run two simultaneous instances of th
e program in Listing 5-1, we see that they
both claim to have exclusively created the file:
./bad_exclusive_open tfile sleep &
fails
fails
file created
Process B
Process A
expires
begins
begins
ends
Executing
Waiting
for CPU
Key
Chapter 5
5.1Atomicity and Race Conditions
Atomicity is a concept that well encounte
r repeatedly when discussing the opera-
tion of system calls. All system calls are
executed atomically. By this, we mean that
the kernel guarantees that all of the step
ussed in subsequent chapters. Building on
this model, we then explain ho
w to duplicate file descriptors.
We then consider some system calls that provide extended read and write func-
tionality. These system calls
allow us to perform I/O at a specific location in a file
File I/O: The Universal I/O Model
Chapter 4
ls -l tfile
Check size of file
-rw-r--r-- 1 mtk users 100003 Feb 10 10:35 tfile
./seek_io tfile s10000 R5
using
. I/O is then performed using
and
writ
. After performing all
I/O, we should free the file descriptor and its associated resources using
close()
These system calls can be used to perform I/O on all types of files.
The fact that all file types and device
drivers implement the same I/O interface
allows for universality of I/O, meaning
that a program can typically be used with
any type of file without requiring code
that is specific to the file type.
#include sys/ioctl.&#xsys/;ioc;&#xtl.7;&#xh000;h
int
ioctl
, int
, ... /*
argp
Value returned on success depends on
, or 1 on error
File I/O: The Universal I/O Model
buf = malloc(len);
if (buf == NULL)
errExit("malloc");
numRead = read(fd, buf, len);
if (numRead == -1)
errExit("read");
if (numRead == 0) {
printf("%s: end-of-file\n", argv[ap]);
} else {
printf("%s: ", argv[ap]);
for (j = 0; j numRead; j++) {
if (argv[ap][0] == 'r')
printf("%c", isprint((unsigned char) buf[j]) ?
buf[j] : '?');
else
printf("%02x ", (unsigned int) buf[j]);
}
printf("\n");
}
free(buf);
break;
Chapter 4
Section 14.4 describes how holes are
represented in a file, and Section
15.1
describes the
system call, which can tell us the current size of a file, as well as
the number of blocks actually allocated to the file.
Example program
Listing 4-3 demonstrates the use of
in conjunction with
and
writ
The first command-line argument to this
program is the name of a file to be
opened. The remaining arguments specify I/
O operations to be performed on the
file. Each of these operations consists of
File I/O: The Universal I/O Model
We cant apply
to all types of files. Applying
to a pipe, FIFO,
Chapter 4
Current
File containing
bytes of data
File I/O: The Universal I/O Model
It is usually good practice to close unne
eded file descriptors explicitly, since this
makes our code more readable
and reliable in the face of subsequent modifica-
tions. Furthermore, file descriptors are a
consumable resource, so failure to close a
file descriptor could result in a process ru
nning out of descriptors. This is a partic-
ularly important issue when writing long-l
ived programs that deal with multiple
#include unis&#xunis;td.;&#xh000;td.h
off_t
lseek
(int
, off_t
Chapter 4
doesnt place a terminating null byte at the end of the string that
is being
asked to print. A moments reflection leads
us to realize that th
is must be so, since
read
can be used to read any
sequence of bytes from a file. In some cases, this
input might be text, but in other cases,
the input might be binary integers or C
structures in binary form. There is no way for
read
to tell the difference, and so it
cant attend to the C convention of null terminating character strings. If a terminating
null byte is required at the end of the inp
ut buffer, we must put it there explicitly:
char buffer[MAX_READ + 1];
ssize_t numRead;
numRead = read(STDIN_FILENO, buffer, MAX_READ);
if (numRead == -1)
errExit("read");
buffer[numRead] = '\0';
printf("The input data was: %s\n", buffer);
Because the terminating null byte requir
es a byte of memory, the size of
buffer
must
be at least one greater than the la
rgest string we expect to read.
4.5Writing to a File:
writ
system call writes data to an open file.
The arguments to
are similar to those for
read
is the address of the
data to be written;
is the number of bytes to write from
buffer
; and
is a file
the file to which data is to be written.
On success,
File I/O: The Universal I/O Model
Chapter 4
EISDIR
The specified file is a directory, and the caller attempted to open it for writ-
ing. This isnt allowed. (On the other
hand, there are occasions when it can
be useful to open a directory for reading. We consider an example in
Section 18.11.)
EMFILE
The process resource limit on the number of open file descriptors has
been reached (
RLIMIT_NOFILE
, described in Section 36.3).
ENFILE
The system-wide limit on the number of open files has been reached.
ENOENT
The specified file doesnt exist, and
O_CREAT
was not specified, or
O_CREAT
was specified, and one of the directories in
doesnt exist or is a
symbolic link pointing to a nonexi
stent a dangling linka dangling link
EROFS
The specified file is on a read-only file system and the caller tried to open it
for writing.
File I/O: The Universal I/O Model
(Inreality, for an unprivileged process, it is the processs file-system user
ID, rather than its effective user ID, that must match the user ID of the file
when opening a file with the
O_NOATIME
flag, as described in Section 9.5.)
This flag is a nonstandard Linux exte
nsion. To expose its definition from
fcntl.hོn;&#xtl.h;瀀
, we must define the
_GNU_SOURCE
feature test macro. The
O_NOATIME
flag is intended for use by indexing
and backup programs. Its use can sig-
nificantly reduce the amount of disk
peated disk seeks
back and forth across the disk are not
required to read the contents of a
file and to update the last access time
in the files i-node (Section 14.4).
Functionality similar to
O_NOATIME
is available using the
MS_NOATIME
mo
flag (Section 14.8.1) and the
FS_NOATIME_FL
flag (Section 15.5).
O_NOCTTY
If the file being opened is a terminal device, prevent it from becoming the
controlling terminal. Cont
rolling terminals are discussed in Section 34.4.
If the file being opened is not a
terminal, this flag has no effect.
O_NOFOLLOW
open()
dereferences
pathname
if it is a symbolic link. However, if
O_NOFOLLOW
flag is specified, then
open
fails (with
Chapter 4
could result in open file descriptors
being unintentionally passed to unsafe
programs. (We say more about ra
ce conditions in Section 5.1.)
O_CREAT
If the file doesnt already ex
ist, it is created as a new, empty file. This flag is
effective even if the file is being opened only for reading. If we specify
O_CREAT
, then we must supply a
mode
argument in the
open()
call; otherwise,
the permissions of the new file will be
File I/O: The Universal I/O Model
The constants in Table 4-3 are divided into the following groups:
File access mode flags
: These are the
O_RDONLY
O_WRONLY
O_RDWR
flags described
t the file descriptors of any process on
the system. There is one file in this dir
ectory for each of the processs open file
descriptors, with a name that matches the number of the descriptor. The
field in this file shows the curr
Chapter 4
if (close(STDIN_FILENO) == -1) /* Close file descriptor 0 */
errExit("close");
fd = open(pathname, O_RDONLY);
if (fd == -1)
errExit("open");
Since file descriptor 0 is unused,
open()
is guaranteed to op
en the file using that
descriptor. In Section 5.5, we look at the use of
and
to achieve a similar
result, but with more flexible control over
the file descriptor used. In that section,
we also show an example of why it can be
useful to control the file descriptor on
which a file is opened.
4.3.1The
flags
In some of the example
open()
calls shown in Listing 4-2, we included other bits
O_CREAT
O_TRUNC
O_APPEND
in addition to the file access mode. We now
consider the
File I/O: The Universal I/O Model
/* Open existing file for reading */
fd = open("startup", O_RDONLY);
if (fd == -1)
errExit("open");
/* Open new or existing file for reading and writing, truncating to zero
bytes; file permissions read+write for owner, nothing for all others */
fd = open("myfile", O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
if (fd == -1)
errExit("open");
/* Open new or existing file for writing; writes should always
append to end of file */
fd = open("w.log", O_WRONLY | O_CREAT | O_TRUNC | O_APPEND,
S_IRUSR | S_IWUSR);
if (fd == -1)
errExit("open");
File descriptor number returned by
SUSv3 specifies that if
open()
succeeds, it is guaranteed
to use the lowest-numbered
unused file descriptor for th
e process. We can use this fe
ature to ensure that a file
is opened using a particular file descriptor. For example, the following sequence
ensures that a file is opened using
file descriptor 0file descriptor 0
File access modes
Access modeDescription
O_RDONLY
Open the file for reading only
O_WRONLY
Open the file for writing only
O_RDWR
Open the file for both reading and writing
Chapter 4
4.2Universality of I/O
One of the distinguishing features of
the UNIX I/O mode
l is the concept of
. This means that the same four system calls
open
read
writ
close()
are used to perform I/O on all type
s of files, including devices such as
terminals. Consequently, if we write a pr
ogram using only these system calls, that
program will work on any type of file. Fo
r example, the following are all valid uses
of the program in Listing 4-1:
./copy test test.old

Copy a regular file
./copy a.txt /dev/tty

Copy a regular file to this terminal
./copy /dev/tty b.txt

Copy input from this te
rminal to a regular file
./copy /dev/pts/16 /dev/tty
Copy input from another terminal
Universality of I/O is achieved by ensuring that each file system and device driver
File I/O: The Universal I/O Model
We can use the program in
Listing 4-1 as follows:
./copy oldfile newfile
Listing 4-1:
Using I/O system calls
fileio/copy.c
#include sys/stat.h&#xsys/;sta;&#xt.h7;
#include fcntünt;l.h;l.h
#include "tlpi_hdr.h"
#ifndef BUF_SIZE /* Allow "cc -D" to override definition */
#define BUF_SIZE 1024
#endif
int
main(int argc, char *argv[])
int inputFd, outputFd, openFlags;
mode_t filePerms;
ssize_t numRead;
char buf[BUF_SIZE];
if (argc != 3 || strcmp(argv[1], "--help") == 0)
usageErr("%s old-file new-file\n", argv[0]);
/* Open input and output files */
inputFd = open(argv[1], O_RDONLY);
if (inputFd == -1)
errExit("opening file %s", argv[1]);
openFlags = O_CREAT | O_WRONLY | O_TRUNC;
filePerms = S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP |
S_IROTH | S_IWOTH; /* rw-rw-rw- */
outputFd = open(argv[2], openFlags, filePerms);
if (outputFd == -1)
errExit("opening file %s", argv[2]);
/* Transfer data until we encounter end of input or an error */
while ((numRead = read(inputFd, buf, BUF_SIZE�)) 0)
if (write(outputFd, buf, numRead) != numRead)
fatal("couldn't write whole buffer");
if (numRead == -1)
errExit("read");
if (close(inputFd) == -1)
errExit("close input");
if (close(outputFd) == -1)
errExit("close output");
exit(EXIT_SUCCESS);
fileio/copy.c
Chapter 4
behalf by the shell, before the program is
started. Or, more precisely, the program
inherits copies of the shell
s file descriptors, and the
shell normally operates with
these three file descriptors always open.
(In an interactive shell, these three file
descriptors normally refer to the terminal
is running.)
If I/O
redirections are specified on a command line, then the shell ensures that the file
descriptors are suitably modified before starting the program.
When referring to these file descriptors in
a program, we can use either the numbers
POSIX standard names defined in
unistd.h&#x-7.1;&#xuni-;.1s;&#xt-7.;.h;&#x-7.1;
Although the variables
stdin
stdout
initially refer to the processs
standard input, output, and error, they can be changed to refer to any file by
using the
freopen()
library function. As part of its operation,
freope
may
change the file descriptor underlying
the reopened stream. In other words,
freope
on
stdout
, for example, it is no longer safe to assume that the
underlying file descriptor is still 1.
The following are the four
key system calls for perfor
ming file I/O (programming
languages and software packages typically em
ploy these calls only indirectly, via I/O
libraries):
fd = open(pathname, flags, mode)
opens the file identified by
FILE I/O: THE UNIVERSAL
We now start to look in earnest at the
system call API. Files are a good place to
start, since they are central to the UNIX ph
ilosophy. The focus of this chapter is the
system calls used for perfor
ming file input and output.
We introduce the concept of a file descri
ptor, and then look at the system calls
that constitute the so-called universal I/
O model. These are the system calls that
open and close a file, and read and write data.
We focus on I/O on disk files. However, much of the material covered here is
relevant for later chapters, since the same
system calls are used for performing I/O
on all types of files, such as pipes and terminals.
Chapter 3
is not required on Linux or by SUSv3,
but because especially olderespecially older
implementations may require it, we should include it in portable programs.
For many of the functions that it sp
ecified, POSIX.
1-1990 required that the
header
sys/types.&#xs7.6;&#xys/7;&#x.6ty;&#xp7.6;s.7;&#x.6h0;h
be included before any other headers associated with the
function. However, this requirement wa
s redundant, because most contempo-
rary UNIX implementation
s did not require applications to include this
header for these functions. Conseque
ntly, SUSv1 removed this requirement.
Nevertheless, when writing portable prog
rams, it is wise to make this one of
the first header files includ
ed. (However, we omit this
header from our example
programs because it is not required on
Linux and omitting it
allows us to make
the example programs one line shorter.)
3.7Summary
System calls allow processes to request
services from the kernel. Even the sim-
plest system calls have a significant ov
erhead by comparison with a user-space
function call, since the system must tempor
arily switch to kernel mode to execute
the system call, and the kernel must ve
rify system call arguments and transfer
System Programming Concepts
Although SUSv3 specifie
s structures such as
sembuf
, it is important to realize the
following:
In general, the order of field definitions within such structures is not specified.
In some cases, extra implementation-spe
cific fields may be included in such
Consequently, it is not portable to use a
structure initializer such as the following:
struct sembuf s = { 3, -1, SEM_UNDO };
Although this initializer will work on Linux, it wont work on another implementa-
tion where the fields in the
structure are defined in a different order. To
portably initialize such structures, we must use explicit assignment statements, as in
struct sembuf s;
s.sem_num = 3;
s.sem_op = -1;
s.sem_flg = SEM_UNDO;
If we are using C99, then we can employ
that languages new syntax for structure
initializers to write an equivalent initialization:
struct sembuf s = { .sem_num = 3, .sem_op = -1, .sem_flg = SEM_UNDO };
Considerations about the order of the members of standard structures also apply
ifwe want to write the contents of a standa
rd structure to a file. To do this port-
ably, we cant simply do a binary write of
the structure. Instead, the structure fields
must be written individually (probably in text form) in a specified order.
Using macros that may not be pr
In some cases, a macro may be not be defined on all UNIX implementations. For
example, the
WCORE
Chapter 3
Printing system data type values
When printing values of one of the nume
ric system data types shown in Table 3-1
(e.g.,
and
), we must be careful not to
include a representation depen-
dency in the
call. A representation dependency can occur because Cs argu-
ment promotion rules convert values of type
to
, but leave values of type
unchanged. This means that, depend
ing on the definition of the system
data type, either an
or a
is passed in the
call. However, because
System Programming Concepts
When discussing the data types in Tabl
e 3-1 in later chapters, well often make
statements that some type is an integer
type [specified by SUSv3]. This means
that SUSv3 requires the type to be defined
as an integer, but doesnt require that a
particular native in
teger type (e.g.,
long
) be used. (Often, we wont say
which particular native data type is actua
lly used to represent each of the system
data types in Linux, because a portable
application should be written so that it
doesnt care which data type is used.)
unsigned integerCounts of messages in System V message
quSection 46.4Section 46.4
unsigned integerNumber of file descriptors for
poll
(Section 63.2.2)
integerCount of (hard) links to a file (Section 15.1)
off_t
Chapter 3
Each of these types is
defined using the C
typedef
feature. For example, the
data type is intended for representing pr
ocess IDs, and on Linux/x86-32 this type
is defined as follows:
typedef int pid_t;
Most of the standard system data types have names ending in
. Many of them are
declared in the header file
sys/typ&#xs7ys;&#x/typ;~s.;&#xh000;es.h
, although a few are defined in other
An application should employ these type
definitions to portably declare the
variables it uses. For example, the follow
ing declaration would allow an application
to correctly represent process IDs on any SUSv3-conformant system:
pid_t mypid;
Table 3-1 lists some of the system data type
s well encounter in this book. For certain
types in this table, SUSv3 requires
that the type be implemented as an
System Programming Concepts
_POSIX_C_SOURCE
_XOPEN_SOURCE
, and POSIX.1/SUS
Only the
_POSIX_C_SOURCE
and
_XOPEN_SOURCE
feature test macros are specified in
POSIX.1-2001/SUSv3, which re
quires that these macros be defined with the values
200112 and 600, respectively, in co
nforming applications. Defining
_POSIX_C_SOURCE
as 200112 provides confor
mance to the POSIX.1-2001 base specification (i.e.,
POSIX conformance
, excluding the XSI extension). Defining
_XOPEN_SOURCE
as 600 pro-
vides conformance to SUSv3 (i.e.,
XSI conformance
, the base specification plus the
XSI extension). Analogous statements
apply for POSIX.1-2008/SUSv4, which
require that the two macros be de
fined with the values 200809 and 700.
Chapter 3
glibc
System Programming Concepts
if (�res INT_MAX || res INT_MIN)
Chapter 3
static void
gnFail(const char *fname, const char *msg, const char *arg, const char *name)
fprintf(stderr, "%s error", fname);
if (name != NULL)
fprintf(stderr, " (in %s)", name);
fprintf(stderr, ": %s\n", msg);
if (arg != NULL && *arg != '\0')
fprintf(stderr, " offending text: %s\n", arg);
exit(EXIT_FAILURE);
static long
getNum(const char *fname, const char *arg, int flags, const char *name)
long res;
char *endptr;
int base;
if (arg == NULL || *arg == '\0')
gnFail(fname, "null or empty string", arg, name);
base = (flags & GN_ANY_BASE) ? 0 : (flags & GN_BASE_8) ? 8 :
(flags & GN_BASE_16) ? 16 : 10;
errno = 0;
res = strtol(arg, &endptr, base);
if (errno != 0)
gnFail(fname, "strfailed", arg, name);
if (*endptr != '\0')
gnFail(fname, "nonnumeric characters", arg, name);
if ((flags & GN_NONNEG) && res 0)
gnFail(fname, "negative value not allowed", arg, name);
if ((flags & GN_GT_0) && res = 0)
gnFail(fname, "value must be � 0", arg, name);
System Programming Concepts
If the
argument is non-
, it should contain a string identifying the argu-
ment in
arg
. This string is included as part of any error message displayed by these
functions.
flags
argument provides some control over the operation of the
Chapter 3
Listing 3-4:
Linux error names (x86-32 version)

lib/ename.c.inc
static char *ename[] = {
/* 0 */ "",
/* 1 */ "EPERM", "ENOENT", "ESRCH", "EINTR", "EIO", "ENXIO", "E2BIG",
/* 8 */ "ENOEXEC", "EBADF", "ECHILD", "EAGAIN/EWOULDBLOCK", "ENOMEM",
/* 13 */ "EACCES", "EFAULT", "ENOTBLK", "EBUSY", "EEXIST", "EXDEV",
/* 19 */ "ENODEV", "ENOTDIR", "EISDIR", "EINVAL", "ENFILE", "EMFILE",
/* 25 */ "ENOTTY", "ETXTBSY", "EFBIG", "ENOSPC", "ESPIPE", "EROFS",
/* 31 */ "EMLINK", "EPIPE", "EDOM", "ERANGE", "EDEADLK/EDEADLOCK",
/* 36 */ "ENAMETOOLONG", "ENOLCK", "ENOSYS", "ENOTEMPTY", "ELOOP", "",
/* 42 */ "ENOMSG", "EIDRM", "ECHRNG", "EL2NSYNC", "EL3HLT", "EL3RST",
/* 48 */ "ELNRNG", "EUNATCH", "ENOCSI", "EL2HLT", "EBADE", "EBADR",
/* 54 */ "EXFULL", "ENOANO", "EBADRQC", "EBADSLT", "", "EBFONT", "ENOSTR",
System Programming Concepts
void
cmdLineErr(const char *format, ...)
va_list argList;
fflush(stdout); /* Flush any pending stdout */
fprintf(stderr, "Command-line usage error: ");
va_start(argList, format);
vfprintf(stderr, format, argList);
va_end(argList);
fflush(stderr); /* In case stderr is not line-buffered */
exit(EXIT_FAILURE);

lib/error_functions.c
enames.c.inc
included by Listing 3-3 is shown in Listing 3-4. This file defines
an array of strings,
, that are the symbolic na
mes corresponding to each
ofthe possible
errno
values. Our error-handling functi
ons use this array to print out
the symbolic name corresponding to a
particular error number. This is a
workaround to deal with the facts that
Chapter 3
void
err_exit(const char *format, ...)
va_list argList;
va_start(argList, format);
outputError(TRUE, errno, FALSE, format, argList);
va_end(argList);
terminate(FALSE);
void
errExitEN(int errnum, const char *format, ...)
va_list argList;
va_start(argList, format);
outputError(TRUE, errnum, TRUE, format, argList);
va_end(argList);
terminate(TRUE);
void
fatal(const char *format, ...)
va_list argList;
va_start(argList, format);
outputError(FALSE, 0, TRUE, format, argList);
va_end(argList);
terminate(TRUE);
void
usageErr(const char *format, ...)
va_list argList;
fflush(stdout); /* Flush any pending stdout */
fprintf(stderr, "Usage: ");
va_start(argList, format);
vfprintf(stderr, format, argList);
va_end(argList);
fflush(stderr); /* In case stderr is not line-buffered */
exit(EXIT_FAILURE);
System Programming Concepts
static void
outputError(Boolean useErr, int err, Boolean flushStdout,
const char *format, va_list ap)
#define BUF_SIZE 500
char buf[BUF_SIZE], userMsg[BUF_SIZE], errText[BUF_SIZE];
vsnprintf(userMsg, BUF_SIZE, format, ap);
if (useErr)
snprintf(errText, BUF_SIZE, " [%s %s]",
(err� 0 && err = MAX_ENAME) ?
ename[err] : "?UNKNOWN?", strerroerrerr);
else
snprintf(errText, BUF_SIZE, ":");
snprintf(buf, BUF_SIZE, "ERROR%s %s\n", errText, userMsg);
if (flushStdout)
fflush(stdout); /* Flush any pending stdout */
fputs(buf, stderr);
fflush(stderr); /* In case stderr is not line-buffered */
void
errMsg(const char *format, ...)
va_list argList;
int savedErrno;
savedErrno = errno; /* In case we change it here */
va_start(argList, format);
outputError(TRUE, errno, TRUE, format, argList);
va_end(argList);
errno = savedErrno;
void
errExit(const char *format, ...)
va_list argList;
va_start(argList, format);
outputError(TRUE, errno, TRUE, format, argList);
va_end(argList);
terminate(TRUE);
Chapter 3
fatal()
function is used to diagnose ge
neral errors, including errors from
System Programming Concepts
nonempty string value, by calling
abor
to produce a core dump file for use with
the debugger. (We explain core dump files in Section 22.1.)
function is similar to
, but differs in two respects:
It doesnt flush standard output be
It terminates the process by calling
instead of
. This causes the pro-
cess to terminate without flushing
buffers or invoking exit handlers.
The details of these differences in the operation of
will become clearer in
Chapter 25, where we describe the differences between
and
sider the treatment of
buffers and exit handlers in a child created by
fork()
now, we simply note that
is especially useful if
we write a library function
that creates a child process that needs to
terminate because of an error. This termi-
nation should occur without fl
ushing the childs copy of th
e parents (i.e., the call-
ing processs)
buffers and without invoking ex
it handlers established by the
function is the same as
, except that instead of printing
the error text corresponding to the current value of
, it prints the text corre-
sponding to the error number (thus, the
suffix) given in the argument
Mainly, we use
errExitEN()
in programs that employ the POSIX threads API.
Unlike traditional UNIX system calls, whic
Chapter 3
Listing 3-2:
Declarations for common error-handling functions
lib/error_functions.h
#ifndef ERROR_FUNCTIONS_H
#define ERROR_FUNCTIONS_H
void errMsg(const char *format, ...);
#ifdef __GNUC__
/* This macro stops 'gcc -Wall' complaining that "control reaches
end of non-void function" if we use the following functions to
terminate mai some other non-void function. */
System Programming Concepts
Each of our example programs that ha
s a nontrivial command-line syntax pro-
vides a simple help facility fo
r the user: if invoked with the
help
option, the pro-
gram displays a usage message that indi
cates the syntax for command-line options
3.5.2Common Functions and Header Files
Most of the example programs includ
e a header file containing commonly
required definitions, and they also use a
set of common functions. We discuss the
header file and functions in this section.
Common header file
Listing 3-1 is the header file used by nearly every program in this book. This header
file includes various other header files used by many of the example programs,
data type, and defines macros
for calculating the minimum and
maximum of two numeric values. Using this
header file allows us to make the
example programs a bit shorter.
Listing 3-1:
Header file used by most example programs
lib/tlpi_hdr.h
#ifndef TLPI_HDR_H
#define TLPI_HDR_H /* Prevent accidental double inclusion */
#include sys/types.&#xsys/;typ;s.7;&#xh000;h /* Type definitions used by many programs */
#include stdi&#xstdi;o.h;o.h /* Standard I/O functions */
#include stdl&#xstdl;ib.;&#xh000;ib.h /* Prototypes of commonly used library functions,
plus EXIT_SUCCESS and EXIT_FAILURE constants */
#include unis&#xunis;td.;&#xh000;td.h /* Prototypes for many system calls */
#include errnrrn;o.h;o.h /* Declares errno and defines error constants */
#include stri&#xstri;ng.;&#xh000;ng.h /* Commonly used string-handling functions */
Chapter 3
System Programming Concepts
if (closfdfd== -1) {
/* Code to handle the error */
When a system call fails, it sets the global integer variable
errno
to a positive value
that identifies the specific error. Including the
errno.hçrr;&#xno.h;瀀
header file provides a dec-
laration of
, as well as a set of constants for the various error numbers. All of
these symbolic names commence with
. The section headed
ERRORS
in each manual
page provides a list of possible
values that can be returned by each system call.
Here is a simple example of the use of
to diagnose a system call error:
cnt = read(fd, buf, numbytes);
if (cnt == -1) {
if (errno == EINTR)
fprintf(stderr, "read was interrupted by a signal\n");
else {
/* Some other error occurred */
}
Successful system calls and library functions never reset
to 0, so this variable
may have a nonzero value as a consequence of
an error from a previous call. Further-
more, SUSv3 permits a successful function call to set
to a nonzero value
(although few functions do this). Therefore,
when checking for an error, we should
Chapter 3
function to
System Programming Concepts
call just outputs a block of bytes. Similarly, the
malloc()
functions perform
various bookkeeping tasks that make them
a much easier way to allocate and free
e underlying
br
system call.
3.3The Standard C Librar
y; The GNU C Library (
glibc
There are different implementations of the standard C library on the various UNIX
implementations. The most commonly used
implementation on Linux is the GNU
C library (
glibc
http://www.gnu.org/software/libc/
The principal developer and maintainer of the GNU C library was initially
Roland McGrath. Nowadays, this task
is carried out by Ulrich Drepper.
Various other C libraries are available for Linux, including libraries with
smaller memory requirements for use in
embedded device applications. Examples
include
Chapter 3
Figure 3-1:
Steps in the execution of a system call
Appendix A describes the
command, which can be used to trace the system
calls made by a program, either for debu
gging purposes or simply to investigate
what a program is doing.
More information about the Linux system call mechanism can be found in
Trap handler
System call
service routine
switch to kernel mode
program
(arguments: __NR_execve,
path, argv, envp)
...(arch/x86/kernel/entry_32.S)
switch to user mode
Kernel Mode
System Programming Concepts
routine performs the required task, which may involve modifying values at
addresses specified in the given argu
Chapter 3
new process, performing I/O, and crea
ting a pipe for interprocess communica-
tion. (The
manual page lists the Linux system calls.)
SYSTEM PROGRAMMING
This chapter covers various topics that
are prerequisites for system programming.
We begin by introducing system calls and de
tailing the steps that occur during their
execution. We then consider library functi
ons and how they differ from system calls,
and couple this with a descri
ption of the (GNU) C library.
Whenever we make a system call or ca
ll a library function, we should always
Chapter 2
dont strictly qualify as realtime, most
UNIX implementations now support some
or all of these extensions. (During the cour
se of this book, we describe those fea-
tures of POSIX.1b that are supported by Linux.)
In this book, we use the term
real time
to refer to the concept of calendar or
elapsed time, and the term
realtime
to denote an operating system or applica-
tion providing the type of respon
siveness described in this section.
2.19The
Like several other UNIX implementations, Linux provides a
/proc
file system,
which consists of a set of direct
ories and files mounted under the
/proc
directory.
file system is a virtual file system
that provides an interface to kernel
data structures in a form that looks like files and directories on a file system. This
provides an easy mechanism for viewing and changing various system attributes. In
addition, a set of directories with names of the form
/proc/
, where
PID
is a pro-
cess ID, allows us to view information
about each process running on the system.
The contents of
files are generally in human-
readable text form and can
be parsed by shell scripts. A program can
simply open and read from, or write to,
the desired file. In most cases, a process
must be privileged to modify the contents
of files in the
/proc
As we describe various parts of the
Linux programming interface, well also
describe the relevant
/proc
files. Section 12.1 provides further general information
on this file system. The
/proc
file system is not specified by any standards, and the
details that we describe are Linux-specific.
2.20Summary
In this chapter, we surveyed a range of
fundamental concepts related to Linux sys-
tem programming. An understanding of th
ese concepts should provide readers
with limited experience on Linux or UNIX with enough background to begin
learning system programming.
Fundamental Concepts
The client and server may reside on the same host computer or on separate
Chapter 2
program undergoes the usual input proces
sing performed by the terminal driver
(for example, in the default mode, a carri
Fundamental Concepts
All major shells, except the
Bourne shell, provide an interactive feature called
job
, which allows the user to simultaneo
usly execute and manipulate multiple
commands or pipelines. In job-control she
lls, all of the processes in a pipeline are
placed in a new
or
job
. (In the simple case of
a shell command line con-
taining a single command, a new process gr
oup containing just a single process is
created.) Each process in a process group has the same integer
, which is the same as the process ID of one of the processes in the group,
termed the
process group leader
The kernel allows for various actions, nota
bly the delivery of signals, to be per-
formed on all members of a process group.
Job-control shells use this feature to
allow the user to suspend or resume all of
the processes in a pipeline, as described
in the next section.
2.14Sessions, Controlling Termin
als, and Controlling Processes
is a collection of process groups (jobs). All of the processes in a session
have the same
session identifier
session leader
is the process that created the ses-
sion, and its process ID becomes the session ID.
Sessions are used mainly by job-contro
l shells. All of the process groups cre-
ated by a job-control shell belong to the same session as the shell, which is the ses-
sion leader.
Sessions usually have an associated
controlling terminal
. The controlling termi-
nal is established when the session leader process first opens a terminal device. For
a session created by an interactive shell,
this is the terminal at which the user
logged in. A terminal may be the controlling terminal of at most one session.
As a consequence of opening the cont
rolling terminal, the session leader
becomes the
controlling process
for the terminal. The cont
rolling process receives a
SIGHUP
signal if a terminal disconnect occurs
(e.g., if the terminal window is closed).
At any point in time, one process group in a session is the
foreground process
group
foreground job
), which may read input from
the terminal and send output to
it. If the user types the
interrupt
character (usually
Control-C
) or the
suspend
character
(usually
Control-Z
) on the controlling terminal, then the terminal driver sends a signal
that kills or suspends (i.e., stops) the foreground process group. A session can have
any number of
background process groups
background jobs
), which are created by ter-
minating a command with the ampersand (
) character.
Job-control shells provide commands for listing all jobs, sending signals to jobs,
Chapter 2
When a process receives a signal, it takes one of the following actions, depend-
ing on the signal:
it ignores the signal;
it is killed by the signal; or
it is suspended until later being resumed by receipt of a special-purpose signal.
For most signal types, instead of acceptin
g the default signal action, a program can
choose to ignore the signal (useful if the
Fundamental Concepts
Therefore, Linux, like all modern UNIX impl
ementations, provides a rich set of mech-
interprocess communication
(IPC), including the following:
, which are used to indicate
that an event has occurred;
(familiar to shell users as the
operator) and
FIFOs
, which can be used to
Chapter 2
Static libraries
Static libraries (sometimes also known as
archives
) were the only type of library on
early UNIX systems. A static library is es
sentially a structured bundle of compiled
object modules. To use functions from a stat
ic library, we specify that library in the
link command used to build a program. Af
ter resolving the various function refer-
ences from the main program to the modules in the static library, the linker
extracts copies of the required object
modules from the libr
ary and copies these
into the resulting executable file
. We say that such a program is
statically linked
The fact that each statically linked pr
ogram includes its own copy of the object
modules required from the library creates
a number of disadvantages. One is that
the duplication of object code in different executable files wastes disk space. A cor-
responding waste of memory occurs when
statically linked programs using the
same library function are executed at the
same time; each program requires its own
copy of the function to reside in me
mory. Additionally, if a library function
requires modification, then,
after recompiling that function and adding it to the
static library, all applications that ne
ed to use the updated function must be
relinked against the library.
Shared libraries
Shared libraries were designed to addr
ess the problems with static libraries.
If a program is linked against a shared
library, then, instead of copying object
modules from the library into the executab
le, the linker just writes a record into
the executable to indicate that at run ti
me the executable needs to use that shared
library. When the executable is loaded in
to memory at run time, a program called
ensures that the shared libraries required by the executable are
found and loaded into memory, and performs
run-time linking to resolve the func-
tion calls in the executable to the correspon
ding definitions in the shared libraries.
At run time, only a single copy of the code
of the shared library needs to be resi-
dent in memory; all running programs can use that copy.
The fact that a shared library contains
the sole compiled version of a function
saves disk space. It also greatly eases the job of ensuring that programs employ the
newest version of a function. Simply rebu
ilding the shared library with the new
function definition causes
existing programs to automa
tically use the new defini-
tion when they are next executed.
2.10Interprocess Communication and Synchronization
A running Linux system consists of numerous processes, many of which operate
independently of each other. Some processes, however, cooperate to achieve their
intended purposes, and these processes ne
Fundamental Concepts
hard limit
, which is a ceiling on the value to which the soft limit may be adjusted. An
unprivileged process may change its soft li
mit for a particular resource to any value
in the range from zero up to the correspon
ding hard limit, but can only lower its
When a new process is created with
fork()
, it inherits copies of its parents
Chapter 2
daemon
is a special-purpose process that is created and handled by the system
inthe same way as other processes, but which is distinguished by the following
It is long-lived. A daemon process is often started at system boot and remains
in existence until the system is shut down.
It runs in the background, and has no controlling terminal from which it can
read input or to which it can write output.
Examples of daemon processes include
, which records messages in the sys-
tem log, and
, which serves web pages via the Hypertext Transfer Protocol
Each process has an
, which is a set of
that are
maintained within the user-space memory of the process. Each element of this list
consists of a name and an associated va
lue. When a new process is created via
fo
, it inherits a copy of its parents environment. Thus, the environment pro-
vides a mechanism for a parent process to
communicate informat
ion to a child pro-
cess. When a process replaces the
program that it is running using
, the new
program either inherits the environment
used by the old program or receives a
new environment specified as part of the
exec
call.
Environment variables are created with the
command in most shells (or
Fundamental Concepts
Effective user ID
and
effective group ID
: These two IDs (in conjunction with the
supplementary group IDs discussed in a
moment) are used in determining the
permissions that the process has when
accessing protected resources such as
files and interprocess communication obje
cts. Typically, the
processs effective
IDs have the same values as the correspo
nding real IDs. Changing the effective IDs
is a mechanism that allows a process to
assume the privileges of another user
or group, as described in a moment.
Supplementary group IDs
: These IDs identify addition
al groups to which a pro-
cess belongs. A new process inherits its supplementary group IDs from its par-
Chapter 2
parent process. The child inherits copies
of the parents data, stack, and heap seg-
ments, which it may then modify independ
ently of the parents copies. (The pro-
gram text, which is placed in memory mark
ed as read-only, is shared by the two
The child process goes on either to exec
Fundamental Concepts
Filters
is the name often applied to a program that reads its input from
forms some transformation of that input, and writes the transformed data to
Examples of filters include
grep
, and
awk
Command-line arguments
In C, programs can access the
command-line arguments
, the words that are supplied
on the command line when the program is
run. To access the command-line argu-
main()
function of the program is declared as follows:
int main(int argc, char *argv[])
argc
variable contains the total number
of command-line arguments, and the
individual arguments are available as stri
ngs pointed to by members of the array
argv
. The first of these strings,
, identifies th
e name of the program itself.
2.7Processes
Put most simply, a
uting program. When a program
is executed, the kernel loads the code of
the program into vi
rtual memory, allo-
Chapter 2
, and so on) are used to perform I/
O on all types of files, including
devices. (The kernel translates the applic
ations I/O requests in
to appropriate file-
system or device-driver operations that pe
open()
takes a pathname argument specifying a fi
le upon which I/O is to be performed.
Normally, a process inherits three open
file descriptors when it is started by
theshell: descriptor 0 is
standard input
, the file from which the process takes its
, the file to which the process writes its output;
and descriptor 2 is
standard error
, the file to which the process writes error
messagesand notification of exceptional or abnormal conditions. In an interactive
shell or program, these three descriptors
are normally connected to the terminal.
In the
library, these descriptors correspond to the file streams
library
To perform file I/O, C programs typicall
y employ I/O functions contained in the
Fundamental Concepts
A pathname describes the location of a
file within the single directory hier-
archy, and is either
An
absolute pathname
begins with a slash (
) and specifies the location of a file
with respect to the root directory. Exam
ples of absolute pathnames for files in
Figure 2-1 are
/home/mtk/.bashrc
/usr/include
(the pathname of the root
directory).
specifies the location of a file
relative to a processs current
working directory (see belo
w), and is distinguished fr
om an absolute pathname
by the absence of an initial slash. In Figure 2-1, from the directory
, the file
types.h
could be referenced usin
g the relative pathname
include/sys/types.h
while from the directory
, the file
.bashrc
could be accessed using the rela-
../mtk/.bashrc
Current working directory
Each process has a
current working directory
Chapter 2
Symbolic links
Like a normal link, a
symbolic link
provides an alternative name for a file. But
whereas a normal link is a filename-plus-po
inter entry in a directory list, a symbolic
link is a specially marked file containing
the name of another file. (In other words,
a symbolic link has a filename-plus-pointer
entry in a directory, and the file referred
to by the pointer contains a string that name
s another file.) This latter file is often
called the
Fundamental Concepts
2.4Single Directory Hierarchy,
The kernel maintains a single hierarchical
directory structure to organize all files in
the system. (This contrasts with operat
ing systems such as Microsoft Windows,
where each disk device has its own directory hierarchy.) At the base of this hier-
archy is the
root directory
slashslashes and directories are children or
further removed descendants of the root di
rectory. Figure 2-1 shows an example of
this hierarchical file structure.
Figure 2-1:
boot
vmlinuz
directory
regular file
Chapter 2
2.3Users and Groups
Each user on the system is uniquely id
Users
Every user of the system has a unique
usernameusernamecorresponding
user ID
UIDUID. For each user, these are defined by a line in the system
Fundamental Concepts
Chapter 2
doesnt know where it is located in RAM or
Fundamental Concepts
Chapter 2
The Linux kernel executable typi
cally resides at the pathname
/boot/vmlinuz
This chapter introduces a range of concep
ts related to Linux system programming.
It is intended for readers
who have worked primarily with other operating systems,
or who have only limited experience with Linux or another UNIX implementation.
2.1The Core Operating System: The Kernel
operating system
is commonly used with
two different meanings:
To denote the entire package consisting
of the central software managing a
computers resources and all of the ac
companying standard software tools,
such as command-line interpreters, graphical user interfaces, file utilities, and
editors.
More narrowly, to refer to
the central software that manages and allocates
e CPU, RAM, and devices).
is often used as a synonym for th
e second meaning, and it is with
this meaning of the term
operating system
that we are concerned in this book.
Although it is possible to run progra
ms on a computer without a kernel, the
presence of a kernel greatly simplifies the writing and use of other programs, and
increases the power and flexibility available to programmers. The kernel does this
by providing a software layer to mana
ge the limited resources of a computer.
Chapter 1
join him in improving the kernel. Many pr
ogrammers did so, and, over time, Linux
was extended and ported to a wide
History and Standards
integrated into the mainline. For example, version 3 of the
journaling file
system was part of some Linux distributions long before it was accepted into the
mainline 2.4 kernel.
The upshot of the preceding points is
that there are (mostly minor) differences
in the systems offered by the various Linux distribution companies. On a much
smaller scale, this is reminiscent of the
splits in implementations that occurred in
the early years of UNIX. The Linux Standard
Base (LSB) is an effort to ensure com-
patibility among the various Linux distributions. To do this, the LSB (
www.linux-foundation.org/en/LSB
Chapter 1
defined the interface that a UNIX implemen
tation must provide in order to be able
to call itself System V Release 4.
(The SVID is available online at
http://www.sco.com/
Because the behavior of some system ca
and it would need to repeat
this testing with each new
distribution release. Never-
theless, it is the de facto near-conforman
ce to various standards that has enabled
History and Standards
Figure 1-1:
19891989
[C89, ISO C 90]
1988, IEEE1988, IEEE
[POSIX 1003.1]
1990, ISO1990, ISO
Realtime
Threads
19921992
Shell & utilities
20002000
19891989
19921992
[SUS, UNIX
19971997
[UNIX 98,
XPG5]
(2001, Austin CSRG)
1996, ISO1996, ISO
19991999
4, 5, 5.24, 5, 5.2
Additional real-
time extensions
Advanced real-
time extensions
(2008, Austin CSRG)
Chapter 1
Some functions specified as options in SUSv3 become a mandatory part of the
base standard in SUSv4. For example, a number of functions that were part of
the XSI extension in SUSv3 become part of the base standard in SUSv4.
Among the functions that become mandatory in SUSv4 are those in the
API (Section 42.1), the realtime signal
s API (Section 22.8), the POSIX sema-
phore Chapter 53Chapter 53and the PO
SIX timers API (Section 23.6).
History and Standards
The additional interfaces and behaviors
required for XSI conformance are collec-
tively known as the
. They include support for features such as threads,
mmap()
and
munmap()
API, resource limits, pseudoterminals, System V
IPC, the
API,
, and login accounting.
In later chapters, when we talk abo
ut SUSv3 conformance, we mean XSI
conformance.
Because POSIX and SUSv3 are now part
of the same document, the additional
interfaces and the selectio
n of mandatory options requ
ired for SUSv3 are indi-
cated via the use of shading and margin markings within the document text.
Unspecified and weakly specified
Occasionally, we refer to an interface as
being unspecified or weakly specified
within SUSv3.
By an
unspecified
, we mean one that is not defined at all in the formal
standard, although in a few cases there ar
e background notes or rationale text that
mention the interface.
Saying that an interface is
is shorthand for saying that, while the
interface is included in the standard, im
Chapter 1
The SUSv3 base specifications consists
of around 3700 pages, divided into the
following four parts:
Base Definitions
(XBD): This part contains de
finitions, terms, concepts, and
specifications of the contents of header files. A total of 84 header file specifica-
tions are provided.
System Interfaces
(XSH): This part begins with various useful background infor-
mation. Its bulk consists of the specific
ation of various functions (which are
implemented as either system calls or library functions on specific UNIX imple-
mentations). A total of 1123 system in
terfaces are included in this part.
Shell and Utilities
(XCU): This specifies the operation of the shell and various
UNIX commands. A total of 160 utili
ties are specified in this part.
(XRAT): This part includes informative text and justifications relat-
ing to the earlier parts.
In addition, SUSv3 includes the
X/Open CURSES Issue 4 Version 2
(XCURSES) spec-
ification, which specifies 372 functions and 3 header files for the
curses
screen-
In all, 1742 interfaces are specified in
SUSv3. By contrast, POSIX.1-1990 (with
FIPS 151-2) specified 199 interfaces, an
d POSIX.2-1992 specified 130 utilities.
SUSv3 is available online at
http://www.unix.org/version3/online.html
. UNIX
implementations certified agai
nst SUSv3 can call themselves
UNIX 03
There have been various minor fixes and improvements for problems discov-
ered since the ratification
of the original SUSv3 text. These have resulted in the
appearance of
Technical Corrigendum Number 1
, whose improvements were incorpo-
rated in a 2003 revision of SUSv3, and
Technical Corrigendum Number 2
improvements were incorpor
ated in a 2004 revision.
POSIX conformance, XSI confor
mance, and the XSI extension
Historicand XPGand XPGanda
rds deferred to the corresponding POSIX
standards and were structured as functional
is both an IEEE standard and an Open
Group Technical Standard (i.e., as noted
already, it is a consolidation of earlie
r POSIX and SUS standards). This document
defines two levels
of conformance:
: This defines a baseline of
interfaces that a conforming
implementation must provid
e. It permits the implemen
tation to provide other
optional interfaces.
X/Open System Interface
(XSI)
conformance
: To be XSI conformant, an implemen-
UNIX 03
History and Standards
1.3.3X/Open Company and The Open Group
X/Open Company was a consortium formed by an international group of com-
puter vendors to adopt and adapt existing
standards in order to produce a compre-
Chapter 1
POSIX.1 was initially based on an earlie
19841984dard produced
by an association of UNIX vendors called
/usr/group
POSIX.1 documents an API for a set of servic
es that should be made available to a
program by a conforming operating system.
An operating system that does this can
be certified as
POSIX.1 conformant
POSIX.1 is based on the UNIX system call and the C library function API, but
it doesnt require any particular implementa
tion to be associated with this inter-
face. This means that the
interface can be implemented by any operating system,
not specifically a UNIX operating system. In fact, some vendors have added APIs to
History and Standards
function prototypes, structure assignment, type qualifiers (
and
), enu-
meration types, and the
These factors created a drive for C standardization that culminated in 1989
with the approval of the American National Standards Institute ANSIANSIstandard
(X3.159-1989), which was subsequently adop
ted in 1990 as an International Stan-
dards Organization (ISO) standard (ISO
/IEC 9899:1990). As we
ll as defining the
syntax and semantics of C, this standard
described the operation of the standard C
library, which includes the
functions, string-handling functions, math func-
tions, various header files, and so on. This version of C is usually known as
or
(less commonly)
, and is fully described in the second (1988) edition of Ker-
nighan and Ritchies
The C Programming Language
A revision of the C standard was ad
opted by ISO in 1999 (ISO/IEC 9899:1999;
see
http://www.open-std.org/jtc1/sc22/wg14/www/standards
). This standard is usually
referred to as C99, and includes a range of
changes to the language and its stan-
dard library. These changes include the addition of
and Boolean data
types, C++-style (
) comments, restricted pointers, and variable-length arrays. (At
the time of writing, work is in progress
on a further revision of the C standard,
informally named C1X. The
new standard is expected
to be ratified in 2011.)
Chapter 1
architectures began to appear, starting with
an early port to the Digital Alpha chip.
The list of hardware architectures to whic
h Linux has been ported continues to grow
and includes x86-64, Motorola/IBM Po
werPC and PowerPC64, Sun SPARC and
UltraSPARCUltraSPARCS, ARM (Acorn
formerly System/390formerly System/390
Intel IA-64 (Itanium; see [Mosberger
& Eranian, 2002]), Hitachi SuperH, HP
PA-RISC, and Motorola 68000.
Linux distributions
Precisely speaking, the term
Linux
refers just to the kernel
developed by Linus Torvalds
and others. However, the term
is commonly used to mean the kernel, plus a
wide range of other software (tools and
History and Standards
and bug fixes. When the current develo
pment branch was deemed suitable for
release, it became the new stable branch
and was assigned an even minor version
number. For example, the 2.3.
development kernel branch resulted in the 2.4
stable kernel branch.
Following the 2.6 kernel
release, the development
model was changed. The
main motivation for this change arose fr
om problems and frustrations caused by
Chapter 1
DragonFly BSD, appeared after a split from FreeBSD 4.
. DragonFly BSD takes a
different approach from FreeBSD 5.
History and Standards
Torvalds therefore started on a project
to create an efficient, full-featured
UNIX kernel to run on the 386. Over a
few months, Torvalds developed a basic
kernel that allowed him to compile
and run various GNU programs. Then, on
October 5, 1991, Torvalds requested the
help of other programmers, making the
following now much-quoted announcement
of version 0.02 of his kernel in the
comp.os.minix
Usenet newsgroup:
Do you pine for the nice days of Minix-1.1, when men were men
and wrote their own device dr
ivers? Are you without a nice
Chapter 1
software. Much of the software in a Li
licensed under the GPL or one of a number of similar licenses. Software licensed
under the GPL must be made available in
source code form, and must be freely
redistributable under the terms of the GP
L. Modifications to
GPL-licensed soft-
ware are freely permitted, but any distrib
ution of such modified software must also
be under the terms of the GPL. If the mo
dified software is distributed in execut-
able form, the author must also allow an
y recipients the option of obtaining the
modified source for no more than the cost
of distribution. The first version of the
GPL was released in 1989. The current ve
rsion of the license, version 3, was
released in 2007. Version 2 of the license,
released in 1991, remains in wide use,
and is the license used for the Linux kernel. (Discussions of various free software
licenses can be found in [St. La
urent, 2004] and [Rosen, 2005].)
The GNU project did not initially prod
uce a working UNIX kernel, but did
produce a wide range of other programs.
Since these programs were designed to
run on a UNIX-like operating system, they
could be, and were, used on existing
UNIX implementations and, in some cases, even ported to other operating sys-
tems. Among the more well-known programs
produced by the GNU project are the
text editor,
GCC
(originally the GNU C compiler, but now renamed the
GNU compiler collection, comprising comp
ilers for C, C++, and other languages),
bash
shell, and
glibc
(the GNU C library).
By the early 1990s, the GNU project had
produced a system that was virtually
History and Standards
acquisitions, HP Tru64 UNIX), IBMs AIX, Hewlett-Packards (HPs) HP-UX,
NeXTs NeXTStep, A/UX for the Apple
Macintosh, and Microsoft and SCOs
XENIX for the Intel x86-32 architecture.
(Throughout this book, the Linux imple-
mentation for x86-32 is referred to as Linux/x86-32.) This situation was in sharp
Chapter 1
and a FORTRAN 77 compiler. The release of Seventh Edition is also significant
because, from this point, UNIX diverged
into two important variants: BSD and Sys-
tem V, whose origins we now briefly describe.
Thompson spent the 1975/1976 academic ye
ar as a visiting professor at the
University of California at Berkeley, th
e university from which he had graduated.
There, he worked with several graduate
students, adding ma
ny new features to
UNIX. (One of these students, Bill Joy,
subsequently went on to cofound Sun
Microsystems, an early entry in the UNIX
History and Standards
UNIX First through Sixth editions
Chapter 1
The other common meaning attached
to the term UNIX denotes those
systems that look and behave like classi
cal UNIX systems (i.e
., the original Bell
Laboratories UNIX and its later princi
pal offshoots, System V and BSD). By
this definition, Linux is generally consid
ered to be a UNIX system (as are the
modern BSDs). Although we give close
attention to the Sing
le UNIX Specifica-
tion in this book, well follow this s
econd definition of UNIX, so that well
often say things such as Linux, like other UNIX implementations. . . .
1.1A Brief History of UNIX and C
The first UNIX implementation was develo
ped in 1969 (the same year that Linus
Torvalds was born) by Ken Thompson at Bell Laboratories, a division of the tele-
phone corporation, AT&T. It was written in assembler for a Digital PDP-7 mini-
computer. The name UNIX was a pun on MULTICS (
Multiplexed Information and
), the name of an earlier operating system project in which AT&T
collaborated with Massachuse
tts Institute of Technology
MITMIT
tric. (AT&T had by this time
withdrawn from the project in frustration at its initial
failure to develop an economically useful
system.) Thompson drew several ideas
for his new operating system from MULTICS, including a tree-structured file sys-
Linux is a member of the UNIX family of operating systems. In computing terms,
UNIX has a long history. The first part of
this chapter provides a brief outline of
that history. We begin with a description
of the origins of the UNIX system and the
C programming language, and then consider
the two key currents that led to the
Linux system as it exists today: the GNU
project and the development of the Linux
One of the notable features
of the UNIX system is that its development was not
controlled by a single vendor or organization. Rather, many groups, both commer-
cial and noncommercial, contributed to it
s evolution. This history resulted in many
innovative features being
added to UNIX, but also ha
d the negative consequence
that UNIX implementations diverged over
time, so that writing applications that
worked on all UNIX implementations became
increasingly difficult. This led to a
drive for standardization of UNIX implementations, which we discuss in the sec-
ond part of this chapter.
Two definitions of the term UNIX are
in common use. One of these denotes
operating systems that have passed the
official conformance tests for the Sin-
gle UNIX Specification and thus are offi
cially granted the right to be branded
as UNIX by The Open Group (the ho
lders of the UNIX trademark). At the
time of writing, none of the free
UNIX implementations (e.g., Linux and
FreeBSD) has obtained this branding.
Preface
xli
Thanks to the team at No Starch Pres
s for all sorts of help on an enormous
project. Thanks to Bill Pollock for being st
raight-talking from the start, having rock-
solid faith in the project, and patiently k
eeping an eye on the project. Thanks to my
initial production editor, Megan Dunchak.
Thanks to my copyeditor, Marilyn Smith,
who, despite my best efforts at clarity and
consistency, still found many things to fix.
Riley Hoffman had overall responsibility for layout and design of the book, and
also took up the reins as production editor
as we came into the home straight. Riley
graciously bore with my many requests
to achieve the right layout and produced a
superb final result. Thank you.
I now know the truth of the clich that
a writers family also pays the price of
the writers work. Thanks to Britta and
Cecilia for their support, and for putting up
with the many hours that I had to be away from family as I finished the book.
Permissions
The Institute of Electrical
and Electronics Engineers and The Open Group have kindly
given permission to quote portions of te
xt from IEEE Std 1003.1, 2004 Edition,
Standard for Information TechnologyPortable Operating System Interface (POSIX),
The Open Group Base Specific
Preface
Aside from technical review
, I received many other kinds of help from various
people and organizations.
Thanks to the following people for an
swering technical questions: Jan Kara,
Dave Kleikamp, and Jon Snader. Thanks to
Claus Gratzl and Paul Marshall for system
Thanks to the Linux Foundation LFLFwhic
h, during 2008, funded me as a Fellow
to work full time on the
man-pages
project and on testing and design review of the
Linux programming interface. Although th
e Fellowship provided
no direct finan-
cial support for working on this book, it
did keep me and my family fed, and the
ability to focus full time on documenting
and testing the Linux programming interface
was a boon to my private project. At a mo
re individual level, thanks to Jim Zemlin
for being my interface while working at the LF, and to the members of the LF
Technical Advisory Board, who support
ed my application for the Fellowship.
Thanks to Alejandro Forero Cuervo fo
More than 25 years ago, Robert Biddl
e intrigued me during my first degree
with tales of UNIX, C, and Ratfor; thank
you. Thanks to the following people, who,
although not directly connect
ed with this project, encouraged me on the path of
writing during my second degree at th
e University of Canterbury, New Zealand:
Michael Howard, Jonathan Mane-Wheoki,
xxxix
Paul Pluzhnikov (Google) was formerly
the technical lead and a key developer
of the
memory-debugging tool.
xxxviii
Andreas Grnbacher (SUSE Labs) is a ke
rnel hacker and author of the Linux
implementation of extended attrib
utes and POSIX access control lists.
Andreas provided thorough review of many chapters, much encouragement,
and the single comment that probably mo
st changed the structure of the book.
Christoph Hellwig is a Linux storage an
d file-systems consultant and a well-
known kernel hacker who has worked
on many parts of the Linux kernel.
Christoph kindly took time out from writing and reviewing Linux kernel
patches to review several chapters of th
is book, suggesting many useful correc-
tions and improvements.
Andreas Jaeger led the development of the Linux port to the x86-64 architec-
ture. As a GNU C Library developer, he
ported the library to x86-64, and
helped make the library standards-confor
mant in several areas, especially in
the math library. He is currently Pr
ogram Manager for openSUSE at Novell.
Andreas reviewed far more chapters th
an I could possibly have hoped, sug-
gested a multitude of improvements, and warmly encouraged the ongoing
work on the book.
Preface
xxxvii
and was lead architect of the threads
implementation for OpenVMS and Digital
UNIX. David reviewed the threads chap
ters, suggested many improvements,
xxxvi
Preface
Ive been using Linux for about half as
long as Ive been using UNIX, and, over
that time, my interest has increasingly
Ive tested most of the example programs presented in this book (other than
those that exploit features that are noted
as being Linux-specific) on some or all of
Solaris, FreeBSD, Mac OS X, Tru64 UNIX
, and HP-UX. To improve portability to
some of these systems, the web site for th
is book provides alternative versions of
certain example programs with extra co
de that doesnt appear in the book.
Linux kernel and C library versions
The primary focus of this book is on Linux 2.6.
, the kernel version in widest use at the
constant definitions and function declarations (except in the case of C++), and some
extra work may be needed to pass function
arguments in the manner required by C
linkage conventions. Notwithstanding these differences, the essential concepts are
the same, and youll find the information in
this book is applicable even if you are
working in another programming language.
About the author
I started using UNIX and C in 1987, when I
spent several weeks si
tting in front of an
HP Bobcat workstation with a copy of
the first edition of Marc Rochkinds
UNIX Programming
and what ultimately became a very dog-eared printed copy of
the C shell manual page. My approach then
was one that I still try to follow today,
and that I recommend to anyone approach
ing a new software technology: take the
but increasingly largebut increasingly large
test programs until you become confident
of your understanding of the software.
Ive found that, in the long run, this kind
of self-training more than pays for itself in
terms of saved time. Many of the programming
examples in this book are constructed
in ways that encourage this learning approach.
Ive primarily been a software engineer an
d designer. However, Im also a passion-
ate teacher, and have spent several years teaching in both academic and commercial
environments. Ive run many week-long co
urses teaching UNIX system programming,
and that experience informs the writing of this book.
xxxiv
Preface
xxxiii
i-node flags;
system call;
file system; and
xxxii
Intended audience
This book is aimed primarily
at the following audience:
programmers and software designers buil
ding applications for Linux, other
UNIX systems, or other POSIX-conformant systems;
fication of file I/O events;
inotify
, a mechanism for monitoring ch
anges in files and directories;
capabilities, a mechanism for granting a
Subject
In this book, I describe the Linux progra
mming interfacethe system calls, library
functions, and other low-level interfaces provided by Linux, a free implementation
of the UNIX operating system. These interfaces are used, directly or indirectly, by
every program that runs on Linux. They al
low applications to perform tasks such as
file I/O, creating and deleting files and di
rectories, creating ne
w processes, executing
ptsn
.......................................... 1382
64.3Opening a Master:
ptyMasterOpen()
......................................................................... 1383
64.4Connecting Processes with a Pseudoterminal:
ptyFor
................................................ 1385
64.5Pseudoterminal I/O................................................................................................. 1388
64.6Implementing
script11
............................................................................................. 1390
64.7Terminal Attributes and Window Size........................................................................ 1394
64.8BSD Pseudoterminals............................................................................................... 1395
64.9Summary................................................................................................................ 139
64.10Exercises................................................................................................................
ATRACING SYSTEM CALLS 1401
BPARSING COMMAND-LINE OPTIONS 1405
CCASTING THE
NULL
POINTER 1413
DKERNEL CONFIGURATION 1417
EFURTHER SOURCES OF INFORMATION 1419
FSOLUTIONS TO SELECTED EXERCISES 1425
BIBLIOGRAPHY 1437
1447
xxviii
Contents in Detail
xxvi
xxiv
45.8IPC Limits.................................................................................................................
45.9Summary.................................................................................................................. 9
45.10Exercises.................................................................................................................
. 936
46SYSTEM V MESSAGE QUEUES 937
46.1Creating or Opening a Message Queue...................................................................... 938
46.2Exchanging Messages............................................................................................... 940
46.2.1Sending Messages................................................................................ 940
46.2.2Receiving Messages.............................................................................. 943
46.3Message Queue Control Operations............................................................................ 947
46.4Message Queue Associated Data Structure.................................................................. 948
46.5Message Queue Limits............................................................................................... 950
46.6Displaying All Message Queues on the System............................................................. 951
46.7Client-Server Programming with Message Queues......................................................... 953
46.8A File-Server Application Using Message Queues.......................................................... 955
46.9Disadvantages of System V Message Queues............................................................... 961
46.10Summary..................................................................................................................
46.11Exercises.................................................................................................................
. 963
47SYSTEM V SEMAPHORES 965
47.1Overview................................................................................................................. 9
xxii
dlsy
............................................ 862
42.1.4Closing a Shared Library:
dlclos
.......................................................... 866
42.1.5Obtaining Information About Loaded Symbols:
dl
............................ 866
42.1.6Accessing Symbols in the Main Program.................................................. 867
42.2Controlling Symbol Visibility....................................................................................... 867
42.3Linker Version Scripts................................................................................................. 868
42.3.1Controlling Symbol Visibility with Version Scripts...................................... 868
42.3.2Symbol Versioning................................................................................ 870
42.4Initialization and Finalization Functions........................................................................ 872
42.5Preloading Shared Libraries........................................................................................ 873
42.6Monitoring the Dynamic Linker:
LD_DEBUG
...................................................................... 874
42.7Summary.................................................................................................................. 8
42.8Exercises..................................................................................................................
43INTERPROCESS COMMUNICATION OVERVIEW 877
43.1A Taxonomy of IPC Facilities...................................................................................... 877
43.2Communication Facilities............................................................................................ 879
43.3Synchronization Facilities........................................................................................... 880
43.4Comparing IPC Facilities............................................................................................ 882
43.5Summary.................................................................................................................. 8
43.6Exercises..................................................................................................................
Contents in Detail
37.3Guidelines for Writing Daemons................................................................................. 771
37.4Using
SIGHUP
to Reinitialize a Daemon......................................................................... 772
37.5Logging Messages and Errors Using
syslog
................................................................... 775
37.5.1Overview............................................................................................. 775
37.5.2The
syslog
API....................................................................................... 777
37.5.3The
/etc/syslog.conf
File...................................................................... 781
37.6Summary.................................................................................................................. 7
37.7Exercise...................................................................................................................
38WRITING SECURE PRIVILEGED PROGRAMS 783
Contents in Detail
29.8Thread Attributes....................................................................................................... 62
29.9Threads Versus Processes........................................................................................... 629
29.10Summary..................................................................................................................
29.11Exercises.................................................................................................................
. 630
30THREADS: THREAD SYNCHRONIZATION 631
30.1Protecting Accesses to Shared Variables: Mutexes......................................................... 631
30.1.1Statically Allocated Mutexes................................................................... 635
30.1.2Locking and Unlocking a Mutex.............................................................. 635
30.1.3Performance of Mutexes........................................................................ 638
30.1.4Mutex Deadlocks.................................................................................. 639
30.1.5Dynamically Initializing a Mutex............................................................. 639
30.1.6Mutex Attributes.................................................................................... 640
30.1.7Mutex Types......................................................................................... 640
30.2Signaling Changes of State: Condition Variables.......................................................... 642
30.2.1Statically Allocated Condition Variables.................................................. 643
30.2.2Signaling and Waiting on Condition Variables........................................ 643
30.2.3Testing a Condition Variables Predicate.................................................. 647
30.2.4Example Program: Joining Any Terminated Thread.................................... 648
30.2.5Dynamically Allocated Condition Variables.............................................. 651
30.3Summary.................................................................................................................. 6
30.4Exercises..................................................................................................................
xviii
Specific Process or Thread............................. 493
23.5.4Improved High-Resolution Sleeping:
clock_nanoslee
............................... 493
23.6POSIX Interval Timers................................................................................................. 495
23.6.1Creating a Timer:
timer_crea
............................................................. 495
23.6.2Arming and Disarming a Timer:
timer_settime
........................................ 498
23.6.3Retrieving the Current Value of a Timer:
Contents in Detail
18.14Parsing Pathname Strings:
dirname()
basena
.................................................... 370
18.15Summary..................................................................................................................
18.16Exercises.................................................................................................................
. 373
19MONITORING FILE EVENTS 375
19.1Overview................................................................................................................. 3
19.2The
inotify
API.......................................................................................................... 376
inotify
Events............................................................................................................ 378
19.4Reading
inotify
Events................................................................................................ 379
19.5Queue Limits and
/proc
Files....................................................................................... 385
19.6An Older System for Monitoring File Events:
dnotify
....................................................... 386
19.7Summary.................................................................................................................. 3
19.8Exercise...................................................................................................................
20SIGNALS: FUNDAMENTAL CONCEPTS 387
20.1Concepts and Overview............................................................................................. 388
20.2Signal Types and Default Actions................................................................................ 390
20.3Changing Signal Dispositions:
......................................................................... 397
20.4Introduction to Signal Handlers................................................................................... 398
20.5Sending Signals:
............................................................................................... 401
20.6Checking for the Existence of a Process........................................................................ 403
20.7Other Ways of Sending Signals:
ki
..................................................... 404
20.8Displaying Signal Descriptions.................................................................................... 406
xiv
Contents in Detail
11SYSTEM LIMITS AND OPTIONS 211
11.1System Limits.............................................................................................................
11.2Retrieving System Limitsand Optionsand Optionsun Time........................................................ 215
11.3Retrieving File-Related Limits (and Options) at Run Time.................................................. 217
/proc/
................................... 224
12.1.2System Information Under
/proc
.............................................................. 226
12.1.3Accessing
/proc
Files............................................................................ 226
12.2System Identification:
.................................................................................... 229
12.3Summary.................................................................................................................. 2
12.4Exercises..................................................................................................................
13FILE I/O BUFFERING 233
13.1Kernel Buffering of File I/O:
The Buffer Cache.............................................................. 233
13.2Buffering in the
stdio
Library....................................................................................... 237
13.3Controlling Kernel Buffering of File I/O........................................................................ 239
13.4Summary of I/O Buffering.......................................................................................... 243
13.5Advising the Kernel About I/O Patterns........................................................................ 244
13.6Bypassing the Buffer Cache: Direct I/O........................................................................ 246
13.7Mixing Library Functions and System Calls for File I/O.................................................. 248
13.8Summary.................................................................................................................. 2
13.9Exercises..................................................................................................................
14FILE SYSTEMS 251
14.1Device Special Files (Devices)..................................................................................... 252
14.2Disks and Partitions................................................................................................... 253
14.3File Systems..............................................................................................................
14.4I-nodes....................................................................................................................
. 256
14.5The Virtual File System (VFS)....................................................................................... 259
14.6Journaling File Systems............................................................................................... 260
14.7Single Directory Hierarchy and Mount Points................................................................ 261
14.8Mounting and Unmounting File Systems....................................................................... 262
14.8.1Mounting a File System:
............................................................ 264
14.8.2Unmounting a File System:
umount()
and
................................ 269
14.9Advanced Mount Features.......................................................................................... 271
14.9.1Mounting a File System at Multiple Mount Points....................................... 271
14.9.2Stacking Multiple Mounts on the Same Mount Point................................... 271
14.9.3Mount Flags That Are Per-Mount Options................................................. 272
14.9.4Bind Mounts......................................................................................... 272
14.9.5Recursive Bind Mounts........................................................................... 273
14.10A Virtual Memory File System:
............................................................................ 274
14.11Obtaining Information About a File System:
statvf
...................................................... 276
14.12Summary..................................................................................................................
14.13Exercise..................................................................................................................
. 278
xii
Brief Contents
Chapter 47: System V Semaphores.............................................................................. 965
Chapter 48: System V Shared Memory......................................................................... 997
Chapter 49: Memory Mappings................................................................................ 1017
Chapter 50: Virtual Memory Operations..................................................................... 1045
Chapter 51: Introduction to POSIX IPC........................................................................ 1057
Chapter 52: POSIX Message Queues......................................................................... 1063
Chapter 53: POSIX Semaphores................................................................................ 1089
Chapter 54: POSIX Shared Memory........................................................................... 1107
Chapter 55: File Locking........................................................................................... 1117
Chapter 20: Signals: Fundamental Concepts................................................................. 387
Chapter 21: Signals: Signal Handlers........................................................................... 421
Chapter 22: Signals: Advanced Features...................................................................... 447
Chapter 23: Timers and Sleeping................................................................................. 479
Chapter 24: Process Creation...................................................................................... 513
Chapter 25: Process Termination.................................................................................. 531
Chapter 26: Monitoring Child Processes....................................................................... 541
Chapter 27: Program Execution................................................................................... 563
Chapter 28: Process Creation and Program
Preface .....................................................................................................................xx
Chapter 1: History and Standards.................................................................................... 1
Chapter 2: Fundamental Concepts................................................................................. 21
Chapter 3: System Programming Concepts...................................................................... 43
Chapter 4: File I/O: The Universal I/O Model................................................................. 69
For Cecilia, who lights up my world.
THE LINUX PROGRAMMING INTERFACE.
Copyright 2010 by Michael Kerrisk.
All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or
mechanical, including photocopying, reco
rding, or by any information storage
. . . encyclopedic in the breadth an
d depth of its coverage, and textbook-
like in its wealth of worked examples
and exercises. Each topic is clearly
and comprehensively covered, from
theory to hands-on working code.
Professionals, students, educators, th
is is the Linux/UNIX reference that
you have been waiting for.
NTHONY
OBINS
ROFESSOR
OMPUTER
CIENCE
NIVERSITY
Ive been very impressed by the precisio
THE LINUX PROGRAMMING INTERFACE
If I had to choose a sing
le book to sit next to my machine when writing
software for Linux, this would be it.
NGINEER
J�;
B?DKN
FHE=H7CC?D=
;H79;
"-JOVYBOE6/*9
4ZTUFN1SPHSBNNJOH)BOECPPL
.*$)"&-,&33*4
,&33*4
J�;
B?DKN
FHE=H7CC?D=
;H79;
5IF-JOVY1SPHSBNNJOH*OUFSGBDF
JTUIFEFGJOJUJWFHVJEF
UPUIF-JOVYBOE6/*9QSPHSBNNJOHJOUFSGBDFUIF
JOUFSGBDFFNQMPZFECZOFBSMZFWFSZBQQMJDBUJPOUIBU
SVOTPOB-JOVYPS6/*9TZTUFN
*OUIJTBVUIPSJUBUJWFXPSL\r-JOVYQSPHSBNNJOH
FYQFSU.JDIBFM,FSSJTLQSPWJEFTEFUBJMFEEFTDSJQUJPOT
PGUIFTZTUFNDBMMTBOEMJCSBSZGVODUJPOTUIBUZPVOFFE
JOPSEFSUPNBTUFSUIFDSBGUPGTZTUFNQSPHSBNNJOH\r
BOEBDDPNQBOJFTIJTFYQMBOBUJPOTXJUIDMFBS\rDPNQMFUF
FYBNQMFQSPHSBNT
:PVMMGJOEEFTDSJQUJPOTPGPWFSTZTUFNDBMMT
BOEMJCSBSZGVODUJPOT\rBOENPSFUIBOFYBNQMFQSP
HSBNT\rUBCMFT\rBOEEJBHSBNT:PVMMMFBSOIPXUP
3FBEBOEXSJUFGJMFTFGGJDJFOUMZ
6TFTJHOBMT\rDMPDLT\rBOEUJNFST
$SFBUFQSPDFTTFTBOEFYFDVUFQSPHSBNT
8SJUFTFDVSFQSPHSBNT
8SJUFNVMUJUISFBEFEQSPHSBNTVTJOH104*9UISFBET
#VJMEBOEVTFTIBSFEMJCSBSJFT
1FSGPSNJOUFSQSPDFTTDPNNVOJDBUJPOVTJOHQJQFT\r
NFTTBHFRVFVFT\rTIBSFENFNPSZ\rBOETFNBQIPSFT
8SJUFOFUXPSLBQQMJDBUJPOTXJUIUIFTPDLFUT"1*
8IJMF
5IF-JOVY1SPHSBNNJOH*OUFSGBDF
DPWFSTBXFBMUI
PG-JOVYTQFDJGJDGFBUVSFT\rJODMVEJOHFQPMM\rJOPUJGZ\rBOE
UIF
+lnk_
GJMFTZTUFN\rJUTFNQIBTJTPO6/*9TUBOEBSET
104*9
 
464W
BOE104*9
 
464W
NBLFTJUFRVBMMZWBMVBCMFUPQSPHSBNNFSTXPSLJOHPO
PUIFS6/*9QMBUGPSNT
5IF-JOVY1SPHSBNNJOH*OUFSGBDF
JTUIFNPTUDPN
QSFIFOTJWFTJOHMFWPMVNFXPSLPOUIF-JOVYBOE6/*9
QSPHSBNNJOHJOUFSGBDF\rBOEBCPPLUIBUTEFTUJOFEUP
CFDPNFBOFXDMBTTJD
#0655)&
65)03
.JDIBFM,FSSJTL
IUUQNBOPSH
IBTCFFOVTJOHBOEQSPHSBNNJOH6/*9TZTUFNT
GPSNPSFUIBOZFBST\rBOEIBTUBVHIUNBOZXFFLMPOHDPVSTFTPO6/*9TZTUFN
QSPHSBNNJOH4JODF\rIFIBTNBJOUBJOFEUIFNBOQBHFTQSPKFDU\rXIJDI
QSPEVDFTUIFNBOVBMQBHFTEFTDSJCJOHUIF-JOVYLFSOFMBOEHMJCDQSPHSBNNJOH
"1*T)FIBTXSJUUFOPSDPXSJUUFONPSFUIBOPGUIFNBOVBMQBHFTBOEJTBDUJWFMZ
JOWPMWFEJOUIFUFTUJOHBOEEFTJHOSFWJFXPGOFX-JOVYLFSOFMVTFSTQBDFJOUFSGBDFT
.JDIBFMMJWFTXJUIIJTGBNJMZJO.VOJDI\r(FSNBOZ
J�;:;?D?
?L;=K?:;
B?DKN
7D:
KD?N
IOI
;C
HE=H7CC?D=
PWFSTDVSSFOU
9TUBOEBSET
04*
9
'#(&&'
464
BOE1
04*
9
'#(&&.
464
9
9
9
5
9

7
8
1
5
9
3

2
7
2
2
0
3


ISBN: 978-1-59327-220-3

8
9
1
4
5

7
2
2
0
0




$%/
4IFMWF*O
HEJQT\rLNKCN=IIEJC
THE FINEST IN GEEK ENTERTAINMENT
www.nostarch.com
5IJTMPHPBQQMJFTPOMZUPUIFUFYUTUPDL

Приложенные файлы

  • pdf 11184201
    Размер файла: 7 MB Загрузок: 0

Добавить комментарий