doc/cdf.1


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107

.\" Copyright (c) 2025
.\" Manual page for cdf(1)
.\"
.Dd $Mdocdate$
.Dt CDF 1
.Os
.Sh NAME
.Nm cdf
.Nd calculate cumulative distribution function from count data
.Sh SYNOPSIS
.Nm cdf
.Op Fl f | u
.Op Fl r
.Op Fl h
.Op Ar file
.Sh DESCRIPTION
The
.Nm
utility computes cumulative distribution functions from input data consisting
of value-count pairs. It reads data from
.Ar file
or standard input if no file is specified, calculates relative frequencies,
and outputs probability-value pairs suitable for statistical analysis and
plotting.
.Pp
Input data must consist of whitespace-separated pairs where the first field
is the count (frequency) and the second field is the data value. Output
consists of cumulative probability values followed by the corresponding data
values, separated by tabs.
.Pp
.Sh OPTIONS
.Bl -tag -width Ds
.It Fl f
Read input values as floats.
.It Fl u
Read input values as unsigned integers
.It Fl r
Generate the complementary cumulative distribution function (CCDF),
which is P(X > x) = 1 - F(x).
.It Fl h
Display usage information and exit.
.El
.Pp
If
.Fl f
or
.Fl u
is not specified, the input values will be read as signed integers by
default.
.Sh INPUT FORMAT
Each input line must contain exactly two whitespace-separated fields:
.Bd -literal -offset indent
count value
.Ed
.Pp
Where
.Em count
is a positive integer representing the frequency of occurrence, and
.Em value
is the data point.
.Pp
Example input:
.Bd -literal -offset indent
15  1.25
23  2.30
8   3.75
.Ed
.Pp
This format was selected to be compatible with the output of the uniq -c command.
.Sh EXIT STATUS
.Ex -std
.Sh EXAMPLES
Calculating the CDF of integer data in a file:
.Bd -literal -offset indent
$ cdf data.txt
0.300000000000000   10
0.650000000000000   15
1.000000000000000   20
.Ed
.Pp
Calculating the complementary CDF as part of a pipeline:
.Bd -literal -offset indent
$ awk '{print $2, $1}' measurements.dat | cdf -r -f
1.000000000000000   1.234000
0.750000000000000   2.567000
0.400000000000000   4.890000
.Ed
.Pp
Using standard tools to preprocess a raw list of measurements
into a CDF:
.Bd -literal -offset indent
$ sort -n data.txt | uniq -c | cdf > dist.cdf
.Ed
.Sh SEE ALSO
.Xr sort 1 ,
.Xr uniq 1 ,
.Xr gnuplot 1
.Sh AUTHORS
.An Douglas B. Rumbaugh
.Mt "dbrumbaugh@harrisburgu.edu"
.Sh BUGS
The program must materialize the full file in order to calculate
the frequency table. It currently does this in memory, and so very
large datasets may lead to crashes due to memory allocation failures
when RAM is limited.