summaryrefslogtreecommitdiffstats
path: root/doc/cdf.1
blob: 06f3a49ddbeda318650d6c5be5f7f2d5c63b3753 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
.\" Copyright (c) 2025
.\" Manual page for cdf(1)
.\"
.Dd $Mdocdate$
.Dt CDF 1
.Os
.Sh NAME
.Nm cdf
.Nd calculate cumulative distribution function from count data
.Sh SYNOPSIS
.Nm cdf
.Op Fl f | u
.Op Fl r
.Op Fl h
.Op Ar file
.Sh DESCRIPTION
The
.Nm
utility computes cumulative distribution functions from input data consisting
of value-count pairs. It reads data from
.Ar file
or standard input if no file is specified, calculates relative frequencies,
and outputs probability-value pairs suitable for statistical analysis and
plotting.
.Pp
Input data must consist of whitespace-separated pairs where the first field
is the count (frequency) and the second field is the data value. Output
consists of cumulative probability values followed by the corresponding data
values, separated by tabs.
.Pp
.Sh OPTIONS
.Bl -tag -width Ds
.It Fl f
Read input values as floats.
.It Fl u
Read input values as unsigned integers
.It Fl r
Generate the complementary cumulative distribution function (CCDF),
which is P(X > x) = 1 - F(x).
.It Fl h
Display usage information and exit.
.El
.Pp
If
.Fl f
or
.Fl u
is not specified, the input values will be read as signed integers by
default.
.Sh INPUT FORMAT
Each input line must contain exactly two whitespace-separated fields:
.Bd -literal -offset indent
count value
.Ed
.Pp
Where
.Em count
is a positive integer representing the frequency of occurrence, and
.Em value
is the data point.
.Pp
Example input:
.Bd -literal -offset indent
15  1.25
23  2.30
8   3.75
.Ed
.Pp
This format was selected to be compatible with the output of the uniq -c command.
.Sh EXIT STATUS
.Ex -std
.Sh EXAMPLES
Calculate CDF from integer data in a file:
.Bd -literal -offset indent
$ cdf data.txt
0.300000000000000   10
0.650000000000000   15
1.000000000000000   20
.Ed
.Pp
Generate complementary CDF in a pipeline:
.Bd -literal -offset indent
$ awk '{print $2, $1}' measurements.dat | cdf -r -f
1.000000000000000   1.234000
0.750000000000000   2.567000
0.400000000000000   4.890000
.Ed
.Pp
Use standard tools for pre-processing:
.Bd -literal -offset indent
$ sort -n data.txt | uniq -c | cdf > dist.cdf
.Ed
.Sh SEE ALSO
.Xr sort 1 ,
.Xr uniq 1 ,
.Xr gnuplot 1
.Sh AUTHORS
.An Douglas B. Rumbaugh
.Mt "dbrumbaugh@harrisburgu.edu"
.Sh BUGS
The program must materialize the full file in order to calculate
the frequency table. It currently does this in memory, and so very
large datasets may lead to crashes due to memory allocation failures
when RAM is limited.