summaryrefslogtreecommitdiffstats
path: root/doc/cdf.1
diff options
context:
space:
mode:
Diffstat (limited to 'doc/cdf.1')
-rw-r--r--doc/cdf.1106
1 files changed, 106 insertions, 0 deletions
diff --git a/doc/cdf.1 b/doc/cdf.1
new file mode 100644
index 0000000..06f3a49
--- /dev/null
+++ b/doc/cdf.1
@@ -0,0 +1,106 @@
+.\" Copyright (c) 2025
+.\" Manual page for cdf(1)
+.\"
+.Dd $Mdocdate$
+.Dt CDF 1
+.Os
+.Sh NAME
+.Nm cdf
+.Nd calculate cumulative distribution function from count data
+.Sh SYNOPSIS
+.Nm cdf
+.Op Fl f | u
+.Op Fl r
+.Op Fl h
+.Op Ar file
+.Sh DESCRIPTION
+The
+.Nm
+utility computes cumulative distribution functions from input data consisting
+of value-count pairs. It reads data from
+.Ar file
+or standard input if no file is specified, calculates relative frequencies,
+and outputs probability-value pairs suitable for statistical analysis and
+plotting.
+.Pp
+Input data must consist of whitespace-separated pairs where the first field
+is the count (frequency) and the second field is the data value. Output
+consists of cumulative probability values followed by the corresponding data
+values, separated by tabs.
+.Pp
+.Sh OPTIONS
+.Bl -tag -width Ds
+.It Fl f
+Read input values as floats.
+.It Fl u
+Read input values as unsigned integers
+.It Fl r
+Generate the complementary cumulative distribution function (CCDF),
+which is P(X > x) = 1 - F(x).
+.It Fl h
+Display usage information and exit.
+.El
+.Pp
+If
+.Fl f
+or
+.Fl u
+is not specified, the input values will be read as signed integers by
+default.
+.Sh INPUT FORMAT
+Each input line must contain exactly two whitespace-separated fields:
+.Bd -literal -offset indent
+count value
+.Ed
+.Pp
+Where
+.Em count
+is a positive integer representing the frequency of occurrence, and
+.Em value
+is the data point.
+.Pp
+Example input:
+.Bd -literal -offset indent
+15 1.25
+23 2.30
+8 3.75
+.Ed
+.Pp
+This format was selected to be compatible with the output of the uniq -c command.
+.Sh EXIT STATUS
+.Ex -std
+.Sh EXAMPLES
+Calculate CDF from integer data in a file:
+.Bd -literal -offset indent
+$ cdf data.txt
+0.300000000000000 10
+0.650000000000000 15
+1.000000000000000 20
+.Ed
+.Pp
+Generate complementary CDF in a pipeline:
+.Bd -literal -offset indent
+$ awk '{print $2, $1}' measurements.dat | cdf -r -f
+1.000000000000000 1.234000
+0.750000000000000 2.567000
+0.400000000000000 4.890000
+.Ed
+.Pp
+Use standard tools for pre-processing:
+.Bd -literal -offset indent
+$ sort -n data.txt | uniq -c | cdf > dist.cdf
+.Ed
+.Sh SEE ALSO
+.Xr sort 1 ,
+.Xr uniq 1 ,
+.Xr gnuplot 1
+.Sh AUTHORS
+.An Douglas B. Rumbaugh
+.Mt "dbrumbaugh@harrisburgu.edu"
+.Sh BUGS
+The program must materialize the full file in order to calculate
+the frequency table. It currently does this in memory, and so very
+large datasets may lead to crashes due to memory allocation failures
+when RAM is limited.
+
+