stats {<ranges>} 'filename' {matrix | using N{:M}} {name 'prefix'} {{no}output}

This command prepares a statistical summary of the data in one or two columns of a file. The using specifier is interpreted in the same way as for plot commands. See **plot** for details on the **index**, **every**, and **using** directives. Data points are filtered against both xrange and yrange before analysis. See **set xrange**. The summary is printed to the screen by default. Output can be redirected to a file by prior use of the command **set print**, or suppressed altogether using the **nooutput** option.

In addition to printed output, the program stores the individual statistics into three sets of variables. The first set of variables reports how the data is laid out in the file:

`STATS_records` | N | total number of in-range data records |

`STATS_outofrange` | number of records filtered out by range limits | |

`STATS_invalid` | number of invalid/incomplete/missing records | |

`STATS_blank` | number of blank lines in the file | |

`STATS_blocks` | number of indexable blocks of data in the file | |

`STATS_columns` | number of data columns in the first row of data |

The second set reports properties of the in-range data from a single column. This column is treated as y. If the y axis is autoscaled then no range limits are applied. Otherwise only values in the range [ymin:ymax] are considered.

If two columns are analysed jointly by a single **stats** command, the suffix "_x" or "_y" is appended to each variable name. I.e. STATS_min_x is the minimum value found in the first column, while STATS_min_y is the minimum value found in the second column. In this case points are filtered by testing against both xrange and yrange.

`STATS_min` | min(y) | minimum value of in-range data points | |

`STATS_max` | max(y) | maximum value of in-range data points | |

`STATS_index_min` | i | y_{i} = min(y)
| index i for which data[i] == STATS_min | |

`STATS_index_max` | i | y_{i} = max(y)
| index i for which data[i] == STATS_max | |

`STATS_mean` | 11#11 = | 12#1213#13y
| mean value of the in-range data points |

`STATS_stddev` | σ_{y} = | 14#14 | population standard deviation of the in-range data |

`STATS_ssd` | s_{y} = | 15#15 | sample standard deviation of the in-range data |

`STATS_lo_quartile` | value of the lower (1st) quartile boundary | ||

`STATS_median` | median value | ||

`STATS_up_quartile` | value of the upper (3rd) quartile boundary | ||

`STATS_sum` | 13#13y | sum | |

`STATS_sumsq` | 13#13y^{2} | sum of squares | |

`STATS_skewness` | 16#1613#13(y-11#11)^{3}
| skewness of the in-range data points | |

`STATS_kurtosis` | 17#1713#13(y-11#11)^{4}
| kurtosis of the in-range data points | |

`STATS_adev` | 12#1213#13| y-11#11|
| mean absolute deviation of the in-range data | |

`STATS_mean_err` | σ_{y}/18#18
| standard error of the mean value | |

`STATS_stddev_err` | σ_{y}/19#19
| standard error of the standard deviation | |

`STATS_skewness_err` | 20#20 | standard error of the skewness | |

`STATS_kurtosis_err` | 21#21 | standard error of the kurtosis |

The third set of variables is only relevant to analysis of two data columns.

`STATS_correlation` | sample correlation coefficient between x and y values | |

`STATS_slope` | A corresponding to a linear fit y = Ax + B | |

`STATS_slope_err` | uncertainty of A | |

`STATS_intercept` | B corresponding to a linear fit y = Ax + B | |

`STATS_intercept_err` | uncertainty of B | |

`STATS_sumxy` | sum of x*y | |

`STATS_pos_min_y` | x coordinate of a point with minimum y value | |

`STATS_pos_max_y` | x coordinate of a point with maximum y value |

When **matrix** is specified, all matrix entries are included in the analysis. The matrix dimensions are saved in the variables STATS_size_x and STATS_size_y.

It may be convenient to track the statistics from more than one file or data column in parallel. The **name** option causes the default prefix "STATS" to be replaced by a user-specified string. For example, the mean value of column 2 data from two different files could be compared by

stats "file1.dat" using 2 name "A" stats "file2.dat" using 2 name "B" if (A_mean < B_mean) {...}The keyword

do for [COL=5:8] { stats 'datafile' using COL name columnheader }

The index reported in STATS_index_xxx corresponds to the value of pseudo-column 0 ($0) in plot commands. I.e. the first point has index 0, the last point has index N-1.

Data values are sorted to find the median and quartile boundaries. If the total number of points N is odd, then the median value is taken as the value of data point (N+1)/2. If N is even, then the median is reported as the mean value of points N/2 and (N+2)/2. Equivalent treatment is used for the quartile boundaries.

For an example of using the **stats** command to annotate a subsequent plot, see stats.dem.

The **stats** command in this version of gnuplot can handle log-scaled data, but not the presence of time/date fields (**set xdata time** or **set ydata time**). This restriction may be relaxed in a future version.

Copyright 1986 - 1993, 1998, 2004 Thomas Williams, Colin Kelley

Distributed under the gnuplot license (rights to distribute modified versions are withheld).