Man Linux: Main Page and Category List

NAME

       ent - pseudorandom number sequence test

SYNOPSIS

       ent [options] [file]

DESCRIPTION

       ENT Logo

       ent  performs  a  variety  of  tests on the stream of bytes in file (or
       standard input if no file is specified) and produces output on standard
       output; for example:

       Entropy = 7.980627 bits per character.

       Optimum compression would reduce the size
       of this 51768 character file by 0 percent.

       Chi square distribution for 51768 samples is 1542.26, and randomly
       would exceed this value 0.01 percent of the times.

       Arithmetic mean value of data bytes is 125.93 (127.5 = random).
       Monte Carlo value for Pi is 3.169834647 (error 0.90 percent).
       Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0).

       The values calculated are as follows:

   ENTROPY
       The  information  density  of  the contents of the file, expressed as a
       number of bits per character. The results above,  which  resulted  from
       processing  an  image file compressed with JPEG, indicate that the file
       is  extremely   dense   in   information—essentially   random.   Hence,
       compression  of  the  file is unlikely to reduce its size. By contrast,
       the C source code of the program has entropy  of  about  4.9  bits  per
       character, indicating that optimal compression of the file would reduce
       its size by 38%. [Hamming, pp. 104-108]

   CHI-SQUARE TEST
       The chi-square test is the most commonly used test for  the  randomness
       of  data, and is extremely sensitive to errors in pseudorandom sequence
       generators. The chi-square distribution is calculated for the stream of
       bytes  in the file and expressed as an absolute number and a percentage
       which indicates how frequently a truly random sequence would exceed the
       value  calculated.  We  interpret the percentage as the degree to which
       the sequence tested is suspected of being non-random. If the percentage
       is  greater  than 99% or less than 1%, the sequence is almost certainly
       not random. If the percentage is between 99% and 95% or between 1%  and
       5%,  the  sequence  is suspect.  Percentages between 90% and 95% and 5%
       and 10% indicate the sequence is "almost suspect". Note that  our  JPEG
       file,  while  very dense in information, is far from random as revealed
       by the chi-square test.

       Applying this test to  the  output  of  various  pseudorandom  sequence
       generators  is  interesting.  The  low-order  8  bits  returned  by the
       standard Unix rand(1) function, for example, yields:

       Chi square distribution for 500000 samples is 0.01, and randomly
       would exceed this value 99.99 percent of the times.

       While an improved generator [Park & Miller] reports:

       Chi square distribution for 500000 samples is 212.53, and randomly
       would exceed this value 95.00 percent of the times.

       Thus, the standard Unix generator (or at least the low-order  bytes  it
       returns)  is  unacceptably  non-random, while the improved generator is
       much better but still sufficiently  non-random  to  cause  concern  for
       demanding  applications.   Contrast  both  of these software generators
       with the chi-square result of a  genuine  random  sequence  created  by
       timing radioactive decay events[1]:

       Chi square distribution for 32768 samples is 237.05, and randomly
       would exceed this value 75.00 percent of the times.

       See  [Knuth, pp. 35-40] for more information on the chi-square test. An
       interactive chi-square calculator[2] is available at this site.

   ARITHMETIC MEAN
       This is simply the result of summing the all the bytes (bits if the  -b
       option  is  specified)  in the file and dividing by the file length. If
       the data are close to random, this should be about 127.5  (0.5  for  -b
       option  output).  If  the  mean departs from this value, the values are
       consistently high or low.

   MONTE CARLO VALUE FOR PI
       Each successive sequence of six bytes  is  used  as  24  bit  X  and  Y
       coordinates  within a square. If the distance of the randomly-generated
       point is less than the radius of a circle inscribed within the  square,
       the six-byte sequence is considered a "hit". The percentage of hits can
       be used to calculate the value of Pi.  For  very  large  streams  (this
       approximation  converges  very  slowly),  the  value  will approach the
       correct value of Pi if the sequence is close to random.  A  32768  byte
       file created by radioactive decay yielded:

       Monte Carlo value for Pi is 3.139648438 (error 0.06 percent).

   SERIAL CORRELATION COEFFICIENT
       This  quantity  measures  the  extent  to  which  each byte in the file
       depends upon the previous byte. For random sequences, this value (which
       can  be positive or negative) will, of course, be close to zero. A non-
       random byte stream such as a C program will yield a serial  correlation
       coefficient  on  the  order  of  0.5.  Wildly  predictable data such as
       uncompressed  bitmaps  will  exhibit  serial  correlation  coefficients
       approaching 1. See [Knuth, pp. 64-65] for more details.

OPTIONS

       -b     The  input  is  treated as a stream of bits rather than of 8-bit
              bytes.  Statistics  reported  reflect  the  properties  of   the
              bitstream.

       -c     Print a table of the number of occurrences of each possible byte
              (or bit, if the -b option is  also  specified)  value,  and  the
              fraction  of  the overall file made up by that value.  Printable
              characters in the ISO-8859-1 (Latin-1) character set  are  shown
              along  with their decimal byte values. In non-terse output mode,
              values with zero occurrences are not printed.

       -f     Fold  upper  case  letters  to  lower  case   before   computing
              statistics.  Folding  is  done based on the ISO-8859-1 (Latin-1)
              character set, with accented letters correctly processed.

       -t     Terse mode: output is written in  Comma  Separated  Value  (CSV)
              format,  suitable for loading into a spreadsheet and easily read
              by any programming language. See Terse Mode Output Format  below
              for additional details.

       -u     Print how-to-call information.

FILES

       If  no  file  is  specified, ent obtains its input from standard input.
       Output is always written to standard output.

TERSE MODE

       Terse mode is selected by specifying the -t option on the command line.
       Terse  mode  output  is  written in Comma Separated Value (CSV) format,
       which can be directly loaded into  most  spreadsheet  programs  and  is
       easily  read  by  any programming language. Each record in the CSV file
       begins with a record type field, which identifies the  content  of  the
       following  fields.  If  the  -c option is not specified, the terse mode
       output will consist of two records, as follows:

       0,File-bytes,Entropy,Chi-square,Mean,Monte-Carlo-Pi,Serial-Correlation
       1,file_length,entropy,chi_square,mean,Pi_value,correlation

       where the italicised values in the type  1  record  are  the  numerical
       values  for  the quantities named in the type 0 column title record. If
       the_ -b_ option is specified, the second field of  the  type  0  record
       will be "File-bits", and the file_length field in type 1 record will be
       given in bits  instead  of  bytes.  If  the  -c  option  is  specified,
       additional  records are appended to the terse mode output which contain
       the character counts:

       2,Value,Occurrences,Fraction
       3,v,count,fraction
       ...

       If the -b option is specified, only two type 3 records will appear  for
       the  two  bit  values  v=0  and  v=1. Otherwise, 256 type 3 records are
       included, one for each possible byte value. The second field of a  type
       3  record  indicates  how many bytes (or bits) of value v appear in the
       input, and fraction gives the decimal fraction of the  file  which  has
       value  v  (which  is equal to the count value of this record divided by
       the file_length field in the type 1 record).

BUGS

       Note that the "optimal compression" shown for the file is computed from
       the byte- or bit-stream entropy and thus reflects compressibility based
       on a reading frame of the chosen width (8-bit bytes or individual  bits
       if  the  -b option is specified). Algorithms which use a larger reading
       frame, such as the Lempel-Ziv [Lempel &  Ziv]  algorithm,  may  achieve
       greater compression if the file contains repeated sequences of multiple
       bytes.

COPYING

       This software is in the public domain. Permission to use, copy, modify,
       and  distribute this software and its documentation for any purpose and
       without fee is hereby granted, without any conditions or  restrictions.
       This  software is provided "as is" without express or implied warranty.

       Original text and program by John Walker[3] October 20th, 1998

       Modifications   by   Wesley    J.    Landaker    <    wjl@icecavern.netmailto:wjl@icecavern.net〉 >, released under the same terms as above.

SEE ALSO

       Introduction to Probability and Statistics[4]

       [Hamming]
              Hamming,  Richard  W.  Coding  and Information Theory. Englewood
              Cliffs NJ: Prentice-Hall, 1980.

       [Knuth]
              Knuth, Donald E. The Art of Computer  Programming,  Volume  2  /
              Seminumerical  Algorithms.  Reading  MA:  Addison-Wesley,  1969.
              ISBN 0-201-89684-2.

       [Lempel & Ziv]
              Ziv J. and A. Lempel. "A Universal Algorithm for Sequential Data
              Compression". IEEE Transactions on Information Theory 23, 3, pp.
              337-343.

       [Park & Miller]
              Park, Stephen K. and Keith W. Miller. "Random Number Generators:
              Good  Ones Are Hard to Find". Communications of the ACM, October
              1988, p. 1192.

       [1]    〈http://www.fourmilab.ch/hotbits/〉

       [2]    〈http://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html〉

       [3]    〈http://www.fourmilab.ch/〉

       [4]    〈http://www.fourmilab.ch/rpkp/experiments/statistics.html〉

                                 20 July 2010                           ent(1)