***********************************************************************
1. Testing with PractRand

1.A - Connecting your RNG to PractRand
1.A.1 - Testing a programs output
1.A.2 - Testing a file:
1.A.3 - Testing your RNG at full speed:
1.A.4 - Options NOT supported by PractRand

1.B - Interpretting PractRand Test Results
1.B.1 - RNG_test results for people in a hurry
1.B.2 - RNG_test results in more detail
1.B.3 - Notes on RNG_test p-values
1.B.4 - TestResult objects from calling PractRand directly
1.B.5 - Misc PractRand test result quality notes
1.B.6 - Metatests performed on PractRand p-values
1.B.7 - What specific failures might mean about a PRNG

2. General Discussion of RNG Testing

3. General Information on Interpretting Test Results
3.A - Forms of test results
3.B - False positives
3.C - Knowing the false-positive rates is IMPORTANT
3.D - Metatests

4. Brief Reviews of Non-PractRand Test Suites
***********************************************************************


***********************************************************************
1. Testing with PractRand
***********************************************************************

1.A - Connecting your RNG to PractRand
If you're testing an RNG that is not built in to PractRand then your 
first challenge is getting the data to PractRand in a manner that it 
can test.  The three general ways to interface an external RNG to a 
tester are by dumping random data to a file, by piping random data, or 
by linking the RNG to the tester.  

1.A.1 - Testing a programs output:
If you have a program that generates raw random data and sends them to 
standard output then you can pipe them in to the PractRand RNG testing 
tool and it can work with them.  If your program is called My_RNG then 
the command line looks like this on linux:
My_RNG | ./RNG_test stdin
Or like this on Windows (from a command prompt):
My_RNG | RNG_test stdin

If your program stops sending out random data before the PractRand RNG 
testing tool is done testing then it may print "error reading input".  
In that case you might want to try adding command line options like 
"-tlmin 1KB", which will tell RNG_test that you want it to start 
printing output ASAP.  It will still print "error reading input" when 
the input stops, but at least you'll get some test results as long as 
the input is at least 1024 bytes long.  

If you know that your program produces data in 8 bit chunks then you 
should change "stdin" to "stdin8".  Similar variants exist for 16, 32, 
and 64.  If you don't know the format of your data or none of those 
make any more sense than the others then just use "stdin".  

1.A.2 - Testing a file:
If you have a file of random data named My_Data that you want to test, 
technically PractRand doesn't support that.  But it's easily adapted 
to piped in data using standard tools on most OSes.  
On linux, try:
cat My_Data | ./RNG_test stdin
On windows instead use:
type My_Data | RNG_test stdin

The comments that apply to testing data from a program apply here as 
well.  In fact the comment on "error reading input" becomes much more 
important.  

Note that reading random data from a file tends to be much much slower 
than getting it by any other method.  

1.A.3 - Testing your RNG at full speed:
It's also possible to link your RNG directly with PractRand and/or 
RNG_test.  This eliminates the overhead of piping the data around, 
increasing performance significantly for fast RNGs.  However, it tends 
to require more work as you have to understand some of the subtleties 
of the internals of PractRand.  The easiest way may be to to edit the 
"dummy" RNG in RNG_test.cpp, but other ways are possible too.  

1.A.4 - Options NOT supported by PractRand
PractRand supports testing ONLY on packed binary data as input.  You 
can not give it ASCII 0s and 1s or ASCII hex or anything else that 
doesn't have 8 random bits per byte.  

1.B - Interpretting Results
The next challenge is figuring out what the results mean.  

1.B.1 - RNG_test results for people in a hurry
If you are using RNG_test the simplest way to interpret test 
results is to look for the word "FAIL" in the output.  It 
will appear on the right-hand side and be obvious, like this:
  Test Name                     Raw         Processed     Evaluation
  BCFN(2+0,13-0)                R=  -6.5    p =1-3.5e-3   unusual
  [Low4/16]BCFN(2+0,13-0)       R= +39.2    p =  4.4e-20    FAIL !!!
  [Low4/16]BCFN(2+1,13-1)       R= +28.8    p =  3.1e-14    FAIL !!
  [Low1/16]BCFN(2+0,13-1)       R= +27.1    p =  2.1e-13    FAIL !!
  ...and 50 test result(s) without anomalies

1.B.3 - RNG_test results in more detail
RNG_test produces a series of interim result summaries as it goes 
along.  Each result summary has a header showing the RNG name, the 
number of bytes tested, the time taken, and the RNG seed used (so 
you can reproduce the results later if desired).  The body of the 
result summary is a table showing all anomolous results followed 
by a statement of how many results were omitted from the table 
because they were not anomolous.  If the table would have zero 
entries then the table is skipped.  If no results were omitted then 
the number of omitted results is skipped.  The threshold for what 
kinds of results are considered anomolous is very adjustable via 
command line parameters.  Adding "-p 1" will force all results to 
be shown in the table regardless of how suspicious they were or 
weren't, while adding "-e 0.001" will make it show only highly 
suspicious results.  The default is "-e 0.1".  
The table has four columns:
"Test Name" - a name for the subtest the line corresponds to
"Raw" - not of much use to end users, but I use this a lot
"Processed" - either a p-value or "pass" or "fail".  However, 
see 3.B.3 for more details on RNG_test p-values
"Evaluation" - an english word or phrase describing the 
result.  "FAIL" means that the RNG unambigiously failed that 
subtest, while "suspicious" means that that result should not 
happen often on a good RNG but should happen occasionally.  
The formula used to assign evaluations is not adjustable from the 
command line, and is a relatively complex function (not just a 
hardwired p-value threshold).  The most extreme evaluations 
get an extra space or two of indentation to make them easier to 
spot when scanning a large chart by eye.  The evaluations treat 
all PractRand p-values as having two symmetric error regions - 
both zero and one are considered equally bad, despite the fact 
that PractRand tests are designed to produce zero for most 
expected failure circumstances.  

1.B.4 - Notes on RNG_test p-values
When RNG_test prints a p-value it can be in any of the 
following forms:
* normal p-value (e.g. "p = 0.188") - exactly what you'd expect
* p-value near zero (e.g. "p = 2.3e-4") - scientific notation in 
a fairly standard form.  Note however that the exponent can reach 
extreme values (up to 9999) that do not fit in normal floating 
point formats.  
* p-value near one (e.g. "p =1-2.7e-3") - one minus a number in 
scientific notation.  This format is a little less standard but 
still fairly self-explanatory and it allows finer resolution to 
be displayed for numbers close to one.  
* p-values equal to zero (e.g. "p = 0") - values VERY close to 
zero may get rounded to zero.  
* p-values equal to one (e.g. "p = 1") - values VERY close to one 
may get rounded to one.  
* p-values that are only very crudely approximated - these can be 
distinguished from normal p-values by the "~" between the "p" and 
the "=" in the Processed results column of RNG_test output.  Such 
results may be shown with fewer significant digits displayed than 
normal p-values.  In the core test set, BRank and TMFn currently 
produce such low-quality p-values.  
* "pass" or "fail" in place of a p-value - Some subtests can't 
produce p-values.  If you must have a p-value, assume that "pass" 
corresponds to a p-value randomly distributed between 0 and 1, 
while "fail" corresponds to a p-value of exactly zero.  Generally 
PractRand subtests that produce this kind of output are very 
boring tests that never fail any but the most absurdly extreme 
RNGs (things like RNGs that produce only zeroes).  These results 
are necessary because some normal subtests results cannot be 
calculated at all under such absurd circumstances, but the test 
still needs some way to report those circumstances occuring.  
An example results chart showing one of each category of result:
  Test Name                     Raw         Processed     Evaluation
  BRank(12):512(2)              R=  +0.4    p~= 0.4       normal
  DC6-9x1Bytes-1                R=  +0.4    p =1-1.1e-3   unusual
  Gap-16:!                      R=  +0.0      "pass"      normal
  Gap-16:A                      R=  +0.5    p = 0.511     normal
  Gap-16:B                      R=  +3.3    p =  9.9e-3   normal

1.B.5 - TestResult objects from calling PractRand directly
PractRand tests natively produce sets of results of type 
PractRand::TestResult.  Each of these objects correspond to a 
subtest result.  The simplest thing to do with them is call their 
get_pvalue() method.  See tests.h for for details on their 
interface, or RNG_test.cpp for examples of more complicated things 
you can do with them.  

1.B.5 - Misc PractRand test result quality notes:
The various tests within PractRand vary widely in terms of the 
quality of p-values they produce - the tests used in the normal test 
set tend to produce cleaner p-values than other tests in PractRand, 
because I spent more effort on them.  In all cases however I have made 
effort to minimize any chance of the tests themselves contributing to 
a false positive - when there is uncertainty what the true p-value 
result should be, I attempt to report the least extreme possibility 
within the range of uncertainty.  

The worst p-value qualities tend to come from tests on short sequences 
of non-power-of-2 lengths (usually 6 KB) or on very long sequences 
(hundreds of gigabytes or more).  

Do NOT assume that p-values from the same results summary have zero 
correlation.  While the normal test set is *mostly* orthogonal it's 
not *completely* orthogonal.  The significant intra-correlations in 
the recommended test set are:
* Gap-16:A is correlated with Gap-16:B.  The strength of correlation 
is very strong for shorter sequences and slowly decreases as sequence 
length increases.  The Gap-16 results should have no significant 
correlation with the other standard tests.  
* BCFN(X1,Y1) should correlate with BCFN(X2,Y2) if and only if X1 and 
X2 are similar - if they differ by 1 then the correlation is moderate, 
if they differ by 2 then the correlation is minor, if they differ by 3 
then the correlation should be insignificant, and at 4 or more the 
correlation should generally be zero.  The Y values are generally not 
very relevant.  
* DC6-9x1Bytes-1 is slightly correlated with BCFN(2,whatever) and 
insignificantly correlated with BCFN(3,whatever).  Higher X BCFNs 
should have zero correlation with DC6-9x1Bytes-1.  
* All of the correlations for pairs of subtests described above also 
exist for pairs of subtests that have names identical to the previous 
described names but with matching [LowX/Y] prefixes added on.  
* A subtest with a prefix of the form [LowX/Y] may correlate with 
subtests of the same name but different or no prefixes.  The degree 
of correlation is (at most) moderate among such cases in the standard 
test sets.  

(why, you ask, are there moderate and strong correlations present in 
an optimized selection of tests?  Because the moderate and strong 
correlations are between subtests that enjoy synergies that allow 
them to be more efficiently calculated at once - the efficiency 
bonuses they enjoy thusly are slightly offset by non-orthogonality 
penalties, but only slightly.  The strongest correlation that is 
not a byproduct of a major efficiency boost is that between 
DC6-9x1Bytes-1 and BCFN(2,whatever), which is only a slight 
correlation)

The expanded test set is not very orthogonal at all - many of its 
tests are strongly correlated with each other.  

1.B.6 - Metatests performed on PractRand p-values:
Normal PractRand p-values are usually fine for performing meta-tests 
on, assuming no ~ is shown with them.  However, using large numbers 
of p-values in a single meta-test may be enough to amplify biases in 
the p-values to noticable levels, depending upon the particular type 
of metatest used and how tolerant you are of bias in the result.  
50 p-values should generally be okay, 5 should always be okay, 
50,000 should only be acceptable on a rather limited range of tests 
and metatests (check emperically).  

If RNG_test printed a "~" or (for those linking with PractRand 
directly) TestResult had a type with BAD in the name (eg TYPE_BAD_P), 
or the result type was a pass/fail result, then metatests are 
probably a bad idea.  

Results summaries from the same run of RNG_test are based upon some 
of the same RNG output and thus have some degree of correlation.  
Normally two consecutive results summaries have 50% overlap, producing 
a substantial degree of correlation, but that can vary with the 
command line parameters used.  

1.B.7 - What specific failures might mean about a PRNG

Failures on test names that begin with [Low1/??] (replacing the ?? 
with 8, 16, 32, or 64) typically mean low-bit issues - not enough 
right-ward mixing (barrel shifts and right-shifts mainly, though also 
table-lookups and some more complex / slower math) in the PRNG.  
However, there are other possibilities - it could be that the bit 
focused on didn't need to be the lowest bit, it just had to be one bit 
per word to emphasize the relationship at that particular spacing.  

Simultaneous failures on multiple unrelated tests often means that the 
PRNGs cycle length was exhausted, or it otherwise changed behaviors to 
something awful near the end of testing.  

There are currently 7 tests in the core test set: one 
parameterization each of BCFN, DC6, Gap-16, FPF, BRank, mod3n, and 
TMFn.  These 7 are not created equal.  The first four of those 
spend a lot of time trying to detect broad categories of biases.  
The last three are only alotted tiny amounts of CPU time, but they 
try to catch narrowly defined type of bias that are known to slip 
past the broader tests sometimes.  

Failures on short-ranged tests, especially DC6 parameterizations, are 
typical of small chaotic PRNGs with insufficient mixing / avalanche.  
Failures on FPF parameterizations can have similar meaning.  However, 
both of those find bias in a wide variety of PRNGs, sometimes for 
non-obvious reasons.  

BCFN failures where the first parameter listed is high (say, 5 or 
more) typically reflect a cyclic buffer being traversed, where 
patterns in the content of that buffer show up at specific spacings.  
On the other hand, BCFN failures where the first parameter listed is 
low are typically similar in meaning to a failure on a DC6 
parameterization (see the previous paragraph).  Note that when I 
refer to the value of the first BCFN parameter, I'm refering to the 
evaluated value listed in the output, meaning that for a test result 
listed under "BCFN(2+3,13-1)" the first parameter would be 5 (2+3).  

Gap16 failures with p-values near 1 generally mean that the PRNG 
output was too regular, typically reflecting a single-cycle PRNG 
nearing the end of its cycle length.  Those with p-value near zero on 
the other hand indicate that some values showed up at more extreme 
spacings than expected, seen most often on single-cycle PRNGs and 
sometimes on rc4-like PRNGs.  

Failures on BRank suggest that in some way the PRNG output, or at 
least part of it, was extremely linear, producable strictly by 
xoring bits of previous PRNG output.  This is classically seen on 
LFSRs / GFSRs, but sometimes it shows up on PRNGs that are 
*supposed* to be chaotic, but actually have some component that 
isn't really.  

For mod3n failures, where the subtest has a name of the form 
"mod3n(X):(Y,Z)", look at the value of Y there.  If the Y value 
is near zero, this corresponds to a short-range correlation.  If 
the Y value is high it corresponds to a long-range correlation.  
This actually looks for simple patterns in the value-modulo-3 of 
the PRNG output.  Most such patterns will be detected by DC6 or 
BCFN, but those two tests both look at hamming weights to spot 
such patterns, and once in a while an otherwise very simple bias 
will show up in such a way that is no impact on the distribution 
of hamming weights.  You might ask "why 3?  why not 5 or 17?".  
Some biases will slip past mod3n but could be caught by a 
hypothetical mod5n.  However, most biases that could be caught 
by a mod5n but not a mod3n will instead be caught by DC6 or 
BCFN - the biases that can get past DC6, BCFN, and mod3n, but 
could get caught by mod5n are very very few.  I think.  

Failures on TMFn mainly happen on LCGs.  TMFn can only find 
simple long-range patterns that only occur on extremely 
structured PRNG output.  It's only included because LCGs are 
popular and all the other included tests are mediocre or 
worse at detecting LCGs.  

Most failures on the core tests (other thatn BRank) should have 
started small and gotten steadily worse as more bytes were tested 
until failure was reached.  Thus, if "FAIL" was reported at 2**36 
bytes then at the previous results summary (at 2**35 bytes) there 
should have been some kind of anomaly on the same subtest, maybe a 
"suspicious" or "very suspicious" test result.  If there was no 
preceding warning and the failure was particularly extreme, that may 
mean that the RNG somehow worsened its behavior in some way shortly 
before the failure was reported, similar to exhausting the cycle 
length, but perhaps in some way detectable to only one of the tests 
in PractRand.  


***********************************************************************
2. General Information on Testing
***********************************************************************

The classic source of empirical statistical tests for random number 
generators is "The Art of Computer Programming", by Donald Knuth.  Several 
empirical tests are described in that book.  

The purpose of such tests is to distinguish data produced by a truly random 
process from data produced by a non-random process.  They look for patterns 
associated with types of not-entirely-random systems that the data might have 
come from instead of coming from a truly random process.  There is a 
theoretical problem here - there are an infinite number of such patterns and 
possible not-entirely-random systems, and if you consider ALL of them then 
all blocks of data are equally likely to be random or non-random.  In 
practice however, this is not actually a problem - most of those infinite 
number of patterns are infinitely complex, utterly impractical to test for, 
and don't really make much sense in the context of math and hardware as we 
know them.  In practice, looking for patterns that tend to arise from non-
random processes of limited complexity and very finite state size built out 
of standard mathematical operations (particularly those that are most 
efficient in binary logic) does a good job.  

In empirical testing, a classic is the Diehard program by Marsaglia.  That 
program is badly outdated today and should not be used anymore, but it was, 
so far as I know, the first real standardized battery of such tests.  A 
standardized battery of tests is just a more clearly defined (and usually 
more convenient) version of running multiple tests on a block of data.  The 
basic results of a standardized battery of tests can be communicated quickly 
and effectively with just the name and version of the standard battery of 
tests and the length of sequence tested (though some batteries operate on 
fixed-size sequence lengths and don't need the last).  That is much easier 
to interpret (not to mention more concise) than a listing of every major 
test name / version / implementation / parameterization tried.  


***********************************************************************
3. General Information on Interpretting Test Results
***********************************************************************

This is all general theory, not anything specific to PractRand.  

3.A - Forms of test results

Most such tests produce results of a form called a "p-value".  p-values 
are numbers between 0 and 1 that represent where a tests results fall 
within the range of results expected on truly random data (ie truly random 
data should produce p-values uniformly distributed between 0 and 1).  The 
results can never truly lie outside the range expected from truly random 
data, but they can lie absurdly close to an edge of the expected range.  
Tests are generally constructed such that the pattern they are looking for 
produces p-values that are very close to zero.  Results that are extremely 
close to 1 are frequently also considered suspicious.  Exactly how close to 
zero or one might be considered a failure varies extremely widely - I have 
seen sources that differed by more than 8 orders of magnitude on the issue.  

Some tests will produce flat pass/fail or pass/ambiguous/fail results in 
place of p-values.  Any pass/fail result should in theory correspond to 
whether or not a hypothetical p-value would be above or below some 
specific threshold (the percent of the time that truly random data would 
produce a fail result).  

Typically p-value calculations are not 100% accurate (ie not 100% 
uniformly distributed between 0 and 1 on truly random data).  Instead they 
have subtle and potentially complex patterns of deviations from uniformity.
In some cases these inaccuracies are insignificant, in other cases they 
are very important.  

Similarly a pass/fail result will fail a percentage of truly random data 
that differs slightly from the intended percentage.  

A p-value can be converted to a pass/fail result simply by comparing it 
to an arbitrary threshold.  Is P less than 0.001?  Then it failed.  That 
kind of thing.  The exact value of the arbitrary threshold used varies 
widely depending upon who picks it - some prefer thresholds of 0.05, 
others prefer thresholds of 0.0000000001.  This is assuming single-ended 
failure criteria, in which a p-value is only considered bad near zero, 
not near one.  If the p-value has double-ended failures you might have 
one threshold at 0.0001 (anything below that is considered a failure) 
and another threshold at 0.9999 (anything above that is considered a 
failure).  

3.B - False positives

The theoretical false positive rate is the chance that a perfect RNG 
would fail a test.  All meaningful evaluations of RNG quality have a 
non-zero false positive rate, though it can be very very close to 
zero in some cases.  

In theory, if you're dealing with p-values (under the idealized 
assumption that only values near zero are failures, values near one 
are not), then the false positive rate is equal to the failure 
threshold.  A failure threshold of 0.001 means a 0.1% false positive 
rate.  

Unfortunately, that does not take in to account inaccuracies in the 
calculations of the p-value, which can have large effects on false- 
positive rates.  

(In PractRand, the target false-positive rate is done on a 
per-results-summary basis rather than the usual per-p-value basis, 
and is currently set so that approximately one in 300 million result 
summaries should show a false positive, meaning that if 300 million 
result summaries are produced from testing good PRNGs, approximately 
1 results summary would include a "FAIL" evaulation.  At that rate, 
if you have PractRand producing a results summary every 2 seconds, 
it would take an average of 20 CPU-years to produce a single 
false-positive.  In practice however, because of biases in 
PractRand's p-values, the actual false-positive rate should be 
substantially lower than that when working with the standard test 
set.)

3.C - Knowing the false-positive rates is important

The important point of test evaluation is that you need to be aware of 
false-positive rates.  Any meaningful scheme for evaluating test 
results must have a non-zero false-positive rate, meaning a chance of 
incorrectly producing a failure on good (ie produced by a random process) 
data.  If the false-positive rate is significantly greater than zero then 
your test results may not mean much on their own.  How much greater than 
zero is significant depends upon the cost of making an error, and how 
many chances to make an error you will have.  If you are producing a 
million test results and want the chance of having any errors at all to 
be on the order of one in a million, then you want a false-positive rate 
on the order of one in a trillion or less.  

The standard approach to dealing with false positive rates that are too 
high is to run tests repeatedly (on different data from the same RNG), 
and ignore failures that did not occur repeatedly, with the number of 
failures required to be considered significant adjusted for the expected 
false-positive rate.  This is a convenient way to handle the issue that 
often meets peoples needs.  However, there are pitfalls to this approach 
that may not be intuitively obvious - having the tests run a variable 
number of times can undermine reproducibility to a surprising degree, 
while running them a constant number of times can slow things down more 
than you might expect if your desired effective false positive rate is 
extremely low.  This sort of thing is a type of metatest.  

(PractRand attempts to set its false-positive rate for "FAIL" messages 
low enough to make it unlikely that any user will ever see a false-
positive from it.  The effectiveness of this strategy is limited by 
two things: 1. I did say "attempts" to set the false-positive rate 
that low.  In practice, for the standard test set, I think I've even 
succeeded.  The expanded test set, and tests not used in either, 
don't necessarily meet the same standards though.  2. users rightly 
pay attention to the number of less extreme result evaluations as 
well, as a kind of informal done-by-eye metatest, for which there is 
no quantified false-positive rate.)

3.D - Metatests

Sometimes instead of considering a tests p-values as a final product 
someone will use it as random data to be put in to another set of 
statistical tests.  Such a test using other tests p-values as input 
is referred to as a metatest.  

P-values for data from good RNGs are supposed to be uniformly 
distributed between 0 and 1.  Test results on data produced by 
different seeds should show no correlation.  Thus p-values from a good 
test run on different seedings of a good RNG should behave like the 
output of a good RNG.  Care must be taken to avoid using correlated (or 
at least strongly correlated) test results in such a way, but this is 
generally a managable problem for a canny user.  

Some test suites have a habit of completely hiding p-values produced 
directly by their basic statistical tests on the random numbers and 
displaying instead metatest results.  

But why would such a thing be desirable?  Here are a few reasons, 
and some associate drawbacks for each:
* If the original test results p-values have biases with known 
properties, this can be an effective way to remove their biases.  
- However this can amplify errors if done incorrectly, and is 
not necessarily the most efficient way to go about correcting 
well understood p-value biases.  
* By combining multiple tests, the sensitivity of a test can be 
scaled up.  For some tests, this may be the *only* practical way 
to scale up their sensitivity.  
- However, in practice this is a very poor way of scaling up a 
tests, so poor that in all cases I have examined you would be 
at least as well off leaving the test unscaled or abandoning the 
test entirely.  
* If there are multiple tests and we don't want to look at their 
individual results but a combined result instead.  
- I have found using extremely crude metatests to be more useful 
for this purpose than sophisticated metatests due to the costs 
imposed by having to eliminate all correlation between the 
various base tests (sophisticated metatests generally require 
that all inter- p-value correlations be completely understood, 
which is generally only possible if all such correlations are 
indistinguishable from zero).  

(PractRand does not encourage the use of metatests by end 
users; metatests are used in the validation and calibration 
of PractRand subtests though.)


***********************************************************************
4. Brief Reviews of Non-PractRand Test Suites
***********************************************************************

gjrand:
	Very good.  Comparable to PractRand.  
	I've only used it on linux, I suspect a windows build would be difficult.  
	Handles all types of RNGs well in my experience.  
	DO NOT RUN MULTIPLE TESTS AT ONCE.  It will silently fail, producing incorrect results.  

	I pipe my data in to it.  I think that's the way it's intended to be used.  

TestU01:
	Decent, but not as good as PractRand or gjrand in my experience.  
	Difficult to build on windows.  
	Slightly worse on chaotic PRNGs than non-chaotic PRNGs.  
	I recommend its SmallCrush, Crush, and BigCrush batteries of tests.  

	I link my PRNGs with it, but I think you can give it files or pipe 
	in data too.  

RaBiGeTe:
	If the new version is out yet it's okay (I've been using a pre-release copy), not as good as TestU01 or PractRand or gjrand but sometimes worth using.  
	Windows only, recent versions are closed-source.  
	Does not need many bits - a big bonus on slow PRNGs.  
	Much worse on chaotic PRNGs than non-chaotic PRNGs.  It tends to work best in my experience on LCG-like PRNGs.  
	Lots of bit-level tests, maybe good for slow non-byte-oriented PRNGs?  
	Numerous minor bugs and presentation issues, but no show-stoppers.  

	I compile my PRNGs in to .dll files for it, but you can also give 
	it files.  

Dieharder:
	Not good, but parts of it show potential.  
	*nix-only.  
	I *strongly* recommend using a customized set of tests for this, never the -a option.  
	I recommend using the dab_ tests and only those tests.  
	dab_monobit2 is the best test, dab_monobit2 + dab_dct is the best pair of tests.  
	Adjust their sequence lengths.  Max out the possible sequence length on dab_monobit2 (it's a 32 bit signed value IIRC).  
	Beware of false positives.  

	I pipe in my PRNG output, but you can also link with Dieharder (it has a viral license for linking though IIRC)

NIST STS:
	Don't bother.  It has multiple critical problems and no significant strengths.  

Diehard:
	Don't bother.  It's of historic value only, very badly outdated.  

