NPL
Neurological Programs and Libraries
Statistical Functions

Classes

struct  npl::RegrResult
 
class  npl::StudentsT
 Student's T-distribution. A cache of the Probability Density Function and cumulative density function is created using the analytical PDF. More...
 
class  npl::Classifier
 Base class for all ND classifiers. More...
 
class  npl::KMeans
 K-means classifier. More...
 
class  npl::ExpMax
 Expectation Maximization With Gaussian Mixture Model. More...
 

Functions

double npl::sample_var (const Ref< const VectorXd > vec)
 Computes the statistical variance of a column vector. More...
 
void npl::regressOutLS (VectorXd &signal, const MatrixXd &X, const MatrixXd &covInv)
 Removes the effects of X from signal (y). Note that this takes both X and the pseudoinverse of X because the bulk of the computation will be on the pseudoinverse. More...
 
template<typename T >
void npl::fillGaussian (Ref< T > m)
 
double npl::gaussianPDF (double mean, double sd, double x)
 1D Gaussian distribution function More...
 
double npl::gaussianCDF (double mean, double sd, double x)
 1D Gaussian cumulative distribution function More...
 
double npl::gammaPDF_MS (double mean, double sd, double x)
 PDF for the gamma distribution, if mean is negative then it is assumed that x should be negated as well. Unlike gammaPDF, this takes the mean and standard deviation (_MS) More...
 
double npl::mutualInformation (size_t len, double *a, double *b, size_t mbin)
 Computes mutual information between signal a and signal b which are of length len. Marginal bins used is mbin. More...
 
double npl::correlation (size_t len, double *a, double *b)
 Computes correlation between signal a and signal b which are of length len. More...
 
double npl::sample_var (int count, double sum, double sumsqr)
 Takes a count, sum and sumsqr and returns the sample variance. This is slightly different than the variance definition and I can never remember the exact formulation. More...
 
double npl::sample_corr (int count, double sum1, double sum2, double sumsq1, double sumsq2, double s1s2)
 Computes the sample correlation. More...
 
void npl::regress (RegrResult *out, const Ref< const VectorXd > y, const Ref< const MatrixXd > X, const Ref< const VectorXd > covInv, const Ref< const MatrixXd > Xinv, const StudentsT &distrib)
 Computes the Ordinary Least Square predictors, beta for. More...
 
void npl::regress (RegrResult *out, const Ref< const VectorXd > y, const Ref< const MatrixXd > X)
 Computes the Ordinary Least Square predictors, beta for. More...
 
void npl::randomizePowerIterationSVD (const Ref< const MatrixXd > A, size_t subsize, size_t poweriters, MatrixXd &U, VectorXd &E, MatrixXd &V)
 
void npl::randomizePowerIterationSVD (const Ref< const MatrixXd > A, double tol, size_t startrank, size_t maxrank, size_t poweriters, MatrixXd &U, VectorXd &E, MatrixXd &V)
 
MatrixXd npl::pca (const Ref< const MatrixXd > X, double varth=1, int odim=-1)
 Computes the Principal Components of input matrix X. More...
 
MatrixXd npl::rpiPCA (const Ref< const MatrixXd > X, double varth, int odim)
 Computes the Principal Components of input matrix X using the randomized power iteration SVD algorithm. More...
 
MatrixXd npl::symICA (const Ref< const MatrixXd > Xin, MatrixXd *unmix=NULL)
 Computes the Independent Components of input matrix X using symmetric decorrlation. Note that you should run PCA on X before running ICA. More...
 
MatrixXd npl::asymICA (const Ref< const MatrixXd > Xin, MatrixXd *unmix=NULL)
 Computes the Independent Components of input matrix X using sequential component extraction. Note that you should run PCA on X before running ICA. More...
 
MatrixXd npl::pseudoInverse (const Ref< const MatrixXd > X)
 Computes the pseudoinverse of the input matrix. More...
 
VectorXd npl::shootingRegr (const Ref< const MatrixXd > X, const Ref< const VectorXd > y, double gamma)
 Performs LASSO regression using the 'shooting' algorithm of. More...
 
VectorXd npl::activeShootingRegr (const Ref< const MatrixXd > X, const Ref< const VectorXd > y, double gamma)
 Performs LASSO regression using the 'activeShooting' algorithm of. More...
 
MatrixXd npl::pcacov (const Ref< const MatrixXd > cov, double varth)
 Computes the Principal Components of input matrix X using the covariance matrix. More...
 
void npl::expMax1D (const Ref< const VectorXd > data, vector< std::function< double(double, double, double)>> pdfs, Ref< VectorXd > mean, Ref< VectorXd > sd, Ref< VectorXd > prior, std::string plotfile="")
 Computes the mean and standard deviation of multiple distributions based on 1D data. The probability distribution functions should be passed in through a vector of function objects (pdfs) taking mu/sd/x. More...
 
void npl::gaussGammaMixtureModel (const Ref< const VectorXd > data, Ref< VectorXd > mu, Ref< VectorXd > sd, Ref< VectorXd > prior, std::string plotfile)
 Computes the mean and standard deviation of multiple distributions based on 1D data. This is a special version in which a negative gamma, a gaussian and positive gamma. The means of the negative and positive gamma are relative to the center of the gaussian. More...
 
void npl::approxKMeans (const Ref< const MatrixXd > samples, size_t nclass, MatrixXd &means)
 Approximates k-means using the algorithm of: More...
 
int npl::fastSearchFindDP (const Eigen::MatrixXf &samples, double thresh, double outthresh, Eigen::VectorXi &classes, bool brute=false)
 Algorithm of unsupervised learning (clustering) based on density. More...
 
int npl::findDensityPeaks (const Eigen::MatrixXf &samples, double thresh, Eigen::VectorXf &rho, Eigen::VectorXf &delta, Eigen::VectorXi &parent)
 Computes Density and Peak computation for Fast Search and Find of Density Peaks algorithm. More...
 
int npl::findDensityPeaks_brute (const Eigen::MatrixXf &samples, double thresh, Eigen::VectorXf &rho, Eigen::VectorXf &delta, Eigen::VectorXi &parent)
 Computes Density and Peak computation for Fast Search and Find of Density Peaks algorithm. This is a slower, non-bin based version. More...
 

Detailed Description

Function Documentation

VectorXd npl::activeShootingRegr ( const Ref< const MatrixXd >  X,
const Ref< const VectorXd >  y,
double  gamma 
)

Performs LASSO regression using the 'activeShooting' algorithm of.

Peng, J., Wang, P., Zhou, N., & Zhu, J. (2009). Partial Correlation Estimation by Joint Sparse Regression Models. Journal of the American Statistical Association, 104(486), 735–746. doi:10.1198/jasa.2009.0126

Essentially solves the equation:

y = X * beta

where beta is mostly 0's

Parameters
XDesign matrix
yMeasured value
gammaWeight of regularization (larger values forces sparser model)
Returns
Beta vector
void npl::approxKMeans ( const Ref< const MatrixXd >  samples,
size_t  nclass,
MatrixXd &  means 
)

Approximates k-means using the algorithm of:

'Fast Approximate k-Means via Cluster Closures' by Wang et al

This is not really meant to be used by outside users, but as an initializer

Parameters
samplesMatrix of samples, one sample per row,
nclassNumber of classes to break samples up into
meansEstimated mid-points/means
MatrixXd npl::asymICA ( const Ref< const MatrixXd >  Xin,
MatrixXd *  unmix = NULL 
)

Computes the Independent Components of input matrix X using sequential component extraction. Note that you should run PCA on X before running ICA.

Outputs reduced dimension (fewer cols) in output

Note that this whole problem is the transpose of the version listed on wikipedia.

In: I number of components In: X RxC matrix with rows representing C-D samples for p in 1 to I: wp = random weight while wp changes: wp = g(X wp)X/R - wp SUM(g'(X wp))/R wp = wp - SUM wp^T wj wj normalize(wp)

Output: W = [w0 w1 w2 ... ] Output: S = XW, where each column is a dimension, each row a sample

Parameters
XinRxC matrix where each row is a sample, each column a dimension (or feature). The number of columns in the output will be fewer because there will be fewer features. Columns should be zero-mean and uncorrelated with one another.
unmixoutput unmixing matrix, can be null if you don't need it
Returns
RxP matrix, where P is the number of independent components
double npl::correlation ( size_t  len,
double *  a,
double *  b 
)

Computes correlation between signal a and signal b which are of length len.

Parameters
lenLength of signal and and b
aSignal a
bSignal b
Returns
void npl::expMax1D ( const Ref< const VectorXd >  data,
vector< std::function< double(double, double, double)>>  pdfs,
Ref< VectorXd >  mean,
Ref< VectorXd >  sd,
Ref< VectorXd >  prior,
std::string  plotfile = "" 
)

Computes the mean and standard deviation of multiple distributions based on 1D data. The probability distribution functions should be passed in through a vector of function objects (pdfs) taking mu/sd/x.

Parameters
dataData points to fit
pdfsProbability distribution functions of the form pdf(mu, sd, x) and returning the probability density at x
meanOutput mean of each distribution, value when called will be used for initialization so it should be pre-set and pre-allocated
sdOutput standard deviation of each distribution, value when called will be used for initialization so it should be pre-set and pre-allocated
priorOutput prior probability of each distribution, initial value is not used but it should be pre-allocated
plotfilefile to plot histogram in (svg)
int npl::fastSearchFindDP ( const Eigen::MatrixXf &  samples,
double  thresh,
double  outthresh,
Eigen::VectorXi &  classes,
bool  brute = false 
)

Algorithm of unsupervised learning (clustering) based on density.

see "Clustering by fast search and find of density peaks" by Rodriguez, a. Laio, A.

Parameters
samplesSamples, S x D matrix with S is the number of samples and D is the dimensionality. This must match the internal dimension count.
threshThreshold distance for density calculation
outthreshthreshold for outlier, ratio of standard devation. Should be > 2 because you want to be well outside the center of the distribution.
classesOutput classes
brutewhether ther use slower brute force method for density calculation (for testing purposes only)

return -1 if maximum number of iterations hit, 0 otherwise (converged)

template<typename T >
void npl::fillGaussian ( Ref< T >  m)

Definition at line 60 of file statistics.h.

int npl::findDensityPeaks ( const Eigen::MatrixXf &  samples,
double  thresh,
Eigen::VectorXf &  rho,
Eigen::VectorXf &  delta,
Eigen::VectorXi &  parent 
)

Computes Density and Peak computation for Fast Search and Find of Density Peaks algorithm.

Parameters
samplesSamples, S x D matrix with S is the number of samples and D is the dimensionality. This must match the internal dimension count.
threshThreshold for density calculation
rhoPoint densities
deltaDistance to nearest peak
parentIndex (point) that is the nearest peak
Returns
0 if successful
int npl::findDensityPeaks_brute ( const Eigen::MatrixXf &  samples,
double  thresh,
Eigen::VectorXf &  rho,
Eigen::VectorXf &  delta,
Eigen::VectorXi &  parent 
)

Computes Density and Peak computation for Fast Search and Find of Density Peaks algorithm. This is a slower, non-bin based version.

Parameters
samplesSamples, S x D matrix with S is the number of samples and D is the dimensionality. This must match the internal dimension count.
threshThreshold for density calculation
rhoPoint densities
deltaDistance to nearest peak
parentIndex (point) that is the nearest peak
Returns
0 if successful
double npl::gammaPDF_MS ( double  mean,
double  sd,
double  x 
)

PDF for the gamma distribution, if mean is negative then it is assumed that x should be negated as well. Unlike gammaPDF, this takes the mean and standard deviation (_MS)

mean = k theta var = k theta theta

theta = var / mean k = mean / theta

prob(mu, sd, x) = x^{k-1}exp(-x/theta)/(gamma(k) theta^k) log prob(mu, sd, x) = (k-1)log(x)-x/theta-log(gamma(k)) - k*log(theta)

Parameters
meanMean to compute density of
sdStandard deviation of distribution
xX position to compute density of (<0 produces 0 probability unless mu is negative
Returns
Probability Density
void npl::gaussGammaMixtureModel ( const Ref< const VectorXd >  data,
Ref< VectorXd >  mu,
Ref< VectorXd >  sd,
Ref< VectorXd >  prior,
std::string  plotfile 
)

Computes the mean and standard deviation of multiple distributions based on 1D data. This is a special version in which a negative gamma, a gaussian and positive gamma. The means of the negative and positive gamma are relative to the center of the gaussian.

Parameters
dataData points to fit
muOutput mean of each distribution, value when called will be used for initialization so it should be pre-set and pre-allocated
sdOutput standard deviation of each distribution, value when called will be used for initialization so it should be pre-set and pre-allocated
priorOutput prior probability of each distribution, initial value is not used but it should be pre-allocated
plotfilefile to plot histogram in (svg)
double npl::gaussianCDF ( double  mean,
double  sd,
double  x 
)

1D Gaussian cumulative distribution function

Parameters
meanMean of gaussian distribution
sdStandard deviation
xPosition to get probability of
Returns
Probability of value falling below x
double npl::gaussianPDF ( double  mean,
double  sd,
double  x 
)

1D Gaussian distribution function

Parameters
meanMean of gaussian distribution
sdStandard deviation
xPosition to get probability of
Returns
Probability Density at Point
double npl::mutualInformation ( size_t  len,
double *  a,
double *  b,
size_t  mbin 
)

Computes mutual information between signal a and signal b which are of length len. Marginal bins used is mbin.

Parameters
lenLength of signal and and b
aSignal a
bSignal b
mbinBins to use in marginal distribution (mbin*mbin) used in joint
Returns
MatrixXd npl::pca ( const Ref< const MatrixXd >  X,
double  varth = 1,
int  odim = -1 
)

Computes the Principal Components of input matrix X.

Outputs reduced dimension (fewer cols) in output. Note that prior to this, the columns of X should be 0 mean otherwise the first component will be the mean

Parameters
XRxC matrix where each column row is a sample, each column a dimension (or feature). The number of columns in the output will be fewer because there will be fewer features
varthVariance threshold. This is the ratio (0-1) of variance to include in the output. This is used to determine the dimensionality of the output. If this is 1 then all variance will be included. If this < 1 and odim > 0 then whichever gives a larger output dimension will be selected.
odimThreshold for output dimensions. If this is <= 0 then it is ignored, if it is > 0 then max(dim(varth), odim) is used as the output dimension.
Returns
RxP matrix, where P is the number of principal components
MatrixXd npl::pcacov ( const Ref< const MatrixXd >  cov,
double  varth 
)

Computes the Principal Components of input matrix X using the covariance matrix.

Outputs projections in the columns of a matrix.

Parameters
covCovariance matrix of XtX
varthVariance threshold. Don't include dimensions after this percent of the variance has been explained.
Returns
RxP matrix, where P is the number of principal components
MatrixXd npl::pseudoInverse ( const Ref< const MatrixXd >  X)

Computes the pseudoinverse of the input matrix.

$ P = UE^-1V^* $

Returns
Psueodinverse
void npl::randomizePowerIterationSVD ( const Ref< const MatrixXd >  A,
size_t  subsize,
size_t  poweriters,
MatrixXd &  U,
VectorXd &  E,
MatrixXd &  V 
)
Parameters
AM x N
subsizeColumns in projection matrix,
poweritersNumber of power iterations to perform
UOutput Left Singular Vectors
EOUtput Singular Values
VOutput Right Singular Vectors
void npl::randomizePowerIterationSVD ( const Ref< const MatrixXd >  A,
double  tol,
size_t  startrank,
size_t  maxrank,
size_t  poweriters,
MatrixXd &  U,
VectorXd &  E,
MatrixXd &  V 
)
void npl::regress ( RegrResult out,
const Ref< const VectorXd >  y,
const Ref< const MatrixXd >  X,
const Ref< const VectorXd >  covInv,
const Ref< const MatrixXd >  Xinv,
const StudentsT distrib 
)

Computes the Ordinary Least Square predictors, beta for.

$ y = \hat \beta X $

Returning beta. This is the same as the other regress function, but allows for cacheing of pseudoinverse of X

Parameters
outStruct with Regression Results.
yresponse variables
Xindependent variables
covInvInverse of covariance matrix, to compute us pseudoinverse(X^TX)
XinvPseudo inverse of X. Compute with pseudoInverse(X)
distribPre-computed students' T distribution. Example: auto v = students_t_cdf(X.rows()-1, .1, 1000);
Returns
Struct with Regression Results.
void npl::regress ( RegrResult out,
const Ref< const VectorXd >  y,
const Ref< const MatrixXd >  X 
)

Computes the Ordinary Least Square predictors, beta for.

$ y = \hat \beta X $

Returning beta. This is the same as the other regress function, but allows for cacheing of pseudoinverse of X

Parameters
outStruct with Regression Results.
yresponse variables
Xindependent variables
void npl::regressOutLS ( VectorXd &  signal,
const MatrixXd &  X,
const MatrixXd &  covInv 
)
inline

Removes the effects of X from signal (y). Note that this takes both X and the pseudoinverse of X because the bulk of the computation will be on the pseudoinverse.

Each column of X should be a regressor, Xinv should be the pseudoinverse of X

Beta in OLS may be computed with the pseudoinverse (P): B = Py where P is the pseudoinverse: P = VE^{-1}U^*

Parameters
signalresponse term (y), will be modified to remove the effects of X
XDesign matrix, or independent values in colums
covInvthe pseudoinverse of XtX

Definition at line 53 of file statistics.h.

MatrixXd npl::rpiPCA ( const Ref< const MatrixXd >  X,
double  varth,
int  odim 
)

Computes the Principal Components of input matrix X using the randomized power iteration SVD algorithm.

Outputs reduced dimension (fewer cols) in output. Note that prior to this, the columns of X should be 0 mean otherwise the first component will be the mean

Parameters
XRxC matrix where each column row is a sample, each column a dimension (or feature). The number of columns in the output will be fewer because there will be fewer features
varthVariance threshold. This is the ratio (0-1) of variance to include in the output. This is used to determine the dimensionality of the output. If this is 1 then all variance will be included. If this < 1 and odim > 0 then whichever gives a larger output dimension will be selected.
odimThreshold for output dimensions. If this is <= 0 then it is ignored, if it is > 0 then max(dim(varth), odim) is used as the output dimension.
Returns
RxP matrix, where P is the number of principal components
double npl::sample_corr ( int  count,
double  sum1,
double  sum2,
double  sumsq1,
double  sumsq2,
double  s1s2 
)
inline

Computes the sample correlation.

Parameters
countNumber of samples
sum1Sum of group 1
sum2Sum of group 2
sumsq1Sum of the sqaure of the elements of group1
sumsq2Sum of the sqaure of the elements of group2
s1s2Elements of group1*elements of group 2
Returns
Sample Correlation

Definition at line 179 of file statistics.h.

double npl::sample_var ( const Ref< const VectorXd >  vec)

Computes the statistical variance of a column vector.

Parameters
vecInput samples
Returns
Variance
double npl::sample_var ( int  count,
double  sum,
double  sumsqr 
)
inline

Takes a count, sum and sumsqr and returns the sample variance. This is slightly different than the variance definition and I can never remember the exact formulation.

Parameters
countNumber of samples
sumsum of samples
sumsqrsum of square of the samples
Returns
sample variance

Definition at line 161 of file statistics.h.

VectorXd npl::shootingRegr ( const Ref< const MatrixXd >  X,
const Ref< const VectorXd >  y,
double  gamma 
)

Performs LASSO regression using the 'shooting' algorithm of.

Fu, W. J. (1998). Penalized Regressions: The Bridge versus the Lasso. Journal of Computational and Graphical Statistics, 7(3), 397. doi:10.2307/1390712

Essentially solves the equation:

y = X * beta

where beta is mostly 0's

Parameters
XDesign matrix
yMeasured value
gammaWeight of regularization (larger values forces sparser model)
Returns
Beta vector
MatrixXd npl::symICA ( const Ref< const MatrixXd >  Xin,
MatrixXd *  unmix = NULL 
)

Computes the Independent Components of input matrix X using symmetric decorrlation. Note that you should run PCA on X before running ICA.

Same number of cols as output

Note that this whole problem is the transpose of the version listed on wikipedia.

In: I number of components In: X RxC matrix with rows representing C-D samples for p in 1 to I: wp = random weight while wp changes: wp = g(X wp)X/R - wp SUM(g'(X wp))/R wp = wp - SUM wp^T wj wj normalize(wp)

Output: W = [w0 w1 w2 ... ] Output: S = XW, where each column is a dimension, each row a sample

Parameters
XinRxC matrix where each row is a sample, each column a dimension (or feature). The number of columns in the output will be fewer because there will be fewer features. Columns should be zero-mean and uncorrelated with one another.
unmixoutput unmixing matrix, can be null if you don't need it
Returns
RxP matrix, where P is the number of independent components