Home Up CV / Resume Research Contact Info

S+ / R
C++ VBA Java Matlab S+ / R SAS Ox / C CRSP


[Under Construction]

 

 

 

 

I have lots of experience with the canned procedures of R which I used to fit Time Series and GLM models while taking the corresponding courses at Purdue. In addition, S+FinMetrics was utilized for my energy market project. Below you can see an example where I had to modify someone else's R package (which may be even harder than creating a new one from scratch).

 

Project 1 : Multiple Testing Procedure

"Large-scale multiple testing" means testing hundreds or thousands ("large scale") of statistical hypotheses simultaneously ("multiple testing"). It originated in Biostatistical gene studies that involve a large number of hypotheses aimed at figuring out the role of each gene.

A multiple inference procedure proposed by Prof. Efron is implemented in R package locfdr. It is assumed that the test statistics of interest (e.g., t-values, F-values and such like) are first converted into "z-values", that is, N(0, 1) scale. Each z-value is marginally N(0, 1) if the corresponding null hypothesis is true. The multiple testing procedure relies on so-called "empirical null distribution" which is estimated (e.g. via MLE) over a pre-specified "zero interval". All z-values in the zero interval are assumed to be null.

Choosing the zero interval itself is a bias-variance tradeoff problem. In the package locfdr the interval for MLE estimation is chosen automatically based on a certain fixed model for the distribution of z-values. That model is nothing but a rough guess, and a much better method can be found in  Turnbull (2007). Unfortunately, it was never implemented in R. My goal is to use a similar bias-variance tradeoff procedure but for that I have to be able to specify the zero interval which is impossible in locfdr.

Also, locfdr provides the post-hoc power measure, Efdr, for the original sample and for hypothetical multiples of the original sample. However, it doesn't provide EfdrLeft (EfdrRight) that separately measure the power for left (right) tail of z-distribution. These quantities are of interest if the null hypothesis is one-sided. Another power-related measure is the proportion of non-null cases that has tail false discovery rate, Fdr, below certain threshold. It would be nice to have it calculated for the original sample and for hypothetical multiples of it. That way we could see how much the study would benefit from the increased sample size.

Correspondingly, I had to modify the locfdr package. The new package locfdr2 has the following features:

1) A new argument "mle.pct0" that allows the user to specify the zero interval for MLE estimation of empirical  null  distribution. When "mle.pct0" is omitted, the zero interval is selected just like in locfdr.

2) A new argument "mult.cut". For instance, when mult.cut = (0.12, 0.15), the abovementioned thresholds for Fdr are 0.12 on the left and 0.15 on the right

3) Additional output "mult2" is produced for power-related statistics

To see how it works, do the following:

1)    Install and launch R version 2.7.1 or higher

2)    Load the original locfdr package (this will also load packages used by locfdr) via "Packages -> Load package"

3)    Load locfdr2: "File -> Source R code", then select this file

4)    Now you can use locfdr2 as shown in this locfdr2 demo or in a similar way.

I successfully applied locfdr2 to the problem of mutual fund performance evaluation, and the final results can be found here.


Software skills

Home Up CV / Resume Research Contact Info