|
|
I have lots of experience with the canned procedures of R which I used to fit Time Series and GLM models while taking the corresponding courses at Purdue. In addition, S+FinMetrics was utilized for my energy market project. Below you can see an example where I had to modify someone else's R package (which may be even harder than creating a new one from scratch).
Project 1 : Multiple Testing Procedure
"Large-scale
multiple testing" means testing hundreds or thousands ("large scale") of
statistical hypotheses simultaneously ("multiple testing"). It originated in
Biostatistical gene studies that involve a large number of hypotheses aimed at
figuring out the role of each gene.
A multiple inference procedure
proposed by
Prof. Efron is implemented in R package
locfdr. It is assumed that the test
statistics of interest (e.g., t-values, F-values and such like) are first
converted into "z-values", that is, N(0, 1) scale. Each z-value is marginally
N(0, 1) if the corresponding null hypothesis is true. The multiple testing
procedure relies on so-called "empirical null distribution" which is estimated
(e.g. via MLE) over a pre-specified "zero interval". All z-values in the zero
interval are assumed to be null.
Choosing the zero interval itself
is a bias-variance tradeoff problem. In the package locfdr the interval
for MLE estimation is chosen automatically based on a certain fixed model for
the distribution of z-values. That model is nothing but a rough guess, and a
much better method can be found in
Turnbull (2007). Unfortunately, it was
never implemented in R. My goal is to use a similar bias-variance tradeoff
procedure but for that I have to be able to specify the zero interval which is
impossible in locfdr. Also, locfdr provides the
post-hoc
power measure, Efdr, for the
original sample and for hypothetical multiples of the original sample. However,
it doesn't provide EfdrLeft (EfdrRight) that separately measure the power
for left (right) tail of z-distribution. These quantities are of interest if the
null hypothesis is one-sided. Another power-related measure is the proportion of
non-null cases that has tail false discovery rate, Fdr, below certain
threshold. It would be nice to have it calculated for the original sample and
for hypothetical multiples of it. That way we could see how much the study would
benefit from the increased sample size. Correspondingly, I had to modify
the locfdr package. The new package locfdr2 has the following
features:
1) A
new argument "mle.pct0" that allows the user to specify the zero interval for
MLE estimation of empirical null distribution. When "mle.pct0" is
omitted, the zero interval is selected just like in locfdr. 2) A new argument "mult.cut". For
instance, when mult.cut = (0.12, 0.15), the abovementioned thresholds for Fdr
are 0.12 on the left and 0.15 on the right 3) Additional
output "mult2" is produced for power-related statistics To see how it works, do the
following:
1)
Install
and launch R version 2.7.1 or higher
2)
Load
the original locfdr package (this will also load packages used by
locfdr) via "Packages -> Load package"
3)
Load
locfdr2: "File -> Source R code", then select
this file
4)
Now you
can use locfdr2 as shown in
this locfdr2 demo or in a similar
way I successfully applied locfdr2 to the problem of mutual fund performance evaluation, and the final results can be found here. |
|