A Systematic Approach to Increase Reproducibility in Simulation Studies

Citation: Wu X, Rai SN (2017) A Systematic Approach to Increase Reproducibility in Simulation Studies. Int J Clin Biostat Biom 3:012. doi.org/10.23937/2469-5831/1510012 Received: July 25, 2017: Accepted: October 05, 2017: Published: October 07, 2017 Copyright: © 2017 Wu X, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.


Introduction
Reproducibility of scientific discovery is an important issue in science, medicine, engineering and other fields, which can provide essential validation [2][3][4][5][6].Despite the importance of reproducibility, there exists the lack of reproducibility for many scientific findings [7][8][9][10].Failure to reproduce results has been a major concern from journal editorial boards [11][12][13][14].Risks with a multiplicity and misinterpretation of the P-values are widespread [15][16][17].Extra variability in the P-value may lead to irreproducible results [1].Thus, the lack of reproducibility presents a fundamental problem in statistical inference.Although inference based on the P-value continues to occupy a prominent place in research, any reports or monly represented by α), is the probability of the critical region when the null hypothesis is true.The power, (commonly represented by 1-β), is the probability of the critical region when the alternative hypothesis is true.
We expand this example further to study the effect of reproducibility.First, we examine reliability of sample values between replicates.The populations A and B have a common variance of 1 but with different means of 0.5 and 0.0, respectively.From each of the populations A and B, we randomly draw 6 sets of samples of sizes 20, 50 and 80, respectively.Note in this setting (n 1 = 20, n 2 = 20; n 1 = 50, n 2 = 50; or n 1 = 80, n 2 = 80), the power within these settings is 33.8%, 69.7% or 88.2%, respectively, at the significant level alpha = 0.05 for comparing means using a two-sample two-sided t-test.
Figure 2 reports the sample values against 6 replicates for each of the sample sizes from each of the populations.As shown in Figure 2, the sample values markedly vary across replicates, especially for small samples.The values of a large sample drawn from a population might tend to represent that population because the sample is less subject to random variation, while the values of a small sample drawn from a population might underlying true means significantly differs.However, in 4 simulations the P-values vary from highly significant to highly not-significant.Furthermore, the P-value reproducibility is affected by the potential outliers and small sample sizes in normally distributed population A (n 1 ) and population B (n 2 ) with (n 1 = n 2 = 10).
To draw statistical inference, we construct hypotheses.Usually the null hypothesis (H 0 ) states that the means in two populations A and B are the same, whereas the alternative hypothesis (H 1 ) states that the means in two populations A and B are different.Using the sample data, we validate these claims.To test these hypotheses, we construct the critical region (range of the mean) and calculate the probability of the critical region under two hypotheses.The significance level, (com-  are 34.2%,68.0% and 88.9%, and are in agreement with the theoretical power.From Figure 1 and Figure 3 we further see that the simulated results tend to be more reliable as the sample sizes increase. In Figure 1 of Halsey, et al. [1] shows contradicting P-values for smaller sample sizes (n 1 = 10 and n 2 = 10).Therefore, there is a need for robust replications irrespective of the sample sizes.In the following section, we develop a systematic approach to reduce the variability in replications.

Two-Stage Approach for Variability Reduction
In replicate studies some of the replication may be over dispersed and discarding those should lead to better inference.We develop a systematic approach, in two stages, to evaluate quality of simulations before any estimation.The first stage of our approach, we reduce variability.In the second stage, we apply the commonly used statistical methods such as point estimation, confidence interval and testing-hypothesis on the reduced replicates.In the following, we consider two scenarios: one-sample and two-sample inferences.

One-sample setting
. Then, we know that . Therefore, not tend to represent that population because in this case, the sample is subject to a random variation.However, whether the sample size is small or large, there exists the variability, with possibility of extra variability, of sample values.
Next, we examine the variability in P-value, estimated difference and range of confidence interval.Figure 1 reports these statistical measures against replicates with each of the sample sizes.Although the P-values tend to be reliable as the corresponding sample sizes increase, there still exists variability.Similarly, the estimated differences in means vary considerably across replicates even for large samples.The ranges of confidence intervals vary across replicates for small samples as well.These replicate variabilities result variability in the P-values; this issue is discussed at length in the recently published paper in Nature Methods [1].
Further, we examine the effect of sample size on the variability of statistical measures.From each of the populations A and B, we generate 1000 samples of sizes 20, 50 and 80. Figure 3 reports the distributions of the P-values for a two-sided test of the null hypothesis (no difference in means) and the distributions of the point estimates of mean difference along with 95% confidence intervals.We calculate an empirical power, which is defined as the percentage of the replicates where the difference between population means, is declared as a significant effect if the P-value is less than 0.05.The empirical power corresponding to three sample sizes Above two equations can be re-represented as where ( ) Here χ − represents a chi-square distribution with n t − degrees of freedom.Note that ( ) vations within a sample but discard the entire sample.
to declare extra-variate pair of samples and to remove from further inference.
As before, we estimate parameters j Statistical analyses are performed on the reduced replicates to compare means in one-sample and two-sample settings, which we call it second stage of the entire analysis.This increases the reproducibility in expressions in Equation ( 1) provide a tool to measure the quality of replicated samples.Now, we define settings and quantitative measures in replicated studies which are generated using parametric bootstrap samples [18] satisfies at least one condition in equation ( 2), which is then the sample is called extra-variant and it is removed from further statistical inference.
To estimate the parameters µ or we remove all the possible extra-variation samples and then perform statistical inference.Even though the number of replicate studies (bootstrap size) is predetermined, the resulting bootstrap size will be a random variable.Note that we do not discard individual obser- replicates.Also, we gain in power properties using this reduction approach as expected.Furthermore, from an inference point of view, the point estimates and confidence intervals using the reduced replicates are more precise than without reduction.To compare a the P-values for a one-sample t-test using the original data and using the reduced data where the null hypothesis 0 H is true, we generate 1000 data sets in which each data set includes 20, 40 and 60 observations from a normal population with mean 0 and variance 1, respectively.Figure 5 reports the P-values using the original data and using the reduced data where the null distribution is true.This figure shows that the P-values using the reduced data are approximately greater than 0.4, which means our approach significantly reduces the type I error (the significant level is 0.05).To compare a the P-values for a one-sample t-test using the original data and using the reduced data where the alternative hypothesis 1 H is true, we generate 1000 data sets in which each data set includes 20, 40 and 60 observations from a normal population with mean 0.5 and variance 1, respectively.Figure 6 reports the P-values using the original data and using the reduced data where the alternative hypoth-simulation studies.Like in one-sample settings, we estimate and compare

Simulation Study
First, the advantage of our two-stage approach can be explicitly visualized in simulation studies.To compare statistical inference for P-value, point estimate or confidence interval using the original replicates and using the reduced replicates, we generated 1000 data sets and each data set includes 20 observations from ( ) . We also calculate the P-values for a one-sample t-test where the null hypothesis 0 H is that the population mean is zero and the alternative hypothesis 1 H is that the population mean is not zero.Figure 4 reports the distributions of the P-values along with the distributions of the point estimates and ranges of 95% confidence intervals for the population means using the original replicates and using the reduced replicates.As shown in this figure, the P-values using the reduced replicates are less variable than those with the original the reduced data, we generate 1000 data sets where the alternative hypothesis 1 H is true and in which each data set comes from a normal population and the settings of population parameters and sample sizes are given in (Table 2).The results about type I errors and powers are summarized in (Table 2), which shows that our approach reduces drastically type I error because it removes all the abnormal samples prior to perform a t-test and that the powers for the t-test using the reduced data are larger than the ones using the original data.Now we generate 1000 data sets for each of two normal populations A and B with a common standard deviation where the sample sizes from two populations in each data set are the same and the settings of population parameters and sample sizes are given in (Table 3).A similar conclusion is summarized in (Table 3).The powers for the two-sample t-test using the reduced data are larger than the ones using the original data in most cases except when sample size is small (e.g., sample size is 40) or when difference in population means is small (e.g., mean of A is 0.8 and mean of B is 0.6).esis is true.This figure shows that the P-values for the t-test using the reduced data are approximately smaller than 0.2 for a sample size of 20 and are around 0 for a higher sample size, which means our approach might significantly improve the power, especially for big samples (e.g., sample size = 40 or 60) since power may be calculated as the proportion of p value less than the significance level alpha = 0.05 for data sets where the alternative hypothesis is true.This advantage in power can numerically be confirmed by the results in (Table 2).
Next the advantage of our two-stage approach can numerically be shown in simulation studies.To compare a type I error for one-sample t-test using the original data and using the reduced data, we generate 1000 data sets where the null hypothesis 0 H is true and in which each data set includes 20, 40 and 60 observations from a normal population with mean 0 and variance 1, respectively.The type I error is calculated as the proportion of P-values less than the significance level alpha = 0.05 for all the 1000 data sets.To compare a power for one-sample t-test using the original data and using in a one-sample case (a similar representation in two-sample case).In equation ( 4) different values of will lead to a different degree of reproducibility, which needs to be studied further.As we know the type I and II errors cannot be simultaneously minimized if parameters are fixed, including bootstrap size.However for a fixed type I error, we can select the optimal reproducibility parameter to minize the type II error.Altogether, we discuss practical issues about variability reduction here, it is important to explore the methodology of our approach when studying properties of estimators using bootstrap simulations in a real data.
There are some limitations of the proposed method, an 'out of box' solution for increasing reproducibility.First, the bootstrap size becomes a random variable; how does it compare when using methods based on trimmed data (deleting outliers) for inference is another issue that can be studied further.Also, when the distributional assumptions (normal distribution/Gaussian distribution), are not valid, the method may still work with symmetric and unimodal such as student t-distribution.However, to generalize to a very skewed and/or multimodal distribution requires additional work that one may consider.

Discussion
To improve the reproducibility of the results in replicate studies, we propose a two-stage approach to reduce variability.Our simulation studies reveal that the variability of statistical measures such as P value, point estimator and confidence interval has considerably reduced.When compared to one-and two-sample t-test using original data, our approach markedly improve the power and meanwhile reducing potential type I error.Also, the results from statistical inference are interest-

Figure 1 :
Figure 1: The scatter plot of the P-values, estimated differences in means and ranges of confidence intervals against replicates.


and let Y and S be the sample mean and standard deviation, respectively, i.e.,

Figure 2 :
Figure 2: The scatter plot of sample values against replicates, where blue colors represent sample mean.

Figure 3 :
Figure 3: Frequencies of the P-values, estimated differences and ranges for the sample sizes of 20, 50 and 80.

S
be the sample mean and standard deviation from the th j population in the th b bootstrap sample, respectively, b = 1, , ; 1, 2; 1, ,

2 ,= 1 ,
N µ σ in the th b bootstrap sample, *b Y and *b S be the sample mean and standard deviation of the th b bootstrap sample, respectively, b two-stage approach to reduce variability is described as follow.If the th

.
Thus, in the first stage,

Figure 4 :
Figure4: Frequencies of the P-values, sample means and ranges using the original data and using the reduced data.
entire replicates and the reduced replicates.

Figure 5 :
Figure 5: P-values across replicates in the original and reduced null data where the null hypothesis is true.

Figure 6 :
Figure 6: P-values across replicates in the original and reduced null data where the alternative hypothesis is true.

Table 1 :
Summary of simulation characteristics in Halsey, et al.