STATISTICS IN R #To measure the probability of an outcome one must know the desired outcome number of selections and the total number of possible outcomes. For example, in a six-sided die you have 6 total number outcomes. If you choose to only select one at a time, then, the probability of selecting any number is about 16.75% and this is what is called a discrete uniform probability. #punif() = it's a probability function in which one can measure the probability of selecting a set number of outcomes by using a continuous uniform distribution. For example, to identify the probability of waiting seven minutes or less (<=) for a bus, we can use the function in the following manner: punif(7, min = 0, max = 12). If we want to know the probability of waiting between four and seven minutes, then, punif(7, min = 0, max = 12) - punif(4, min = 0, max = 12). If we want to know the probability of waiting 7 minutes or more (>= ; right end portion of the graph), then, punif(7, min = 0, max = 12, lower.tail = FALSE). #runif() = it is one of various functions (dunif(), punif(), qunif()) that works with the uniform distribution.runif() gives a random number between a predetermined range. For example, runif(1000, min = 0, max = 30) gives 1,000 different random numbers that are between 0 and 30. #rbinom() = The function rbinom generates a vector of binomial distributed random variables given a vector length n, number of trials (size) and probability of success on each trial (prob). Ex., rbinom(n, size, prob) #pbinom() = Put simply, pbinom returns the area to the left of a given value q in the binomial distribution. If you’re interested in the area to the right of a given value q, you can simply add the argument lower.tail = FALSE. Ex., pbinom(q, size, prob, lower.tail = FALSE) #dbinom() = Put simply, dbinom finds the probability of getting a certain number of successes (x) in a certain number of trials (size) where the probability of success on each trial is fixed (prob). Ex. dbinom(x, size, prob) #qbinom() = Put simply, you can use qbinom to find out the pth quantile of the binomial distribution. Ex., qbinom(q, size, prob) #expected value = n * p (i.e., group number * probability) is the expected value of the binomial distribution or obtained outcomes of interest. #pnorm() = this function calculates the percent of subjects contained under the curve from and to the left of the number of interest. #qnorm() = this function calculates the percentage of people that are under a percentile of interest. #rnorm() = this function generates random amount of observations given a mean and SD. #dpois() = The dpois function finds the probability that a certain number of successes occur based on an average rate of success, using the following syntax: dpois(x, lambda) #ppois() = The ppois function finds the probability that a certain number of successes or less occur based on an average rate of success, using the following syntax: ppois(q, lambda) #qpois() = The qpois function finds the number of successes that corresponds to a certain percentile based on an average rate of success, using the following syntax: qpois(p, lambda) #rpois() = The rpois function generates a list of random variables that follow a Poisson distribution with a certain average rate of success, using the following syntax: rpois(n, lambda) #dexp() = Exponential distribution or probability distribution function (PDF) Ex, dexp(x, lambda = rate of decay #, log = FALSE) #pexp() = Exponential distribution or cumulative distribution function (CDF) Ex, pexp(x, lambda = rate of increase #, lower.tail = TRUE, log.p = FALSE) #qexp() = Quantile function of the exponential distribution. Ex. qexp(x, lambda = rate of increase #, lower.tail = TRUE, log.p = FALSE) #rexp() = generate a vector of n random observations from an exponential distribution. Ex, rexp(# of observations, rate = 1)
GGPLOT SERIES #factor() = can be used within ggplot in the aes() function to designate a variables as a factor. Ex ggplot(x, aes(factor(var1), var2)) + geom_point() #an alpha argument can be used within the geom_point() function to specify the transparency of the points (Ex. alpha = 0 - 1.0) #the aes() function can be used within the geom_point() function to describe categories. (Ex. geom_point(aes(color = variable))). This is only necessary if all layers should not inherit the aesthetics. #Typical visible aesthetics: aes(x, y, fill, color, size) and/or aes(alpha, linetype, labels,shape) #fill changes the inside of a geom while color changes the outline. However, geom_point() is an exception as color is used instead.
NOTE: Notice that mapping a categorical variable onto fill doesn't change the colors, although a legend is generated! This is because the default shape for points only has a color attribute and not a fill attribute! Use fill when you have another shape (such as a bar), or when using a point that does have a fill and a color attribute, such as shape = 21, which is a circle with an outline. Any time you use a solid color, make sure to use alpha blending to account for over plotting.
#geom_text() = plots the data onto graph as text. For example, instead of points or shapes, it will display as the name or number. #an aes() argument can be the result of a calculation. For example, aes(x, y, size = var1/var2). #within the geom_bar() we can set the position of bars to on top or side by side by using "position" within the function. Ex, geom_bar(position = "dodge"). #The alpha argument is used to show data with transparency to avoid overplotting or plotting data behind other data.
SAMPLING IN R
#slice_sample() = takes a data frame and creates a sample data frame using its arguments for n sample # Ex. df %>% select(var1, var2) %>% slice_sample(n = 10) #sample() = works the same way as the previous (in fact, the prior is a base of this one), but only using vectors. Ex. #rowid_to_columns() = adding a first column with a sequence of numbers to be used as ID. rowid_to_columns(var = "ID") #seq_len() = creates a vector of sequence of the input. Ex. sample1 = 5, then, seq_len(sample1) will create a vector of values with a sequence of 5.
#We can create an "interval" by dividing population size by the sample size using the mathematical operator "%/%". This leaves out the remainder and creates the set interval to choose for in a systematic sampling.
#For proportion stratified sampling, the group_by() and ungroup() functions could be used to perform sampling by a variable. For example, a random sample by country instead of introducing the sample by global population.
#Weighted random sampling can be used to perform a random sampling that considers the weight of the variable in respect to the entire population.
HYPOTHESIS TESTING IN R
#conf_int = to calculate the confidence interval we can use the following code: conf_int <- bootstrap_distribution %>% summarize(lower = quantile(var1, 0.025), upper = quantile(var1, 0.975)) #z_score = to calculate a z-score we can use the following code: z_score <- (sample_mean - hypothesis_mean) / std_error #p_value = to calculate a p-value we can use the following code: pnorm(z_score) for "less than" pnorm(z_score, lower.tail = FALSE) for "greater than" pnorm(z_score) + pnorm(z_score, lower.tail = FALSE) for "not equal/two tailed" or p_value <- 2 * pnorm(z_score) #pt() = this function is the Student "t" Distribution. pt(t_score, df [degrees of freedom], lower.tail = TRUE, log.p = FALSE). #t.test() = is the easiest way to perform a t-test statistic calculation. t.test(x [can be variables in columns],y, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) #pairwise.t.test() = Calculate pairwise comparisons between group levels with corrections for multiple testing. Ex. pairwise.t.test(x, g, p.adjust.method = p.adjust.methods, pool.sd = !paired, paired = FALSE, alternative = c("two.sided", "less", "greater"), ...) #prop_test() = it is a function to test for the proportions of two samples. Ex. df %>% prop_test( x = df, formula = proportions_var1 ~ category_var1, response = NULL, explanatory = NULL, p = NULL, order = NULL, alternative = "two-sided", conf_int = TRUE, conf_level = 0.95, success = NULL [element], correct = NULL [Yates' continuity correction needed for very small samples], z = FALSE, ...) This uses the "infer" package. Ex. library(infer)
#weighted.means() = averages values weighted by the second argument given. Ex. weighted.mean(x, weighted value) #chisq_test() = The chi-square test is a statistical test used to determine if there is a significant association between two categorical variables. It is a non-parametric test, meaning it does not make assumptions about the underlying distribution of the data. Ex. chisq_test(x, formula = var1 ~ var2[order does not matter], response = , explanatory = NULL) #goodness of fit test is a chi-square of one sample test. Ex. df %>% chisq_test(response = category_var1, p = proportions) #The infer pipeline for hypothesis testing requires four steps to calculate the null distribution: specify, hypothesize, generate, and calculate. Workflow is as follows for null distribution and observed stat: null_distribution <- df %>% specify() %>% hypothesize() %>% generate() %>% calculate() observed_stat <- df %>% specify(category_var1 ~ category_var2, success = "[element]") %>% calculate() get_p_value(null_distribution, observed_stat, direction= "two-sided")
#specify(x, formula, response = NULL, explanatory = NULL, success = NULL) #hypothesize(x, null, p = NULL, mu = NULL, med = NULL, sigma = NULL) #generate(x, reps = 1, type = NULL, variables = !!response_expr(x), ...) #calculate( x, stat = c("mean", "median", "sum", "sd", "prop", "count", "diff in means", "diff in medians", "diff in props", "Chisq", "F", "slope", "correlation", "t", "z", "ratio of props", "odds ratio", "ratio of means"), order = c(var1 - var2), ...) #wilcox.test(numerical_var1 ~ categorical_var2, data = df, alternative = "two-sided", correct = FALSE) #kruskal.test(numerical_var1 ~ categorical_var2, data = df)