Biostatistics Tutorial Full course for Beginners to Experts

Module 2 – Describing Data: Shape

require(&quot;moments&quot;)

Loading required package: moments

So this section is shape.

Learning Objectives

Outline

Statistical Notation

N: The number of N in a population, n: the number of n in a sample

Organizing Data

Frequency Distribution:

An Organized table of the number of scores located in each category on the measurement scale. The number of times each possible value of a variable occurs in a dataset.

# Load the PostgreSQL driver, create a connection to the postgres database
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "smokyants", host = "localhost", port = 5432, user = "ruser", password = "ruser")
sql_string <- paste("SELECT * FROM smokyants", sep="")
smokyants <- data.frame(dbGetQuery(con, sql_string))
dbDisconnect(con)

[1] TRUE

doesit <- data.frame(c(rowSums(smokyants[5:42])), sort(smokyants$elevation_m))
plot(doesit[,1]~doesit[,2], ylab="RowSum", xlab="Elevation", axes=FALSE, main = "Counts at Elevation")
axis(1, sort(smokyants$elevation_m), las=2)

axis(2, sort(rowSums(smokyants[5:42])), las=2)

FileDirectory <- paste("/home/daiten/Programming/R/Projects/Biostatistics Tutorial/Media/", sep="")
png(paste(FileDirectory, "AntFrequency.jpg", sep = ""), width = 1000, height = 500)

plot(doesit[,1]~doesit[,2], ylab="RowSum", xlab="Elevation", axes=FALSE)
axis(1, sort(smokyants$elevation_m), las=2)
axis(2, sort(rowSums(smokyants[5:42])), las=2)

dev.off()

png 
  2

Cumulative Frequency Distribution:

Dissolved Nitrate over the course of a year or something.

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "cali_water", host = "localhost", port = 5432,user = "ruser", password = "ruser")
sql_string <- paste("SELECT * FROM lab_results WHERE (units = 'mg/L' AND county_name = 'Alameda' AND parameter = 'Dissolved Nitrate')", sep="")
Lab_results <- data.frame(dbGetQuery(con, sql_string))

plot(na.omit(cumsum(Lab_results$result)), type="l", 
     main = "Cumulative Dissolved Nitrate, Alameda", 
     ylab = "mg/L",
     xlab = paste("Number of samples from ", range(Lab_results$sample_date)[1], 
                  " to ", range(Lab_results$sample_date)[2]))
axis(1, range(Lab_results$sample_date))

dbDisconnect(con, )

[1] TRUE

Histogram/Bar Graph

A graphical representation of a grouped frequency distribution with continuous classes. An approximate representation of the distribution of numerical data.

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "covertype", host = "localhost", port = 5432,user = "ruser", password = "ruser")
sql_string <- paste("SELECT * FROM covertype", sep="")
CoverType <- data.frame(dbGetQuery(con, sql_string))
dbDisconnect(con)

[1] TRUE

CoverTypes <- c("1. Spruce/Fir", "2. Lodgepole Pine", "3. Ponderosa Pine", "4. Cottonwood/Willow", "5. Aspen", "6. Douglas-fir", "7. Krummholz")
hist(CoverType$cover_type, main = "Cover Type Representation", xlab = "Tree Type")
legend("topright", CoverTypes)

Pie Chart

A circular statistical graphic, which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice (and consequently its central angle and area) is proportional to the quantity it represents.

CoverTypes <- c("1. Spruce/Fir", "2. Lodgepole Pine", "3. Ponderosa Pine", "4. Cottonwood/Willow", "5. Aspen", "6. Douglas-fir", "7. Krummholz")
pie(table(CoverType$cover_type), labels = CoverTypes, main = "Cover Type Representation")

# 
# slices &lt;- c(10, 12,4, 16, 8)
# lbls &lt;- c(&quot;US&quot;, &quot;UK&quot;, &quot;Australia&quot;, &quot;Germany&quot;, &quot;France&quot;)
# pie(slices, labels = lbls, main=&quot;Pie Chart of Countries&quot;)

Shape of a Distribution

Normal Distribution

hist(rnorm(1000, mean = 70, sd = 10))


x= rbeta(10000,5,5)
hist(x, main="Symmetrical", freq=FALSE)
lines(density(x), col='red', lwd=3)

abline(v = c(mean(x),median(x)),  col=c("green", "red"), lty=c(2,2), lwd=c(3, 3))

Skewed Distribution

# https://stackoverflow.com/questions/28099590/create-sample-vector-data-in-r-with-a-skewed-distribution-with-limited-range
# The beta distribution takes values from 0 to 1. If you want your values to be from 0 to 5 for instance, then you can multiply them by 5. Finally, you can get a "skewness" with the beta distribution. For example, for the skewness you can get these three types:
# enter image description here
# And using R and beta distribution you can get similar distributions as follows. Notice that the Green Vertical line refers to mean and the Red to median:

x= rbeta(10000,5,2)
hist(x, main="Negative or Left Skewness", freq=FALSE)
lines(density(x), col='red', lwd=3)

abline(v = c(mean(x),median(x)),  col=c("green", "red"), lty=c(2,2), lwd=c(3, 3))


x= rbeta(10000,2,5)
hist(x, main="Positive or Right Skewness", freq=FALSE)
lines(density(x), col='red', lwd=3)

abline(v = c(mean(x),median(x)),  col=c("green", "red"), lty=c(2,2), lwd=c(3, 3))


set.seed(3)
hist(sample(1:10, size = 100, replace = TRUE, prob = 10:1))

Kurtosis

Kurtosis is a statistical measure used to describe the degree to which scores cluster in the tails or the peak of a frequency distribution. The peak is the tallest part of the distribution, and the tails are the ends of the distribution. There are three types of kurtosis: mesokurtic, leptokurtic, and platykurtic. A measure of whether or not a distribution is heavy-tailed or light-tailed relative to a normal distribution.

Mesokurtic: Distributions that are moderate in breadth and curves with a medium peaked height.
Leptokurtic: More values in the distribution tails and more values close to the mean (i.e. sharply peaked with heavy tails)
Platykurtic: Fewer values in the tails and fewer values close to the mean (i.e. the curve has a flat peak and has more dispersed scores with lighter tails).

The kurtosis of a normal distribution is 3.

If a given distribution has a kurtosis less than 3, it is said to be playkurtic, which means it tends to produce fewer and less extreme outliers than the normal distribution.
If a given distribution has a kurtosis greater than 3, it is said to be leptokurtic, which means it tends to produce more outliers than the normal distribution.

data = c(88, 95, 92, 97, 96, 97, 94, 86, 91, 95, 97, 88, 85, 76, 68)
hist(data)


#calculate skewness
print(paste("Skewness: ", skewness(data)))

[1] "Skewness:  -1.39177658345157"

#calculate kurtosis
print(paste("Kurtosis: ", kurtosis(data)))

[1] "Kurtosis:  4.17786452179821"

Rank & Percentile

Percentiles

A score below which a given percentage (“k”) of scores in its frequency distribution falls (exclusive definition) or a score at or below which a given percentage falls (inclusive definition). The value below which a percentage of data falls, a number where a certain percentage of scores fall below that number.

Quantiles

Cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. Common quantiles have special names, such as quartiles (four groups), deciles (ten groups), and percentiles (100 groups). The 25th percentile is also known as the first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as the third quartile (Q3).

Key Terms

github: https://github.com/cmenefee/BiostatisticsTutorial

Kinoko_Mori

Biostatistics Tutorial Full course for Beginners to Experts

Biostatistics Tutorial Full course for Beginners to Experts

Module 2 – Describing Data: Shape

Learning Objectives

Outline

Statistical Notation

Organizing Data

Frequency Distribution:

Cumulative Frequency Distribution:

Histogram/Bar Graph

Pie Chart

Shape of a Distribution

Normal Distribution

Skewed Distribution

Kurtosis

Rank & Percentile

Percentiles

Quantiles

Leave a Reply Cancel reply