Iris flower data set

The Iris flower data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

The three Iris species are:

Iris setosa

Iris virginica

Iris versicolor

Data set

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Specifically, iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The first four are numeric variables and the last one is a factor.

Below are some basic statistics on each of the columns of the data frame using the summary function:

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Data visualization

To visualize the data graphically and improve our understanding of the iris data set we can create the following plots:

In the first, we use the ggplot function from ggplot2 to construct a histogram for each of the four numerical variables. Histograms allow us to represent the distribution of the data set, helping us to see the center, extent, and shape of the data.

library(ggplot2)
library(reshape2)
iris2 <- melt(iris)

## Using Species as id variables

p <- ggplot(iris2, aes(x=value, fill = Species))
p <- p + geom_histogram(binwidth = 0.2, alpha=.5)
p <- p + facet_grid(. ~ variable, scales = "free")
p <- p + scale_fill_manual(values=c("#00BFFF", "#FFC125", "#FF7256"))
p <- p + xlab("Length")
p <- p + ylab("Count")
p <- p + theme_bw()
p

We can already see that there are some morphological differences between the three Iris species, especially in the petal measurements.

Another figure we can make is a boxplot for each numerical variable as a function of the three Iris species. To graphically represent the numerical data through their quartiles. In this one, we can observe more clearly the differences between the species for the four variables:

p <- ggplot(iris2, aes(x=Species, y=value, fill = Species))
p <- p + geom_boxplot()
p <- p + scale_fill_manual(values=c("#00BFFF", "#FFC125", "#FF7256"))
p <- p + facet_grid(. ~ variable)
p <- p + xlab("")
p <- p + ylab("Length")
p <- p + theme_bw()
p

I. virginica tends to have higher values for most variables, I. setosa smaller values and I. versicolor intermediate values. But in the case of Sepal.Width the trend does not hold. Furthermore, it appears that both Length and Width of I. setosa have a more restricted distribution compared to the other two species.

Homework first session

Diana Bonilla

Iris flower data set

Data set

Data visualization