The Iris flower data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
The three Iris species are:
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Specifically, iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The first four are numeric variables and the last one is a factor.
Below are some basic statistics on each of the columns of the data
frame using the summary
function:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
To visualize the data graphically and improve our understanding of
the iris
data set we can create the following plots:
In the first, we use the ggplot
function from
ggplot2
to construct a histogram for each of the four
numerical variables. Histograms allow us to represent the distribution
of the data set, helping us to see the center, extent, and shape of the
data.
library(ggplot2)
library(reshape2)
iris2 <- melt(iris)
## Using Species as id variables
p <- ggplot(iris2, aes(x=value, fill = Species))
p <- p + geom_histogram(binwidth = 0.2, alpha=.5)
p <- p + facet_grid(. ~ variable, scales = "free")
p <- p + scale_fill_manual(values=c("#00BFFF", "#FFC125", "#FF7256"))
p <- p + xlab("Length")
p <- p + ylab("Count")
p <- p + theme_bw()
p
We can already see that there are some morphological differences between the three Iris species, especially in the petal measurements.
Another figure we can make is a boxplot for each numerical variable as a function of the three Iris species. To graphically represent the numerical data through their quartiles. In this one, we can observe more clearly the differences between the species for the four variables:
p <- ggplot(iris2, aes(x=Species, y=value, fill = Species))
p <- p + geom_boxplot()
p <- p + scale_fill_manual(values=c("#00BFFF", "#FFC125", "#FF7256"))
p <- p + facet_grid(. ~ variable)
p <- p + xlab("")
p <- p + ylab("Length")
p <- p + theme_bw()
p
I. virginica tends to have higher values for most variables, I. setosa smaller values and I. versicolor intermediate values. But in the case of Sepal.Width the trend does not hold. Furthermore, it appears that both Length and Width of I. setosa have a more restricted distribution compared to the other two species.