• IRIS DATASET
    • About the dataset
    • Variables
    • Summary and Visualization

IRIS DATASET

About the dataset

The Iris flower data set or Fisher’s Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”

This famous iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are:
Iris setosa Iris setosa
Iris versicolor Iris versicolor Iris virginica Iris virginica

Variables

Five variables are included in Iris:

  • A factor with 3 leveles indicating each species (I. setosa, I.versicolor, I. virginica), named Species
  • The length of the sepals in centimeters, named Sepal.Length
  • The width of the sepals in centimeters, named Sepal.Width
  • The length of the petals in centimeters, named Petal.Length
  • The width of the petals in centimeters, named Petal.Width

Summary and Visualization

Below you can see some statistics for each variable, also you can see that each species has 50 measurements.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Also here is the plot that shows the relationship between variables. You can see that Sepal and Petal length are highly correlated, similarly Petal length and width are also correlated.

pairs(iris[1:4], main= "Iris Data", pch=19)

Finally, the following plots show the value of each variable for each species. As you can see I. virginica is the species with larger petals and sepals, except for sepal width that is larger in I. setosa.

par(mfrow=c(2,2))
boxplot(iris$Sepal.Length~iris$Species, xlab="Species", ylab="Sepal length", col=c("slateblue2", "orange","darkgreen"))
boxplot(iris$Sepal.Width~iris$Species,xlab="Species", ylab="Sepal width",col=c("slateblue2", "orange","darkgreen"))
boxplot(iris$Petal.Length~iris$Species,xlab="Species", ylab="Petal length",col=c("slateblue2", "orange","darkgreen"))
boxplot(iris$Petal.Width~iris$Species,xlab="Species", ylab="Petal width",col=c("slateblue2", "orange","darkgreen"))

That is all for now! thanks for your attention!