Example: Fisher Iris Data Set

The Fisher iris data set is frequently used as a sample statistical data set. This example reads the data set in a CVS (comma separated value) format.

The first few lines of the data set are as follows:

Species,Sepal Length,Sepal Width,Petal Length,Petal Width
 1.0, 5.1, 3.5, 1.4, .2
 1.0, 4.9, 3.0, 1.4, .2
 1.0, 4.7, 3.2, 1.3, .2
 1.0, 4.6, 3.1, 1.5, .2
 1.0, 5.0, 3.6, 1.4, .2
 1.0, 5.4, 3.9, 1.7, .4

The first line contains the column names, with a comma as the separator. The rest of the lines contain double data, one observation per line, with comma as a separator.

The class FlatFileEx1 extends com.imsl.io.FlatFile . The FlatFileEx1 constructor constructs a BufferedReader object and calls the com.imsl.io.FlatFile constructor. It then reads the line containing the column names. The column names are parsed and used to set the column names in com.imsl.io.FlatFile . All of the columns are also set to type Double .

The class FlatFileEx1 is used in the method main . The data set is assumed to be in a file called "FisherIris.csv" in the same location as the example class file, so the getResourceAsStream can be used to open the file as a stream. A com.imsl.stat.Summary is created and used to compute statistics for the "Sepal Width" column.

import com.imsl.io.FlatFile;
import com.imsl.stat.Summary;
import java.io.*;
import java.sql.SQLException;
import java.util.StringTokenizer;


public class FlatFileEx1 extends FlatFile {    
   public FlatFileEx1(InputStream is) throws IOException {
      super(new BufferedReader(new InputStreamReader(is)));
      String line = readLine();
      StringTokenizer st = new StringTokenizer(line, ",");
      for (int j = 0;  st.hasMoreTokens();  j++) {
         setColumnName(j+1, st.nextToken().trim());
         setColumnClass(j, Double.class);
      }
   }

   public static void main(String[] args) throws SQLException, IOException {
      InputStream is = FlatFileEx1.class.getResourceAsStream("FisherIris.csv");
      FlatFileEx1 iris = new FlatFileEx1(is);

      Summary summary = new Summary();
      while (iris.next()) {
         summary.update(iris.getDouble("Sepal Width"));
      }

      System.out.println("Sepal Width mean " + summary.getMean());
      System.out.println("Sepal Width variance " + summary.getVariance());
   }
}

Output

Sepal Width mean 3.057333333333334
Sepal Width variance 0.18871288888888907
Link to Java source.

Reference

Fisher, R.A. (1936), The use of multiple measurements in taxonomic problems , The Annals of Eugenics, 7, 179-188.