Types of Data
Chapters 1 and 2

In science, to be able to study something we must be able to observe it and its characteristics. To this end scientists collect what is called data and these data may take a variety of forms. We are going to start with some definitions and then introduce the various types of data that biologists may collect.

Definitions (chapter 2):

Observations -- These are measurements taken on the smallest sampling unit. Another, much less used term for the same thing is variate. For example, if you were studying the size of raccoons, you might decide to weigh them. The weight of one raccoon would be one observation, as one raccoon is the smallest sampling unit in your study.

Sample of observations -- The collection of observations made during the study constitutes the sample. For example, in our study of raccoons, we recorded the weight on 23 different raccoons. This collection of 23 weights constitutes the sample for our study. It is assumed that the 23 raccoons were taken randomly from the population under study.

Character -- The actual property measured by the individual observations. Another term that is often used is variate. For example, in our study on raccoons, body weight is the character or variable that we are measuring.

Population -- The totality of individual observations about which inferences are to be made. A population exists in a defined space and at a defined period of time. For example, in our study on raccoons, we are interested in drawing inferences about the average size of raccoons in Lyon County, Kansas during the summer of 1999. Thus our population is defined as all raccoons that exist in Lyon County during the summer of 1999. Obviously, we can not measure all of these raccoons, so we take a sample and from the sample make inferences about the average size of raccoons in this population. Based on our sample of 23 raccoons from Lyon County, we can not make inferences about the average size of raccoons that live in North America. We did not collect observations in such a way that we can do that. Populations are, in reality, finite, though in practice they are generally large enough that statistically they can be treated as infinite.

Types of measurements (chapter 1.1)

The collection of observations constitutes our data and data come in the form of measurements. There are four basic kinds of measurements, which can be thought of a hierarchy of data. It is important to keep the types of data in mind, as certain types of analyses are appropriate for one level but not for another.

1) nominal data -- This means simply to name an object and thus to assign it to a class. For example, by classifying individuals as male or female we assign each individual to a class and thus have nominal data. Creating lists of species and numbers of individuals is another type of nominal data that ecologists often collect. Statistics can be applied to these data after they have been converted to percentages or proportions, for example 45% female and 55% male.

2) ordinal data -- Objects are measures as being either greater than or less than, higher or lower than a comparative object. For example, if we were at the store to but a head of lettuce and the price was \$1.50 per head. We might pick up several heads, hold them in our hands, in an attempt to determine which head was largest. We would be collecting ordinal data in that we are simply determining their relative weights. Another example, would be finishing positions in a race (first, second, third, etc.).

3) interval level of measurement -- This is the point where we introduce a basic standard interval but not necessarily a true zero (that is the zero does not have a true meaning indicating the complete absence of whatever we are measuring). An example is temperature on the Celsius or Fahrenheit scales. Temperature is a measure of the energy in the molecules of a substance, the higher the temperature, the higher the kinetic energy. On either scale, zero does not indicate a complete absence of kinetic energy and 60 degrees is not twice as much energy as 30 degrees. However, the differences in kinetic energy between 0 degrees and 1 degree is the same as the difference between 2400 degrees and 2401 degrees, because the interval of measurement is standard along the scale. Each interval represents one unit of kinetic energy.

4) ratio level of measurement -- There is a basic standard interval, as above, and a meaningful zero value. For example, in weight the basic standard interval could be defined as a gram and zero grams would indicate no weight, also 60 grams is twice as heavy as 30 grams. If you want to measure temperature using a ratio scale, then you would have to use the Kelvin scale, where zero degrees means the complete absence of kinetic energy in a substance and 60 degrees Kelvin is twice as hot as 30 degrees Kelvin.

5) derived variables -- This represents a separate class of measurements, which is not really part of the hierarchy, in which the measurement is derived from two or more independently measured variables. For example, you could define a stoutness index in humans that would be equal to body weight divided by height. You could define a motor development index, which consists of a score on each of six different motor tests and the motor development index is the sum of the scores on each of these six tests. These derived variables often have non-normal distributions (which we will define later) and thus often you will have problems when analyzing the data by methods that assume that the data are normally distributed. More will be said about this later.

Continuous versus discrete data

Data are continuous when the measurements could conceivably take on any value. For example, body weight in raccoons is continuous as we could have a raccoon that weighed 8 kg and we could have one that weighed 9 kg. We could also have one that weighed 8.7 kg or 8.67 kg or 8.679 kg. As you can see any real number within the range of body weights is possible.

Data are discrete (also called meristic) when the measurements or observations can only take on integral values. For example, data based on counts are discrete such as the number of legs that an animal may have, the number of scales in the lateral line of a fish, the number of chromosomes in the cells of an organism.

Some data may appear discrete but are really continuous, for example scores on a Likert scale. For example, you will be asked to evaluate the effectiveness of my teaching and you will be asked to circle either a 1, 2, 3, 4, or 5 on a computer-scorable sheet. Now the effectiveness of an instructor is a continuous variable as it could take on any value between 1 and 5, however it is very difficult (if not impossible) to discriminate between observations any more accurately when recording this observation.

Frequency distributions (chapters 1.3 and 1.4)

Frequency distributions are a graphical representation of the frequency of occurrence of observations across the range of observations. Frequency distributions are a very useful methods for examining your data and gaining insights into the structure of the data that would not be obvious to you from the calculation of descriptive statistics (which we will cover in the next section). For example, you may find that your data are distributed as a bimodal distribution (that is that it has two peaks), or that the distribution is U-shaped or J-shaped and clearly not normally distributed.

Frequency distributions take different forms depending upon whether the data are discrete or continuous. If you are plotting discrete data, then you should use a bar graph with the bars not touching each other to show that the data are discrete. For example, if you were producing a graph that shows the frequency of occurrence of numbers of species of fish in a lake, your data may consist of 2 lakes with 1 species, 4 lakes with 2 species, 5 lakes with 3 species, 7 lakes with 4 species, 6 lakes with 5 species, 3 lakes with 6 species, 1 lake with 7 species, and 1 lake with 9 species. As the number of species can only take integral values, the data are discrete. Thus the bars used to represent the number of species can not touch adjacent bars.

Plot the number of species along the x-axis and numbers of lakes along the y-axis. To do this we will use a computer package called Sigma Plot. You can access this program on the computers in SH 158. Your graph should look like the one below.

If you are producing a frequency distributions of continuous data, then the bars will touch each other indicating that the observations can take on any conceivable value. The data below come from Sokal and Rohlf and represent the femur length of 25 aphids.
```3.8      3.6      4.3     3.5     4.3     3.3     4.3     3.9     4.3     3.8
3.9      4.4      3.8     4.7     3.6     4.1     4.4     4.5     3.6     3.8
4.4      4.1      3.6     4.2     3.9```
We will first arrange the observations in ascending order from lowest to highest. Note that when we record a value of 3.3, that means that the true length of the femur lies between 3.25 and 3.35 and when we record a value of 4.2, that means that the true length lies between 4.15 and 4.25. We are only recording our data to the nearest tenth (read chapter 1.2). See the column labeled implied limits in the table below. We have actually divided our data into 15 different classes with a class interval of 0.1. If you look closely at the column labeled frequency, you will see several classes that have zero observations (3.4, 3.7, 4.0, and 4.6). This is not unexpected when you have 25 observations divided among 15 classes.

classmark impliedclass limits frequency cumulativefrequency Class mark represents the mid-point of the implied class limits. In this case, it represents the actual measurement recorded for each aphid. Frequency is the number of times that measurement with that class mark occurs in the set of data. Cumulative frequency is the sum of the frequencies for each class from the smallest class mark up through the class mark of interest. 3.3 3.25 - 3.35 1 1 3.4 3.35 - 3.45 0 1 3.5 3.45 - 3.55 1 2 3.6 3.55 - 3.65 4 6 3.7 3.65 -3.75 0 6 3.8 3.75 - 3.85 4 10 3.9 3.85 - 3.95 3 13 4.0 3.95 - 4.05 0 13 4.1 4.05 - 4.15 2 15 4.2 4.15- 4.25 1 16 4.3 4.25 - 4.35 4 20 4.4 4.35 - 4.45 3 23 4.5 4.45 - 4.55 1 24 4.6 4.55 - 4.65 0 24 4.7 4.65 - 4.75 1 25

Plot these data using SigmaPlot. Your graph should look like the one below.

Because we had so many classes with zero observations, we are next going to lump our data into fewer classes, in this case 5 classes (see the second table below). This will produce a graph in which it is easier for us to see the pattern in our data.

classmark impliedclass limits practicalclass frequency cumulativefrequency limits Practical class limits are the ranges of recorded values that fall within a particular class. Note here that the class interval is 0.3 3.4 3.25 - 3.55 3.3 - 3.5 2 2 3.7 3.55 - 3.85 3.6 - 3.8 8 10 4.0 3.85 - 4.15 3.9 - 4.1 5 15 4.3 4.15 - 4.45 4.2 - 4.4 8 23 4.6 4.45 - 4.75 4.5 - 4.7 2 25
Plot these data using SigmaPlot. Your graph should look like the one below.

Last updated on 24 August 2000.
Provide comments to Dwight Moore at mooredwi@emporia.edu.