Types of Data
Chapters 1 and 2
In science, to be able to study something we must be able to observe it and its characteristics. To
this end scientists collect what is called data and these data may take a variety of forms. We are
going to start with some definitions and then introduce the various types of data that biologists
Definitions (chapter 2):
Observations -- These are measurements taken on the smallest sampling unit. Another, much
less used term for the same thing is variate. For example, if you were studying the size of
raccoons, you might decide to weigh them. The weight of one raccoon would be one
observation, as one raccoon is the smallest sampling unit in your study.
Sample of observations -- The collection of observations made during the study constitutes the
sample. For example, in our study of raccoons, we recorded the weight on 23 different raccoons.
This collection of 23 weights constitutes the sample for our study. It is assumed that the 23 raccoons were taken randomly from the population under study.
Character -- The actual property measured by the individual observations. Another term that is
often used is variate. For example, in our study on raccoons, body weight is the character or
variable that we are measuring.
Population -- The totality of individual observations about which inferences are to be made. A
population exists in a defined space and at a defined period of time. For example, in our study
on raccoons, we are interested in drawing inferences about the average size of raccoons in Lyon
County, Kansas during the summer of 1999. Thus our population is defined as all raccoons that
exist in Lyon County during the summer of 1999. Obviously, we can not measure all of these
raccoons, so we take a sample and from the sample make inferences about the average size of
raccoons in this population. Based on our sample of 23 raccoons from Lyon County, we can not
make inferences about the average size of raccoons that live in North America. We did not
collect observations in such a way that we can do that. Populations are, in reality, finite, though
in practice they are generally large enough that statistically they can be treated as infinite.
Types of measurements (chapter 1.1)
The collection of observations constitutes our data and data come in the form of measurements.
There are four basic kinds of measurements, which can be thought of a hierarchy of data. It is
important to keep the types of data in mind, as certain types of analyses are appropriate for one
level but not for another.
1) nominal data -- This means simply to name an object and thus to assign it to a class. For
example, by classifying individuals as male or female we assign each individual to a class and
thus have nominal data. Creating lists of species and numbers of individuals is another type of
nominal data that ecologists often collect. Statistics can be applied to these data after they have
been converted to percentages or proportions, for example 45% female and 55% male.
2) ordinal data -- Objects are measures as being either greater than or less than, higher or lower
than a comparative object. For example, if we were at the store to but a head of lettuce and the
price was $1.50 per head. We might pick up several heads, hold them in our hands, in an attempt
to determine which head was largest. We would be collecting ordinal data in that we are simply
determining their relative weights. Another example, would be finishing positions in a race
(first, second, third, etc.).
3) interval level of measurement -- This is the point where we introduce a basic standard interval
but not necessarily a true zero (that is the zero does not have a true meaning indicating the
complete absence of whatever we are measuring). An example is temperature on the Celsius or
Fahrenheit scales. Temperature is a measure of the energy in the molecules of a substance, the
higher the temperature, the higher the kinetic energy. On either scale, zero does not indicate a
complete absence of kinetic energy and 60 degrees is not twice as much energy as 30 degrees.
However, the differences in kinetic energy between 0 degrees and 1 degree is the same as the
difference between 2400 degrees and 2401 degrees, because the interval of measurement is
standard along the scale. Each interval represents one unit of kinetic energy.
4) ratio level of measurement -- There is a basic standard interval, as above, and a meaningful
zero value. For example, in weight the basic standard interval could be defined as a gram and
zero grams would indicate no weight, also 60 grams is twice as heavy as 30 grams. If you want
to measure temperature using a ratio scale, then you would have to use the Kelvin scale, where
zero degrees means the complete absence of kinetic energy in a substance and 60 degrees Kelvin
is twice as hot as 30 degrees Kelvin.
5) derived variables -- This represents a separate class of measurements, which is not really part of the hierarchy, in which the measurement is
derived from two or more independently measured variables. For example, you could define a
stoutness index in humans that would be equal to body weight divided by height. You could
define a motor development index, which consists of a score on each of six different motor tests
and the motor development index is the sum of the scores on each of these six tests. These
derived variables often have non-normal distributions (which we will define later) and thus often
you will have problems when analyzing the data by methods that assume that the data are
normally distributed. More will be said about this later.
Continuous versus discrete data
Data are continuous when the measurements could conceivably take on any value. For example,
body weight in raccoons is continuous as we could have a raccoon that weighed 8 kg and we
could have one that weighed 9 kg. We could also have one that weighed 8.7 kg or 8.67 kg or 8.679
kg. As you can see any real number within the range of body weights is possible.
Data are discrete (also called meristic) when the measurements or observations can only take on
integral values. For example, data based on counts are discrete such as the number of legs that an
animal may have, the number of scales in the lateral line of a fish, the number of chromosomes in
the cells of an organism.
Some data may appear discrete but are really continuous, for example scores on a Likert scale.
For example, you will be asked to evaluate the effectiveness of my teaching and you will be
asked to circle either a 1, 2, 3, 4, or 5 on a computer-scorable sheet. Now the effectiveness of an
instructor is a continuous variable as it could take on any value between 1 and 5, however it is
very difficult (if not impossible) to discriminate between observations any more accurately when
recording this observation.
Frequency distributions (chapters 1.3 and 1.4)
Frequency distributions are a graphical representation of the frequency of occurrence of
observations across the range of observations. Frequency distributions are a very useful methods
for examining your data and gaining insights into the structure of the data that would not be
obvious to you from the calculation of descriptive statistics (which we will cover in the next
section). For example, you may find that your data are distributed as a bimodal distribution (that is that it has two peaks), or
that the distribution is U-shaped or J-shaped and clearly not normally distributed.
Frequency distributions take different forms depending upon whether the data are discrete or
continuous. If you are plotting discrete data, then you should use a bar graph with the bars not
touching each other to show that the data are discrete. For example, if you were producing a
graph that shows the frequency of occurrence of numbers of species of fish in a lake, your data
may consist of 2 lakes with 1 species, 4 lakes with 2 species, 5 lakes with 3 species, 7 lakes with
4 species, 6 lakes with 5 species, 3 lakes with 6 species, 1 lake with 7 species, and 1 lake with 9
species. As the number of species can only take integral values, the data are discrete. Thus the
bars used to represent the number of species can not touch adjacent bars.
Plot the number of species along the x-axis and numbers of lakes along the y-axis. To do this we
will use a computer package called Sigma Plot. You can access this program on the computers
in SH 158. Your graph should look like the one below.
If you are producing a frequency distributions of continuous data, then the bars will touch each
other indicating that the observations can take on any conceivable value. The data below come
from Sokal and Rohlf and represent the femur length of 25 aphids.
3.8 3.6 4.3 3.5 4.3 3.3 4.3 3.9 4.3 3.8
3.9 4.4 3.8 4.7 3.6 4.1 4.4 4.5 3.6 3.8
4.4 4.1 3.6 4.2 3.9
We will first arrange the observations in ascending order from lowest to highest. Note that when
we record a value of 3.3, that means that the true length of the femur lies between 3.25 and 3.35
and when we record a value of 4.2, that means that the true length lies between 4.15 and 4.25.
We are only recording our data to the nearest tenth (read chapter 1.2). See the column labeled implied limits in the
table below. We have actually divided our data into 15 different classes with a class interval of
0.1. If you look closely at the column labeled frequency, you will see several classes that have
zero observations (3.4, 3.7, 4.0, and 4.6). This is not unexpected when you have 25 observations
divided among 15 classes.
|Class mark represents the mid-point of the implied class limits. In this case, it represents the actual measurement recorded for each aphid. Frequency is the number of times that measurement with that class mark occurs in the set of data. Cumulative frequency is the sum of the frequencies for each class from the smallest class mark up through the class mark of interest.|
| 3.3 || 3.25 - 3.35 || 1 || 1 |
| 3.4 || 3.35 - 3.45 || 0 || 1 |
| 3.5 || 3.45 - 3.55 || 1 || 2 |
| 3.6 || 3.55 - 3.65 || 4 || 6 |
| 3.7 || 3.65 -3.75 || 0 || 6 |
| 3.8 || 3.75 - 3.85 || 4 || 10 |
| 3.9 || 3.85 - 3.95 || 3 || 13 |
| 4.0 || 3.95 - 4.05 || 0 || 13 |
| 4.1 || 4.05 - 4.15 || 2 || 15 |
| 4.2 || 4.15- 4.25 || 1 || 16 |
| 4.3 || 4.25 - 4.35 || 4 || 20 |
| 4.4 || 4.35 - 4.45 || 3 || 23 |
| 4.5 || 4.45 - 4.55 || 1 || 24 |
| 4.6 || 4.55 - 4.65 || 0 || 24 |
| 4.7 || 4.65 - 4.75 || 1 || 25 |
Plot these data using SigmaPlot. Your graph should look like the one below.
Because we had so many classes with zero observations, we are next going to lump our data into fewer classes, in this case
5 classes (see the second table below). This will produce a graph in which it is easier for us to see the pattern in our data.
Plot these data using SigmaPlot. Your graph should look like the one below.
|Practical class limits are the ranges of recorded values that fall within a particular class. Note here that the class interval is 0.3|
| 3.4 || 3.25 - 3.55 || 3.3 - 3.5 || 2 || 2 |
| 3.7 || 3.55 - 3.85 || 3.6 - 3.8 || 8 || 10 |
| 4.0 || 3.85 - 4.15 || 3.9 - 4.1 || 5 || 15 |
| 4.3 || 4.15 - 4.45 || 4.2 - 4.4 || 8 || 23 |
| 4.6 || 4.45 - 4.75 || 4.5 - 4.7 || 2 || 25 |
Last updated on 24 August 2000.
Provide comments to Dwight Moore at firstname.lastname@example.org.
Return to the RDA Home Page at Emporia State University.