# Data

To start off with, there is some terminology you need to be familiar with.

A

**population**is the whole set of items that are of interestA

**sampling unit**is an individual unit of a populationA

**sampling frame**is a named or numbered list of the sampling units in the population.A

**quantitative variable**is one associated with numerical observationsA

**qualitative variable**is one that is non-numericalA

**continuous variable**can take any value, e.g. decimalsA

**discrete variable**can only take fixed values, e.g. integers or coloursA

**census**measures every member of a populationA

**sample**is a selection of observations from a subset of the population

There are advantages and disadvantages to all forms of statistical investigations:

A

**census**is entirely accurate (because it measures every sampling unit), but it is time consuming, cannot work with destructive testing (when the sample is destroyed when testing), and produces a vast amount of data to be processed.A

**Sample**is less time consuming because less has to be tested and less data is produced, however it may not be as accurate, and the sample may not reflect the population well.

## Sampling

Broadly speaking, there are two types of sampling - **random, and non-random**. These each have their own sub types, too.

### Random Sampling

In random sampling, each member of a population has an equal chance of being chosen. This means the sample should be both representative and unbiased.

**Simple Random Sampling**

For a simple random sample, a sampling frame is created where each member is given a number. Then, a random number generator or a lottery is used to create the sample.

Advantages are that it is free from bias, easy and cheap to use on small populations and samples, and the probability of being selected is known.

Disadvantages are that a sampling frame needs to be constructed, and it is difficult when the population/sample is large.

**Systematic Sampling**

For a systematic sample, the required elements are selected at regular, chosen intervals from an ordered list. For example, if you had a population of 50 and wanted a sample of 10, use a random number generator to pick a number between one and five to find the first person, then chose every fifth after the first.

Advantages include that it is simple and quick and works for large samples and populations

Disadvantages include that a sampling frame is needed and, if this is not random, bias can be introduced.

**Stratified Sampling**

For stratified sampling, the population is divided into mutually exclusive strata, and a random sample is taken from each. These strata could be gender, eye colour etc. It is important that the proportion of each strata should be representative of the population, for example if 40% of a population are males and 60% female, a sample of 10 should have 4 males and 6 females.

Advantages are that it reflects the population structure and gives proportional representation

Disadvantages are that the population must be divided into mutually exclusive strata, and that the selection of members for each strata has the same issues as simple random sampling

### Non-Random Sampling

There are two main types of non-random sampling:

**Quota Sampling**

Quota sampling is when a researcher selects a sample that reflects the characteristics of the whole population. Individuals are screened to see which quota they fit into, and this continues until each quota is filled.

Advantages include that it allows a small sample to represent a large population, no sampling frame is needed, it is quick and easy and allows for comparison between different groups.

Disadvantages include that it can introduce bias, group divisions can be vastly inaccurate, and people who do not easily fit into a group are ignored.

**Opportunity Sampling**

Opportunity sampling, also known as convenience sampling, involves taking the sample from whoever is readily available at the time and fits the criteria. For example, this might just be the first 10 people you find.

Advantages are that it is extremely quick and easy

Disadvantages are that it is very unlikely to represent the population and is highly dependent on the individual researcher.

## Location & Spread

The position of something in a data set can be described using a **measure of location**, such as the mean, median and mode:

The

**mode**is the value that occurs most oftenThe

**median**is the middle value when data points are in orderThe

**mean**is calculated using:

### Variance & Standard Deviation

The **variance** is a measure used to describe the spread of a data set:

The **standard deviation** is the square root of the variance:

Generally, it is easiest to use the first form of the equation (without the Sxx) when you have raw data.

The second one (with Sxx) is best used when you can use a calculator to find out Sxx quickly.

When working with frequencies, use this equation for the variance, σ², instead:

Again, standard deviation, σ, is given as the square root of this.

### Ranges

Another form of describing the spread of a data set is using ranges.

**'The' range**is the difference between the largest and the smallest value**Interquartile range**is t