To start off with, there is some terminology you need to be familiar with.
A population is the whole set of items that are of interest
A sampling unit is an individual unit of a population
A sampling frame is a named or numbered list of the sampling units in the population.
A quantitative variable is one associated with numerical observations
A qualitative variable is one that is non-numerical
A continuous variable can take any value, e.g. decimals
A discrete variable can only take fixed values, e.g. integers or colours
A census measures every member of a population
A sample is a selection of observations from a subset of the population
There are advantages and disadvantages to all forms of statistical investigations:
A census is entirely accurate (because it measures every sampling unit), but it is time consuming, cannot work with destructive testing (when the sample is destroyed when testing), and produces a vast amount of data to be processed.
A Sample is less time consuming because less has to be tested and less data is produced, however it may not be as accurate, and the sample may not reflect the population well.
Broadly speaking, there are two types of sampling - random, and non-random. These each have their own sub types, too.
In random sampling, each member of a population has an equal chance of being chosen. This means the sample should be both representative and unbiased.
Simple Random Sampling
For a simple random sample, a sampling frame is created where each member is given a number. Then, a random number generator or a lottery is used to create the sample.
Advantages are that it is free from bias, easy and cheap to use on small populations and samples, and the probability of being selected is known.
Disadvantages are that a sampling frame needs to be constructed, and it is difficult when the population/sample is large.
For a systematic sample, the required elements are selected at regular, chosen intervals from an ordered list. For example, if you had a population of 50 and wanted a sample of 10, use a random number generator to pick a number between one and five to find the first person, then chose every fifth after the first.
Advantages include that it is simple and quick and works for large samples and populations
Disadvantages include that a sampling frame is needed and, if this is not random, bias can be introduced.
For stratified sampling, the population is divided into mutually exclusive strata, and a random sample is taken from each. These strata could be gender, eye colour etc. It is important that the proportion of each strata should be representative of the population, for example if 40% of a population are males and 60% female, a sample of 10 should have 4 males and 6 females.
Advantages are that it reflects the population structure and gives proportional representation
Disadvantages are that the population must be divided into mutually exclusive strata, and that the selection of members for each strata has the same issues as simple random sampling
There are two main types of non-random sampling:
Quota sampling is when a researcher selects a sample that reflects the characteristics of the whole population. Individuals are screened to see which quota they fit into, and this continues until each quota is filled.
Advantages include that it allows a small sample to represent a large population, no sampling frame is needed, it is quick and easy and allows for comparison between different groups.
Disadvantages include that it can introduce bias, group divisions can be vastly inaccurate, and people who do not easily fit into a group are ignored.
Opportunity sampling, also known as convenience sampling, involves taking the sample from whoever is readily available at the time and fits the criteria. For example, this might just be the first 10 people you find.
Advantages are that it is extremely quick and easy
Disadvantages are that it is very unlikely to represent the population and is highly dependent on the individual researcher.
Location & Spread
The position of something in a data set can be described using a measure of location, such as the mean, median and mode:
The mode is the value that occurs most often
The median is the middle value when data points are in order
The mean is calculated using:
Variance & Standard Deviation
The variance is a measure used to describe the spread of a data set:
The standard deviation is the square root of the variance:
Generally, it is easiest to use the first form of the equation (without the Sxx) when you have raw data.
The second one (with Sxx) is best used when you can use a calculator to find out Sxx quickly.
When working with frequencies, use this equation for the variance, σ², instead:
Again, standard deviation, σ, is given as the square root of this.
Another form of describing the spread of a data set is using ranges.
'The' range is the difference between the largest and the smallest value
Interquartile range is t