Origin of Statistics
Statistics originated from the Latin word “ Status” and the Italian word “Statista”
both the words represent “Political state”. In those days, the scope of statistics
was limited only on the interest of rulers and for the sake of fulfilling their needs.
Mainly, statistics was used for assessing population, military potential , State owned wealth ,
agricultural production, taxation etc. Since then, Statistics has developed gradually
during the last few centuries, but mostly during the end of nineteenth century. With the
advent electronic computers and other computing facilities, it has shown a significant
development during the 20th century.
It is obivious that at present almost all day to day activities of mankind are directly or indirectly related with statistics.
As an examples some of them are listed below:
- Leading stores use statistic for strategic stock of consumer goods
- Producers use statistics for testing quality of their product.
- Insurance companies use statistics to calculate the risk factor that decides the premium rate of any location of interest.
- Medical scientists use statistics to test the validity and effectiveness of any drugs before they are prescribed in the market
- Political analyst use statistics to predict the winner of an election, and
- Many more …
Definition of Statistics
The science of statistics deals with the study of the principles and the methods applied in collecting,
organizing, classifying, analyzing, presenting and interpreting the data in any field of inquiry.
In other words, Statistics are the statements of facts that is capable of analysis and interpretation.
Branches of Statistics
There are two main branches of statistics: Descriptive and Inferential
- Descriptive statistics : are used to describe the basic features of the dataset in a study. It mainly discusses on collecting,
classifying, summarizing and displaying the data.
- Inferential statistics, on the other hand, tries to infer: what the population would look like, from the sample data. It focuses on estimating
the population parameters making use of the descriptive statistics.
A researcher is interested to study the housing price of certain locality. There may be so many houses in the locality, s/he is unable to enumerate
all the houses in the locality in his study because of time and resource constraints. He takes sample of few houses from the locality and collects the prices
of the sampled houses. He classifies the data using statistical tools and finds the summary values of the collected data. This proces of collecting,
classifying, and summarizing the data is considered as the descriptive branche of statistics. The researcher further analyses the collected data also taking
care of the descriptive statistics for infering the population parameters, such as forcasting the house price for future. This process of infering the population
parameter with the help of descriptive statistics is called the inferential branch of statistics.
Terminology used in Statistics
Statistical studies focus on a particular group of interest called Population.
It is a huge collection of items in the realm
of study, which is difficult to handle, since it is rarely obesrved. The term population in statistics,
has wider scope than that of ordinary vocabulary sense, that takes care only the count of people but in
statistical sense, the population encompasses all persons and things defined on the domain of study for making inferential
decision in a subject of interest.
In a study of “Survey of House hold” the domain of population can be restricted as follows according to the “particular group of interest”
of the researcher’s scope of study
- All the household in the USA,
- Household in a state, city, county etc.
Similarly, In a study of "Average GPA of Students" the domain of population can be restricted as follows:
- At the Universities, throughout the country, state, city.
- At the college.
- At the High school.
- At particular class etc.
Likewise, In a study of average salary of professors at public universities the domain of population can be considered as:
- In the entire country
- All the State
- All cities etc.
Population parameter :
The numerical description or summary of a particular population characteristics is called Population parameter.
Data summaries of a population such as mean, median, variance, standard deviation, minimum value, maximum value etc.
are called population parameters.
Special notes on parameters:
- Parameter is a fixed number
- Often it is an unknown quantity, because it is impossible to enumerate each and every individual unit in the
- Generally, population is estimated from the sample.
As stated earlier, the population is rarely observed, but sometimes all the individual unit in the population is enumerated,
such kind of study is known as Census, in which the data are obtained from each and every member of the population. All most
all the countries of the world have regular practice to conduct census study of its population at every 10 years.
A sample is a subset of a population.
It is already mentioned that the study of population at individual level (each and every unit in the population) is a rare event.
So, the parameter values are almost unknown. Therefore, a representative, appropriate reflection of the population, portion of
the population is a viable solution for estimating population parameter. Hence, a sample : defined in short as a subset of population
from which data are collected for analysis and making inferences about population parameter. Therefore, it is the main focus in the process of study.
But one must be very careful while choosing the sample in order that the population is well-represented and should be random in nature, that is, each member of the
population has equal chance of being selected. Also, it is well preserved so that the result of the study becomes meaningful.
A reseacher wants to know the average income of households in Dallas county. All the houses in Dallas county is the population. The researcher surveys
400 houses and collects the household income. So the surveyed 400 household inocme is the sample for the study.
Sample statistic :
We have defined population parameter as a measure that describes the population characteristic, similar measure in sample which describes sample characteristic
is a sample statistic. Examples are sample mean, sample median, sample variance, sample standard deviation, minimum and maximum values in the sample etc.
For our study, we consider a population and concentrate on a portion of it called a sample. Variable is a name given to certain characteristic in a population or sample,
as the name suggest, it varies in different situations, that is, the value of the variable go on changing among the individual member of the population or the sample.
Age is a variable that changes according to the member in the population or sample. Similarly, height, weight, grade, income, expenditure etc. also can be considered
as the variable names. It is the choice of an individual to name the variable, but it should be a single word and follow the convention of naming a variable.
The information collected on a specific variable constitute data.
If we collect the GPA of class of students. GPA is a variable name. Each student in the class has GPA, so the collection of all the GPAs of the class forms data.
Data classification :
Data collected in a statistical study can be classified into different categories : Initially we can classify the data set into two sub-divisions namely : Qualitative and Quantitative
are also referred as categorical data
. It consists of the data values that are labels or names.
- Gender is a variable name and the values of the variable are "male" or "female"
- EyeColor is a variable name and the values of the variable are "blue" , "brown", "green", etc.
- Logical is a variable name and the values of the variable are "true" or "false"
- Rank is a variable name and the values of the variable are "small" , "medium", "large" : "first", "second", "third", etc.
Note: Sometimes variable values such as zip codes and jersey numbers of the players, though numerical, but are qualitative variables. These numerical values are used only
for identification purpose but are qualitative data. Mathematical operations like addition, subtraction etc. are invalid for qualitative data.
reflects a notion of magnitude that is measurable, so quantitative data are numerical such as counts or measurements.
Examples of quantitative data include
Age, height, and weight of a person that can be measured in terms of numerical values, similarly grade point average of students, average rainfall of certain locality etc.
GPA of students, if measured in score it is quantitative data but if it is a letter grade it is qualitative data.Mathematical operations like addition, Subtraction, multiplication etc.
are valid for quantitative data.
Quantitative data can further be classified into two sub-divisions : Discrete and continuous.
are count data so it is also called countable. It can take only particular values and cannot take the values in between these particular values. It jump from one value
to next creating a gap in between the values.Examples are the number of telephone calls that we receive in a day, number of children in a family, the score in a basketball game,
specially the intger values, that is, we don't receive fraction number of telephone calls. On the other hand, a continuous data
can take any value in a given range of numbers,
specially the intervals. Examples of the continuous data are the measurements, like the height of a person, temperature, rainfall of certain cities etc.
Levels of Measurement :
Further process in the data classification is the Level of Measurement which Stanley Smith Stevens introduced in early 1940. There are four levels of measurements, namely :
nominal, ordinal, interval, and ratio.
Among these four levels, nominal and ordinal belongs to qualitative data and the rest interval and ratio fall on quantitative data.
level of data consists of names and labels only, such as gender with two levels : male and female. Eye color : blue, brown, green. Brand name : Ford, Toyota, Nissan, etc.
level of measurement is also labels or names but in this level data can be arranged in meaningful order, such as ranks: first, second, third, ... sizes of t-shirts: small, mdeium, large,
and extra large. Though looks numerical, but mathematical calculations such as addition or division do not make sense. In short, ordinal data have all the attributes of nominal data, but, in addition,
they also have a meaningful order.
Next level of measurement in quantitative data is the Interval
Level of Measurement, like in ordinal, data in this level of measurement also can be ordered but here in this measurement level, difference of order is
meaningful and mathematical computation is valid, whereas in ordinal it is not. Temperature can be considered a good example in this level.
Last level of measurement in the quantitative data is the ratio
scale, it is similar to interval level but here in this level, comarison is meaningful and zero has some special significance. Objects can be compared
one is some specified number times bigger or smaller to other. The value "Zero" in this level indicates non-existance of the objects. For exampl: age of a certain person, s/he is 2 times older or younger than other
person, and at zero age life does not exist.
The Process of Statistical Study :
In the previous section we learned how to classify data. In this section, we shall focus on the basic process of statistical study with respect to the researcher’s objective in the concern field,
that is, the researcher has a question in his/her mind seeking a valid answer that fulfills the objective of the study. For example, some of the questions in researcher’s mind might be like
“What is the trend of housing market after the recession”,
“Does the dose of aspirin each morning reduce the risk of heart attack?”.
There are various ways to answer these questions, but in statistical sense there are some recognized methods, as listed below, to conduct the study for answering these questions.
- Determine the design of the study
Collect the data
Organize the data
Use an appropriate statistical tool to analyze the data to answer the question that meets the objective of the study
- Define the objective of the study, stating the question to be studied
- Specify the population of study, identify the domain and recognize the variable(s) to be studied
- Determine the sampling method for selecting a representative sample
Following examples illustrate, how to start designing a statistical study.
Does the new method of teaching adopted in the college has improved the standard of the education ?
Population: Registered College students
Variable : GPA grade
Did the discount price of certain commodities during Christmas increase the sale?
Population : Prospective buyers of a particular commodity
Variable : type of commodity, price, sale volume
Once the objective of the study is defined and population is identified, the next step is to collect the data. Since enumeration of population is not viable, so a sample would be a reasonable
representation of population for collecting data. But, before collecting the data we need to select an appropriate sampling method that assures representative amd ramdomized sample for a
population to be studied.
Among the several sampling methods available for collecting samples from a population, few of them are briefly discuss below. Choice of these methods depends on many factors such as type of population under study,
the study question, available resources, geographic region etc.
- Random Sampling: In this method of sampling every member of the population has an equal chance of being selected. Lottery is a good example of this method, that is,
identification number is assigned to each member of the population then members are selected at random.
Example: Drawing name out of the box for selecting the winner of a lottery.
- Simple Random Sampling: In this method every sample from the population has an equal chance of being chosen. So before drawing the sample, one has to prepare the list of possible
samples of predetermined size and then a sample is chosen randomly (random sampling) from this list.
Stratified Sampling: In this method the population is divided into two or more subgroups, called strata, like age group, gender, region, education level etc. This method ensures
that the particular subgroup has the representation in the sample. In other method it may so happen that even a single member of some subgroup might not be included in the sample.
Before the sampling, if certain percentage is preassigned to some group then this type of stratification is called Quota sampling.
Example: Representation in the college student union from all the levels. (strata : Freshmen, Sophomore, Juniors, and Seniors)
Cluster Sampling: This method is similar to stratified sampling that the members of the population are divided into subgroups called clusters, then the sample of clusters are selected randomly.
Unlike the stratified sampling, each member of the selected clusters forms the sample.
Example: An educator randomly selects 10 schools from Dallas Independent school district and asks each household in the selected school districts how many school-age children are in the home.
Systematic Sampling: In this type of sampling, every nth member of the population is selected to form a sample.
Example: A quality control engineer selects every 5th bolt out of the assembly line to inspect.
Convenience Sampling: Sample selection in this method is done according to the convenience of the researcher, so this method may not yield the representative sample, but in some cases,
it is preferred because it is personal and easy.
Example: A teacher asking question to the students who are sitting front row in the class.
Basically, there are two ways that we can acuire data either by collection or by generation. We have briefly discussed sampling methods, the techniques of data collection. Next, we shall discuss the design
of experiment, a process of data generation. But, first let us highlight two types of studies : Observational studies and Experimental studies.
Observational studies :
we observe or collect data that already exists, secondary source of data, this type of studies are called observational studies. Experimental studies :
In this study, data is generated
by actually performing an experiment, a primary source of data, that helps us to identify the cause-and- effect relationship.
Researcher wants to know the average age of college students across nation, s/he can use the college record that already exists, so this type of study is called observational study.
On the other hand, researcher wish to determine if flu shots actually help prevent severe cases of the flu, s/he actually need to perform an experiment to establish the cause-and-effect
relationship between flu shots and flu prevention, so this an experimental study.
In an observational study, a researcher collects data from the population that already exist. It is obvious that the census study is not always possible because of the time and resource constraint. In such situation a researcher must confine
himself/herself to collect sample data from the population. The collected data should represent the whole population. For this purpose, the reseacher has to adopt the appropriate sampling
method that is discussed above in brief.
An observational study depends on the existing data. There are several types of data that exist in the pool. Observational study can have several subdivisions according to the types of data that it analyses. Some of them are listed below:
Cross-sectional study :
characterized by looking at data that are collected from a population at a specific single point in time. So it is faster and inexpensive. The participants in this study
are selected based on particular variables of interest. Most of the polls and surveys fall in this category. These types of studies are mostly useful in public health planning, monitoring, and evaluation.
An educationist surveys 200 students in a class, at the end of the semester after using technology bases teaching, to test whether students’ GPA has improved.
Longitudinal study :
In this study, data are collected repeatedly from a particular group over a period to time. This study type is particularly useful for evaluating the relationship between risk factors and the development
of disease, and the outcomes of treatments over different lengths of time.
A group of 300 patients is followed for 10 years in order to determine long term health effects, resulting from kidney transplant.
is a study that compiles infromation from previous studies. A benefit of meta-analysis is that many smaller samples can be combined into a single larger sample. A drawback is that the combined study is only
as good as its weakest link. In other words, if you use poorly constructed studies, your study will not be able to draw strong conclusions.
Oceanographers study research on tsunamis dating from 1900 to 2000 to determine their effects on the ocean floor. Because the oceanographers are looking at multiple studies relating to the single variable of tsunamis' effects on the ocean floor,
this is a meta-analysis study.
Case study :
looks at multiple variables that affect a single event. When you desire to look at a single case in depth and all of the possible variables associated with that case, then it would be most appropriate to perform a case study.
Meteorologists study the Indian Ocean tsunami of December 2004 to try to identify warning signs. In order to identify tsunami warning signs, meteorologists would most likely look at multiple variables relating to the 2004 tsunami.
Because they are studying several aspects of a single tsunami, it is a case study.
Experimental studies :
An experiment is a commonly used method for data generation. In an experiment, researchers apply a treatment
to a group of people or things, called subjects
, (If they are people,
they can also be referred to as participants
.) and measure the response.
A treatment is simply some condition that is applied to a group of subjects for experimental purposes, such as asking one group of people in an experiment to take a vitamin. The variable that responds to the treatment is called the response variable
The variable that causes the change in the response variable is called the explanatory variable
The three main principles in experimental design are : randomization, Local control, and replication.
Randomization involves randomly allocating the experimental units across the treatment groups. For example, if an experiment compares a new drug against a standard drug, then the patients should be allocated to either
the new drug or to the standard drug control using randomization.
Local control :
means the control of all factors except the ones about which we are investigating. Local control, like replication is yet another device to reduce or control the variation due to extraneous
factors and increase the precision of the experiment.
is where each treatment is assigned to many participants. In other words, the entire experiment is repeated on a large group of subjects.
The treatment group
is the group of subjects to which the treatment is applied, while the control group
does not receive the treatment.
The control and treatment groups may be the same group of participants measured before and after the treatment is applied, such as in a pretest/posttest scenario,
or they may be separate similar groups. In order to generalize the results, it is important that the two groups are as similar as possible so that any difference
between the groups can be attributed to the treatment. This establishes that the treatment is the cause of any effects that are seen.One method that researchers use
to create similar groups is to randomly assign the volunteers to the two groups.
When researchers directly assign participants to the various groups, they are controlling for confounding variables
, which are factors other than the
treatment that cause an effect on the groups. This is just one of many ways that researchers can control for confounding variables. If we are studying the effect of
a new weight-loss program, factors such as diet, heredity, exercise, and motivation might all affect the outcome of weight loss. These can all be confounding variables,
and the design of the experiment should try to control for these factors.
Another effect that researchers need to control for is called the placebo effect
. Because people respond to suggestion, giving someone a drug for the common
cold and telling them that it will cure them will often produce effects caused by the suggestion alone and not the drug itself. This response is known as the placebo effect.
To counteract the placebo effect, subjects in the control group are given a placebo
, which appears identical to the actual treatment, but contains no intrinsic
beneficial elements. For example, if the treatment were a small orange pill that contains vitamin C, then the placebo would be a small orange pill that does not contain vitamin C.
With both groups taking what they believe to be the treatment, the placebo effect will be the same for both groups. Thus, differences in the groups can then be attributed
to the actual treatment instead of the suggestion from the placebo.
In experiments that use a placebo, the subjects do not know if they are in the control group or the treatment group. This is called a single-blind experiment
In a single-blind experiment, the people interacting with the subjects in the experiment know in which group each subject has been placed. The researchers knowing which
subjects are taking the actual treatment can cause the researcher to subconsciously influence the results of the study through their interactions with the participants.
In a double-blind experiment
, however, neither the subjects nor the people interacting with the subjects, such as doctors or nurses, know to which group each subject belongs.
Consider a study, in which neurologists want to determine if taking an intravenous dose of vitamin C will reduce the amount of nerve pain reported by patients.
Suppose that the study was narrowed to focus only on patients with the nerve disorder, multiple sclerosis (MS). After study approval, the neurologists solicit
volunteers who are patients with MS who are reporting nerve pain. The participants are then randomly assigned to two groups, each having 20 participants.
Participants in Group A are administered intravenous doses of vitamin C, and their nerve pain is tracked. Participants in Group B are administered intravenous doses
of saline (which has no active ingredients) and their pain levels are also tracked. The patients are not told which of the two groups they are in; however, the nurses
administering the IVs are aware of the group assignments. After a predetermined length of time, the amounts of pain reported by the separate groups are compared to determine
if an intravenous dose of vitamin C will reduce the amount of nerve pain.
- Identify the explanatory and response variables.
- What is the treatment?
- Which group is the treatment group and which group is the control group?
- What is the purpose of administering saline to Group B?
- Is this a single-blind or double-blind study? Do you think this is the best choice for this study?
- The explanatory variable is what "explains" the changes in the response variable. Since the neurologists are trying to determine if the dose of vitamin C can reduce nerve pain,
the explanatory variable is the dose of vitamin C and the response variable is the amount of nerve pain reported by each patient.
- The treatment is what is being applied to the group, so the treatment is the dose of vitamin C
- The group that received the treatment of vitamin C, namely Group A, is the treatment group. The group that did not receive the treatment, Group B, is the control group.
- The saline that is administered to Group B is a placebo, and is administered to compensate for the placebo effect, so that all patients are responding to the same suggestion
that they are receiving treatment.
- Since the patients do not know the group assignments, but the nurses who are interacting with the patients do know to which group the patients were assigned, this is a single-blind study.
As reported pain is a subjective measure, it would be easy for nurses to unintentionally influence the patients' responses; thus a double-blind study would probably be a better design for
Institutional Review Boards
Once a researcher has formulated a question and designed a study to explore that question, there is one more step that must be taken before actually gathering data.
The researcher must get approval to conduct the study from an Institutional Review Board (IRB)
, particularly in the academic and medical communities.
The IRB is usually made up of people from the institution with which the researcher is affiliated (such as a university or workplace), as well as people from the community
where the institution is located. The job of the IRB is to review the design of the study to make sure that it is appropriate and that no unnecessary harm will come to the
subjects involved. The IRB will require the researcher to fill out documents describing the proposed study in detail, including how issues of informed consent, human or animal subjects,
and confidentiality will be handled. Once the IRB approves the design of a study, the researcher is ready to begin collecting data.
An Institutional Review Board ( IRB) is a group of people who review the design of a study to make sure that it is appropriate and that no unnecessary harm will come to the subjects involved.
When collecting data, researchers must get the informed consent
of participants. Informed consent involves completely disclosing to participants the goals and procedures involved
in a study and obtaining their agreement to participate. However, getting a participant's informed consent is not as straightforward as it sounds. There are gray areas, such as
studies involving child participants or people with intellectual disabilities, in which questions arise about the meaning of informed consent and who should give it.
Furthermore, not all studies involve human subjects. Since animal subjects cannot give their informed consent to participate, it is the job of the IRB to provide that
consent on behalf of the animals. This means that the IRB will require more extensive documentation on study procedures involving animals and any possible harm that might
happen to the animals in the course of the study.
Informed consent involves completely disclosing to participants the goals and procedures involved in a study and obtaining their agreement to participate.
Studies involve a high level of trust between the researcher and the participants. Part of that trust is that any information acquired will be kept confidential.
An IRB will require evidence that any data gathered from participants will be kept in a secure location and that only people who need to see the raw data will be
allowed to do so. However, this does not mandate that the data must be gathered anonymously. For example, an educator studying the numbers of class absences and
their effect on students' final grades can keep the raw data confidential by keeping students' records in a locked file cabinet and only allowing authorized people
to see the data; however, it is likely that the educator will know which student is associated with what data. Therefore, although the data are kept confidential,
they are not collected anonymously.
How to Critique a Published Study
In the previous sections we have discussed researcher's interests and working on the data, also we looked into the design principles for better understanding the bedind-
the-scenes setup of a study and better identifying the valid conclusions from the data. Now let us look at some other things that we might come across while our results at hand.
When we think of this word "source", we have to consider this phrase in much wider sense some thing like, funding source, data source, and sources of other resources.
In this connection we might think the questions in our mind like : Who paid for the study? Where were the data collected? When was the information collected? Who published the study?
and so on...These questions might be relevant in certain situation, though not in the present situation when working on the research project.
Conssider the setup :
In this section we shall discuss some of the terminologies which will be very much helpful for a research study especially design and analysis of experiment.
means favoring of a certain outcome.
Sampling bias :
We mentioned previously, the sample chosen for a study should accurately reflect the population. So an appropriate question to ask is, "Does it?"
If it does not, we say the results are biased
because they do not accurately represent the population being studied. This type of bias is called sampling bias.
Participants who begin the study but fail to finish. Dropouts can reduce the size of your sample, thus affecting how representative your sample is of the population.
Processing errors :
Errors that occur simply from the data being processed, like typos when data are being entered or illegible handwriting on a survey.
Although very unintentional, this could potentially sway the outcome and hence have a biasing effect.
Participants remain in the study until the end but stray from the directions they were given at the beginning. For example, consider a participant
who was asked to exercise 30 minutes per day who really did not do that faithfully but claims they did to the researcher.
Researcher bias :
When it is the researcher who influences the results of the study to favor a certain outcome. Researcher bias may also be intentional or unintentional.
The researcher might intentionally choose a favorable sample or unintentionally influence the sample's responses by his or her actions.
Response bias :
The researcher's facial expression, tone of voice, or physical proximity to the participant could all encourage the participant to respond with the answer
they believe the researcher wants, instead of their true feelings. It is also worth noting that researchers should be mindful of how they handle very sensitive information in a setting that
would make the subjects feel uneasy. Participants in an uncomfortable situation might be less likely to give truthful information. It is also possible for response bias to come from the participant,
not the researcher. Study questions that ask participants to remember, for example, how many caffeinated beverages they have had in the last week, often lead participants to "make up"
an answer in order to participate in the study.
participation bias :
is created when there is a problem with the participation—or lack thereof—of those chosen for the study.
Nonresponse bias :
occurs when there is a lack of participation in a self-selected sample from certain segments of a population, when a person refuses to participate in a survey,
or when a respondent omits questions when answering a survey.