##
Elektronická kniha:
Modern Analysis of Biological Data

Autor:
Stanislav Pekár; Marek Brabec

Podnázev: Generalized Linear Models in R

Nakladatelství: | » Masarykova univerzita |

Dostupné formáty ke stažení: | |

Upozornění: | většina e-knih je zabezpečena proti tisku a kopírování |

Médium: | e-book |

Rok vydání: | 2016 |

Počet stran: | 256 |

Úprava: | 1 online zdroj (247 pages): illustrations (some color) |

Jazyk: | anglicky |

ADOBE DRM: | bez |

ISBN: | 978-80-210-8019-5 |

Ukázka: | » zobrazit ukázku |

Kniha je zaměřena na regresní modely, konkrétně jednorozměrné zobecněné lineární modely (GLM). Je určena především studentům a kolegům z biologických oborů a vyžaduje pouze základní statistické vzdělání, jakým je např. jednosemestrový kurz biostatistiky. Text knihy obsahuje nezbytné minimum statistické teorie, především však řešení 18 reálných příkladů z oblasti biologie. Každý příklad je rozpracován od popisu a stanovení cíle přes vývoj statistického modelu až po závěr. K analýze dat je použit populární a volně dostupný statistický software R. Příklady byly záměrně vybrány tak, aby upozornily na leckteré problémy a chyby, které se mohou v průběhu analýzy dat vyskytnout. Zároveň mají čtenáře motivovat k tomu, jak o statistických modelech přemýšlet a jak je používat. Řešení příkladů si může čtenář vyzkoušet sám na datech, jež jsou dodávána spolu s knihou.

(generalized linear models in R)Kvantitativní výzkum

### Elektronické knihy > Naučná a odborná literatura

### Katalog předmětový > K > Kvantitativní výzkum

### Katalog vybraných autorů > P > Pekár – Stanislav Pekár

### Katalog vybraných autorů > B > Brabec – Marek Brabec

### Katalog nakladatelství > M > Masarykova univerzita

### Elektronické knihy > Ostatní > Nezařazeno

MODERN ANALYSIS OF BIOLOGICAL DATA

GENERALIZED LINEAR MODELS IN R

STANO PEKÁR, MAREK BRABEC

MODERN

ANALYSIS

OF BIOLOGICAL DATA

GENERALIZED LINEAR MODELS

IN R

Masaryk University, Brno 2016

STANO PEKÁR

MAREK BRABEC

http://www.muni.cz/press/books/pekar_en

Pekár S. & Brabec M. 2016. Modern Analysis of Biological Data:

Generalized Linear Models in R. Masaryk University Press, Brno.

Th is book was supported by Masaryk University Project No. MUNI/FR/1304/2014.

Text © 2016 Stano Pekár, Marek Brabec

Illustrations © 2016 Stano Pekár

Design © 2016 Ivo Pecl, Stano Pekár, Grafi que

© 2016 Masarykova univerzita

ISBN 978-80-210-8106-2 (online : pdf )

ISBN 978-80-210-8019-5 (Paperback)

V

CONTENTS

Foreword 1 Introduction

1.1 How to read the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Types of variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Statistical soft ware 2.1 Th e R Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Installation and use of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Basic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

2.4 Data frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Exploratory data analysis (EDA)

3.1 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

3.2 Variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Confi dence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Summary tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

3.5 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5.1 Distribution plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5.2 Scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.3 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.4 Lattice plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5.5 Interaction plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5.6 Bar plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5.7 Paired plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5.8 3D plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5.9 Plots with whiskers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40

3.5.10 Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 VI

CONTENTS

4 Statistical modelling

4.1 Regression model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 General linear model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Generalized linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Searching for the “correct” model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.6 Model diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54 5 Th e fi rst trial

5.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3 Presumed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4.1 ANOVA table of Type I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.2 Nonlinear trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4.3 Removal of model terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4.4 Comparison of levels using contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4.5 Contrasts and the model parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4.6 Posterior simplifi cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4.7 Diagnosis of the fi nal model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6 Systematic part

6.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 ANOVA and ANODEV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.3 ANCOVA and ANCODEV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.4 Syntax of the systematic part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7 Random part

7.1 Continuous measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.2 Counts and frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.3 Relative frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 8 Gaussian distribution

8.1 Description of LM and GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8.3 Weighted regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

8.4 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

VII

CONTENTS

8.5 Two-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

8.6 One-way ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 9 Gamma and lognormal distributions

9.1 Description of the Gamma model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

9.2 Description of the lognormal model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

9.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

9.4 Two-way ANODEV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

9.5 Two-way ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 10 Poisson distribution

10.1 Description of the Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

10.2 One-way ANODEV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

10.3 Overdispersion and underdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

10.4 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

10.5 One-way ANCODEV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 10.6 Th ree-way ANODEV (Contingency table) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 11 Negative-binomial distribution

11.1 Description of the negative-binomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

11.2 One-way ANODEV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 12 Binomial distribution

12.1 Description of binomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

12.2 Two-way ANODEV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

12.3 Overdispersion and underdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

12.4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

12.5 One-way ANCODEV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

12.6 Binary one-way ANCODEV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 References Index

Subject index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

R functions and their arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

I X

Th is book is meant especially for students and scholars of biology, i.e. biologists who work

in natural science, at agricultural, veterinary, pharmaceutical and medical faculties or at

research institutes of a similar orientation. It has been written for people who have only a

basic knowledge of statistics (for example, people who have attended only a Basic statistics/

Biostatistics course) but who need to correctly analyse the data resulting from their observa

tions or experiments.

Th e generally negative attitude of biologists towards mathematics is well known. It is pre

cisely why we have tried to write the book in a relatively simple style – with minimal math

ematical requirements. Sometimes, the task turned out to be easy, other times not that easy,

and sometimes it became almost impossible. Th at is why there are still some mathematical

equations in almost all chapters of the book (even though they are used in a simplifi ed form

in order to be more apprehensible to less experienced readers). Despite this fact, the book

includes much less mathematical and statistical theories than it is common for standard

statistical literature.

Th e book is mainly built on examples of real data analyses. Th ey are presented from the very

beginning to the end, from a description and determination of objectives and assumptions

to study conclusions. Th ey thus simulate (even though in a simplifi ed way) the procedure

usually used when preparing a paper for a scientifi c journal. We believe that practical experi

ence with data analyses is irreplaceable. Because of the anticipated biology-oriented readers,

we selected examples from the areas of ecology, ethology, toxicology, physiology, zoology

and agricultural production. All these data were analysed during various previous research

projects. Th ey have been adjusted in order to suit the pedagogical intentions of this book.

For example, the original long and complex Latin names of species have been replaced with

a generic short name (e.g., specA).

Finally, we would like to thank all our colleagues without whose help this book would never

have been written. First of all, we would like to thank to Vojtěch Jarošík (in memoriam), for

introducing GLM to the fi rst author of the book during his studies at the university, thus

igniting his interest in statistics generally; Alois Honěk for many consultations, and our col

leagues from the Crop Research Institute in Prague-Ruzyně and the students of the Faculty

of Science of the Masaryk University in Brno for inspiring comments to the original text

of the book. Finally, we would also like to thank the following colleagues of ours who have

kindly let us use their (though adjusted) data for presenting examples in this book: T. Bilde,

A. Honěk, J. Hubert, D. Chmelař, J. Lipavský, M. Řezáč, P. Saska and V. Stejskal.

FOREWORD

X

We welcome any comments regarding the text and content of the book. Please direct them to

the following email addresses: pekar@sci.muni.cz and/or mbrabec@cs.cas.cz.

December 2015

Stano Pekár

Marek Brabec

FOREWORD

1

INTRODUCTION

1

Let us start with a demonstration of two standard situations. Th e fi rst one took place

aft er a thesis defence when a student complained to another student: “Supposedly I used

the wrong statistical test.” Other situations can occur, for example, in a hallway of a re

search institute where a biologist reproaches a colleague for the way he/she presented

his/her results: “Th e data should be analysed more properly.” Both situations have one

thing in common – a desperate reference to the statistics. Indeed, statistical data analy

sis forms an integral part of scientifi c publications in many biological (and other) fi elds,

thus accompanying bio-logists throughout their careers. In some fi elds, such as taxono

my, statistical analyses may play just a marginal role. For other fi elds, such as ecology or

physiology, it is oft en almost a cornerstone of many new fi ndings. In these fi elds, short

comings in statistical analysis can have catastrophic consequences. It can easily happen

that a report on a good experimental study (e.g., in the form of a scientifi c paper) will

not be complete without a statistical analysis, in which case the results of the study can

become completely useless and all the previous eff ort of its authors can thus be wasted.

Th e only way to prevent such disasters is to strive to understand the statistics or, at least, to

fi nd somebody who understands it. Obviously, conducting practical data analyses are much

easier today than ever before. Th is is due to the development of personal computers and sub

sequent developments in computational algorithms, which have literally meant a revolution

for data analyses. Fift y years ago, even a simple statistical analysis, using a calculator and

a pen, would take several hours and sometimes even days (while, at the same time, it was not

easy to avoid calculation errors etc.). Today, using computers, even a relatively complicated

analysis can take less than a minute or even just a few milliseconds. Preparing a plot is oft en

easier and faster than, for example, preparing a coff ee. However, technical improvements

have led to increased demands for using adequate statistical methods. While simpler stat

istical methods were preferred in the past, despite the fact they were not the most suitable,

today the emphasis is put on using methods that correspond in the relevant aspects to the

actual data at hand as closely as possible. Th is is because the computing requirements do

not represent an insurmountable obstacle any more. Unlike the former “universal” simple

procedures, application of statistics today utilises such methods and models that realistically

consider the important characteristics of the data and of the studied processes. Very oft en

this means that a given method or a statistical model can and should be adjusted to the real

situation at hand. In short, models should be adjusted to the data and not vice versa!

Nevertheless, it is obviously not always easy to comply with this requirement. In fact, cre

ative and useful practical analyses need theoretical knowledge about numerous models and

methods as a pre-requisite. Moreover, certain experience with practical data analyses, with

2

the application of various models on real data, with model building as well as estimation

procedures, with conducting appropriate tests, etc. is also necessary. It is clear that one can

not become an experienced data analyst only from reading books. To get such experience,

you have to put in some work and, most of all, some thinking. Guidance and examples can

assist you in this process.

Th is book attempts to help exactly along these lines by presenting examples of particular

analyses, including the specifi cation of the problem of interest, development of a statistical

model and formulation of possible conclusions. Th e book is not, and does not even aspire

to be, a manual for selecting the “best” method for a particular data analysis (it is our strong

opinion that, for various reasons, such a manual cannot be ever made). Instead, the book

tries to demonstrate how to think about particular statistical models, how to use them and

also how not to use them – to point out many of the problems and errors that can occur

(and do indeed occur in practical analyses and even in published papers). We demonstrate

various general approaches on particular (hopefully biologist-engaging) examples and we

also present their detailed implementation in the R language, thus allowing everybody to

try them for themselves using the data provided in the book as well as their own datasets of

a similar nature.

Since regression is a very powerful instrument, used very oft en in biological studies, this

book is almost exclusively dedicated to regression models. It assists in solving important

questions of the following type: how does variable y depend on variable x or how can we

predict the value of a new observation when we know the value of one or several so-called

explanatory variables. However, we will talk about regression in a somewhat wider context

compared to what you know from basic statistics courses. Namely, we will use the concept

of the so-called Generalized Linear Models (GLM). Th eir name explicitly emphasises that

they are generalized linear regression models – that is, generalisations of models that many

people know under the term of general linear model. Th e GLM generalization is very useful

from a practical point of view since it allows correct analyses of various data that cannot be

processed using a standard (linear) regression without (very) rough approximations and

(crude) simplifi cations. GLM represents a relatively large class of models with a unifi ed stat

istical theory and very universal computational implementations. It is a class by the means

of which you can analyse a wide range of data, such as weight measurements, concentration,

number of individuals, or relative frequency of a phenomenon (i.e. various continuous un

restricted, continuous positive, discrete positive data).

From a formal point of view, we will use only univariate GLM models (or statistical

methods based on them), and only those that are suitable exclusively for independent mea

surements (observations). In order to analyse data with some kind of correlation (for ex

ample, in the case of repeated measurements on the same individuals), somewhat diff erent

and more complex models and methods need to be used. We address these in another book

(Pekár & Brabec 2012).

As we have already stated, the objective of this book is to assist practical users of statistics

in the process of formulating and using statistical models – and thus also in selecting suit

INTRODUCTION

3

able methods for analyses of various data types. At the same time, we would like to motivate

them to seek assistance from a professional statistician if they fi nd that the complexity of

the model, study design and/or data complexity exceed their abilities and experience with

statistical modelling. Th ings are not always as easy as they may appear at fi rst. Th e same

data can be analysed using various methods, depending, for example, on the objectives of

a particular analyst. In fact, data include a large amount of information but we are oft en

interested only in extracting just some part of it at a given moment. Th at is why the method

selected depends on what we regard as salient features of the data (and what we are willing to

leave out), as well as what (and for what purposes) we want to extract from the observations/

measurements. Nevertheless, it is oft en the case that multiple procedures (with diff erent

characteristics) can be used for analysing even a single aspect. In this book, we will mostly

present only one procedure without denying the existence of other approaches. Otherwise,

the book would become signifi cantly longer.

We have applied a similar approach when describing the R language and discussing the use

of particular objects. Th e R language is, in fact, very rich. Like in any other programming

language environment, the same result can be oft en obtained in several ways. Th ere is not

enough room here to list them all (if you are interested, you can consult the corresponding

manuals, e.g. Zoonekynd 2007). We chose those that we consider to be the easiest and most

practical for a beginner.

1.1 How to read the book

Th e text of this book combines examples of selected statistical methods and descriptions

of the R language as a statistical environment. Both are then illustrated on practical data

analyses.

How to read this book? It depends on your knowledge and experience with statistical analy

ses as well as with the R environment. Th ose of you who have not done any (or almost

any) data analysis since they attended their basic statistics/biostatistics course and who have

never worked with R before should read the book from the beginning to the end. You who

already know the environment R (at least the basics) and have been using regression in your

work, can read the book somewhat non-systematically – picking the chapter that deals with

data and/or methods you are currently interested in.

Th e most important part of the fi rst chapter is the section 1.2, which defi nes variables. You

certainly should not skip this part even if you believe that you are clear about distinguishing

among variable types. Th is is because our defi nitions may diff er from the defi nitions you

may have encountered in other books and/or courses.

Chapter 2 describes the installation and use of the R soft ware, which we will be using in this

book for various data analyses. If you have never heard about R (or you have heard about it but

never actually worked with it), this chapter is here especially for you. Apart from the instal

lation of the soft ware, you will also learn both some general principles for working in the R

environment and important commands used repeatedly for various data manipulations later

1.1 HOW TO READ THE BOOK

4

in this book. Moreover, the chapter explains how to enter data into the R environment. All of

the above is absolutely essential for being able to use the soft ware successfully, not only for

practicing the examples of this book, but also for practical data analyses of your own.

Chapter 3 shows some of the methods of so-called exploratory data analysis. Th is analysis

means calculation of the basic statistical characteristics of the examined data sets as well

as their informal comparison using plots, tables and other instruments. You will fi nd there

a general description of several useful R commands, by means of which you can create the

majority of plots, with which we will work in later chapters.

Th e next four chapters (Chapters 4-8) are oriented rather more generally and even include

some theory. Do not be misled and do not skip them thinking that they are not important

from a practical point of view! In fact, they are some of the most important ones. Th is is

because you can use their content for data analyses even if you use diff erent soft ware or, for

example, for general considerations related to the strategy of statistical analyses. Particularly,

Chapter 4 addresses the issue of how to work with statistical models of the type that will be

used later in the book. Model formulation (traditionally in the form of a mathematical for

mula) forms a steppingstone, on which statistical analysis is based. In this chapter, we will

talk about regression models and we will discuss what GLM actually is.

Chapter 5 presents the fi rst example of a concrete and non-trivial data analysis. Th e pre

sented example is not as simple as motivational examples usually are. It is more demanding,

which allows us to discuss more of the aspects you will encounter when working with other

examples (and in practical analyses of your own). Th e analysis is described in detail and

commented abundantly (with references to the basic rules, which analysts should observe).

In addition to the previous, some R outputs are described and interpreted and important

defi nitions are stated (for example, the defi nition of contrasts).

Chapter 6 then goes deeper and deals with various systematic parts of a GLM model. It de

scribes basic model classifi cation based on the character of the explanatory variables. However,

most importantly, it demonstrates in general terms how to translate a mathematical model

into the R language and suggests how to interpret the model aft er the analysis is done.

Chapter 7 can be used as a simple “key”, based on which readers (beginners) can try to

decide which particular method to use (you will see yourself that, once you acquire more

knowledge and experience, this decision will require even more thinking). We assume that

the reader will be coming back to the book to use it as an aid for his/her own data analyses.

If the reader remembers the general knowledge from Chapters 1–6, he/she can go straight

to Chapter 7, which will direct him/her to a chapter where he/she will fi nd an analysis of an

example similar to his/her own; however, if you cannot decide which method to use, con

tinue reading subsequent chapters in the order they are presented. Either way, we certainly

recommend that you initially read Chapters 8–12 completely.

Chapters 8–12 are based on several examples, but they explain also theory, which you will

encounter during these (and other) analyses. In general, the analyses of all examples from

INTRODUCTION

5

these chapters follow a similar plan. Th e chapters are written in a way that makes them mu

tually independent (they can be read separately). As a result of this, you encounter similar

problems repeatedly. When you pick a chapter, you should always read it as a whole. Do not

try to follow it merely as a concrete report of a particular analysis.

1.2 Types of variables

Defi nitions and a detailed description of the characteristics of various variable types form

the topic of a basic statistical course, which we certainly do not intend to repeat here. Let us

just remind you of a few important facts that will be the most useful later in the book.

Th ere are several possible viewpoints based on which variables can be classifi ed. For our

purposes, we will only need the following variable types:

Response variable: a (random) variable, the variability of which we attempt to explain by

means of a statistical (regression) model. In other words, it is a variable that we want to

model using a single explanatory variable or multiple explanatory variables. In this book we

will exclusively address univariate models. Th at is, we will always model only one random

response variable at a time (in contrast to multivariate models with several response vari

ables considered in one model simultaneously). Depending on the particular GLM model

type, our response variable will be either continuous or discrete, but always numeric (i.e. the

values are numbers).

Explanatory variable: a variable by the means of which we explain the values of a response

variable. Even in the case of univariate models, a response variable can be modelled by a single

explanatory variable or by multiple explanatory variables. Th ese can be either numeric or

categorical (the values are characters and character chains that correspond to a given code

marking groups/categories). Numerical variables can be continuous or discrete. We will

mark continuous explanatory variables using lowercase letters, for example x. We will mark

their value for the ith observation using index i (e.g. x

i

). We will call continuous variables

covariates. We will mark categorical variables using uppercase letters (for example A).

a categorical variable can always take at least two diff erent values, which we call levels (for

example, “male“ and “female“). Th en the jth level of such a variable is marked with index j,

for example as A

j

. a categorical explanatory variable (with categories denoted by characters

or numbers) will be also called a factor, as in the analysis of variance ( ANOVA).

Weights represent a special case of variable. Th ey determine relative weights of individu

al observations within a given data set. By default, functions in R use the same (unitary)

weights for all observations. Externally entered weights are useful when a scheme with equal

weights is not satisfactory and we need to change it. For example, weights can be based on

known relative accuracy of observations (when data are averages of samples of diff erent size,

the sample sizes are oft en taken as weights). Th e weights must take non-negative values. Zero

weights are possible, but somewhat extreme (they exclude the given measurement from the

analysis), and we do not recommend using them (selected observations can be excluded

from the analysis in a much more elegant manner).

1.2 TYPES OF VARIABLES

1.3 Conventions

Th e book uses several font types. We use them for distinguishing the basic text of the book

from soft ware commands (and other key words) of the R language. For the names of com

mands and their arguments, we use the bold Courier New font. Th e names of objects that

we create during analyses are typed in the normal Courier New font. Th e other text is

typed in the Times New Roman font. Names of variables, parameter values and mathemat

ical formulas are written in italics, while the names of factor levels are enclosed in quotation

marks. Th e names of packages are underlined.

For transcribing everything that takes place in the command window of the R environment,

we use the Courier New font of a smaller size. For better orientation, we diff erentiate

between commands entered by the user, which we write in bold (for example, a <-1:5 ),

and program responses in normal style (for example, a). To save space, some rows of output

were cut off and substituted with three dots.

Plots made at the beginning of analyses were created using as few commands as possible and

that is why they oft en do not have appropriate captions, legends, etc. Only plots made at the

end of the analysis are closer to real presentation-quality and hence include various details

(at the cost of longer commands).

Th e use of the natural logarithm (with base e) is prevalent in the statistics. We will thus label it simply by log. You need to be aware of the fact that this symbol can have a diff erent meaning elsewhere (for example, in MS Excel it means the decadic logarithm – with base 10).

INTRODUCTION

7

Th ere are many commercial as well as non-commercial programs available for data analyses.

Th ey widely diff er by their quality, extent and price. Many inexperienced users work just

with the basic soft ware, for example, MS Excel. Some of the popular specialised programs

include Statistica or JMP. And fi nally, there are large (and expensive) packages, such as SPSS

or SAS. Th e diff erence among statistical soft ware packages is not only in the number of im

plemented analysis types, but also in their fl exibility, i.e. in the possibilities of user program

ming and other “customised” modifi cations and the automation of selected procedures. A

simple analysis can, of course, be executed using any of the statistical soft ware, however,

more advanced methods are off ered only by specialised packages.

2.1 The R environment

Th e good news is that one of the best statistical soft ware packages for modern data analyses

is available completely for free. It is called R (R Core Team 2015) and this book is based on its

extensive possibilities. It actually covers a much wider area than is presented in this book.

Th e R programming environment is similar to the commercial program S-Plus (© Insightful

Corporation), which used the S programming language; it was developed in the 1980’s in the

American AT&T Bell Laboratories (it has been improved several times since then). R uses

the dialect of the S language in combination with the Scheme language, which means that,

from the user’s point of view, the overwhelming majority of the language for R and S-Plus

are either the same or similar. R was created by New Zealanders Ihaka & Gentleman (1996).

Today, it is managed by a group of people from around the globe, who call themselves the

R Core Team. Th anks to that, the development of R is extremely dynamic. It is constantly

evolving and self-improving.

On its own, R is not a user-friendly program of the MS Excel or Statistica type. You will not

fi nd nice, colourful windows, pull-down menus and clickable buttons, basically an environ

ment where the main control instrument is the mouse. Be prepared that when you open R,

you will only see an empty grey window (Fig. 2-1). R is mainly controlled by commands

entered on a keyboard.

Th ere are specialised environments available which can provide a standard user higher com

fort when using R (for example, the Rstudio downloadable from http://www.rstudio.com/).

As we are concentrating on the GLM models and data analysis here, we will not use them

in this book.

2

STATISTICAL SOFTWARE

8 You may ask why we have chosen such “unfriendly” soft ware. We have several reasons for it: • R is one of the most extensive statistical packages that contain all essential types for mod

ern analyses, which are continuously amended thanks to the continuous work of many

people from around the world. Commercial statistical soft ware is usually a closed system

(it can be expanded only by purchasing yet another version), while R has been built as an

open system. It means that new and additional methods can be added easily and for free

at any time. • Control of this program is not (for a non-statistician) that easy. However, that can be, para

doxically, an advantage because it forces its users to acquire certain knowledge about what

they are doing and to think about it (at least) a little bit. Data analyses using commercial

soft ware can be somewhat “dangerous” since overly-friendly soft ware allows people who

have no idea what they are doing to perform an analysis by a sequence of more or less

random clicks. Needless to say that the results of such analyses are oft en misleading or

completely wrong. • Yet another reason is that “friendly” statistical soft ware will produce a huge volume of

information, which they will spout out to you once the analysis is completed. It may not be

easy for you to make sense of it. Th e philosophy of R is completely diff erent – it will display

only what the user asks for with the commands. Th is approach is based on the assump

tion that users will only use commands that they know something about or, alternatively,

that they will look for in help pages or reference literature, thus being stimulated to study

the given topic. By default, R outputs are typically quite modest in size. And, in fact, that

STATISTICAL SOFTWARE

Fig. 2-1 Window of the R environment.

9

is good indeed. As you will see, it is easier to make general sense of the analysis this way.

At any rate, specifi c details can be explicitly requested from the R environment later, if

needed. • One of the strengths of R lies in powerful modern graphics, focused on transparent and

effi cient data presentation. Taken together, R is certainly not a hostile program. You will see while reading this book and while analysing your own data that, quite to the contrary, R control is relatively easy and, most of all, very effi cient. To make your learning easier, we have described and explained the way to call various important procedures (and their outputs) in detail. As we have already stated above, R is an environment used for handling objects. Objects can be data, results of their analyses (or results of various intermediate computations and manipulations), and handling means mathematical and statistical computations, manipulations, and constructions of tables and plots, etc. In reality, R is more than just a statistical package: it is a powerful programming language (for object programming), similar to C, C+ or Java. In order to successfully use R, you need to, among other things, learn its basic control commands. Making you familiar with these basic commands is the main objective of this chapter. It is the basic R philosophy that you can learn more details later when you need them. However, fi rst of all, you need to install R on your computer. 2.2 Installation and use of R Th e R installation fi le can be found on the Internet at http://www.r-project.org/. You can get it via the Download dialogue box where, upon clicking on CRAN, a list of servers, from which you can actually download the fi le, is displayed. Upon selecting a server, a window comes up where you choose the platform on which you will work with R: Linux, Mac OS or MS Windows. All calculations in this book have been done in MS Windows (use of R in alternative platforms is quite similar). Apart from the latest version of the installation fi le of R, the base folder also includes several info fi les. What we need now is the installation fi le with EXE extension. New versions of the soft ware are uploaded in a relatively regular manner every few months. At the time of writing this book, the latest available version was 3.0.0. Th e name of the installation fi le thus was R-3.0.0-win.exe. Th is version incorporates 29 basic packages (i.e. libraries that implement various statistical methods and other computations). Th ere are several other packages that include additional methods and other useful commands (at the time of writing this book, there were more than a thousand of them). You can see them by clicking on Packages on the taskbar on the left . A list of packages with a brief description of included functions is then displayed. If you are interested in any of them, you can download them (of course for free and in unlimited number). We will talk about how to do it later. Once you have successfully installed the soft ware, an icon called R 3.0.0 will appear on the desktop. Upon launching the program, the main (large) window opens with a command (smaller) window in it – this is the R Console (Fig. 2-1). Th e upper taskbar of the main window includes several basic commands accessible either via a pull-down menu or using a button.

2.2 INSTALLATION AND USE OF R

10

We will describe the functions of the most important ones. In the File menu, you can fi nd basic

options, such as Display fi le(s)... for displaying fi les in the R-3.0.0 folder. Th is is a standard,

pre-set working folder. It is quite convenient to store your data in this folder because then you

do not have to specify the full path when you want to read/save them. To keep things organ

ised, we have saved all data that we will work with in a folder called MABD (abbreviation of the

name of the book), located in folder C:Program FilesRR-3.0.0. If you want to do the same,

you have to fi rst download the data fi les from http://www.muni.cz/press/books/pekar_en and

save them in the folder called MABD. Th en you can change the working directory via the

Change dir... option from inside of the launched R program (using the File option from the

main menu).

By selecting Load Workspace... (again using the File option from the main menu), you can

upload your previous work, provided you saved it in the end of your previous session using

the Save Workspace... option. Th ese options are especially useful when you want to continue

your analysis aft er an interruption, for example, on the next day. You can just upload what you

did earlier and you thus do not have to do everything from the scratch again. Nevertheless, be

careful! Only the commands and (result) objects are saved (uploaded), not outputs and graph

ics. Th is means that if you want to see an output from some of the previous commands, you

have to call and launch it again (provided you have not already copied the text output into, for

example, a text fi le using Copy and Paste).

Th e Edit menu includes the copy and inserts functions in the same way we are used to from

other programs. Furthermore, the menu off ers an option for clearing the console window:

Clear console, and an option for opening the spreadsheet editor: Data editor..., and an option

for modifying the program appearance: GUI preferences... Here, you can change the size of

the windows, font type, background and font colour, etc.

Th e Misc menu includes the useful Stop current computation option for suspending a given

computation (for example, when the computation “freezes” for some reason, or when you

realise that you have entered an input incorrectly, etc.). Th is can be done even faster using a

shortcut key – by pressing the ESC key or by clicking on the STOP button. Th e Remove all

objects option is used for removing all currently defi ned or uploaded objects.

Th e Packages menu off ers options for working with additional packages, for example those

that contain additional functions. Packages (with the exception of a few basic ones) are not

automatically loaded into memory upon launching the program so that they are not occupy

ing it with functions that we do not need at a given moment. When you want to use a function

included in a certain package, you need to load the package by selecting the Load package...

option. However, if the required package is not included in the library of your computer, you

have to install it fi rst. We will practice it by installing the sciplot package, which we will use

for creating some plots. If you are connected to the Internet, click on the Install package(s)...

option and select a server from the displayed list; from this, you will download the package.

When you do that, your computer connects to the given server and displays a list of all cur

rently available packages. Select the one called sciplot (its installation on your computer is then

executed automatically). If you are not connected to the Internet, you have to fi rst download

STATISTICAL SOFTWARE

1 1

the zip form of the package from the above address separately and install the package from

the zip fi le manually aft erwards. When doing so, you need to be aware of the version of R that

you are using since packages can be specifi c for diff erent versions. Install the zip fi le into your

library using the Install package(s) from local zip fi les... option. Do not forget that in order to

activate functions of an already installed package, you need to always load it for each session.

Th e Windows menu includes options for switching between the command and graphical win

dows. Th e most useful is probably the Tile option, which allows you to see two windows next

to each other.

Finally, the Help menu off ers many useful options. Th e Console option will explain to you how

to control the console window. Commands are entered into the console following prompt (>).

Th e prompt is normally red, while program answers are (typically) displayed in blue. Com

mands can be typed by the user line by line (hitting “Enter” aft er each complete command), or

on the same line, separated by semicolons. No additional spaces are necessary between com

mands that are separated by a semicolon. Previous commands can be recalled one aft er an

other by pressing the up arrow key and back by pressing the down arrow key. Using the arrow

keys makes the work much easier. For example, if you make a mistake entering a command,

you can recall it by pressing the up arrow key and correct it. Using the left and right arrow keys,

you can move within the currently edited line.

You can learn many useful facts about R when using the FAQ on R or Manuals in (PDF) op

tions. Th e R functions (text)... option is very useful; it displays a detailed description of the

function, name of which you enter. Here you can fi nd a description of what the given function

can be used for, a list of all its arguments, their legal values, references to literature (related to

the method, on which the function is based), and a few examples. Th is command works only

for functions from the package(s) that are currently uploaded. Descriptions of all functions

(including those that are included in the installed packages but not uploaded) can be found by

using the Html help option. If you select this option, pages with some basic information about

R are displayed, including the functions included in the pre-installed packages. If you do not

know the name of a given function but you know what it is supposed to do, you can try to fi nd

it using the Search help option. For example, if we want to fi nd a function that computes the

Shapiro-Wilk test, type the key word „shapiro“. Th e program then searches all installed pack

ages and displays the functions that include the searched word in their names or descriptions.

Th e name of the package, in which the given function can be found, is displayed in the brackets

behind the name of the function. In our case, the function is called shapiro. test and it

is in the stats package.

2.3 Basic operations

Th e possibilities of R are really wide-ranging – the program includes hundreds and hun

dreds of diff erent commands. Of course we cannot mention and explain all of them here.

We will focus only on the most basic ones and on those you will encounter in this book most

oft en. Some others will be introduced within the context of examples later in the book. A

2.2 INSTALLATION AND USE OF R

12

brief list of the most important commands called the “R Reference Card” can be found at the

following address: http://cran.r-project.org/doc/contrib/Short-refcard.pdf. It is basically a four

page sheet. You can also read more about R, in the book by one of the R Core Team experts,

Dalgaard (2008).

Let us try some operations, starting with simple manipulations and proceeding to more

complicated procedures. Firstly, we use the environment as a scientifi c calculator. Just type

on the command line 2+5, press ENTER and the soft ware produces a result on a new line

(which begins with [ 1] ). Basic mathematical operators are: addition (+), subtraction (-),

multiplication (*), division (/), and power (^). Logic operators have the following form: less

than (<), greater than (>), equal (==), not equal (!=), less than or equal (<=), greater than

or equal (>=). Th ey do not produce a number but a logical value: either TRUE (abbreviated

T), or FALSE (abbreviated F).

Names of mathematical functions are in R very intuitive and thus easily memorable, for

example, absolute value ( abs), logarithm with base of e ( log), logarithm with base of 2

( log2), logarithm with base of 10 ( log10), exponent ( exp), sine ( sin), cosine ( cos),

tangent ( tan), arc sine ( asin), arc cosine ( acos), arc tangent ( atan), sum of several

numbers ( sum), product of several numbers ( prod). Th e command for square root

is sqrt. For other roots it is necessary to type them as powers (the power operator ^

is general and allows even negative and as well as non-integer arguments when they are

mathematically correct). Th ese simple functions are called by their name, followed by

a number in parentheses, as you can see in the following examples:

> 3*2

[1] 6

> 3^4

[1] 81

> sqrt(9)

[1] 3

> 8^(1/3)

[1] 2

> 3==4

[1] FALSE

> log(10)

[1] 2.302585

> log10(10)

[1] 1

> exp(2)

[1] 7.389056

> prod(2,3,4)

[1] 24

Answers to these commands were only displayed on the screen. You cannot directly work

with them any further. You can copy them and move them to the current command line

using Copy and Paste. Th is can be inconvenient, especially if we want to conduct further

operations with the result. In that case, you need to save the result into an object (and

subsequently use its name in parentheses when using it as an argument of some other

function). Th e simplest object is a scalar, i.e. a vector with a single element. Creating a scalar

STATISTICAL SOFTWARE

1 3

is easy. Choose its name, for example a, type an arrow (composed of ‘less than’ and a dash,

<-) or = behind it, and enter the value you want to enter in it, for example 8. All of this is

meant as an assignment of the value to the object a. You can easily display the content of

a vector by simply typing its name on a new line or on the same line behind a semicolon.

To create vectors of length larger than 1, you have to use the c (concatenate) function. Th is

function will bind all entered values (given as arguments, within parentheses and separated

by commas) into a vector in the order entered:

> a<-8; a

[1] 8

> b<-c(2,1/2,95); b

[1] 2.0 0.5 95.0

All names in R are case sensitive – including keywords (thus by typing b and B you are calling

diff erent objects). Object names can be almost arbitrary. Nevertheless, it is a good practice

to remember some restrictions. Names must not begin with a number or a special symbol

(such as comma or dot, etc.). Some names are better not to be used to avoid confusion.

Th is applies to standard R commands and functions which have a predefi ned function, for

example break, c, C, D, diff, else, F, FALSE, for, function, I, if, in, Inf, mean,

NA, NaN, next, NULL, pi, q, range, rank, repeat, s, sd, t, tree, TRUE, T, var and

while. If you use them then the predefi ned function will be masked (replaced with the new

object). Th is can cause a number of nasty problems that are diffi cult to discover. You had

best avoid using names that collide with the names of R functions (unless your intention is

to replace the standard R function by a version of your own).

Values of vector components (and hence of vectors themselves) can be of diff erent types:

numeric, e.g. c (1 .5,20,-3.1), logical, e.g. c(TRUE,FALSE,TRUE), or character.

Characters are always enclosed in quotation marks, e.g. c ("blue","red","green").

A vector should always include values of the same type. If there are various types, e.g. numbers

and characters, the vector will be automatically classifi ed by R as the most general appropriate

type (i.e. character). Character vectors cannot be used for mathematical operations. Th is is

useful to remember when searching for errors and reasons why something does not work.

Numeric vectors can be created by entering every number or using a colon, which requests

creation of an integer sequence of values from:to.

> y <-1:11; y

[1] 1 2 3 4 5 6 7 8 9 10 11

You can also create the same sequence in a diff erent way – using the seq command. Th is

command is not as simple as the previous command since it can do more – it can create

a sequence that does not consist of integers only. To make it work, you need to specify the

arguments that you need for the given objective. Particularly, you need to specify the initial

value ( from), fi nal value (to) and the step size ( by). It is typical for functions in R to

include multiple arguments that have their own name and pre-defi ned positions (there are

also functions with arguments that do not have names. It lets you to specify the arguments

either using their names in any order or without names but in the appropriate order. Th e

2.3 BASIC OPERATIONS