+1443 776-2705 panelessays@gmail.com

Need in 20 hours

PH 30004

November 1, 2021

Upcoming and updates

Exam 3 Nov 12

Final small assignment – Nov 22

Adjustments this week:

QDAS today; extended time and flexibility for participation points

Possible guest on Friday (will verify on Wed)

Review final proposal details on Wed; revise/refine topics during this week

Adjustments going forward:

Work days (final proposal) plus exam next week – no meetings

Visualization content week of Nov 15th (with posters)

New schedule in Canvas – Nov 1

Aims of qualitative analysis software:

Facilitate organization of data and codes

Facilitate coding processes – especially reapplication of the same code

Enable multiple comparisons:

Code comparisons – how often is a code used? Where is a code used?

Source comparisons – length, type, etc.

Demographics (aka attributes or descriptors– usually coded at source/case level

Attribute by code

Attribute by source

Visualization – vary by program

Terminology for QDAS

Codes are not often called codes

Nodes – NVivo

Quirks – Quirkos

Codes – Dedoose

Codes – Hyperresearch

MAXQDA

Atlas.ti

F4 analyze

QDAS program comparison

Quirkos –

Web-based or local

Project sharing

Projects can be transferred to other programs

Inexpensive perpetual student license

No audio/visual files

Nvivo –

Primarily local

Project sharing with cloud or server

Projects can be transferred

Limited student license

Kent has limited number of perpetual licenses that can be obtained

QDAS program comparison

Hyperresearch –

Local license

Data not imported into program

Has transcribe add on

Free unlimited trial version

Perpetual student license

Dedoose –

Desktop version (was web until Flash went away)

Monthly fee

Promotes itself as “mixed methods”

When to use QDAS

Multiple coders – use a program that facilitates sharing (web-based)

Multi-media files (all but Quirkos)

Need to do categorical/quantitative comparison

Meta-studies

Data are in electronic format (i.e., not books or physical photos, unless you can scan)

Appropriate qualitative approach:

Grounded theory; descriptive; case study; some narrative; large samples

Not phenomenological; IPA; some narrative; small samples; other non-coding approaches

Can at times do later cycles in QDAS after initial work done in Word or similar program

Update/recap

Review final proposal

Friday – Mendeley presentation

Participation points for attendance

Participation points for turning in demo practice

Reviewing last small assignment – can use Mendeley to complete

QDAS – changing from participation to extra credit (10 points)

Review a couple of other programs today

Next week – two work days plus exam

Exam most likely available on Wed due to holiday

Next week, I will request verification of final proposal topic and make up of any groups

For Friday

Download Mendeley Cite and Mendeley Reference manager

Find three articles to practice with – any journal articles

Download to computer

Newer is better!

PH 30004

Week of Oct 25, 2021

Schedule/recap

Grading status

Exam items

Guest lectures coming up:

Ms. Essel on Mendeley (will verify date)

Mr. Coetzer-Liversage on substance abuse research (mid November)

Others as possible

Exam 3 – Nov 12 (shorter time gap); study guide on Nov 8

Last small assignment – Nov 22 (no live meeting that day)

SURE opportunities – applications in Spring

This week

Ethical research

Basic math

R / R studio – problems on Wed

Quirkos – Friday or next Monday

Overview of other available QDAS

Miro exercise – next week

Three opportunities for participation points

Ethical research practice

CITI

Office of Human Research Protections

Is it research?

Example – journalistic interviews and oral histories are not research – systematic but not generalizable

Does it fall within the exempt categories?

Example – interviews aimed to contribute to generalizable knowledge

Secondary analysis of things that are not research (see above) aimed at developing generalizable knowledge

Challenging human participant contexts

Children – usually defined as under 18

Parental consent and children’s assent/consent are both required

Other vulnerable/disadvantaged populations

Illegal immigrants; refugees and individuals seeking political asylum

Persons in community-based settings where poor research practice has occurred in the past

Not necessarily viewed by IRB as vulnerable but may view themselves as vulnerable

Low income and/or individuals with low levels of educational attainment

Critical need to understand consent wording, research process, and right to withdraw

Three major ongoing challenges – #1 incentives

Increasing emphasis on incentives

Potential for coercive impact on recruitment

Potential to sway responses – participants feel they “owe” something

Can limit research efforts (no budget for incentives = no research)

Might dissuade some otherwise eligible participants (identifying info needed for incentives including at times tax information)

What about incentivizing parents for children’s participation? Is this a good practice

Research results on persuasive power of incentives is very mixed

Many will participate due to interest in the subject rather than rewards

Participants should not have to pay (parking, equipment, etc.)

Current/ongoing challenges #2 – archived data

Archiving and re-using data

Adding to archives/secondary data is essential and resource-effective

Where to archive?

De-identifying qualitative data – how much detail can be removed while preserving integrity?

IRBs requesting re-contact

Challenges in finding participants

Is this intrusive? Participants did not consent to re-contact

Ensure consent describes potential for and type of re-use

Historical precedents for reuse that was not always appropriate

How companies like FitBit/Google and MedProctor get your “consent”

Current/ongoing challenges #3 – regulated data

PHI – personal health information and educational records

The latter include enrollment in a course, attendance, etc. not just grades and scores

HIPAA and FERPA

Records exist; access ranges from impossible to routine

De-identification processes

Storage processes

Privacy officer needs to be involved, even for non-research projects

Ex – follow up calls to remind of medical or advising appointments

Quantitative data analysis

Use math when possible

Phone/online calculator

Physical calculator or machine

Why use a machine with a tape?

Qualtrics built in functions

MS Excel***

Basic mathematical processes

Mean/average (add then divide by n)

Frequency (how many times does something happen in a given time period?)

Expressed as simple tally

Proportions example

N = 50

12.5 purchased a course text = (12.5*100)/50 = 25%

Quick calculations – determine 1% by dividing by 100 – multiple by n to account for decimals:

(1/100) *50 = 0.5

Multiple 1% (0.5) by desired number

Index example

An index is a measure with a flexible standard or base that allows you to compare one value to another

Primary uses are to determine percent increase or decrease

Example: Tuition in 1980 was $500 per term; tuition in 1982 was $550 per term

Making 1980 the base value of 100, the 1982 value is 10% more so reflects an index of 110. This can be reported as a 10% increase or as an index value of 110.

An index change is merely the difference = + 10 index points

Indexes are always relative to the base value and are not necessarily the same as a percentage change

R and R studio

Download before class

https://www.r-project.org

Work through example problems on Wed

Will turn something in for participation credit

No computer? take good notes

Exam debrief

M/C 4 is just wrong. All received credit. I believe it should have been “fail to reject” but other answers were close

HCD versus Scientific Method – two differences means two pairs of differences, not A about X and B about Y

Banned – lots of bads and no goods – will people just be passive and sit by? The point of a brainstorm is to look at problems but also consider work arounds. Will people just let the piles of garbage grow?

Ex: The Duxbury Dump

Is it worth spending time to try for a higher score?

R – basic info

Open source, open access, sustained by international users and contributors community

Basic package plus growing add ons

Multiple options for most functions/procedures

Can expand as needed rather than wait for updates

“Reads” data from a designated location

Can enter data

.csv is preferred file type – looks like Excel, saved with .csv suffix

R

Code – equations, formulae, directions

Click programs use code, too, behind the scenes

Excel for stats – not as intuitive (working in cells)

SAS, SPSS, Stata – license fee, periodic updates, not freely expandable

Fully functional SAS is Windows only

Most modern datasets are amenable to R

Python is increasingly popular

R – basic processes

Using a script document

Basic math

Setting a working directory

Create a group

Read in data

Friday –

Install package

Basic graph

Modifying script

To do for participation points

Directions at end of 10/29/2021 code

R Studio

Most people who use R Studio – an interface for R – prefer it.

More point and click options

Easier to access data and other files

Help readily available

Difficult to switch back to R once you have downloaded R studio – opening any files will default to opening R Studio, not R

https://www.rstudio.com/products/rstudio/download/

Exam 3

Questions on this exam come from class meetings, readings, a guest lecture and software demo and practice opportunities

Part 3 Longer Response – 4 questions; 6 points each. Please read the instructions carefully. Partial credit is available for partially correct responses.

1. Use the screenshot below to respond to the following:

a. What does # mean in front of a line of code?

b. What does the command “library” do?

c. What is the point of the word “score” before the Goetz (<-) symbol?

2 Use the screenshot below to response to the following:

a. What does the code “head” show?

b. What does the code “str” show?

c. In the line of code with the command “as.factor” – what is R being asked to do?

3 Review the screenshots of Mendeley Cite for the next series of questions:

a) Refer to the screenshot above. What needs to be done to correct the placement of these two references in the text of a paper or article?

b) What citation style is shown in the screenshot above (in text and reference list)?

c) Which citation style is shown in the screenshot below (in text and reference list)?

d) Identify one citation error in the screenshot associated with item b

e) Identify one citation error in the screenshot associated with item c.

4. The following questions are all about the screenshot below from a library search:

a. Which databases were searched?

b. What is the best way to keep track of specific results I want to be able to look at later on this website?

c. What is one thing that is probably incorrect about the way these search terms were set up?

d. Describe two ways that are consistent with the general aims of academic literature searching, that would reduce the number of results (note correcting the incorrect thing you identified in item c does not count as one of the two ways).

Nicholas J. Horton
Randall Pruim
Daniel T. Kaplan

A Student’s
Guide to

R

Project MOSAIC

2 horton, kaplan, pruim

Copyright (c) 2015 by Nicholas J. Horton, Randall
Pruim, & Daniel Kaplan.

Edition 1.2, November 2015

This material is copyrighted by the authors under a
Creative Commons Attribution 3.0 Unported License.
You are free to Share (to copy, distribute and transmit
the work) and to Remix (to adapt the work) if you
attribute our work. More detailed information about
the licensing is available at this web page: http:
//www.mosaic-web.org/go/teachingRlicense.html.

Cover Photo: Maya Hanna.

Contents

1 Introduction 13

2 Getting Started with RStudio 15

3 One Quantitative Variable 27

4 One Categorical Variable 39

5 Two Quantitative Variables 45

6 Two Categorical Variables 55

7 Quantitative Response, Categorical Predictor 61

8 Categorical Response, Quantitative Predictor 69

9 Survival Time Outcomes 73

4 horton, kaplan, pruim

10 More than Two Variables 75

11 Probability Distributions & Random Variables 83

12 Power Calculations 89

13 Data Management 93

14 Health Evaluation (HELP) Study 107

15 Exercises and Problems 111

16 Bibliography 115

17 Index 117

About These Notes

We present an approach to teaching introductory and in-
termediate statistics courses that is tightly coupled with
computing generally and with R and RStudio in particular.
These activities and examples are intended to highlight
a modern approach to statistical education that focuses
on modeling, resampling based inference, and multivari-
ate graphical techniques. A secondary goal is to facilitate
computing with data through use of small simulation
studies and appropriate statistical analysis workflow. This
follows the philosophy outlined by Nolan and Temple
Lang1. The importance of modern computation in statis- 1 D. Nolan and D. Temple Lang.

Computing in the statistics
curriculum. The American
Statistician, 64(2):97–107, 2010

tics education is a principal component of the recently
adopted American Statistical Association’s curriculum
guidelines2.

2 ASA Undergraduate Guide-
lines Workgroup. 2014 cur-
riculum guidelines for under-
graduate programs in statisti-
cal science. Technical report,
American Statistical Associa-
tion, November 2014. http:
//www.amstat.org/education/
curriculumguidelines.cfm

Throughout this book (and its companion volumes),
we introduce multiple activities, some appropriate for
an introductory course, others suitable for higher levels,
that demonstrate key concepts in statistics and modeling
while also supporting the core material of more tradi-
tional courses.

A Work in Progress
Caution!

Despite our best efforts, you
WILL find bugs both in this
document and in our code.
Please let us know when you
encounter them so we can call
in the exterminators.

These materials were developed for a workshop entitled
Teaching Statistics Using R prior to the 2011 United States
Conference on Teaching Statistics and revised for US-
COTS 2011, USCOTS 2013, eCOTS 2014, ICOTS 9, and
USCOTS 2015. We organized these workshops to help
instructors integrate R (as well as some related technolo-
gies) into statistics courses at all levels. We received great
feedback and many wonderful ideas from the participants
and those that we’ve shared this with since the work-
shops.

6 horton, kaplan, pruim

Consider these notes to be a work in progress. We ap-
preciate any feedback you are willing to share as we con-
tinue to work on these materials and the accompanying
mosaic package. Drop us an email at [email protected]
org with any comments, suggestions, corrections, etc.

Updated versions will be posted at http://mosaic-web.
org.

Two Audiences

We initially developed these materials for instructors of
statistics at the college or university level. Another audi-
ence is the students these instructors teach. Some of the
sections, examples, and exercises are written with one or
the other of these audiences more clearly at the forefront.
This means that

1. Some of the materials can be used essentially as is with
students.

2. Some of the materials aim to equip instructors to de-
velop their own expertise in R and RStudio to develop
their own teaching materials.

Although the distinction can get blurry, and what
works “as is” in one setting may not work “as is” in an-
other, we’ll try to indicate which parts fit into each cate-
gory as we go along.

R, RStudio and R Packages

R can be obtained from http://cran.r-project.org/.
Download and installation are quite straightforward for
Mac, PC, or linux machines.

RStudio is an integrated development environment
(IDE) that facilitates use of R for both novice and expert
users. We have adopted it as our standard teaching en-
vironment because it dramatically simplifies the use of R
for instructors and for students. RStudio can be installed

More Info
Several things we use that
can be done only in RStudio,
for instance manipulate() or
RStudio’s integrated support for
reproducible research).

as a desktop (laptop) application or as a server applica-
tion that is accessible to users via the Internet. RStudio server version works

well with starting students. All
they need is a web browser,
avoiding any potential prob-
lems with oddities of students’
individual computers.

In addition to R and RStudio, we will make use of sev-
eral packages that need to be installed and loaded sep-
arately. The mosaic package (and its dependencies) will

a student’s guide to r 7

be used throughout. Other packages appear from time to
time as well.

Marginal Notes

Marginal notes appear here and there. Sometimes these Have a great suggestion for a
marginal note? Pass it along.are side comments that we wanted to say, but we didn’t

want to interrupt the flow to mention them in the main
text. Others provide teaching tips or caution about traps,
pitfalls and gotchas.

What’s Ours Is Yours – To a Point

This material is copyrighted by the authors under a Cre-
ative Commons Attribution 3.0 Unported License. You
are free to Share (to copy, distribute and transmit the
work) and to Remix (to adapt the work) if you attribute
our work. More detailed information about the licensing
is available at this web page: http://www.mosaic-web.
org/go/teachingRlicense.html. Digging Deeper

If you know LATEX as well as
R, then knitr provides a nice
solution for mixing the two. We
used this system to produce
this book. We also use it for
our own research and to intro-
duce upper level students to
reproducible analysis methods.
For beginners, we introduce
knitr with RMarkdown, which
produces PDF, HTML, or Word
files using a simpler syntax.

Document Creation

This document was created on November 15, 2015, using

• knitr, version 1.11

• mosaic, version 0.12.9003

• mosaicData, version 0.12.9003

• R version 3.2.2 (2015-08-14)

Inevitably, each of these will be updated from time to
time. If you find that things look different on your com-
puter, make sure that your version of R and your pack-
ages are up to date and check for a newer version of this
document.

Project MOSAIC

This book is a product of Project MOSAIC, a community
of educators working to develop new ways to introduce
mathematics, statistics, computation, and modeling to
students in colleges and universities.

The goal of the MOSAIC project is to help share ideas
and resources to improve teaching, and to develop a cur-
ricular and assessment infrastructure to support the dis-
semination and evaluation of these approaches. Our goal
is to provide a broader approach to quantitative stud-
ies that provides better support for work in science and
technology. The project highlights and integrates diverse
aspects of quantitative work that students in science, tech-
nology, and engineering will need in their professional
lives, but which are today usually taught in isolation, if at
all.

In particular, we focus on:

Modeling The ability to create, manipulate and investigate
useful and informative mathematical representations of
a real-world situations.

Statistics The analysis of variability that draws on our
ability to quantify uncertainty and to draw logical in-
ferences from observations and experiment.

Computation The capacity to think algorithmically, to
manage data on large scales, to visualize and inter-
act with models, and to automate tasks for efficiency,
accuracy, and reproducibility.

Calculus The traditional mathematical entry point for col-
lege and university students and a subject that still has
the potential to provide important insights to today’s
students.

10 horton, kaplan, pruim

Drawing on support from the US National Science
Foundation (NSF DUE-0920350), Project MOSAIC sup-
ports a number of initiatives to help achieve these goals,
including:

Faculty development and training opportunities, such as the
USCOTS 2011, USCOTS 2013, eCOTS 2014, and ICOTS
9 workshops on Teaching Statistics Using R and RStu-
dio, our 2010 Project MOSAIC kickoff workshop at the
Institute for Mathematics and its Applications, and
our Modeling: Early and Often in Undergraduate Calculus
AMS PREP workshops offered in 2012, 2013, and 2015.

M-casts, a series of regularly scheduled webinars, de-
livered via the Internet, that provide a forum for in-
structors to share their insights and innovations and
to develop collaborations to refine and develop them.
Recordings of M-casts are available at the Project MO-
SAIC web site, http://mosaic-web.org.

The construction of syllabi and materials for courses that
teach MOSAIC topics in a better integrated way. Such
courses and materials might be wholly new construc-
tions, or they might be incremental modifications of
existing resources that draw on the connections be-
tween the MOSAIC topics.

More details can be found at http://www.mosaic-web.
org. We welcome and encourage your participation in all
of these initiatives.

Computational Statistics

There are at least two ways in which statistical software
can be introduced into a statistics course. In the first ap-
proach, the course is taught essentially as it was before
the introduction of statistical software, but using a com-
puter to speed up some of the calculations and to prepare
higher quality graphical displays. Perhaps the size of
the data sets will also be increased. We will refer to this
approach as statistical computation since the computer
serves primarily as a computational tool to replace pencil-
and-paper calculations and drawing plots manually.

In the second approach, more fundamental changes in
the course result from the introduction of the computer.
Some new topics are covered, some old topics are omit-
ted. Some old topics are treated in very different ways,
and perhaps at different points in the course. We will re-
fer to this approach as computational statistics because
the availability of computation is shaping how statistics is
done and taught. Computational statistics is a key com-
ponent of data science, defined as the ability to use data
to answer questions and communicate those results.

Students need to see aspects of
computation and data science
early and often to develop
deeper skills. Establishing
precursors in introductory
courses help them get started.

In practice, most courses will incorporate elements of
both statistical computation and computational statistics,
but the relative proportions may differ dramatically from
course to course. Where on the spectrum a course lies
will be depend on many factors including the goals of the
course, the availability of technology for student use, the
perspective of the text book used, and the comfort-level of
the instructor with both statistics and computation.

Among the various statistical software packages avail-
able, R is becoming increasingly popular. The recent addi-
tion of RStudio has made R both more powerful and more
accessible. Because R and RStudio are free, they have be-
come widely used in research and industry. Training in R

12 horton, kaplan, pruim

and RStudio is often seen as an important additional skill
that a statistics course can develop. Furthermore, an in-
creasing number of instructors are using R for their own
statistical work, so it is natural for them to use it in their
teaching as well. At the same time, the development of R
and of RStudio (an optional interface and integrated de-
velopment environment for R) are making it easier and
easier to get started with R.

Information about the mosaic
package, including vignettes
demonstrating features and
supplementary materials (such
as this book) can be found at
https://cran.r-project.org/
web/packages/mosaic.

We developed the mosaic R package (available on
CRAN) to make certain aspects of statistical computation
and computational statistics simpler for beginners, with-
out limiting their ability to use more advanced features of
the language. The mosaic package includes a modelling
approach that uses the same general syntax to calculate
descriptive statistics, create graphics, and fit linear mod-
els.

1
Introduction

In this reference book, we briefly review the commands
and functions needed to analyze data from introductory
and second courses in statistics. This is intended to com-
plement the Start Teaching with R and Start Modeling with
R books.

Most of our examples will use data from the HELP
(Health Evaluation and Linkage to Primary Care) study:
a randomized clinical trial of a novel way to link at-risk
subjects with primary care. More information on the
dataset can be found in chapter 14.

Since the selection and order of topics can vary greatly
from textbook to textbook and instructor to instructor, we
have chosen to organize this material by the kind of data
being analyzed. This should make it straightforward to
find what you are looking for. Some data management
skills are needed by students1. A basic introduction to 1 N.J. Horton, B.S. Baumer, and

H. Wickham. Setting the stage
for data science: integration
of data management skills
in introductory and second
courses in statistics (http:
//arxiv.org/abs/1401.3269).
CHANCE, 28(2):40–50, 2015

key idioms is provided in Chapter 13.
This work leverages initiatives undertaken by Project

MOSAIC (http://www.mosaic-web.org), an NSF-funded
effort to improve the teaching of statistics, calculus, sci-
ence and computing in the undergraduate curriculum.
In particular, we utilize the mosaic package, which was
written to simplify the use of R for introductory statis-
tics courses, and the mosaicData package which includes
a number of data sets. A short summary of the R com-
mands needed to teach introductory statistics can be
found in the mosaic package vignette: https://cran.
r-project.org/web/packages/mosaic.

Other related resources from Project MOSAIC may be
helpful, including an annotated set of examples from the
sixth edition of Moore, McCabe and Craig’s Introduction
to the Practice of Statistics2 (see http://www.amherst.edu/ 2 D. S. Moore and G. P. McCabe.

Introduction to the Practice of
Statistics. W.H.Freeman and
Company, 6th edition, 2007

14 horton, kaplan, pruim

~nhorton/ips6e), the second and third editions of the Sta-
tistical Sleuth3 (see http://www.amherst.edu/~nhorton/ 3 F. Ramsey and D. Schafer.

Statistical Sleuth: A Course in
Methods of Data Analysis. Cen-
gage, 2nd edition, 2002

sleuth), and Statistics: Unlocking the Power of Data by Lock
et al (see https://github.com/rpruim/Lock5withR).

To use a package within R, it must be installed (one
time), and loaded (each session). The mosaic package can
be installed using the following commands:

> install.packages(“mosaic”) # note the quotation marks

The # character is a comment in R, and all text after that

RStudio features a simplified
package installation tab (in the
bottom right panel).

on the current line is ignored.
Once the package is installed (one time only), it can be

loaded by running the command:

> require(mosaic)

The knitr/LATEX system allows
experienced users to combine
R and LATEX in the same docu-
ment. The reward for learning
this more complicated system
is much finer control over the
output format. But RMarkdown
is much easier to learn and is
adequate even for professional-
level work.

Using Markdown or
knitr/LATEX requires that the
markdown package be installed.

The RMarkdown system provides a simple markup
language and renders the results in PDF, Word, or HTML.
This allows students to undertake their analyses using a
workflow that facilitates “reproducibility” and avoids cut
and paste errors.

We typically introduce students to RMarkdown very
early, requiring students to use it for assignments and
reports4. 4 B.S. Baumer, M. Çetinkaya

Rundel, A. Bray, L. Loi, and
N. J. Horton. R Markdown:
Integrating a reproducible
analysis tool into introductory
statistics. Technology Innovations
in Statistics Education, 8(1):281–
283, 2014

2
Getting Started with RStudio

RStudio is an integrated development environment (IDE)
for R that provides an alternative interface to R that has
several advantages over other the default R interfaces:

A series of getting started
videos are available at
http://www.amherst.edu/
~nhorton/rstudio.

• RStudio runs on Mac, PC, and Linux machines and pro-
vides a simplified interface that looks and feels identical
on all of them.
The default interfaces for R are quite different on the
various platforms. This is a distractor for students and
adds an extra layer of support responsibility for the
instructor.

• RStudio can run in a web browser.
In addition to stand-alone desktop versions, RStudio
can be set up as a server application that is accessed
via the internet.
The web interface is nearly identical to the desktop
version. As with other web services, users login to Caution!

The desktop and server version
of RStudio are so similar that
if you run them both, you will
have to pay careful attention
to make sure you are working
in the one you intend to be
working in.

access their account. If students logout and login in
again later, even on a different machine, their session
is restored and they can resume their analysis right
where they left off. With a little advanced set up, in-
structors can save the history of their classroom R use
and students can load those history files into their own
environment. Note

Using RStudio in a browser is
like Facebook for statistics.
Each time the user returns, the
previous session is restored and
they can resume work where
they left off. Users can login
from any device with internet
access.

• RStudio provides support for reproducible research.
RStudio makes it easy to include text, statistical

analysis (R code and R output), and graphical displays
all in the same document. The RMarkdown system
provides a simple markup language and renders the
results in HTML. The knitr/LATEX system allows users

16 horton, kaplan, pruim

to combine R and LATEX in the same document. The
reward for learning this more complicated system is
much finer control over the output format. Depending
on the level of the course, students can use either of
these for homework and projects. To use Markdown or

knitr/LATEX requires that the
knitr package be installed on
your system.• RStudio provides an integrated support for editing and

executing R code and documents.

• RStudio provides some useful functionality via a graph-
ical user interface.

RStudio is not a GUI for R, but it does provide a
GUI that simplifies things like installing and updating
packages; monitoring, saving and loading environ-
ments; importing and exporting data; browsing and
exporting graphics; and browsing files and documenta-
tion.

• RStudio provides access to the manipulate package.
The manipulate package provides a way to create

simple interactive graphical applications quickly and
easily.

While one can certainly use R without using RStudio,
RStudio makes a number of things easier and we highly
recommend using RStudio. Furthermore, since RStudio is
in active development, we fully expect more useful fea-
tures in the future.

We primarily use an online version of RStudio. RStudio
is a innovative and powerful interface to R that runs in a
web browser or on your local machine. Running in the
browser has the advantage that you don’t have to install
or configure anything. Just login and you are good to go.
Furthermore, RStudio will “remember” what you were
doing so that each time you login (even on a different
machine) you can pick up right where you left off. This
is “R in the cloud” and works a bit like GoogleDocs or
Facebook for R.

R can also be obtained from http://cran.r-project.
org/. Download and installation are pretty straightfor-
ward for Mac, PC, or Linux machines. RStudio is available
from http://www.rstudio.org/.

a student’s guide to r 17

2.1 Connecting to an RStudio server

RStudio servers have been set up at a number of schools to
facilitate cloud-based computing.

RStudio servers have been in-
stalled at many institutions.
More details about (free) aca-
demic licenses for RStudio
Server Pro as well as setup
instructions can be found at
http://www.rstudio.com/
resources/faqs under the
Academic tab.

Once you connect to the server, you should see a login
screen:

The RStudio server doesn’t tend
to work well with Internet
Explorer.

Once you authenticate, you should see the RStudio
interface:

Notice that RStudio divides its world into four panels.
Several of the panels are further subdivided into multi-

18 horton, kaplan, pruim

ple tabs. Which tabs appear in which panels can be cus-
tomized by the user.

R can do much more than a simple calculator, and we
will introduce additional features in due time. But per-
forming simple calculations in R is a good way to begin
learning the features of RStudio.

Commands entered in the Console tab are immediately
executed by R. A good way to familiarize yourself with
the console is to do some simple calculator-like compu-
tations. Most of this will work just like you would expect
from a typical calculator. Try typing the following com-
mands in the console panel.

> 5 + 3

[1] 8

> 15.3 * 23.4

[1] 358.02

> sqrt(16) # square root

[1] 4

This last example demonstrates how functions are
called within R as well as the use of comments. Com-
ments are prefaced with the # character. Comments can
be very helpful when writing scripts with multiple com-
mands or to annotate example code for your students.

You can save values to named variables for later reuse.

It’s probably best to settle on
using one or the other of the
right-to-left assignment opera-
tors rather than to switch back
and forth. We prefer the arrow
operator because it represents
visually what is happening in
an assignment and because it
makes a clear distinction be-
tween the assignment operator,
the use of = to provide values to
arguments of functions, and the
use of == to test for equality.

> product = 15.3 * 23.4 # save result

> product # display the result

[1] 358.02

> product <- 15.3 * 23.4 # <- can be used instead of =

> product

[1] 358.02

Once variables are defined, they can be referenced in
other operations and functions.

a student’s guide to r 19

> 0.5 * product # half of the product

[1] 179.01

> log(product) # (natural) log of the product

[1] 5.880589

> log10(product) # base 10 log of the product

[1] 2.553907

> log2(product) # base 2 log of the product

[1] 8.483896

> log(product, base=2) # base 2 log of the product, another way

[1] 8.483896

The semi-colon can be used to place multiple com-
mands on one line. One frequent use of this is to save and
print a value all in one go:

> product <- 15.3 * 23.4; product # save result and show it

[1] 358.02

2.1.1 Version information

At times it may be useful to check what version of the
mosaic package, R, and RStudioyou are using. Running
sessionInfo() will display information about the version
of R and packages that are loaded and RStudio.Version()
will provide information about the version of RStudio.

> sessionInfo()

R version 3.2.2 (2015-08-14)

Platform: x86_64-apple-darwin13.4.0 (64-bit)

Running under: OS X 10.10.5 (Yosemite)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

20 horton, kaplan, pruim

attached base packages:

[1] grid stats graphics grDevices utils datasets

[7] methods base

other attached packages:

[1] mosaic_0.12.9003 mosaicData_0.9.9001 car_2.1-0

[4] ggplot2_1.0.1 dplyr_0.4.3 lattice_0.20-33

[7] knitr_1.11

loaded via a namespace (and not attached):

[1] Rcpp_0.12.1 magrittr_1.5 splines_3.2.2

[4] MASS_7.3-45 munsell_0.4.2 colorspace_1.2-6

[7] R6_2.1.1 ggdendro_0.1-17 minqa_1.2.4

[10] highr_0.5.1 stringr_1.0.0 plyr_1.8.3

[13] tools_3.2.2 nnet_7.3-11 parallel_3.2.2

[16] pbkrtest_0.4-2 nlme_3.1-122 gtable_0.1.2

[19] mgcv_1.8-9 quantreg_5.19 DBI_0.3.1

[22] MatrixModels_0.4-1 lme4_1.1-10 assertthat_0.1

[25] digest_0.6.8 Matrix_1.2-2 gridExtra_2.0.0

[28] nloptr_1.0.4 reshape2_1.4.1 formatR_1.2.1

[31] evaluate_0.8 stringi_1.0-1 scales_0.3.0

[34] SparseM_1.7 proto_0.3-10

2.2 Working with Files

2.2.1 Working with R Script Files

As an alternative, R commands can be stored in a file.
RStudio provides an integrated editor for editing these
files and facilitates executing some or all of the com-
mands. To create a file, select File, then New File, then R
Script from the RStudio menu. A file editor tab will open
in the Source panel. R code can be entered here, and but-
tons and menu items are provided to run all the code
(called sourcing the file) or to run the code on a single
line or in a selected section of the file.

2.2.2 Working with RMarkdown, and knitr/LATEX

A third alternative is to take advantage of RStudio’s sup-
port for reproducible research. If you already know LATEX,

a student’s guide to r 21

you will want to investigate the knitr/LATEX capabili-
ties. For those who do not already know LATEX, the sim-
pler RMarkdown system provides an easy entry into the
world of reproducible research methods. It also provides
a good facility for students to create homework and re-
ports that include text, R code, R output, and graphics.

To create a new RMarkdown file, select File, then New
File, then RMarkdown. The file will be opened with a short
template document that illustrates the mark up language.

The mosaic package includes two useful RMarkdown
templates for getting started: fancy includes bells and
whistles (and is intended to give an overview of features),
while plain is useful as a starting point for a new analy-
sis. These are accessed using the Template option when
creating a new RMarkdown file.

22 horton, kaplan, pruim

Click on the Knit button to convert to an HTML, PDF,
or Word file.

This will generate a formatted version of the docu-
ment.

a student’s guide to r 23

There is a button (marked with a question mark)
which provides a brief description of the supported markup
commands. The RStudio web site includes more extensive
tutorials on using RMarkdown. Caution!

RMarkdown, and knitr/LATEX
files do not have access to the
console environment, so the
code in them must be self-
contained.

It is important to remember that unlike R scripts,
which are executed in the console and have access to
the console environment, RMarkdown and knitr/LATEX
files do not have access to the console environment This
is a good feature because it forces the files to be self-
contained, which makes them transferable and respects
good reproducible research practices. But beginners, es-
pecially if they adopt a strategy of trying things out in the
console and copying and pasting successful code from the
console to their file, will often create files that are …

simpleR – Using R for Introductory Statistics

John Verzani

20000 40000 60000 80000 120000 160000

2
e

+
0

5
4

e
+

0
5

6
e

+
0

5
8

e
+

0
5

y

page i

Preface
These notes are an introduction to using the statistical software package R for an introductory statistics course.

They are meant to accompany an introductory statistics book such as Kitchens “Exploring Statistics”. The goals
are not to show all the features of R, or to replace a standard textbook, but rather to be used with a textbook to
illustrate the features of R that can be learned in a one-semester, introductory statistics course.

These notes were written to take advantage of R version 1.5.0 or later. For pedagogical reasons the equals sign,
=, is used as an assignment operator and not the traditional arrow combination <-. This was added to R in version
1.4.0. If only an older version is available the reader will have to make the minor adjustment.

There are several references to data and functions in this text that need to be installed prior to their use. To
install the data is easy, but the instructions vary depending on your system. For Windows users, you need to
download the “zip” file , and then install from the “packages” menu. In UNIX, one uses the command R CMD
INSTALL packagename.tar.gz. Some of the datasets are borrowed from other authors notably Kitchens. Credit is
given in the help files for the datasets. This material is available as an R package from:

http://www.math.csi.cuny.edu/Statistics/R/simpleR/Simple 0.4.zip for Windows users.
http://www.math.csi.cuny.edu/Statistics/R/simpleR/Simple 0.4.tar.gz for UNIX users.

If necessary, the file can sent in an email. As well, the individual data sets can be found online in the directory

http://www.math.csi.cuny.edu/Statistics/R/simpleR/Simple.

This is version 0.4 of these notes and were last generated on August 22, 2002. Before printing these notes, you
should check for the most recent version available from

the CSI Math department (http://www.math.csi.cuny.edu/Statistics/R/simpleR).

Copyright c© John Verzani ([email protected]), 2001-2. All rights reserved.

Contents

Introduction 1
What is R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
A note on notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Data 1
Starting R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Entering data with c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Data is a vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Univariate Data 8
Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Bivariate Data 19
Handling bivariate categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Handling bivariate data: categorical vs. numerical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Bivariate data: numerical vs. numerical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Linear regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Multivariate Data 32
Storing multivariate data in data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Accessing data in data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Manipulating data frames: stack and unstack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Using R’s model formula notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Ways to view multivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
The lattice package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

simpleR – Using R for Introductory Statistics

page ii

Random Data 41
Random number generators in R– the “r” functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Simulations 47
The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Using simple.sim and functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Exploratory Data Analysis 54
Our toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Confidence Interval Estimation 59
Population proportion theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Proportion test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
The z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
The t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Confidence interval for the median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Hypothesis Testing 66
Testing a population parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Testing a mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Tests for the median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Two-sample tests 68
Two-sample tests of proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Two-sample t-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Resistant two-sample tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Chi Square Tests 72
The chi-squared distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chi-squared goodness of fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chi-squared tests of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chi-squared tests for homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Regression Analysis 77
Simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Testing the assumptions of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Multiple Linear Regression 84
The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Analysis of Variance 89
one-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Appendix: Installing R 94

Appendix: External Packages 94

Appendix: A sample R session 94
A sample session involving regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
t-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A simulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

simpleR – Using R for Introductory Statistics

page iii

Appendix: What happens when R starts? 100

Appendix: Using Functions 100
The basic template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
For loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Conditional expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Appendix: Entering Data into R 103
Using c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
using scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Using scan with a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Editing your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Reading in tables of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Fixed-width fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Spreadsheet data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
XML, urls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
“Foreign” formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Appendix: Teaching Tricks 106

Appendix: Sources of help, documentation 107

simpleR – Using R for Introductory Statistics

Data page 1

Section 1: Introduction

What is R

These notes describe how to use R while learning introductory statistics. The purpose is to allow this fine software
to be used in ”lower-level” courses where often MINITAB, SPSS, Excel, etc. are used. It is expected that the reader
has had at least a pre-calculus course. It is the hope, that students shown how to use R at this early level will better
understand the statistical issues and will ultimately benefit from the more sophisticated program despite its steeper
“learning curve”.

The benefits of R for an introductory student are

• R is free. R is open-source and runs on UNIX, Windows and Macintosh.
• R has an excellent built-in help system.
• R has excellent graphing capabilities.
• Students can easily migrate to the commercially supported S-Plus program if commercial software is desired.
• R’s language has a powerful, easy to learn syntax with many built-in statistical functions.
• The language is easy to extend with user-written functions.
• R is a computer programming language. For programmers it will feel more familiar than others and for new

computer users, the next leap to programming will not be so large.

What is R lacking compared to other software solutions?

• It has a limited graphical interface (S-Plus has a good one). This means, it can be harder to learn at the outset.
• There is no commercial support. (Although one can argue the international mailing list is even better)
• The command language is a programming language so students must learn to appreciate syntax issues etc.
R is an open-source (GPL) statistical environment modeled after S and S-Plus (http://www.insightful.com).

The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and
Ross Ihaka of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread
audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer
developers. The R project web page

http://www.r-project.org

is the main site for information on R. At this site are directions for obtaining the software, accompanying packages
and other sources of documentation.

A note on notation

A few typographical conventions are used in these notes. These include different fonts for urls, R commands,
dataset names and different typesetting for

longer sequences of R commands.

and for

Data sets.

Section 2: Data

Statistics is the study of data. After learning how to start R, the first thing we need to be able to do is learn how
to enter data into R and how to manipulate the data once there.

Starting R

simpleR – Using R for Introductory Statistics

Data page 2

R is most easily used in an interactive manner. You ask it a question and R gives you an answer. Questions are
asked and answered on the command line. To start up R’s command line you can do the following: in Windows find
the R icon and double click, on Unix, from the command line type R. Other operating systems may have different
ways. Once R is started, you should be greeted with a command similar to

R : Copyright 2001, The R Development Core Team

Version 1.4.0 (2001-12-19)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type ‘license()’ or ‘licence()’ for distribution details.

R is a collaborative project with many contributors.

Type ‘contributors()’ for more information.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or

‘help.start()’ for a HTML browser interface to help.

Type ‘q()’ to quit R.

[Previously saved workspace restored]

>

The > is called the prompt. In what follows below it is not typed, but is used to indicate where you are to type if
you follow the examples. If a command is too long to fit on a line, a + is used for the continuation prompt.

Entering data with c

The most useful R command for quickly entering in small data sets is the c function. This function combines, or
concatenates terms together. As an example, suppose we have the following count of the number of typos per page
of these notes:

2 3 0 3 1 0 0 1

To enter this into an R session we do so with

> typos = c(2,3,0,3,1,0,0,1)

> typos

[1] 2 3 0 3 1 0 0 1

Notice a few things

• We assigned the values to a variable called typos

• The assignment operator is a =. This is valid as of R version 1.4.0. Previously it was (and still can be) a <-.
Both will be used, although, you should learn one and stick with it.

• The value of the typos doesn’t automatically print out. It does when we type just the name though as the last
input line indicates

• The value of typos is prefaced with a funny looking [1]. This indicates that the value is a vector. More on
that later.

Typing less

For many implementations of R you can save yourself a lot of typing if you learn that the arrow keys can be used
to retrieve your previous commands. In particular, each command is stored in a history and the up arrow will traverse
backwards along this history and the down arrow forwards. The left and right arrow keys will work as expected. This
combined with a mouse can make it quite easy to do simple editing of your previous commands.

Applying a function

R comes with many built in functions that one can apply to data such as typos. One of them is the mean function
for finding the mean or average of the data. To use it is easy

simpleR – Using R for Introductory Statistics

Data page 3

> mean(typos)

[1] 1.25

As well, we could call the median, or var to find the median or sample variance. The syntax is the same – the
function name followed by parentheses to contain the argument(s):

> median(typos)

[1] 1

> var(typos)

[1] 1.642857

Data is a vector

The data is stored in R as a vector. This means simply that it keeps track of the order that the data is entered in.
In particular there is a first element, a second element up to a last element. This is a good thing for several reasons:

• Our simple data vector typos has a natural order – page 1, page 2 etc. We wouldn’t want to mix these up.

• We would like to be able to make changes to the data item by item instead of having to enter in the entire data
set again.

• Vectors are also a mathematical object. There are natural extensions of mathematical concepts such as addition
and multiplication that make it easy to work with data when they are vectors.

Let’s see how these apply to our typos example. First, suppose these are the typos for the first draft of section 1
of these notes. We might want to keep track of our various drafts as the typos change. This could be done by the
following:

> typos.draft1 = c(2,3,0,3,1,0,0,1)

> typos.draft2 = c(0,3,0,3,1,0,0,1)

That is, the two typos on the first page were fixed. Notice the two different variable names. Unlike many other
languages, the period is only used as punctuation. You can’t use an _ (underscore) to punctuate names as you might
in other programming languages so it is quite useful. 1

Now, you might say, that is a lot of work to type in the data a second time. Can’t I just tell R to change the first
page? The answer of course is “yes”. Here is how

> typos.draft1 = c(2,3,0,3,1,0,0,1)

> typos.draft2 = typos.draft1 # make a copy

> typos.draft2[1] = 0 # assign the first page 0 typos

Now notice a few things. First, the comment character, #, is used to make comments. Basically anything after the
comment character is ignored (by R, hopefully not the reader). More importantly, the assignment to the first entry
in the vector typos.draft2 is done by referencing the first entry in the vector. This is done with square brackets [].
It is important to keep this in mind: parentheses () are for functions, and square brackets [] are for vectors (and
later arrays and lists). In particular, we have the following values currently in typos.draft2

> typos.draft2 # print out the value

[1] 0 3 0 3 1 0 0 1

> typos.draft2[2] # print 2nd pages’ value

[1] 3

> typos.draft2[4] # 4th page

[1] 3

> typos.draft2[-4] # all but the 4th page

[1] 0 3 0 1 0 0 1

> typos.draft2[c(1,2,3)] # fancy, print 1st, 2nd and 3rd.

[1] 0 3 0

Notice negative indices give everything except these indices. The last example is very important. You can take more
than one value at a time by using another vector of index numbers. This is called slicing.

Okay, we need to work these notes into shape, let’s find the real bad pages. By inspection, we can notice that
pages 2 and 4 are a problem. Can we do this with R in a more systematic manner?

1The underscore was originally used as assignment so a name such as The Data would actually assign the value of Data to the variable
The. The underscore is being phased out and the equals sign is being phased in.

simpleR – Using R for Introductory Statistics

Data page 4

> max(typos.draft2) # what are worst pages?

[1] 3 # 3 typos per page

> typos.draft2 == 3 # Where are they?

[1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE

Notice, the usage of double equals signs (==). This tests all the values of typos.draft2 to see if they are equal to 3.
The 2nd and 4th answer yes (TRUE) the others no.

Think of this as asking R a question. Is the value equal to 3? R/ answers all at once with a long vector of TRUE’s
and FALSE’s.

Now the question is – how can we get the indices (pages) corresponding to the TRUE values? Let’s rephrase, which
indices have 3 typos? If you guessed that the command which will work, you are on your way to R mastery:

> which(typos.draft2 == 3)

[1] 2 4

Now, what if you didn’t think of the command which? You are not out of luck – but you will need to work harder.
The basic idea is to create a new vector 1 2 3 … keeping track of the page numbers, and then slicing off just the
ones for which typos.draft2==3:

> n = length(typos.draft2) # how many pages

> pages = 1:n # how we get the page numbers

> pages # pages is simply 1 to number of pages

[1] 1 2 3 4 5 6 7 8

> pages[typos.draft2 == 3] # logical extraction. Very useful

[1] 2 4

To create the vector 1 2 3 … we used the simple : colon operator. We could have typed this in, but this is a
useful thing to know. The command a:b is simply a, a+1, a+2, …, b if a,b are integers and intuitively defined
if not. A more general R function is seq() which is a bit more typing. Try ?seq to see it’s options. To produce the
above try seq(a,b,1).

The use of extracting elements of a vector using another vector of the same size which is comprised of TRUEs and
FALSEs is referred to as extraction by a logical vector. Notice this is different from extracting by page numbers
by slicing as we did before. Knowing how to use slicing and logical vectors gives you the ability to easily access your
data as you desire.

Of course, we could have done all the above at once with this command (but why?)

> (1:length(typos.draft2))[typos.draft2 == max(typos.draft2)]

[1] 2 4

This looks awful and is prone to typos and confusion, but does illustrate how things can be combined into short
powerful statements. This is an important point. To appreciate the use of R you need to understand how one composes
the output of one function or operation with the input of another. In mathematics we call this composition.

Finally, we might want to know how many typos we have, or how many pages still have typos to fix or what the
difference is between drafts? These can all be answered with mathematical functions. For these three questions we
have

> sum(typos.draft2) # How many typos?

[1] 8

> sum(typos.draft2>0) # How many pages with typos?

[1] 4

> typos.draft1 – typos.draft2 # difference between the two

[1] 2 0 0 0 0 0 0 0

Example: Keeping track of a stock; adding to the data

Suppose the daily closing price of your favorite stock for two weeks is

45,43,46,48,51,46,50,47,46,45

We can again keep track of this with R using a vector:

> x = c(45,43,46,48,51,46,50,47,46,45)

> mean(x) # the mean

[1] 46.7

simpleR – Using R for Introductory Statistics

Data page 5

> median(x) # the median

[1] 46

> max(x) # the maximum or largest value

[1] 51

> min(x) # the minimum value

[1] 43

This illustrates that many interesting functions can be found easily. Let’s see how we can do some others. First, lets
add the next two weeks worth of data to x. This was

48,49,51,50,49,41,40,38,35,40

We can add this several ways.

> x = c(x,48,49,51,50,49) # append values to x

> length(x) # how long is x now (it was 10)

[1] 15

> x[16] = 41 # add to a specified index

> x[17:20] = c(40,38,35,40) # add to many specified indices

Notice, we did three different things to add to a vector. All are useful, so lets explain. First we used the c (combine)
operator to combine the previous value of x with the next week’s numbers. Then we assigned directly to the 16th
index. At the time of the assignment, x had only 15 indices, this automatically created another one. Finally, we
assigned to a slice of indices. This latter make some things very simple to do.

R Basics: Graphical Data Entry Interfaces

There are some other ways to edit data that use a spreadsheet interface. These may be preferable to some
students. Here are examples with annotations

> data.entry(x) # Pops up spreadsheet to edit data

> x = de(x) # same only, doesn’t save changes

> x = edit(x) # uses editor to edit x.

All are easy to use. The main confusion is that the variable x needs to be defined previously. For example

> data.entry(x) # fails. x not defined

Error in de(…, Modes = Modes, Names = Names) :

Object “x” not found

> data.entry(x=c(NA)) # works, x is defined as we go.

Other data entry methods are discussed in the appendix on entering data.
Before we leave this example, lets see how we can do some other functions of the data. Here are a few examples.
The moving average simply means to average over some previous number of days. Suppose we want the 5 day

moving average (50-day or 100-day is more often used). Here is one way to do so. We can do this for days 5 through
20 as the other days don’t have enough data.

> day = 5;

> mean(x[day:(day+4)])

[1] 48

The trick is the slice takes out days 5,6,7,8,9

> day:(day+4)

[1] 5 6 7 8 9

and the mean takes just those values of x.
What is the maximum value of the stock? This is easy to answer with max(x). However, you may be interested

in a running maximum or the largest value to date. This too is easy – if you know that R had a built-in function to
handle this. It is called cummax which will take the cumulative maximum. Here is the result for our 4 weeks worth
of data along with the similar cummin:

> cummax(x) # running maximum

[1] 45 45 46 48 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51

> cummin(x) # running minimum

[1] 45 43 43 43 43 43 43 43 43 43 43 43 43 43 43 41 40 38 35 35

simpleR – Using R for Introductory Statistics

Data page 6

Example: Working with mathematics

R makes it easy to translate mathematics in a natural way once your data is read in. For example, suppose the
yearly number of whales beached in Texas during the period 1990 to 1999 is

74 122 235 111 292 111 211 133 156 79

What is the mean, the variance, the standard deviation? Again, R makes these easy to answer:

> whale = c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)

> mean(whale)

[1] 152.4

> var(whale)

[1] 5113.378

> std(whale)

Error: couldn’t find function “std”

> sqrt(var(whale))

[1] 71.50789

> sqrt( sum( (whale – mean(whale))^2 /(length(whale)-1)))

[1] 71.50789

Well, almost! First, one needs to remember the names of the functions. In this case mean is easy to guess, var
is kind of obvious but less so, std is also kind of obvious, but guess what? It isn’t there! So some other things were
tried. First, we remember that the standard deviation is the square of the variance. Finally, the last line illustrates
that R can almost exactly mimic the mathematical formula for the standard deviation:

SD(X) =

√√√√ 1
n − 1

n∑

i=1

(Xi − X̄)2.

Notice the sum is now sum, X̄ is mean(whale) and length(x) is used instead of n.
Of course, it might be nice to have this available as a built-in function. Since this example is so easy, lets see how

it is done:

> std = function(x) sqrt(var(x))

> std(whale)

[1] 71.50789

The ease of defining your own functions is a very appealing feature of R we will return to.
Finally, if we had thought a little harder we might have found the actual built-in sd() command. Which gives

> sd(whale)

[1] 71.50789

R Basics: Accessing Data

There are several ways to extract data from a vector. Here is a summary using both slicing and extraction by
a logical vector. Suppose x is the data vector, for example x=1:10.

how many elements? length(x)
ith element x[2] (i = 2)
all but ith element x[-2] (i = 2)
first k elements x[1:5] (k = 5)
last k elements x[(length(x)-5):length(x)] (k = 5)
specific elements. x[c(1,3,5)] (First, 3rd and 5th)
all greater than some value x[x>3] (the value is 3)
bigger than or less than some values x[ x< -2 | x > 2]
which indices are largest which(x == max(x))

simpleR – Using R for Introductory Statistics

Data page 7

Problems

2.1 Suppose you keep track of your mileage each time you fill up. At your last 6 fill-ups the mileage was

65311 65624 65908 66219 66499 66821 67145 67447

Enter these numbers into R. Use the function diff on the data. What does it give?

> miles = c(65311, 65624, 65908, 66219, 66499, 66821, 67145, 67447)

> x = diff(miles)

You should see the number of miles between fill-ups. Use the max …