A Complete Guide to the Built-in Datasets in R - Coding

R is a very famous open-source programming language in the fields of Statistical computing, data analytics, data visualization, and Machine Learning. R is now being used in fields like Data Mining and Bio-informatics. R comes with several packages that allow users to use different functions and tools in R. Along with these R has some pre-built datasets for its users. These datasets cover a wide range of fields from biology to social records. If you are new to the field of R programming then you can use these datasets to learn using R. You can perform various operations and visualizations on the built-in datasets.

Check the article on R Tutorial | Learn R Programming Language for a better understanding of R programming.

Built-in Datasets in R

There are several built-in datasets in R. These datasets are useful for beginners to practice model building, visualization, and other data analytic operations. To check the list of built-in datasets in R, run the following command in the R console.

data()

Output:

Data sets in package ‘datasets’:

AirPassengers        Monthly Airline Passenger Numbers
                     1949-1960
BJsales              Sales Data with Leading Indicator
BJsales.lead (BJsales)
                     Sales Data with Leading Indicator
BOD                  Biochemical Oxygen Demand
CO2                  Carbon Dioxide Uptake in Grass Plants
ChickWeight          Weight versus age of chicks on different
                     diets
DNase                Elisa assay of DNase
EuStockMarkets       Daily Closing Prices of Major European
                     Stock Indices, 1991-1998
Formaldehyde         Determination of Formaldehyde
HairEyeColor         Hair and Eye Color of Statistics Students
Harman23.cor         Harman Example 2.3
Harman74.cor         Harman Example 7.4
Indometh             Pharmacokinetics of Indomethacin
InsectSprays         Effectiveness of Insect Sprays
JohnsonJohnson       Quarterly Earnings per Johnson & Johnson
                     Share
LakeHuron            Level of Lake Huron 1875-1972
LifeCycleSavings     Intercountry Life-Cycle Savings Data
Loblolly             Growth of Loblolly pine trees
Nile                 Flow of the River Nile
Orange               Growth of Orange Trees
OrchardSprays        Potency of Orchard Sprays
PlantGrowth          Results from an Experiment on Plant Growth
Puromycin            Reaction Velocity of an Enzymatic Reaction
Seatbelts            Road Casualties in Great Britain 1969-84
Theoph               Pharmacokinetics of Theophylline
Titanic              Survival of passengers on the Titanic
ToothGrowth          The Effect of Vitamin C on Tooth Growth in
                     Guinea Pigs
UCBAdmissions        Student Admissions at UC Berkeley
UKDriverDeaths       Road Casualties in Great Britain 1969-84
UKgas                UK Quarterly Gas Consumption
USAccDeaths          Accidental Deaths in the US 1973-1978
USArrests            Violent Crime Rates by US State
USJudgeRatings       Lawyers' Ratings of State Judges in the US
                     Superior Court
USPersonalExpenditure
                     Personal Expenditure Data
UScitiesD            Distances Between European Cities and
                     Between US Cities
VADeaths             Death Rates in Virginia (1940)
WWWusage             Internet Usage per Minute
WorldPhones          The World's Telephones
ability.cov          Ability and Intelligence Tests
airmiles             Passenger Miles on Commercial US Airlines,
                     1937-1960
airquality           New York Air Quality Measurements
anscombe             Anscombe's Quartet of 'Identical' Simple
                     Linear Regressions
attenu               The Joyner-Boore Attenuation Data
attitude             The Chatterjee-Price Attitude Data
austres              Quarterly Time Series of the Number of
                     Australian Residents
beaver1 (beavers)    Body Temperature Series of Two Beavers
beaver2 (beavers)    Body Temperature Series of Two Beavers
cars                 Speed and Stopping Distances of Cars
chickwts             Chicken Weights by Feed Type
co2                  Mauna Loa Atmospheric CO2 Concentration
crimtab              Student's 3000 Criminals Data
discoveries          Yearly Numbers of Important Discoveries
esoph                Smoking, Alcohol and (O)esophageal Cancer
euro                 Conversion Rates of Euro Currencies
euro.cross (euro)    Conversion Rates of Euro Currencies
eurodist             Distances Between European Cities and
                     Between US Cities
faithful             Old Faithful Geyser Data
fdeaths (UKLungDeaths)
                     Monthly Deaths from Lung Diseases in the
                     UK
freeny               Freeny's Revenue Data
freeny.x (freeny)    Freeny's Revenue Data
freeny.y (freeny)    Freeny's Revenue Data
infert               Infertility after Spontaneous and Induced
                     Abortion
iris                 Edgar Anderson's Iris Data
iris3                Edgar Anderson's Iris Data
islands              Areas of the World's Major Landmasses
ldeaths (UKLungDeaths)
                     Monthly Deaths from Lung Diseases in the
                     UK
lh                   Luteinizing Hormone in Blood Samples
longley              Longley's Economic Regression Data
lynx                 Annual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths)
                     Monthly Deaths from Lung Diseases in the
                     UK
morley               Michelson Speed of Light Data
mtcars               Motor Trend Car Road Tests
nhtemp               Average Yearly Temperatures in New Haven
nottem               Average Monthly Temperatures at
                     Nottingham, 1920-1939.............................................................

Use ‘data(package = .packages(all.available = TRUE))’
to list the data sets in all *available* packages.

These datasets are available under datasets package. These are the commonly referred as the built-in dataset in R. This contains some of the popular datasets that we will discuss later. Now, to check all the built-in datasets available in all the installed packages of R environment run the following command.

data(package = .packages(all.available = TRUE))

Output:

Data sets in package ‘ade4’:

abouheif.eg          Phylogenies and quantitative traits from
                     Abouheif
acacia               Spatial pattern analysis in plant
                     communities
aminoacyl            Codon usage
apis108              Allelic frequencies in ten honeybees
                     populations at eight microsatellites loci
aravo                Distribution of Alpine plants in Aravo
                     (Valloire, France)
ardeche              Fauna Table with double (row and column)
                     partitioning
arrival              Arrivals at an intensive care unit
atlas                Small Ecological Dataset
atya                 Genetic variability of Cacadors
avijons              Bird species distribution
avimedi              Fauna Table for Constrained Ordinations
aviurba              Ecological Tables Triplet
bacteria             Genomes of 43 Bacteria
banque               Table of Factors
baran95              African Estuary Fishes
bf88                 Cubic Ecological Data
bordeaux             Wine Tasting
bsetal97             Ecological and Biological Traits
buech                Buech basin
butterfly            Genetics-Ecology-Environment Triple
capitales            Road Distances
carni19              Phylogeny and quantative trait of
                     carnivora
carni70              Phylogeny and quantitative traits of
                     carnivora

As you can we are getting built-in datasets from all installed packages in R. The packages are ‘ape’, ‘bit64’, ‘boot’, and more. This also includes the dataset in package ‘datasets‘.

Count number of Datasets

There is no direct way to get the count of datasets available in R. What we can do, is either count the datasets manually or we can do the followings,

Get the list of datasets.
Store the List in a variable
Get the variable length and print it.

Let’s check the number of datasets available under the datasets package.

# List datasets from all installed packages
listofdata <- data()$results[, "Item"]

# Count the number of datasets
len <- length(listofdata)
print(len)

Output:

[1] 104

And we get the output as 104 which means there are 104 datasets available in the dataset package.

Popular built-in Datasets in R

There are several built in datasets available in R which are famous among R programmers for learning and testing purpose. Following are examples of few commonly used famous built-in datasets in R.

iris: This is the most famous built-in dataset available in R environment. This is a classic dataset which contains information about measurements of 3 species of iris flowers. This dataset was provided by Sir Ronald Fisher who is considered as one of the greatest biologist. This dataset is commonly used in Data analysis and Classification.
mtcars: This dataset contains data about various popular car models in 1973 – ’74. There are 11 characteristics include, number of cylinders, horsepower, etc. This dataset has 32 rows of data containing information about 32 different cars.
airquality: Airquality dataset has air quality records of New York city in 1973. This dataset has 6 columns including Ozone, Solar, Wind, Temp, etc. There are 154 observations recorded in this dataset.
USArrests: This is also one of the famous datasets available in R. This dataset contains information about the arrests for various crimes in United States during 1973. This has observations for each of the 50 states about various crimes. This dataset is commonly used for descriptive statistics.
AirPassengers: Airpassenger is a classical time series dataset that contains the number of monthly passengers of international airlines from 1949 to 1960. This dataset is very important for time series analysis, forecasting and modeling.

Let’s use a built-in dataset from the above datasets and try to do basic data operations on the dataset. In this example, we will see how we can get access to a built-in dataset and perform some analytics and visualizations.

Load the Dataset

For the example purpose, let’s use the iris dataset. To load the iris dataset, run the following command

data("iris")
head(iris)

Output:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Specifying the dataset name in data() function we can access the built-in dataset. The head() function here takes the dataset name and show us the first 6 rows of the dataset. similarly we can use tail() function to get last few rows.

tail(iris)

Output:

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
145          6.7         3.3          5.7         2.5 virginica
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica

As you can observe that we are getting the last 6 rows of data from the iris dataset.

Analyze the dataset

Let’s check the number of rows and columns available in the dataset.

dim(iris)

Output:

[1] 150   5

We can see from the output that the iris dataset contains 150 observations on 5 attributes.

Check the attribute names ( column names) of the dataset

names(iris)

Output:

[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
[5] "Species"

The output returns us the names of attributes in the iris dataset which are, “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width”, and “Species“. The “Species” column also has three different species.

To check the species names we can use

unique(iris$Species)

Output:

[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica

This shows the 3 different species in the dataset which are setosa, versicolor, and virginica.

Let’s get a summary of the whole dataset

summary(iris)

Output:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

This summary result provides a great insight on the iris dataset where we get the minimum and maximum for each column containing numerical values. For the Species attribute, we can see that all the species contains 50 observations each which is equal number of observations.

Visualize the Dataset

Visualize the dataset using scatterplots, where the plot displays individual data points on a 2D system. Let’s use the plot() function of R to built a scatterplot to better understand the relationship between Sepal Length and Sepal Width.

plot(iris$Sepal.Length, iris$Sepal.Width, main = "Sepal Length vs. Sepal Width",
     xlab = "Sepal Length", ylab = "Sepal Width", col = iris$Species)

Output:

A Complete Guide to the Built-in Datasets in R

You can see the the distributions on the above picture. The different species are denoted on the plot using three different color.

Let’s create a Histogram to see the distribution of data for the Petal Length. We will be using the hist() function of R. inside the function we will specify the data attribute, name of plot, and labels.

hist(iris$Petal.Length, 
     main = "Histogram of Petal Length", 
     xlab = "Petal Length", 
     ylab = "Frequency",
     col = "lightblue")

Output:

A Complete Guide to the Built-in Datasets in R

Conclusion

The in-built dataset provides better learning experience for beginners to learn R programming and use different formulas, models on the dataset. In this article you have seen what are the famous built-in datasets available in R. Then we have learned how we can access a dataset and perform various analyzation, operations and visualizations using the in-built dataset in R.

Reffered: https://www.geeksforgeeks.org

R Language

Related
How to create an array in R
Welch’s t-test in R
Axes customization in R
How to Debug paste Error in R
How to Manage Argument Length Zero Error in R Programming

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	13