R is a very famous open-source programming language in the fields of Statistical computing, data analytics, data visualization, and Machine Learning. R is now being used in fields like Data Mining and Bio-informatics. R comes with several packages that allow users to use different functions and tools in R. Along with these R has some pre-built datasets for its users. These datasets cover a wide range of fields from biology to social records. If you are new to the field of R programming then you can use these datasets to learn using R. You can perform various operations and visualizations on the built-in datasets.
Check the article on R Tutorial | Learn R Programming Language for a better understanding of R programming.
Built-in Datasets in RThere are several built-in datasets in R. These datasets are useful for beginners to practice model building, visualization, and other data analytic operations. To check the list of built-in datasets in R, run the following command in the R console.
R
Output:
Data sets in package ‘datasets’:
AirPassengers Monthly Airline Passenger Numbers 1949-1960 BJsales Sales Data with Leading Indicator BJsales.lead (BJsales) Sales Data with Leading Indicator BOD Biochemical Oxygen Demand CO2 Carbon Dioxide Uptake in Grass Plants ChickWeight Weight versus age of chicks on different diets DNase Elisa assay of DNase EuStockMarkets Daily Closing Prices of Major European Stock Indices, 1991-1998 Formaldehyde Determination of Formaldehyde HairEyeColor Hair and Eye Color of Statistics Students Harman23.cor Harman Example 2.3 Harman74.cor Harman Example 7.4 Indometh Pharmacokinetics of Indomethacin InsectSprays Effectiveness of Insect Sprays JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share LakeHuron Level of Lake Huron 1875-1972 LifeCycleSavings Intercountry Life-Cycle Savings Data Loblolly Growth of Loblolly pine trees Nile Flow of the River Nile Orange Growth of Orange Trees OrchardSprays Potency of Orchard Sprays PlantGrowth Results from an Experiment on Plant Growth Puromycin Reaction Velocity of an Enzymatic Reaction Seatbelts Road Casualties in Great Britain 1969-84 Theoph Pharmacokinetics of Theophylline Titanic Survival of passengers on the Titanic ToothGrowth The Effect of Vitamin C on Tooth Growth in Guinea Pigs UCBAdmissions Student Admissions at UC Berkeley UKDriverDeaths Road Casualties in Great Britain 1969-84 UKgas UK Quarterly Gas Consumption USAccDeaths Accidental Deaths in the US 1973-1978 USArrests Violent Crime Rates by US State USJudgeRatings Lawyers' Ratings of State Judges in the US Superior Court USPersonalExpenditure Personal Expenditure Data UScitiesD Distances Between European Cities and Between US Cities VADeaths Death Rates in Virginia (1940) WWWusage Internet Usage per Minute WorldPhones The World's Telephones ability.cov Ability and Intelligence Tests airmiles Passenger Miles on Commercial US Airlines, 1937-1960 airquality New York Air Quality Measurements anscombe Anscombe's Quartet of 'Identical' Simple Linear Regressions attenu The Joyner-Boore Attenuation Data attitude The Chatterjee-Price Attitude Data austres Quarterly Time Series of the Number of Australian Residents beaver1 (beavers) Body Temperature Series of Two Beavers beaver2 (beavers) Body Temperature Series of Two Beavers cars Speed and Stopping Distances of Cars chickwts Chicken Weights by Feed Type co2 Mauna Loa Atmospheric CO2 Concentration crimtab Student's 3000 Criminals Data discoveries Yearly Numbers of Important Discoveries esoph Smoking, Alcohol and (O)esophageal Cancer euro Conversion Rates of Euro Currencies euro.cross (euro) Conversion Rates of Euro Currencies eurodist Distances Between European Cities and Between US Cities faithful Old Faithful Geyser Data fdeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK freeny Freeny's Revenue Data freeny.x (freeny) Freeny's Revenue Data freeny.y (freeny) Freeny's Revenue Data infert Infertility after Spontaneous and Induced Abortion iris Edgar Anderson's Iris Data iris3 Edgar Anderson's Iris Data islands Areas of the World's Major Landmasses ldeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK lh Luteinizing Hormone in Blood Samples longley Longley's Economic Regression Data lynx Annual Canadian Lynx trappings 1821-1934 mdeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK morley Michelson Speed of Light Data mtcars Motor Trend Car Road Tests nhtemp Average Yearly Temperatures in New Haven nottem Average Monthly Temperatures at Nottingham, 1920-1939.............................................................
Use ‘data(package = .packages(all.available = TRUE))’ to list the data sets in all *available* packages. These datasets are available under datasets package. These are the commonly referred as the built-in dataset in R. This contains some of the popular datasets that we will discuss later. Now, to check all the built-in datasets available in all the installed packages of R environment run the following command.
R
data(package = .packages(all.available = TRUE))
Output:
Data sets in package ‘ade4’:
abouheif.eg Phylogenies and quantitative traits from Abouheif acacia Spatial pattern analysis in plant communities aminoacyl Codon usage apis108 Allelic frequencies in ten honeybees populations at eight microsatellites loci aravo Distribution of Alpine plants in Aravo (Valloire, France) ardeche Fauna Table with double (row and column) partitioning arrival Arrivals at an intensive care unit atlas Small Ecological Dataset atya Genetic variability of Cacadors avijons Bird species distribution avimedi Fauna Table for Constrained Ordinations aviurba Ecological Tables Triplet bacteria Genomes of 43 Bacteria banque Table of Factors baran95 African Estuary Fishes bf88 Cubic Ecological Data bordeaux Wine Tasting bsetal97 Ecological and Biological Traits buech Buech basin butterfly Genetics-Ecology-Environment Triple capitales Road Distances carni19 Phylogeny and quantative trait of carnivora carni70 Phylogeny and quantitative traits of carnivora As you can we are getting built-in datasets from all installed packages in R. The packages are ‘ape’, ‘bit64’, ‘boot’, and more. This also includes the dataset in package ‘datasets‘.
Count number of DatasetsThere is no direct way to get the count of datasets available in R. What we can do, is either count the datasets manually or we can do the followings,
- Get the list of datasets.
- Store the List in a variable
- Get the variable length and print it.
Let’s check the number of datasets available under the datasets package.
R
# List datasets from all installed packages
listofdata <- data()$results[, "Item"]
# Count the number of datasets
len <- length(listofdata)
print(len)
Output:
[1] 104 And we get the output as 104 which means there are 104 datasets available in the dataset package.
Popular built-in Datasets in RThere are several built in datasets available in R which are famous among R programmers for learning and testing purpose. Following are examples of few commonly used famous built-in datasets in R.
- iris: This is the most famous built-in dataset available in R environment. This is a classic dataset which contains information about measurements of 3 species of iris flowers. This dataset was provided by Sir Ronald Fisher who is considered as one of the greatest biologist. This dataset is commonly used in Data analysis and Classification.
- mtcars: This dataset contains data about various popular car models in 1973 – ’74. There are 11 characteristics include, number of cylinders, horsepower, etc. This dataset has 32 rows of data containing information about 32 different cars.
- airquality: Airquality dataset has air quality records of New York city in 1973. This dataset has 6 columns including Ozone, Solar, Wind, Temp, etc. There are 154 observations recorded in this dataset.
- USArrests: This is also one of the famous datasets available in R. This dataset contains information about the arrests for various crimes in United States during 1973. This has observations for each of the 50 states about various crimes. This dataset is commonly used for descriptive statistics.
- AirPassengers: Airpassenger is a classical time series dataset that contains the number of monthly passengers of international airlines from 1949 to 1960. This dataset is very important for time series analysis, forecasting and modeling.
Let’s use a built-in dataset from the above datasets and try to do basic data operations on the dataset. In this example, we will see how we can get access to a built-in dataset and perform some analytics and visualizations.
Load the DatasetFor the example purpose, let’s use the iris dataset. To load the iris dataset, run the following command
R
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa Specifying the dataset name in data() function we can access the built-in dataset. The head() function here takes the dataset name and show us the first 6 rows of the dataset. similarly we can use tail() function to get last few rows.
R
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 145 6.7 3.3 5.7 2.5 virginica 146 6.7 3.0 5.2 2.3 virginica 147 6.3 2.5 5.0 1.9 virginica 148 6.5 3.0 5.2 2.0 virginica 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica As you can observe that we are getting the last 6 rows of data from the iris dataset.
Analyze the datasetLet’s check the number of rows and columns available in the dataset.
R
Output:
[1] 150 5 We can see from the output that the iris dataset contains 150 observations on 5 attributes.
Check the attribute names ( column names) of the dataset
R
Output:
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" [5] "Species" The output returns us the names of attributes in the iris dataset which are, “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width”, and “Species“. The “Species” column also has three different species.
To check the species names we can use
R
Output:
[1] setosa versicolor virginica Levels: setosa versicolor virginica This shows the 3 different species in the dataset which are setosa, versicolor, and virginica.
Let’s get a summary of the whole dataset
R
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 This summary result provides a great insight on the iris dataset where we get the minimum and maximum for each column containing numerical values. For the Species attribute, we can see that all the species contains 50 observations each which is equal number of observations.
Visualize the DatasetVisualize the dataset using scatterplots, where the plot displays individual data points on a 2D system. Let’s use the plot() function of R to built a scatterplot to better understand the relationship between Sepal Length and Sepal Width.
R
plot(iris$Sepal.Length, iris$Sepal.Width, main = "Sepal Length vs. Sepal Width",
xlab = "Sepal Length", ylab = "Sepal Width", col = iris$Species)
Output:
 A Complete Guide to the Built-in Datasets in R You can see the the distributions on the above picture. The different species are denoted on the plot using three different color.
- Let’s create a Histogram to see the distribution of data for the Petal Length. We will be using the hist() function of R. inside the function we will specify the data attribute, name of plot, and labels.
R
hist(iris$Petal.Length,
main = "Histogram of Petal Length",
xlab = "Petal Length",
ylab = "Frequency",
col = "lightblue")
Output:
 A Complete Guide to the Built-in Datasets in R
ConclusionThe in-built dataset provides better learning experience for beginners to learn R programming and use different formulas, models on the dataset. In this article you have seen what are the famous built-in datasets available in R. Then we have learned how we can access a dataset and perform various analyzation, operations and visualizations using the in-built dataset in R.
|