Title: | Datasets and Supplemental Functions from Bayes Rules! Book |
---|---|
Description: | Provides datasets and functions used for analysis and visualizations in the Bayes Rules! book (<https://www.bayesrulesbook.com>). The package contains a set of functions that summarize and plot Bayesian models from some conjugate families and another set of functions for evaluation of some Bayesian models. |
Authors: | Mine Dogucu [aut, cre] , Alicia Johnson [aut], Miles Ott [aut] |
Maintainer: | Mine Dogucu <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.0.3.9000 |
Built: | 2024-11-03 05:38:08 UTC |
Source: | https://github.com/bayes-rules/bayesrules |
The AirBnB data was collated by Trinh and Ameri as part of a course project at St Olaf College, and distributed with "Broadening Your Statistical Horizons" by Legler and Roback. This data set includes the prices and features for 1561 AirBnB listings in Chicago, collected in 2016.
airbnb
airbnb
A data frame with 1561 rows and 12 variables. Each row represents a single AirBnB listing.
the nightly price of the listing (in USD)
the listing's average rating, on a scale from 1 to 5
number of user reviews the listing has
the type of listing (eg: Shared room)
number of guests the listing accommodates
the number of bedrooms the listing has
the minimum number of nights to stay in the listing
the neighborhood in which the listing is located
the broader district in which the listing is located
the neighborhood's rating for walkability (0 - 100)
the neighborhood's rating for access to public transit (0 - 100)
the neighborhood's rating for bikeability (0 - 100)
Ly Trinh and Pony Ameri (2018). Airbnb Price Determinants: A Multilevel Modeling Approach. Project for Statistics 316-Advanced Statistical Modeling, St. Olaf College. Julie Legler and Paul Roback (2019). Broadening Your Statistical Horizons: Generalized Linear Models and Multilevel Models. https://bookdown.org/roback/bookdown-bysh/. https://github.com/proback/BeyondMLR/blob/master/data/airbnb.csv/
The AirBnB data was collated by Trinh and Ameri as part of a course project at St Olaf College, and distributed with "Broadening Your Statistical Horizons" by Legler and Roback. This data set, a subset of the airbnb data in the bayesrules package, includes the prices and features for 869 AirBnB listings in Chicago, collected in 2016.
airbnb_small
airbnb_small
A data frame with 869 rows and 12 variables. Each row represents a single AirBnB listing.
the nightly price of the listing (in USD)
the listing's average rating, on a scale from 1 to 5
number of user reviews the listing has
the type of listing (eg: Shared room)
number of guests the listing accommodates
the number of bedrooms the listing has
the minimum number of nights to stay in the listing
the neighborhood in which the listing is located
the broader district in which the listing is located
the neighborhood's rating for walkability (0 - 100)
the neighborhood's rating for access to public transit (0 - 100)
the neighborhood's rating for bikeability (0 - 100)
Ly Trinh and Pony Ameri (2018). Airbnb Price Determinants: A Multilevel Modeling Approach. Project for Statistics 316-Advanced Statistical Modeling, St. Olaf College. Julie Legler and Paul Roback (2019). Broadening Your Statistical Horizons: Generalized Linear Models and Multilevel Models. https://bookdown.org/roback/bookdown-bysh/. https://github.com/proback/BeyondMLR/blob/master/data/airbnb.csv/
Bald Eagle count data collected from the year 1981 to 2017, in late December, by birdwatchers in the Ontario, Canada area. The data was made available by the Bird Studies Canada website and distributed through the R for Data Science TidyTuesday project. A more complete data set with a larger selection of birds can be found in the bird_counts data in the bayesrules package.
bald_eagles
bald_eagles
A data frame with 37 rows and 5 variables. Each row represents Bald Eagle observations in the given year.
year of data collection
number of birds observed
total person-hours of observation period
count divided by hours
count_per_hour multiplied by 168 hours per week
The WNBA Basketball Data was scraped from https://www.basketball-reference.com/wnba/players/ and contains information on basketball players from the 2019 season.
basketball
basketball
A data frame with 146 rows and 30 variables. Each row represents a single WNBA basketball player. The variables on each player are as follows.
first and last name
height in inches
weight in pounds
year of the WNBA season
team that the WNBA player is a member of
age in years
number of games played by the player in that season
number of games the player started in that season
average number of minutes played per game
average number of field goals per game played
average number of field goals attempted per game played
percent of field goals made throughout the season
average number of three pointers per game played
average number of three pointers attempted per game played
percent of three pointers made throughout the season
average number of two pointers made per game played
average number of two pointers attempted per game played
percent of two pointers made throughout the season
average number of free throws made per game played
average number of free throws attempted per game played
percent of free throws made throughout the season
average number of offensive rebounds per game played
average number of defensive rebounds per game played
average number of rebounds (both offensive and defensive) per game played
average number of assists per game played
average number of steals per game played
average number of blocks per game played
average number of turnovers per game played
average number of personal fouls per game played. Note: after 5 fouls the player is not allowed to play in that game anymore
average number of points made per game played
total number of minutes played throughout the season
whether or not the player started in more than half of the games they played
https://www.basketball-reference.com/
A dataset containing data behind the story "The Dollar-And-Cents Case Against Hollywood's Exclusion of Women" https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/.
bechdel
bechdel
A data frame with 1794 rows and 3 variables:
The release year of the movie
The title of the movie
Bechdel test result (PASS, FAIL)
<https://github.com/fivethirtyeight/data/tree/master/bechdel/>
Data on the effectiveness of a digital learning program designed by the Abdul Latif Jameel Poverty Action Lab (J-PAL) to address disparities in vocabulary levels among children from households with different income levels.
big_word_club
big_word_club
A data frame with 818 student-level observations and 31 variables:
unique student id
control group (0) or treatment group (1)
age in months
whether student identifies as female
grade level, pre-school (0) or kindergarten (1)
unique teacher id
unique school id
whether school is private
whether school has Title 1 status
percent of school that receive free / reduced lunch
school location
whether student has ESL status
whether student has special education status
whether student enrolled after program began
student's distraction level during assessment 1 (0 = not distracted; 1 = mildly distracted; 2 = moderately distracted; 3 = extremely distracted)
same as distracted_a1 but during assessment 2
same as distracted_a1 but during standardized assessment
student score on BWC assessment 1
whether student's score on assessment 1 was invalid
student score on BWC assessment 2
whether student's score on assessment 2 was invalid
student score on standardized assessment
score_ppvt adjusted for age
whether student's score on standardized assessment was invalid
number of teacher logins onto BWC system in April
number of teacher logins onto BWC system during entire study
number of weeks of the BWC program that the classroom has completed
teacher response to the number of words students had learned through BWC (0 = almost none; 1 = 1 to 5; 2 = 6 to 10)
teacher response to the number of their students that have families that experience financial struggle
teacher response to frequency that student misbehavior interferes with teaching (0 = never; 1 = rarely; 2 = occasionally; 3 = frequently)
teacher's number of years of teaching experience
percent change in scores before and after the program
These data correspond to the following study: Ariel Kalil, Susan Mayer, Philip Oreopoulos (2020). Closing the word gap with Big Word Club: Evaluating the Impact of a Tech-Based Early Childhood Vocabulary Program. Data was obtained through the was obtained through the Inter-university Consortium for Political and Social Research (ICPSR) https://www.openicpsr.org/openicpsr/project/117330/version/V1/view/.
Data on ridership among registered members and casual users of the Capital Bikeshare service in Washington, D.C..
bike_users
bike_users
A data frame with 534 daily observations, 267 each for registered riders and casual riders, and 13 variables:
date of observation
fall, spring, summer, or winter
the year of the date
the month of the date
the day of the week
whether or not the date falls on a weekend (TRUE or FALSE)
whether or not the date falls on a holiday (yes or no)
raw temperature (degrees Fahrenheit)
what the temperature feels like (degrees Fahrenheit)
humidity level (percentage)
wind speed (miles per hour)
weather category (categ1 = pleasant, categ2 = moderate, categ3 = severe)
rider type (casual or registered)
number of bikeshare rides
Fanaee-T, Hadi and Gama, Joao (2013). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence. https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset/
Data on ridership among registered members of the Capital Bikeshare service in Washington, D.C..
bikes
bikes
A data frame with 500 daily observations and 13 variables:
date of observation
fall, spring, summer, or winter
the year of the date
the month of the date
the day of the week
whether or not the date falls on a weekend (TRUE or FALSE)
whether or not the date falls on a holiday (yes or no)
raw temperature (degrees Fahrenheit)
what the temperature feels like (degrees Fahrenheit)
humidity level (percentage)
wind speed (miles per hour)
weather category (categ1 = pleasant, categ2 = moderate, categ3 = severe)
number of bikeshare rides
Fanaee-T, Hadi and Gama, Joao (2013). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence. https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
Bird count data collected between the years 1921 and 2017, in late December, by birdwatchers in the Ontario, Canada area. The data was made available by the Bird Studies Canada website and distributed through the R for Data Science TidyTuesday project.
bird_counts
bird_counts
A data frame with 18706 rows and 7 variables. Each row represents observations for the given bird species in the given year.
year of data collection
scientific name of observed bird species
latin name of observed bird species
number of birds observed
total person-hours of observation period
count divided by hours
count_per_hour multiplied by 168 hours per week
https://github.com/rfordatascience/tidytuesday/blob/master/data/2019/2019-06-18/bird_counts.csv/.
The book banning data was collected by Fast and Hegland as part of a course project at St Olaf College, and distributed with "Broadening Your Statistical Horizons" by Legler and Roback. This data set includes the features and outcomes for 931 book challenges (ie. requests to ban a book) made in the US between 2000 and 2010. Information on the books being challenged and the characteristics of these books were collected from the American Library Society. State-level demographic information and political leanings were obtained from the US Census Bureau and Cook Political Report, respectively. Due to an outlying large number of challenges, book challenges made in the state of Texas were omitted.
book_banning
book_banning
A data frame with 931 rows and 17 variables. Each row represents a single book challenge within the given state and date.
title of book being challenged
identifier for the book
author of the book
date of the challenge
year of the challenge
whether or not the challenge was successful (the book was removed)
whether the book was challenged for sexually explicit material
whether the book was challenged for anti-family material
whether the book was challenged for occult material
whether the book was challenged for inapropriate language
whether the book was challenged for LGBTQ material
whether the book was challenged for violent material
US state in which the challenge was made
Political Value Index of the state (negative = leans Republican, 0 = neutral, positive = leans Democrat)
median income in the state, relative to the average state median income
high school graduation rate, in percent, relative to the average state high school graduation rate
college graduation rate, in percent, relative to the average state college graduation rate
Shannon Fast and Thomas Hegland (2011). Book Challenges: A Statistical Examination. Project for Statistics 316-Advanced Statistical Modeling, St. Olaf College. Julie Legler and Paul Roback (2019). Broadening Your Statistical Horizons: Generalized Linear Models and Multilevel Models. https://bookdown.org/roback/bookdown-bysh/. https://github.com/proback/BeyondMLR/blob/master/data/bookbanningNoTex.csv/
A sub-sample of outcomes for the annual Cherry Blossom Ten Mile race in Washington, D.C.. This sub-sample was taken from the complete Cherry data in the mdsr package.
cherry_blossom_sample
cherry_blossom_sample
A data frame with 252 Cherry Blossom outcomes and 7 variables:
a unique identifier for the runner
age of the runner
time to complete the race, from starting line to finish line (minutes)
time between the official start of the of race and the finish line (minutes)
year of the race
the number of previous years in which the subject ran in the race
Data in the original Cherry data set were obtained from https://www.cherryblossom.org/post-race/race-results/.
Given a set of observed data including a binary response variable y and an rstanreg model of y, this function returns summaries of the model's posterior classification quality. These summaries include a confusion matrix as well as estimates of the model's sensitivity, specificity, and overall accuracy.
classification_summary(model, data, cutoff = 0.5)
classification_summary(model, data, cutoff = 0.5)
model |
an rstanreg model object with binary y |
data |
data frame including the variables in the model, both response y and predictors x |
cutoff |
probability cutoff to classify a new case as positive (0.5 is the default) |
a list
x <- rnorm(20) z <- 3*x prob <- 1/(1+exp(-z)) y <- rbinom(20, 1, prob) example_data <- data.frame(x = x, y = y) example_model <- rstanarm::stan_glm(y ~ x, data = example_data, family = binomial) classification_summary(model = example_model, data = example_data, cutoff = 0.5)
x <- rnorm(20) z <- 3*x prob <- 1/(1+exp(-z)) y <- rbinom(20, 1, prob) example_data <- data.frame(x = x, y = y) example_model <- rstanarm::stan_glm(y ~ x, data = example_data, family = binomial) classification_summary(model = example_model, data = example_data, cutoff = 0.5)
Given a set of observed data including a binary response variable y and an rstanreg model of y, this function returns cross validated estimates of the model's posterior classification quality: sensitivity, specificity, and overall accuracy. For hierarchical models of class lmerMod, the folds are comprised by collections of groups, not individual observations.
classification_summary_cv(model, data, group, k, cutoff = 0.5)
classification_summary_cv(model, data, group, k, cutoff = 0.5)
model |
an rstanreg model object with binary y |
data |
data frame including the variables in the model, both response y (0 or 1) and predictors x |
group |
a character string representing the name of the factor grouping variable, ie. random effect (only used for hierarchical models) |
k |
the number of folds to use for cross validation |
cutoff |
probability cutoff to classify a new case as positive |
a list
x <- rnorm(20) z <- 3*x prob <- 1/(1+exp(-z)) y <- rbinom(20, 1, prob) example_data <- data.frame(x = x, y = y) example_model <- rstanarm::stan_glm(y ~ x, data = example_data, family = binomial) classification_summary_cv(model = example_model, data = example_data, k = 2, cutoff = 0.5)
x <- rnorm(20) z <- 3*x prob <- 1/(1+exp(-z)) y <- rbinom(20, 1, prob) example_data <- data.frame(x = x, y = y) example_model <- rstanarm::stan_glm(y ~ x, data = example_data, family = binomial) classification_summary_cv(model = example_model, data = example_data, k = 2, cutoff = 0.5)
A sub-sample of the Himalayan Database distributed through the R for Data Science TidyTuesday project. This dataset includes information on the results and conditions for various Himalayan climbing expeditions. Each row corresponds to a single member of a climbing expedition team.
climbers_sub
climbers_sub
A data frame with 2076 observations (1 per climber) and 22 variables:
unique expedition identifier
unique climber identifier
unique identifier of the expedition's destination peak
name of the expedition's destination peak
year of expedition
season of expedition (Autumn, Spring, Summer, Winter)
climber gender identity which the database oversimplifies to a binary category
climber age
climber citizenship
climber's role in the expedition (eg: Co-Leader)
whether the climber was a hired member of the expedition
the destination peak's highpoint (metres)
whether the climber successfully reached the destination
whether the climber was on a solo expedition
whether the climber utilized supplemental oxygen
whether the climber died during the expedition
whether the climber was injured on the expedition
number of climbers in the expedition
height of the peak in meters
the year of the first recorded summit of the peak (though not necessarily the actual first summit!)
Original source: https://www.himalayandatabase.com/. Complete dataset distributed by: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-22/.
A sub-set of data on coffee bean ratings / quality originally collected by James LeDoux (jmzledoux) and distributed through the R for Data Science TidyTuesday project.
coffee_ratings
coffee_ratings
A data frame with 1339 batches of coffee beans and 27 variables on each batch.
farm owner
farm where beans were grown
country where farm is
where beans were processed
country of coffee partner
lowest altitude of the farm
highest altitude of the farm
average altitude of the farm
number of bags tested
weight of each tested bag
bean species
bean variety
how beans were processed
bean aroma grade
bean flavor grade
bean aftertaste grade
bean acidity grade
bean body grade
bean balance grade
bean uniformity grade
bean clean cup grade
bean sweetness grade
bean moisture grade
count of category one defects
count of category two defects
bean color
total bean rating (0 – 100)
A sub-set of data on coffee bean ratings / quality originally collected by James LeDoux (jmzledoux) and distributed through the R for Data Science TidyTuesday project. This is a simplified version of the coffee_ratings data.
coffee_ratings_small
coffee_ratings_small
A data frame with 636 batches of coffee beans and 11 variables on each batch.
farm where beans were grown
total bean rating (0 – 100)
bean aroma grade
bean flavor grade
bean aftertaste grade
bean acidity grade
bean body grade
bean balance grade
bean uniformity grade
bean sweetness grade
bean moisture grade
Data on the number of LGBTQ+ equality laws (as of 2019) and demographics in each U.S. state.
equality_index
equality_index
A data frame with 50 observations, one per state, and 6 variables:
state name
region in which the state falls
percent of the 2016 presidential election vote earned by the Republican ("GOP") candidate
number of LGBTQ+ rights laws (as of 2019)
political leaning of the state over time (gop = Republican, dem = Democrat, swing = swing state)
percent of state's residents that live in urban areas (by the 2010 census)
Data on LGBTQ+ laws were obtained from Warbelow, Sarah, Courtnay Avant, and Colin Kutney (2020). 2019 State Equality Index. Washington, DC. Human Rights Campaign Foundation. https://assets2.hrc.org/files/assets/resources/HRC-SEI-2019-Report.pdf?_ga=2.148925686.1325740687.1594310864-1928808113.1594310864&_gac=1.213124768.1594312278.EAIaIQobChMI9dP2hMzA6gIVkcDACh21GgLEEAAYASAAEgJiJvD_BwE/. Data on urban residency obtained from https://www.icip.iastate.edu/tables/population/urban-pct-states/.
A dataset containing data behind the study "FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media" https://arxiv.org/abs/1809.01286. The news articles in this dataset were posted to Facebook in September 2016, in the run-up to the U.S. presidential election.
fake_news
fake_news
A data frame with 150 rows and 6 variables:
The title of the news article
Text of the article
Hyperlink for the article
Authors of the article
Binary variable indicating whether the article presents fake or real news(fake, real)
Number of words in the title
Number of words in the text
Number of characters in the title
Number of characters in the text
Number of words that are all capital letters in the title
Number of words that are all capital letters in the text
Percent of words that are all capital letters in the title
Percent of words that are all capital letters in the text
Number of characters that are exclamation marks in the title
Number of characters that are exclamation marks in the text
Percent of characters that are exclamation marks in the title
Percent of characters that are exclamation marks in the text
Binary variable indicating whether the title of the article includes an exlamation point or not(TRUE, FALSE)
Percent of words that are associated with anger
Percent of words that are associated with anticipation
Percent of words that are associated with disgust
Percent of words that are associated with fear
Percent of words that are associated with joy
Percent of words that are associated with sadness
Percent of words that are associated with surprise
Percent of words that are associated with trust
Percent of words that have negative sentiment
Percent of words that have positive sentiment
Number of syllables in text
Number of syllables per word in text
Shu, K., Mahudeswaran, D., Wang, S., Lee, D. and Liu, H. (2018) FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media
Brain measurements for football and non-football players as provided in the Lock5 package
football
football
A data frame with 75 observations and 5 variables:
control = no football, fb_no_concuss = football player but no concussions, fb_concuss = football player with concussion history
Number of years a person played football
Total hippocampus volume, in cubic centimeters
Singh R, Meier T, Kuplicki R, Savitz J, et al., "Relationship of Collegiate Football Experience and Concussion With Hippocampal Volume and Cognitive Outcome," JAMA, 311(18), 2014
A random subset of the data on hotel bookings originally collected by Antonio, Almeida and Nunes (2019) and distributed through the R for Data Science TidyTuesday project.
hotel_bookings
hotel_bookings
A data frame with 1000 hotel bookings and 32 variables on each booking.
"Resort Hotel" or "City Hotel"
whether the booking was cancelled
number of days between booking and arrival
year of scheduled arrival
month of scheduled arrival
week of scheduled arrival
day of month of scheduled arrival
number of reserved weekend nights
number of reserved week nights
number of adults in booking
number of children
number of babies
whether the booking includes breakfast (BB = bed & breakfast), breakfast and dinner (HB = half board), or breakfast, lunch, and dinner (FB = full board)
guest's country of origin
market segment designation (eg: TA = travel agent, TO = tour operator)
booking distribution channel (eg: TA = travel agent, TO = tour operator)
whether or not booking was made by a repeated guest
guest's number of previous booking cancellations
guest's number of previous bookings that weren't cancelled
code for type of room reserved by guest
code for type of room assigned by hotel
number of changes made to the booking
No Deposit, Non Refund, Refundable
booking travel agency
booking company
number of days the guest waited for booking confirmation
Contract, Group, Transient, Transient-party (a transient booking tied to another transient booking)
average hotel cost per day
number of parking spaces the guest needed
number of guest special requests
Canceled, Check-Out, No-Show
when the guest cancelled or checked out
Nuno Antonio, Ana de Almeida, and Luis Nunes (2019). "Hotel booking demand datasets." Data in Brief (22): 41-49. https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/hotels.csv/.
Loon count data collected from the year 2000 to 2017, in late December, by birdwatchers in the Ontario, Canada area. The data was made available by the Bird Studies Canada website and distributed through the R for Data Science TidyTuesday project. A more complete data set with a larger selection of birds can be found in the bird_counts data in the bayesrules package.
loons
loons
A data frame with 18 rows and 5 variables. Each row represents loon observations in the given year.
year of data collection
number of loons observed
total person-hours of observation period
count divided by hours
count_per_hour multiplied by 100 hours
https://github.com/rfordatascience/tidytuesday/blob/master/data/2019/2019-06-18/bird_counts.csv.
The Museum of Modern Art data includes information about the individual artists included in the collection of the Museum of Modern Art in New York City. It does not include information about works for artist collectives or companies. The data was made available by MoMA itself and downloaded in December 2020.
moma
moma
A data frame with 10964 rows and 11 variables. Each row represents an individual artist in the MoMA collection.
name
country of origin
year of birth
year of death
whether or not the artist was living at the time of data collection (December 2020)
whether or not the artist is Gen X or younger, ie. born during 1965 or after
gender identity (as perceived by MoMA employees)
MoMA department in which the artist's works most frequently appear
number of the artist's works in the MoMA collection
first year MoMA acquired one of the artist's works
most recent year MoMA acquired one of the artist's works
https://github.com/MuseumofModernArt/collection/blob/master/Artworks.csv/.
A random sample of 100 artists represented in the Museum of Modern Art in New York City. The data was made available by MoMA itself and downloaded in December 2020. It does not include information about artist collectives or companies.
moma_sample
moma_sample
A data frame with 100 rows and 10 variables. Each row represents an individual artist in the MoMA collection.
name
country of origin
year of birth
year of death
whether or not the artist was living at the time of data collection (December 2020)
whether or not the artist is Gen X or younger, ie. born during 1965 or after
gender identity (as perceived by MoMA employees)
number of the artist's works in the MoMA collection
first year MoMA acquired one of the artist's works
most recent year MoMA acquired one of the artist's works
https://github.com/MuseumofModernArt/collection/blob/master/Artworks.csv/.
Given a set of observed data including a categorical response variable y and a naiveBayes model of y, this function returns summaries of the model's posterior classification quality. These summaries include a confusion matrix as well as an estimate of the model's overall accuracy.
naive_classification_summary(model, data, y)
naive_classification_summary(model, data, y)
model |
a naiveBayes model object with categorical y |
data |
data frame including the variables in the model |
y |
a character string indicating the y variable in data |
a list
data(penguins_bayes, package = "bayesrules") example_model <- e1071::naiveBayes(species ~ bill_length_mm, data = penguins_bayes) naive_classification_summary(model = example_model, data = penguins_bayes, y = "species")
data(penguins_bayes, package = "bayesrules") example_model <- e1071::naiveBayes(species ~ bill_length_mm, data = penguins_bayes) naive_classification_summary(model = example_model, data = penguins_bayes, y = "species")
Given a set of observed data including a categorical response variable y and a naiveBayes model of y, this function returns a cross validated confusion matrix by which to assess the model's posterior classification quality.
naive_classification_summary_cv(model, data, y, k = 10)
naive_classification_summary_cv(model, data, y, k = 10)
model |
a naiveBayes model object with categorical y |
data |
data frame including the variables in the model |
y |
a character string indicating the y variable in data |
k |
the number of folds to use for cross validation |
a list
data(penguins_bayes, package = "bayesrules") example_model <- e1071::naiveBayes(species ~ bill_length_mm, data = penguins_bayes) naive_classification_summary_cv(model = example_model, data = penguins_bayes, y = "species", k = 2)
data(penguins_bayes, package = "bayesrules") example_model <- e1071::naiveBayes(species ~ bill_length_mm, data = penguins_bayes) naive_classification_summary_cv(model = example_model, data = penguins_bayes, y = "species", k = 2)
Data on penguins in the Palmer Archipelago, originally collected by Gordan etal and distributed through the penguins data in the palmerpenguins package. In addition to the original penguins data is a variable above_average_weight.
penguins_bayes
penguins_bayes
A data frame with 344 penguins and 9 variables on each.
species (Adelie, Chinstrap, Gentoo)
home island (Biscoe, Dream, Torgersen)
year of observation
length of bill (mm)
depth of bill (mm)
length of flipper (mm)
body mass (g)
whether or not the body mass exceeds 4200g (TRUE or FALSE)
male or female
Gorman KB, Williams TD, and Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of antarctic penguins (Genus Pygoscelis). PLoS ONE, 9(3).
Plots the probability density function (pdf) for
a Beta(alpha, beta) model of variable .
plot_beta(alpha, beta, mean = FALSE, mode = FALSE)
plot_beta(alpha, beta, mean = FALSE, mode = FALSE)
alpha , beta
|
positive shape parameters of the Beta model |
mean , mode
|
a logical value indicating whether to display the model mean and mode |
A density plot for the Beta model.
plot_beta(alpha = 1, beta = 12, mean = TRUE, mode = TRUE)
plot_beta(alpha = 1, beta = 12, mean = TRUE, mode = TRUE)
Consider a Beta-Binomial Bayesian model for parameter with
a Beta(alpha, beta) prior on
and Binomial likelihood with n trials
and y successes. Given information on the prior (alpha and data) and data (y and n),
this function produces a plot of any combination of the corresponding prior pdf,
scaled likelihood function, and posterior pdf. All three are included by default.
plot_beta_binomial( alpha, beta, y = NULL, n = NULL, prior = TRUE, likelihood = TRUE, posterior = TRUE )
plot_beta_binomial( alpha, beta, y = NULL, n = NULL, prior = TRUE, likelihood = TRUE, posterior = TRUE )
alpha , beta
|
positive shape parameters of the prior Beta model |
y |
observed number of successes |
n |
observed number of trials |
prior |
a logical value indicating whether the prior model should be plotted |
likelihood |
a logical value indicating whether the scaled likelihood should be plotted |
posterior |
a logical value indicating whether posterior model should be plotted |
a ggplot
plot_beta_binomial(alpha = 1, beta = 13, y = 25, n = 50) plot_beta_binomial(alpha = 1, beta = 13, y = 25, n = 50, posterior = FALSE)
plot_beta_binomial(alpha = 1, beta = 13, y = 25, n = 50) plot_beta_binomial(alpha = 1, beta = 13, y = 25, n = 50, posterior = FALSE)
Plots the probability density function (pdf) for a
Beta(alpha, beta) model of variable with markings indicating
a credible interval for
.
plot_beta_ci(alpha, beta, ci_level = 0.95)
plot_beta_ci(alpha, beta, ci_level = 0.95)
alpha , beta
|
positive shape parameters of the Beta model |
ci_level |
credible interval level |
A density plot for the Beta model
plot_beta_ci(alpha = 7, beta = 12, ci_level = 0.80)
plot_beta_ci(alpha = 7, beta = 12, ci_level = 0.80)
Plots the Binomial likelihood function for variable
given y observed successes in a series of n Binomial trials.
plot_binomial_likelihood(y, n, mle = FALSE)
plot_binomial_likelihood(y, n, mle = FALSE)
y |
number of successes |
n |
number of trials |
mle |
a logical value indicating whether maximum likelihood estimate of |
a ggplot
plot_binomial_likelihood(y = 3, n = 10, mle = TRUE)
plot_binomial_likelihood(y = 3, n = 10, mle = TRUE)
Plots the probability density function (pdf) for
a Gamma(shape, rate) model of variable .
plot_gamma(shape, rate, mean = FALSE, mode = FALSE)
plot_gamma(shape, rate, mean = FALSE, mode = FALSE)
shape |
non-negative shape parameter of the Gamma model |
rate |
non-negative rate parameter of the Gamma model |
mean , mode
|
a logical value indicating whether to display the model mean and mode |
A density plot for the Gamma model.
plot_gamma(shape = 2, rate = 11, mean = TRUE, mode = TRUE)
plot_gamma(shape = 2, rate = 11, mean = TRUE, mode = TRUE)
Consider a Gamma-Poisson Bayesian model for rate parameter with
a Gamma(shape, rate) prior on
and a Poisson likelihood for the data.
Given information on the prior (shape and rate)
and data (the sample size n and sum_y),
this function produces a plot of any combination of the corresponding prior pdf,
scaled likelihood function, and posterior pdf. All three are included by default.
plot_gamma_poisson( shape, rate, sum_y = NULL, n = NULL, prior = TRUE, likelihood = TRUE, posterior = TRUE )
plot_gamma_poisson( shape, rate, sum_y = NULL, n = NULL, prior = TRUE, likelihood = TRUE, posterior = TRUE )
shape |
non-negative shape parameter of the Gamma prior |
rate |
non-negative rate parameter of the Gamma prior |
sum_y |
sum of observed data values for the Poisson likelihood |
n |
number of observations for the Poisson likelihood |
prior |
a logical value indicating whether the prior model should be plotted. |
likelihood |
a logical value indicating whether the scaled likelihood should be plotted. |
posterior |
a logical value indicating whether posterior model should be plotted. |
a ggplot
plot_gamma_poisson(shape = 100, rate = 20, sum_y = 39, n = 6) plot_gamma_poisson(shape = 100, rate = 20, sum_y = 39, n = 6, posterior = FALSE)
plot_gamma_poisson(shape = 100, rate = 20, sum_y = 39, n = 6) plot_gamma_poisson(shape = 100, rate = 20, sum_y = 39, n = 6, posterior = FALSE)
Plots the probability density function (pdf) for a
Normal(mean, sd^2) model of variable .
plot_normal(mean, sd)
plot_normal(mean, sd)
mean |
mean parameter of the Normal model |
sd |
standard deviation parameter of the Normal model |
a ggplot
plot_normal(mean = 3.5, sd = 0.5)
plot_normal(mean = 3.5, sd = 0.5)
Plots the Normal likelihood function for variable
given a vector of Normal data y.
plot_normal_likelihood(y, sigma = NULL)
plot_normal_likelihood(y, sigma = NULL)
y |
vector of observed data |
sigma |
optional value for assumed standard deviation of y. by default, this is calculated by the sample standard deviation of y. |
a ggplot of Normal likelihood
plot_normal_likelihood(y = rnorm(50, mean = 10, sd = 2), sigma = 1.5)
plot_normal_likelihood(y = rnorm(50, mean = 10, sd = 2), sigma = 1.5)
Consider a Normal-Normal Bayesian model for mean parameter with
a N(mean, sd^2) prior on
and a Normal likelihood for the data.
Given information on the prior (mean and sd)
and data (the sample size n, mean y_bar, and standard deviation sigma),
this function produces a plot of any combination of the corresponding prior pdf,
scaled likelihood function, and posterior pdf. All three are included by default.
plot_normal_normal( mean, sd, sigma = NULL, y_bar = NULL, n = NULL, prior = TRUE, likelihood = TRUE, posterior = TRUE )
plot_normal_normal( mean, sd, sigma = NULL, y_bar = NULL, n = NULL, prior = TRUE, likelihood = TRUE, posterior = TRUE )
mean |
mean of the Normal prior |
sd |
standard deviation of the Normal prior |
sigma |
standard deviation of the data, or likelihood standard deviation |
y_bar |
sample mean of the data |
n |
sample size of the data |
prior |
a logical value indicating whether the prior model should be plotted |
likelihood |
a logical value indicating whether the scaled likelihood should be plotted |
posterior |
a logical value indicating whether posterior model should be plotted |
a ggplot
plot_normal_normal(mean = 0, sd = 3, sigma= 4, y_bar = 5, n = 3) plot_normal_normal(mean = 0, sd = 3, sigma= 4, y_bar = 5, n = 3, posterior = FALSE)
plot_normal_normal(mean = 0, sd = 3, sigma= 4, y_bar = 5, n = 3) plot_normal_normal(mean = 0, sd = 3, sigma= 4, y_bar = 5, n = 3, posterior = FALSE)
Plots the Poisson likelihood function for variable
given a vector of Poisson counts y.
plot_poisson_likelihood(y, lambda_upper_bound = 10)
plot_poisson_likelihood(y, lambda_upper_bound = 10)
y |
vector of observed Poisson counts |
lambda_upper_bound |
upper bound for lambda values to display on x-axis |
a ggplot of Poisson likelihood
plot_poisson_likelihood(y = c(4, 2, 7), lambda_upper_bound = 10)
plot_poisson_likelihood(y = c(4, 2, 7), lambda_upper_bound = 10)
Results of a volunteer survey on how people around the U.S. refer to fizzy cola drinks. The options are "pop", "soda", "coke", or "other".
pop_vs_soda
pop_vs_soda
A data frame with 374250 observations, one per survey respondent, and 4 variables:
the U.S. state in which the respondent resides
region in which the state falls (as defined by the U.S. Census)
how the respondent refers to fizzy cola drinks
whether or not the respondent refers to fizzy cola drinks as "pop"
The survey responses were obtained at https://popvssoda.com/ which is maintained by Alan McConchie.
Given a set of observed data including a quantitative response variable y and an rstanreg model of y, this function returns 4 measures of the posterior prediction quality. Median absolute prediction error (mae) measures the typical difference between the observed y values and their posterior predictive medians (stable = TRUE) or means (stable = FALSE). Scaled mae (mae_scaled) measures the typical number of absolute deviations (stable = TRUE) or standard deviations (stable = FALSE) that observed y values fall from their predictive medians (stable = TRUE) or means (stable = FALSE). within_50 and within_90 report the proportion of observed y values that fall within their posterior prediction intervals, the probability levels of which are set by the user.
prediction_summary( model, data, prob_inner = 0.5, prob_outer = 0.95, stable = FALSE )
prediction_summary( model, data, prob_inner = 0.5, prob_outer = 0.95, stable = FALSE )
model |
an rstanreg model object with quantitative y |
data |
data frame including the variables in the model, both response y and predictors x |
prob_inner |
posterior predictive interval probability (a value between 0 and 1) |
prob_outer |
posterior predictive interval probability (a value between 0 and 1) |
stable |
TRUE returns the number of absolute deviations and FALSE returns the standard deviations that observed y values fall from their predictive medians |
a tibble
example_data <- data.frame(x = sample(1:100, 20)) example_data$y <- example_data$x*3 + rnorm(20, 0, 5) example_model <- rstanarm::stan_glm(y ~ x, data = example_data) prediction_summary(example_model, example_data, prob_inner = 0.6, prob_outer = 0.80, stable = TRUE)
example_data <- data.frame(x = sample(1:100, 20)) example_data$y <- example_data$x*3 + rnorm(20, 0, 5) example_model <- rstanarm::stan_glm(y ~ x, data = example_data) prediction_summary(example_model, example_data, prob_inner = 0.6, prob_outer = 0.80, stable = TRUE)
Given a set of observed data including a quantitative response variable y and an rstanreg model of y, this function returns 4 cross-validated measures of the model's posterior prediction quality: Median absolute prediction error (mae) measures the typical difference between the observed y values and their posterior predictive medians (stable = TRUE) or means (stable = FALSE). Scaled mae (mae_scaled) measures the typical number of absolute deviations (stable = TRUE) or standard deviations (stable = FALSE) that observed y values fall from their predictive medians (stable = TRUE) or means (stable = FALSE). within_50 and within_90 report the proportion of observed y values that fall within their posterior prediction intervals, the probability levels of which are set by the user. For hierarchical models of class lmerMod, the folds are comprised by collections of groups, not individual observations.
prediction_summary_cv( data, group, model, k, prob_inner = 0.5, prob_outer = 0.95 )
prediction_summary_cv( data, group, model, k, prob_inner = 0.5, prob_outer = 0.95 )
data |
data frame including the variables in the model, both response y and predictors x |
group |
a character string representing the name of the factor grouping variable, ie. random effect (only used for hierarchical models) |
model |
an rstanreg model object with quantitative y |
k |
the number of folds to use for cross validation |
prob_inner |
posterior predictive interval probability (a value between 0 and 1) |
prob_outer |
posterior predictive interval probability (a value between 0 and 1) |
list
example_data <- data.frame(x = sample(1:100, 20)) example_data$y <- example_data$x*3 + rnorm(20, 0, 5) example_model <- rstanarm::stan_glm(y ~ x, data = example_data) prediction_summary_cv(model = example_model, data = example_data, k = 2)
example_data <- data.frame(x = sample(1:100, 20)) example_data$y <- example_data$x*3 + rnorm(20, 0, 5) example_model <- rstanarm::stan_glm(y ~ x, data = example_data) prediction_summary_cv(model = example_model, data = example_data, k = 2)
Cards Against Humanity's "Pulse of the Nation" project (https://thepulseofthenation.com/) conducted monthly polls into people's social and political views, as well as some silly things. This data includes responses to a subset of questions included in the poll conducted in September 2017.
pulse_of_the_nation
pulse_of_the_nation
A data frame with observations on 1000 survey respondents with 15 variables:
income in \$1000s
age in years
political party affiliation
approval level of Donald Trump's job performance
maximum education level completed
opinion of how likely their job is to be replaced by robots within 10 years
belief in climate change
the number of Transformers film the respondent has seen
opinion of whether scientists are generally honest and serve the public good
opinion of whether vaccines are safe and protect children from disease
number of books read in the past year
whether or not they believe in ghosts
respondent's estimate of the percentage of the federal budget that is spent on scientific research
belief about whether the earth is always farther away from the sun in winter than in summer (TRUE or FALSE)
whether the respondent would rather be wise but unhappy, or unwise but happy
https://thepulseofthenation.com/downloads/201709-CAH_PulseOfTheNation_Raw.csv
Calculate the sample mode of vector x.
sample_mode(x)
sample_mode(x)
x |
vector of sample data |
sample mode
sample_mode(rbeta(100, 2, 7))
sample_mode(rbeta(100, 2, 7))
A sub-sample of the Spotify song data originally collected by Kaylin Pavlik (kaylinquest) and distributed through the R for Data Science TidyTuesday project.
spotify
spotify
A data frame with 350 songs (or tracks) and 23 variables:
unique song identifier
song name
song artist
song popularity from 0 (low) to 100 (high)
id of the album on which the song appears
name of the album on which the song appears
when the album was released
Spotify playlist on which the song appears
unique playlist identifier
genre of the playlist
subgenre of the playlist
a score from 0 (not danceable) to 100 (danceable) based on features such as tempo, rhythm, etc.
a score from 0 (low energy) to 100 (high energy) based on features such as loudness, timbre, entropy, etc.
song key
song loudness (dB)
0 (minor key) or 1 (major key)
a score from 0 (non-speechy tracks) to 100 (speechy tracks)
a score from 0 (not acoustic) to 100 (very acoustic)
a score from 0 (not instrumental) to 100 (very instrumental)
a score from 0 (no live audience presence on the song) to 100 (strong live audience presence on the song)
a score from 0 (the song is more negative, sad, angry) to 100 (the song is more positive, happy, euphoric)
song tempo (beats per minute)
song duration (ms)
https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/spotify_songs.csv/.
Summarizes the expected value, variance, and mode of
a Beta(alpha, beta) model for variable .
summarize_beta(alpha, beta)
summarize_beta(alpha, beta)
alpha , beta
|
positive shape parameters of the Beta model |
a summary table
summarize_beta(alpha = 1, beta = 15)
summarize_beta(alpha = 1, beta = 15)
Consider a Beta-Binomial Bayesian model for parameter with
a Beta(alpha, beta) prior on
and Binomial likelihood with n trials
and y successes. Given information on the prior (alpha and data) and data (y and n),
this function summarizes the mean, mode, and variance of the
prior and posterior Beta models of
.
summarize_beta_binomial(alpha, beta, y = NULL, n = NULL)
summarize_beta_binomial(alpha, beta, y = NULL, n = NULL)
alpha , beta
|
positive shape parameters of the prior Beta model |
y |
number of successes |
n |
number of trials |
a summary table
summarize_beta_binomial(alpha = 1, beta = 15, y = 25, n = 50)
summarize_beta_binomial(alpha = 1, beta = 15, y = 25, n = 50)
Summarizes the expected value, variance, and mode of
a Gamma(shape, rate) model for variable .
summarize_gamma(shape, rate)
summarize_gamma(shape, rate)
shape |
positive shape parameter of the Gamma model |
rate |
positive rate parameter of the Gamma model |
a summary table
summarize_gamma(shape = 1, rate = 15)
summarize_gamma(shape = 1, rate = 15)
Consider a Gamma-Poisson Bayesian model for rate parameter with
a Gamma(shape, rate) prior on
and a Poisson likelihood for the data.
Given information on the prior (shape and rate)
and data (the sample size n and sum_y),
this function summarizes the mean, mode, and variance of the
prior and posterior Gamma models of
.
summarize_gamma_poisson(shape, rate, sum_y = NULL, n = NULL)
summarize_gamma_poisson(shape, rate, sum_y = NULL, n = NULL)
shape |
positive shape parameter of the Gamma prior |
rate |
positive rate parameter of the Gamma prior |
sum_y |
sum of observed data values for the Poisson likelihood |
n |
number of observations for the Poisson likelihood |
data frame
summarize_gamma_poisson(shape = 3, rate = 4, sum_y = 7, n = 12)
summarize_gamma_poisson(shape = 3, rate = 4, sum_y = 7, n = 12)
Consider a Normal-Normal Bayesian model for mean parameter with
a N(mean, sd^2) prior on
and a Normal likelihood for the data.
Given information on the prior (mean and sd)
and data (the sample size n, mean y_bar, and standard deviation sigma),
this function summarizes the mean, mode, and variance of the
prior and posterior Normal models of
.
summarize_normal_normal(mean, sd, sigma = NULL, y_bar = NULL, n = NULL)
summarize_normal_normal(mean, sd, sigma = NULL, y_bar = NULL, n = NULL)
mean |
mean of the Normal prior |
sd |
standard deviation of the Normal prior |
sigma |
standard deviation of the data, or likelihood standard deviation |
y_bar |
sample mean of the data |
n |
sample size of the data |
data frame
summarize_normal_normal(mean = 2.3, sd = 0.3, sigma = 5.1, y_bar = 128.5, n = 20)
summarize_normal_normal(mean = 2.3, sd = 0.3, sigma = 5.1, y_bar = 128.5, n = 20)
Voice pitch data collected by Winter and Grawunder (2012). In an experiment, subjects participated in role-playing dialog under various conditions, while researchers monitored voice pitch (Hz). The conditions spanned different scenarios (eg: making an appointment, asking for a favor) and different attitudes to use in the scenario (polite or informal).
voices
voices
A data frame with 84 rows and 4 variables. Each row represents a single observation for the given subject.
subject identifier
context of the dialog (encoded as A, B, ..., G)
whether the attitude to use in dialog was polite or informal
average voice pitch (Hz)
Winter, B., & Grawunder, S. (2012). The Phonetic Profile of Korean Formal and Informal Speech Registers. Journal of Phonetics, 40, 808-815. https://bodo-winter.net/data_and_scripts/POP.csv. https://bodo-winter.net/tutorial/bw_LME_tutorial2.pdf.
A sub-sample of daily weather information from the weatherAUS data in the rattle package for three Australian cities: Wollongong, Hobart, and Uluru.
weather_australia
weather_australia
A data frame with 300 daily observations and 22 variables from 3 Australian weather stations:
one of three weather stations
minimum temperature (degrees Celsius)
maximum temperature (degrees Celsius)
rainfall (mm)
direction of strongest wind gust
speed of strongest wind gust (km/h)
direction of wind gust at 9am
direction of wind gust at 3pm
wind speed at 9am (km/h)
wind speed at 3pm (km/h)
humidity level at 9am (percent)
humidity level at 3pm (percent)
atmospheric pressure at 9am (hpa)
atmospheric pressure at 3pm (hpa)
temperature at 9am (degrees Celsius)
temperature at 3pm (degrees Celsius)
whether or not it rained today (Yes or No)
the amount of rain today (mm)
whether or not it rained the next day (Yes or No)
the year of the date
the month of the date
the day of the year
Data in the original weatherAUS data set were obtained from https://www.bom.gov.au/climate/data/. Copyright Commonwealth of Australia 2010, Bureau of Meteorology.
A sub-sample of daily weather information on Perth, Australia from the weatherAUS data in the rattle package.
weather_perth
weather_perth
A data frame with 1000 daily observations and 21 variables:
minimum temperature (degrees Celsius)
maximum temperature (degrees Celsius)
rainfall (mm)
direction of strongest wind gust
speed of strongest wind gust (km/h)
direction of wind gust at 9am
direction of wind gust at 3pm
wind speed at 9am (km/h)
wind speed at 3pm (km/h)
humidity level at 9am (percent)
humidity level at 3pm (percent)
atmospheric pressure at 9am (hpa)
atmospheric pressure at 3pm (hpa)
temperature at 9am (degrees Celsius)
temperature at 3pm (degrees Celsius)
whether or not it rained today (Yes or No)
the amount of rain today (mm)
whether or not it rained the next day (Yes or No)
the year of the date
the month of the date
the day of the year
Data in the original weatherAUS data set were obtained from https://www.bom.gov.au/climate/data/. Copyright Commonwealth of Australia 2010, Bureau of Meteorology.
A sub-sample of daily weather information from the weatherAUS data in the rattle package for two Australian cities, Wollongong and Uluru. The weather_australia data in the bayesrules package combines this data with a third city
weather_WU
weather_WU
A data frame with 200 daily observations and 22 variables from 2 Australian weather stations:
one of two weather stations
minimum temperature (degrees Celsius)
maximum temperature (degrees Celsius)
rainfall (mm)
direction of strongest wind gust
speed of strongest wind gust (km/h)
direction of wind gust at 9am
direction of wind gust at 3pm
wind speed at 9am (km/h)
wind speed at 3pm (km/h)
humidity level at 9am (percent)
humidity level at 3pm (percent)
atmospheric pressure at 9am (hpa)
atmospheric pressure at 3pm (hpa)
temperature at 9am (degrees Celsius)
temperature at 3pm (degrees Celsius)
whether or not it rained today (Yes or No)
the amount of rain today (mm)
whether or not it rained the next day (Yes or No)
the year of the date
the month of the date
the day of the year
Data in the original weatherAUS data set were obtained from https://www.bom.gov.au/climate/data. Copyright Commonwealth of Australia 2010, Bureau of Meteorology.