Statistical Concepts – A Second Course, 5th Edition

About

Statistical Concepts—A Second Course presents the last 10 chapters from An Introduction to Statistical Concepts, Fourth Edition. Designed for second and upper-level statistics courses, this book highlights how statistics work and how best to utilize them to aid students in the analysis of their own data and the interpretation of research results.

In this new edition, Hahs-Vaughn and Lomax discuss sensitivity, specificity, and false positive and false negative errors. Coverage of effect sizes has been expanded upon and more organizational features (to summarize key concepts) have been included. A final chapter on mediation and moderation has been added for a more complete presentation of regression models.

This book acts as a clear and accessible instructional tool to help readers fully understand statistical concepts and how to apply them to data. It is an invaluable resource for students undertaking a course in statistics in any number of social science and behavioral science disciplines.

Chapter Outline

One-Factor Analysis of Variance - Fixed Effects Model

11.1What one-factor analysis of variance is and how it works

11.1.1 Characteristics

11.1.1.1 The Layout of the data

11.1.1.2 ANOVA theory

11.1.1.2.1 General theory and logic

11.1.1.2.2 General linear model

11.1.1.2.3 Partitioning the sums of squares

11.1.1.2.4 ANOVA summary table

11.1.1.3 The ANOVA model

11.1.1.3.1 The model

11.1.1.3.2 Estimation of the parameters of the model

11.1.1.3.3 Confidence intervals

11.1.1.3.4 An example

11.1.1.3.5 Expected mean squares

11.1.1.4 The unequal n's or unbalanced design

11.1.1.5 Alternative ANOVA procedures

11.1.1.5.1 Kruskal-Wallis test

11.1.1.5.2 Welch, Brown-Forsyth, and James Procedures

11.1.2 Power

11.1.3 Effect Size

11.1.3.1 Eta squared

11.1.3.2 Omega squared and epsilon squared

11.1.3.3 Cohen's f

11.1.3.4 Interpretation of effect size values

11.1.3.5 An effect size example

11.1.3.6 Confidence intervals for effect size

11.1.3.7 Items to Consider

11.1.4 Assumptions 11.1.4.1 Independence

11.1.4.2 Homogeneity of variance

11.1.4.3 Normality

11.2 Computing parametric and nonparametric models using SPSS

11.2.1 One-factor analysis of variance

11.2.2 Nonparametric procedures

11.2.2.1 Kruskal-Wallis

11.2.2.1.1 Interpreting the output for Kruskal-Wallis

11.2.2.2 Welch and Brown Forsythe

11.2.2.2.1 Interpreting the output for the Welch and Brown-Forsythe

11.3 Computing parametric and nonparametric models using R

11.3.1 Reading data into R

11.3.2 Generating the one-way ANOVA model

11.3.3 Generating the Welch test

11.3.4 Generating the Kruskal-Wallis test

11.4 Data screening

11.4.1 Normality

11.4.1.1 Interpreting normality evidence

11.4.2 Independence

11.4.2.1 Interpreting independence evidence

11.4.3 Homogeneity of variance

11.5 Power using G*Power

11.5.1 Post hoc power for the one-way ANOVA using G*Power

11.5.2 A priori power for the one-way ANOVA using G*Power

11.6 Research question template and example write-up

11.7 Additional resources

Multiple Compatison Procedues

12.1 What multiple comparison procedures are and how they work

12.1.1 Characteristics

12.1.1.1 Contrasts

12.1.1.2 Planned versus post hoc comparisons

12.1.1.3 The Type I error rate

12.1.1.4 Orthogonal contrasts

12.1.2 Selected multiple comparison procedures

12.1.2.1 Planned analysis of trend

12.1.2.2 Planned orthogonal contrasts

12.1.2.3 Planned contrasts with reference group: Dunnett method

12.1.2.4 Other planned contrasts: Dunn (or Bonferroni) and Dunn-Sidak methods

12.1.2.5 Complex post hoc contrasts: Scheffé and Kaiser-Bowden methods

12.1.2.6 Simple post hoc contrasts: Tukey HSD, Tukey-Kramer, Fisher LSD and Fisher-Hayter tests

12.1.2.7 Simple post hoc contrasts for unequal variances: Games-Howell, Dunnett T3, and C tests

12.1.2.8 Follow up tests to Kruskal-Wallis

12.1.3 Selecting the proper multiple comparison procedure

12.2 Computing multiple comparison procedures using SPSS

12.3 Computing multiple comparison procedures using R

12.3.1 Reading data into R

12.3.2 Generating the one-way ANOVA

12.3.3 Generating Tukey’s Multiple Comparison Procedure

12.3.4 Generating trend analysis

12.3.5 Generating other MCPs

12.4 Research question template and example write-up

Factorial Analysis of Variance - Fixed-Effects Model

13.1 What two-factor ANOVA is and how it works

13.1.1 Characteristics

13.1.1.1 The layout of the data

13.1.1.2 The ANOVA model

13.1.1.3 Main effects and interaction effects

13.1.1.4 Partitioning the sums of squares

13.1.1.5 The ANOVA summary table

13.1.1.6 Multiple comparison procedures

13.1.1.7 Expected mean squares

13.1.1.8 An example

13.1.2 Power

13.1.3 Effect size

13.1.3.1 Proportion of total variance effect size

13.1.3.2 Proportion of partial variance effect size

13.1.3.3 Interpreting effect size

13.1.3.4 Additional effect size considerations

13.1.3.5 Effect size example

13.1.3.6 Confidence intervals for effect size

13.1.4 Assumptions

13.2 What three-factor and higher-order ANOVA models are and how they work

13.2.1 Characteristics

13.2.2 The ANOVA model

13.2.3 The ANOVA summary table

13.2.4 The triple interaction

13.3 What the factorial ANOVA with unequal n's is and how it works

13.4 Computing factorial ANOVA using SPSS

13.4.1 Testing a statistically significant interaction

13.5 Computing factorial ANOVA using R

13.5.1 Reading data into R

13.5.2 Generating the factorial ANOVA

13.5.3 Generating tests for homogeneity of variance

13.5.4 Generating post hoc tests

13.5.5 Computing effect size

13.6 Data screening

13.6.1 Normality

13.6.1.1 Interpreting normality evidence

13.6.2 Independence

13.6.2.1 Interpreting independence evidence

13.6.3 Homogeneity of variance

13.7 Power using G*Power

13.7.1 Post hoc power for factorial ANOVA using G*Power

13.7.1.1 Power for interactions

13.7.2 A priori power for factorial ANOVA using G*Power

13.8 Research question template and example write-up

13.9 Additional resources

Introduction to Analysis of Covariance: The One-Factor Fixed-Effects Model with a Single Covatiate

14.1 What ANCOVA is and how it works

14.1.1 Characteristics

14.1.1.1 The layout of the data

14.1.1.2 The ANCOVA model

14.1.1.3 The ANCOVA summary table

14.1.1.4 Partitioning the sums of squares

14.1.1.5 Adjusted means and related procedures

14.1.1.6 An example

14.1.1.7 ANCOVA without randomization

14.1.1.8 More complex ANCOVA models

14.1.1.9 Nonparametric ANCOVA procedures

14.1.2 Sample size

14.1.3 Power

14.1.4 Effect size

14.1.5 Assumptions

14.1.5.1 Independence

14.1.5.2 Homogeneity of variance

14.1.5.3 Normality

14.1.5.4 Linearity

14.1.5.5 Fixed independent variable

14.1.5.6 Independence of the covariate and the independent variable

14.1.5.7 Covariate measured without error

14.1.5.8 Homogeneity of regression slopes

14.2 Computing ANCOVA using SPSS

14.3 Computing ANCOVA using R

14.3.1 Reading data into R

14.3.2 Generating the ANCOVA model

14.4 Data screening

14.4.1 Independence

14.4.1.1 Interpreting independence evidence

14.4.2 Homogeneity of variance

14.4.3 Normality

14.4.3.1 Interpreting normality evidence

14.4.4 Linearity

14.4.4.1 Overall linearity evidence

14.4.4.1.1 Interpreting overall linearity evidence

14.4.4.2 Linearity evidence by group

14.4.4.2.1 Interpreting evidence of linearity by group

14.4.5 Independence of covariate and independent variable

14.4.5.1 Interpreting evidence of independence of covariate and independent variable

14.4.6 Homogeneity of regression slopes

14.4.6.1 Interpreting evidence of homogeneity of regression slopes

14.5 Power using G*Power

14.5.1 Post hoc power for ANCOVA using G*Power

14.5.2 A priori power for ANCOVA using G*Power

14.6 Research question template and example write-up

14.7 Additional resources

Random- and Mixed-Effects Analysis of Variance Models

15.1 The one-factor random-effects model

15.1.1 Characteristics of the model

15.1.2 The ANOVA model

15.1.3 ANOVA summary table and expected mean squares

15.1.4 Assumptions and violation of assumptions

15.1.5 Multiple comparison procedures

15.2 The two-factor random-effects model

15.2.1 Characteristics of the model

15.2.2 The ANOVA model

15.2.3 ANOVA summary table and expected mean squares

15.2.4 Assumptions and violation of assumptions

15.2.5 Multiple comparison procedures

15.3 The two-factor mixed-effects model

15.3.1 Characteristics of the model

15.3.2 The ANOVA model

15.3.3 ANOVA summary table and expected mean squares

15.3.4 Assumptions and violation of assumptions

15.3.5 Multiple comparison procedures

15.4 The one-factor repeated measures design

15.4.1 Characteristics of the model

15.4.2 The layout of the data

15.4.3 The ANOVA model

15.4.4 Assumptions and violation of assumptions

15.4.5 ANOVA summary table and expected mean squares

15.4.6 Multiple comparison procedures

15.4.7 Alternative ANOVA procedures

15.4.8 An example

15.5 The two-factor split plot or mixed design

15.5.1 Characteristics of the model

15.5.2 The layout of the data

15.5.3 The ANOVA model

15.5.4 Assumptions and violation of assumptions

15.5.5 ANOVA summary table and expected mean squares

15.5.6 Multiple comparison procedures

15.5.7 An example

15.6 Computing ANOVA Models using SPSS

15.6.1 One-factor random-effects ANOVA

15.6.2 Two-factor random-effects ANOVA

15.6.3 Two-factor mixed-effects ANOVA

15.6.4 One-factor repeated measures ANOVA

15.6.5 Friedman’s Test: Nonparametric One-factor repeated measures ANOVA

15.6.6 Two-factor split-plot ANOVA

15.7 Computing ANOVA Models using R

15.7.1 The one-factor repeated measures design

15.7.2 Restructuring data for the one-factor repeated measures ANOVA model

15.7.3 Generating the one-factor repeated measures ANOVA model

15.7.4 Computing Friedman’s Test in R: Nonparametric one-factor repeated measures ANOVA

15.7.5 Computing the two-factor split-plot or mixed design in R

15.7.5.1 Reading data into R

15.7.5.2 Generating the two-factor split-plot ANOVA

15.8 Data screening for the two-factor split-plot ANOVA

15.8.1 Normality 15.8.1.1 Generating normality evidence

15.8.1.2 Interpreting normality evidence 15.8.2 Independence

15.8.2.1 Generating the scatterplot

15.8.2.2 Interpreting independence evidence

15.9 Power using G*Power

15.9.1 Post hoc power for two-factor split-plot ANOVA

15.9.2 A priori power for two-factor split-plot ANOVA

15.10 Research question template and example write-up

15.11 Additional resources

Hierarchical and Randomized Block Analysis of Variance Models

16.1 What hierarchical and randomized block ANOVA models are and how they work

16.1.1 Characteristics of the two-factor hierarchical model

16.1.1.1 The layout of the data for the two-factor hierarchical model

16.1.1.2 The two-factor hierarchical ANOVA model

16.1.1.3 ANOVA summary table and expected mean squares for the two-factor hierarchical model

16.1.1.4 Multiple comparison procedures for the two-factor hierarchical model

16.1.1.5 An example of the two-factor hierarchical model

16.1.2 Characteristics of the two-factor randomized block design for n = 1

16.1.2.1 The layout of the data for the two-factor randomized block design for n = 1

16.1.2.2 The two-factor randomized block design for n = 1 ANOVA model

16.1.2.3 ANOVA summary table and expected mean squares

16.1.2.4 Multiple comparison procedures

16.1.2.5 Methods of block formation

16.1.2.6 An example

16.1.3 Characteristics of the two-factor randomized block design for n > 1

16.1.4 Characteristics of the Friedman test

16.1.5 Comparison of various ANOVA models

16.1.6 Sample size

16.1.6.1 Hierarchical ANOVA model sample size

16.1.6.2 Randomized block ANOVA sample size

16.1.7 Power 16.1.8 Effect Size

16.1.8.1 Hierarchical ANOVA effect size

16.1.8.2 Two-factor randomized block effect size

16.1.9 Assumptions

16.1.9.1 Assumptions of hierarchical models

16.1.9.2 Assumptions of the two-factor randomized block ANOVA

16.2 Mathematical introduction snapshot

16.3 Computing hierarchical and randomized block ANOVA Models using SPSS

16.3.1 Computing the two-factor hierarchical ANOVA Using SPSS

16.3.2 Computing the two-factor fixed-effects randomized block ANOVA for n = 1 using SPSS

16.3.2.1 Interpreting the output

16.3.3 Computing the two-factor fixed-effects randomized block ANOVA for n > 1 using SPSS

16.3.4 Computing the Friedman Test Using SPSS

16.4 Computing hierarchical and randomized block analysis of variance models using R

16.4.1 Two-factor hierarchical ANOVA in R

16.4.1.1 Reading data into R

16.4.1.2 Generating the two-factor nested ANOVA model

16.4.1.3 Generating a post hoc test

16.4.2 Two-factor fixed-effects randomized block ANOVA in R

16.4.2.1 Reading data into R

16.4.2.2 Generating the two- factor fixed-effects randomized block ANOVA

16.5 Data screening

16.5.1 Examining Assumptions for the Two-Factor Hierarchical ANOVA

16.5.1.1 Normality

16.5.1.1.1 Interpreting normality evidence

16.5.1.2 Independence

16.5.1.3 Homogeneity of variance

16.5.2 Examining assumptions for the two-factor fixed-effects randomized block ANOVA for n = 1

16.5.2.1 Normality

16.5.2.1.1 Interpreting normality evidence

16.5.2.2 Independence

16.5.2.2.1 Generating the scatterplot

16.5.2.2.2 Interpreting independence evidence

16.5.2.3 Homogeneity of variance

16.6 Power using G*Power

16.7 Research question template and example write-up

16.8 Additional resources

Simple Linear Regression

17.1 What simple linear regression is and how it works

17.1.1 Characteristics

17.1.1.1 The Population Simple Linear Regression Model

17.1.1.2 The Sample Simple Linear Regression Model

17.1.1.2.1 Unstandardized regression model

17.1.1.2.2 Standardized regression model

17.1.1.2.3 Prediction errors

17.1.1.2.4 Least squares criterion

17.1.1.2.5 Proportion of predictable variation (coefficient of determination)

17.1.1.2.6 Significance tests and confidence intervals

17.1.2 Sample size

17.1.3 Power

17.1.4 Effect Size

17.1.4.1 Coefficient of determination

17.1.4.2 f2

17.1.4.3 Confidence intervals for effect size

17.1.5 Assumptions

17.1.5.1 Independence

17.1.5.2 Homoscedasticity

17.1.5.3 Normality

17.1.5.4 Linearity

17.1.5.5 Fixed X

17.1.5.6 Summary

17.2 Mathematical introduction snapshot

17.3 Computing simple linear regression using SPSS

17.4 Computing simple linear regression using R

17.4.1 Reading data into R

17.4.2 Generating the simple linear regression model

17.4.3 Generating correlation coefficients

17.4.4 Generating confidence intervals of coefficient estimates

17.5 Data screening

17.5.1 Independence

17.5.2 Homoscedasticity

17.5.3 Linearity

17.5.3.1 Hypothesis tests to examine linearity using SPSS

17.5.3.1.1 Interpreting hypothesis tests to examine linearity

17.5.4 Normality

17.5.4.1 Generating normality evidence

17.5.4.2 Interpreting normality evidence

17.5.5 Screening data for influential points

17.5.5.1 Casewise diagnostics

17.5.5.2 Cook's distance

17.5.5.3 Mahalanobis distances

17.5.5.4 DfBeta

17.6 Power using G*Power

17.6.1 Post hoc power

17.6.2 A priori power

17.7 Research question template and example write-up

17.8 Additional resources

Multiple Linear Regression

18.1 What multiple linear regression is and how it works

18.1.1 Characteristics

18.1.1.1 Partial correlation

18.1.1.2 Semipartial (part) correlation

18.1.1.3 Unstandardized regression model

18.1.1.4 Standardized regression model

18.1.1.5 Coefficient of multiple determination and multiple correlation

18.1.1.6 Significance tests

18.1.1.6.1 Test of significance of the overall regression model

18.1.1.6.2 Test of significance of bk

18.1.1.6.3 Other tests

18.1.1.7 Methods of entering predictors

18.1.1.7.1 Simultaneous regression

18.1.1.7.2 Backward elimination

18.1.1.7.3 Forward selection

18.1.1.7.4 Stepwise selection

18.1.1.7.5 All possible subsets regression

18.1.1.7.6 Hierarchical regression

18.1.1.7.7 Commentary on sequential regression procedures

18.1.1.8 Nonlinear relationships

18.1.1.9 Interactions

18.1.1.10 Categorical predictors

18.1.2 Sample size

18.1.3 Power

18.1.4 Effect size

18.1.4.1 Coefficient of multiple determination, R2

18.1.4.2 Multiple partial R2

18.1.4.3 f2

18.1.4.4 Partial f2

18.1.4.5 Additional effect size considerations

18.1.5 Assumptions

18.1.5.1 Independence

18.1.5.2 Homoscedasticity

18.1.5.3 Normality

18.1.5.4 Linearity

18.1.5.5 Fixed X

18.1.5.6 Noncollinearity

18.1.5.7 Summary of assumptions

18.2 Mathematical introduction snapshot

18.3 Computing multiple linear regression using SPSS

18.4 Computing multiple linear regression using R

18.4.1 Reading data into R

18.4.2 Generating the multiple regression model and saving values

18.4.3 Generating correlation coefficients

18.4.4 Generating confidence intervals of coefficient estimates

18.5 Data screening

18.5.1 Independence

18.5.2 Homoscedasticity

18.5.3 Linearity

18.5.4 Normality

18.5.4.1 Interpreting normality evidence

18.5.5 Screening data for influential points

18.5.5.1 Casewise diagnostics

18.5.5.2 Cook's distance

18.5.5.3 Mahalanobis distance

18.5.5.4 Centered leverage values

18.5.5.5 DfBeta

18.5.5.6 Diagnostic plots

18.5.6 Noncollinearity

18.6 Power using G*Power

18.6.1 Post Hoc power

18.6.2 A priori power

18.7 Research question template and example write-up

18.8 Additional resources

Logistic Regression

19.1 What logistic regression is and how it works

19.1.1 Characteristics

19.1.1.1 Logistic regression equation

19.1.1.2 Probability

19.1.1.3 Odds and logit (or log odds)

19.1.1.4 Estimation and model fit

19.1.1.5 Significance tests

19.1.1.5.1 Test of significance of the overall regression model

19.1.1.5.1.1 Change in log likelihood

19.1.1.5.1.2 Hosmer-Lemeshow goodness of fit test

19.1.1.5.1.3 Pseudo-variance explained

19.1.1.5.1.4 Predicted group membership

19.1.1.5.1.5 Cross-validation

19.1.1.6 Test of significance of the logistic regression coefficients

19.1.1.7 Methods of predictor entry

19.1.1.7.1 Simultaneous logistic regression

19.1.1.7.2 Stepwise logistic regression

19.1.1.7.3 Hierarchical regression

19.1.2 Sample size

19.1.3 Power

19.1.4 Effect size

19.1.5 Assumptions

19.1.5.1 Noncollinearity

19.1.5.2 Linearity

19.1.5.3 Independence of errors

19.1.5.4 Fixed X

19.1.5.5 Conditions

19.1.5.5.1 Nonzero cell counts

19.1.5.5.2 Nonseparation of data

19.1.5.5.3 Lack of influential points

19.2 Mathematical introduction snapshot

19.3 Computing logistic regression using SPSS

19.4 Computing logistic regression using R

19.4.1 Reading data into R

19.4.2 Generating the logistic regression model and saving values

19.4.3 Generating confidence intervals of coefficient estimates

19.4.4 Exponentiating coefficients

19.4.5 Producing odds ratios and their confidence intervals

19.5 Data screening

19.5.1 Noncollinearity

19.5.2 Linearity

19.5.3 Independence

19.5.4 Absence of outliers

19.5.4.1 Cook's distance

19.5.4.2 Leverage values

19.5.4.3 DfBeta

19.5.5 Assessing classification accuracy

19.5.5.1 ROC curves and AUC

19.6 Power using G*Power

19.6.1 Post hoc power

19.6.2 A priori power

19.7 Research question template and example write-up

19.8 Additional resources

Mediation and Moderation

20.1 What mediation is and how it works

20.1.1 Characteristics

20.1.1.1 Additional mediation models

20.1.2 Sample size

20.1.3 Power

20.1.4 Effect size

20.1.4.1 Partially standardized effect

20.1.4.2 Completely standardized effect

20.1.4.3 Other effect size indices for mediation models

20.1.5 Assumptions

20.2 What moderation is and how it works

20.2.1 Characteristics

20.2.1.1 Probing an interaction

20.2.1.2 Centering

20.2.2 Sample size

20.2.3 Power

20.2.4 Effect size

20.2.5 Assumptions

20.3 Computing mediation and moderation using SPSS

20.3.1 Installing the PROCESS macro

20.3.2 Computing mediation analysis using SPSS

20.3.2.1 Interpreting mediation output

20.3.3 Computing moderation analysis using SPSS

20.3.3.1 Interpreting moderation output

20.4 Computing mediation and moderation using R

20.4.1 Reading data into R

20.4.2 Generating a mediation model using R

20.4.3 Generating a moderation model using R

20.5 Additional resources

Data Files

Download All

Flashcards

Quizzes

Secondary Data Source

Offered here are a number of links for secondary data, many of which are publicly available (some of which require registration to use or are by application process to access restricted use data). 

Bureau of Justice Statistics.  Data is collected on a number of topics such as corrections, courts, crime type, law enforcement, and victims.

https://www.bjs.gov/

Census Bureau.  The U.S. Census Bureau’s American FactFinder provides access to data about the United States, Puerto Rico, and the island areas.  Data from surveys and censuses include:  American Community Survey, Commodity Flow Survey, Economic Census, Population Estimates Program, and more.

https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

 

Centers for Disease Control and Prevention.  Data and statistics are available on a wide variety of health related areas, such as alcohol use, birth defects, cancer, deaths and mortality, environmental health, healthy aging, life expectancy, physical activity, smoking and tobacco, and more.  Survey data accessible includes the National Health Interview Survey and National Survey of Family Growth, among others.

https://www.cdc.gov/DataStatistics/

Central Intelligence Agency (CIA) World Fact Book.  For over 260 countries in the world, information is available on history, government, people, economy, geography, transportation, military, and more.

https://www.cia.gov/library/publications/resources/the-world-factbook/

Child Care and Early Education Research Connections. Research Connections promotes high quality research in child care and early education and the use of that research in policy making.

https://www.researchconnections.org/childcare/search/studies

 

Common Core of Data (CCD).  The CCD is the Department of Education’s primary database on public elementary and secondary education in the U.S.  The CCD is a comprehensive, annual, national database of all public elementary and secondary schools and school districts.

https://nces.ed.gov/ccd/

Equity in Athletics.  Provided by the Office of Postsecondary Education of the U.S. Department of Education, the data are drawn from the OPE Equity in Athletics Disclosure website database, which consists of athletics data submitted annually by all co-educational postsecondary institutions that receive Title IV funding (i.e., those that participate in federal student aid programs) and that have an intercollegiate athletics program as required by the Equity in Athletics Disclosure Act.

https://ope.ed.gov/athletics/#/

European Union statistics.  A wide variety of statistics on countries in Europe is available from this site.  Data topics include, for example, economic trends, trade, transportation, environment and energy, science, technology, digital society, and more.

https://ec.europa.eu/eurostat/

Geospatial and Statistical Data Center. Hosted by the University of Virginia, this site provides access to Census data, maps, and more.

http://www.worldcat.org/identities/lccn-no99-40549/

Inter-University Consortium for Political & Social Research (ICPSR).  Through the University of Michigan’s ICPSR, access is provided to a large number (over 10,000) and wide variety of datasets (e.g., National Longitudinal Study of Adolescent to Adult Health, Add Health, 1994-2008; National Health and Nutrition Examination Survey, NHANES, 2007–2008; India Human Development Survey-II, 2011–2012).  Most ICPSR data holdings are public use with no access restrictions.

https://www.icpsr.umich.edu/icpsrweb/ICPSR/

Integrated Postsecondary Education Data System (IPEDS).  IPEDS provides information on U.S. colleges, universities, and technical and vocational institutions.

https://nces.ed.gov/ipeds/

National Center for Education Statistics (NCES).  NCES provides access to a wide variety of data related to education such as the Early Childhood Longitudinal Studies (ECLS) program, IPEDS, Schools and Staffing, and more.

https://nces.ed.gov/surveys/

National Institute of Health Supported Data Repositories.  “This table lists NIH-supported data repositories that make data accessible for reuse. Most accept submissions of appropriate data from NIH-funded investigators (and others), but some restrict data submission to only those researchers involved in a specific research network. Also included are resources that serve as a portal for information about biomedical data and information sharing systems.”

https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html

National Science Foundation Scientists and Engineers Statistical Data System (SESTAT). These public use data files  are related to the science and engineering workforce and STEM graduates including the National Survey of College Graduates, Recent College Graduates, and Survey of Doctorate Recipients.

https://sestat.nsf.gov/datadownload/

Office of Population Research (OPR).  The OPR’s data archive includes access to the Mexican Migration Project, Fragile Families and Child Wellbeing Study, New Immigrant Survey, The Game of Contacts (a behavioral surveillance study of heavy drug users in Curitiba, Brazil), National Longitudinal Survey of Freshmen, and much more.

https://opr.princeton.edu/archive/

Organization for Economic Cooperation and Development.  Country level data is available on multiple indicators such as agriculture, development, population, education, GDP, tax, income equality, debt, unemployment, and more.

https://data.oecd.org/

Pew Research Center.  The Pew Center collects data on topics related to U.S. politics, media and news, social and demographic trends, religion and public life, Internet and technology, science, Hispanic trends, and more.

https://www.people-press.org/datasets/

Program for International Student Assessment (PISA). "The Program for International Student Assessment (PISA) is an international assessment that measures 15-year-old students' reading, mathematics, and science literacy every three years. First conducted in 2000, the major domain of study rotates between reading, mathematics, and science in each cycle. PISA also includes measures of general or cross-curricular competencies, such as collaborative problem solving. By design, PISA emphasizes functional skills that students have acquired as they near the end of compulsory schooling. PISA is coordinated by the Organization for Economic Cooperation and Development (OECD), an intergovernmental organization of industrialized countries, and is conducted in the United States by NCES.”

https://nces.ed.gov/surveys/pisa/datafiles.asp

State Profiles.  Search for statewide information in elementary and secondary education, postsecondary education, and selected demographics for all states in the U.S. based on data collected and maintained by the National Center for Education Statistics. Data is also available on U.S. average. This resource also has the ability to graph the results.

https://nces.ed.gov/pubs2000/stateprofiles/state_profiles/index.asp

Statistics Canada. This is the national statistical office for Canada.  A variety of public use microdata is available such as the Canadian Community Health Survey, Employment Insurance Coverage Survey, Travel Survey of Residents of Canada, National Graduates Survey, Canadian Internet Use Survey, and more.

https://www.statcan.gc.ca/

Study of Instructional Improvement.  “The Study of Instructional Improvement (SII) was a large scale quasi-experiment that sought to understand the impact of three widely-disseminated comprehensive school reform (CSR) programs on instruction and student achievement in high-poverty elementary schools. Over a four-year period, researchers at the University of Michigan followed schools working with one of three CSR programs—Accelerated Schools Project, America's Choice, and Success for All. The study also followed a set of closely matched comparison schools. The purpose of the study was to track implementation of the CSR programs in elementary schools and to investigate the impact of participation in these programs on teachers, students, and schools.” This website provides readers with an online report that describes the SII research program and provides a narrative account highlighting selected findings. The website also allows readers to gain familiarity with and/or to download SII data (note that SPSS data files are made available).

http://www.sii.soe.umich.edu/

Survey of Adult Skills (PIAAC).  The Survey of Adult Skills is an international survey conducted in multiple countries as part of the Programme for the International Assessment of Adult Competencies (PIAAC). It measures the key cognitive and workplace skills needed for individuals to participate in society and for economies to prosper.

http://www.oecd.org/skills/piaac/publicdataandanalysis/

United Nations.  Country-level data is available on population, education, labor market, international merchandise trade, energy, crime, nutrition and health, science and technology, finance, environment, tourism, and more.

http://data.un.org/

 

World Bank. Through the World Bank, data is available on economy, health, education, and much more on countries throughout the world.

https://data.worldbank.org/

World Values Survey.  “The World Values Survey (www.worldvaluessurvey.org) is a global network of social scientists studying changing values and their impact on social and political life, led by an international team of scholars, with the WVS association and secretariat headquartered in Stockholm, Sweden. The WVS seeks to help scientists and policy makers understand changes in the beliefs, values and motivations of people throughout the world. Thousands of political scientists, sociologists, social psychologists, anthropologists and economists have used these data to analyze such topics as economic development, democratization, religion, gender equality, social capital, and subjective well-being. These data have also been widely used by government officials, journalists and students, and groups at the World Bank have analyzed the linkages between cultural factors and economic development.”

http://www.worldvaluessurvey.org/wvs.jsp

OTHER COLLECTIONS

Data Repositories.  A list of data repositories where datasets for articles published in Scientific Data may be hosted.

https://www.nature.com/sdata/policies/repositories

Economics-related.  Many of the data published in articles by Dr. Joshua Angrist, MIT, can be accessed here.

https://economics.mit.edu/faculty/angrist/data1/data

Journal of Statistics Education data archive.  Data archived for publications from the journal.  Please note that reading the article from which the data were published will be important to understand from where the data come.

http://jse.amstat.org/jse_data_archive.htm

Politically-related.  A collection of links to politically related datasets, composed by Professor Dale Story at the University of Texas-Arlington.

http://www.uta.edu/faculty/story/DataSets.htm

Social science related.  Maintained by the University of Amsterdam, the site provides links to a number of worldwide and country-specific entities for data and statistics related to social science topics.

http://www.sociosite.net/databases.php

OTHER RESOURCES ON FINDING AND ACCESSING SECONDARY DATA

If your institution has access to lynda.com, you may want to access the video on “Learning Public Data Sets.”  This video shows how to find free, public sources of data on a variety of business, education, and health issues and download the data for your own analysis. Author Curt Frye introduces resources from the US government (from Census to trademark data), international agencies such as the World Bank and United Nations, search engines, web services, and even language resources like the Ngram Viewer for Google Books. He also shows how to import the data into an Excel spreadsheet for visualization and analysis. Topics addressed in the video include:

  • Working with US census data
  • Using data from the Securities and Exchange Commission
  • Accessing data from other US agencies
  • Finding international sources of data
  • Gathering data from web-based search engines and data portals
  • Visualizing and analyzing public data sets in Excel