# Exploring Data using Stata: Descriptive Statistics

This tutorial provides instructions on exploring the basic features of data and conducting preliminary analysis using Stata.

## 1. Introduction

This guide discusses techniques to explore data using Stata. To explore data, we usually need to know about the format of the variables, summary statistics, crosstab, frequency, etc. We will provide Stata command to do all of this exploration. We will use built-in Stata data throughout this guide, which we can get by typing the following codes in the Stata command window:

sysuse nlsw88, clear

Note: To practice the commands using your data, you have to open your data from your working directory. You can do it using the point and click technique. For instance, to open a Stata dataset, which is stored as a .dta file, click on

File Open your .dta file

## 2. Descriptive Statistics (describe)

Descriptive statistics is vital to understanding the nature of your data. It provides a basic description of your data and allows you to explore the formats ("display format") of the variables. We will use the describe command to get descriptive statistics.

We will explore descriptive statistics of dataset nlsw88 provided by Stata with the package.

First, get the dataset by typing:

sysuse nlsw88, clear

Note: for your data, open it from your working directory by clicking File → Open → your .dta file

In the command window, type:

describe

Stata will give us the following description table.

###### Contains data from C:\Program Files\Stata18\ado\base/n/nlsw88.dta  Observations:         2,246                  NLSW, 1988 extract     Variables:            17                  1 May 2022 22:52                                               (_dta has notes) ----------------------------------------------------------------------------------------------------------- Variable      Storage   Display    Value     name         type    format    label      Variable label ----------------------------------------------------------------------------------------------------------- idcode          int     %8.0g                 NLS ID age             byte    %8.0g                 Age in current year race            byte    %8.0g      racelbl    Race married         byte    %8.0g      marlbl     Married never_married   byte    %16.0g     nev_mar    Never married grade           byte    %8.0g                 Current grade completed collgrad        byte    %16.0g     gradlbl    College graduate south           byte    %9.0g      southlbl   Lives in the south smsa            byte    %9.0g      smsalbl    Lives in SMSA c_city          byte    %16.0g     ccitylbl   Lives in a central city industry        byte    %23.0g     indlbl     Industry occupation      byte    %22.0g     occlbl     Occupation union           byte    %8.0g      unionlbl   Union worker wage            float   %9.0g                 Hourly wage hours           byte    %8.0g                 Usual hours worked ttl_exp         float   %9.0g                 Total work experience (years) tenure          float   %9.0g                 Job tenure (years) ----------------------------------------------------------------------------------------------------------- Sorted by: idcode

Interpretation

• The above table provides a summary of the entire dataset. For instance, Variables: 17 indicates that there are 17 variables in the dataset.
• Storage type helps us understanding the data type of a variable. Variables in Stata can have different data types such as int, byte, float, double, etc. int indicates that the variable is of integer data type (an integer is a whole number that does not have a decimal or fractional component). byte data type indicates that the variable is stored as a whole number within a limited range (they are typically used when you have categorical or ordinal variables with a small number of distinct values).  float indicates that the variable is stored as floating-point numbers with decimals. Str indicates that the variable is stored as string (text). The double data type refers to a variable that stores numeric values as both whole numbers and fractional values with a high degree of precision.
• Display format refers to the format in which numeric values are displayed when we view or print the data in Stata. For instance, %9.0g in the wage variable indicates that when you view or print the "wage" values, they will be displayed with a total width of 9 characters. The "%g" indicates that it's a general format, , which is a flexible format that displays values in a compact and easy-to-read manner.

## 3. Summary Statistics (summarize)

We will use the summarize command in Stata to get the basic summary statics of the variables. If you use the latest versions of Stata, you can use su instead of summarize. Summarize provides basic statistics of your data and helps us understand the essential characteristics of the variables.

We will check the summary statistics for the nlsw88 dataset provided by Stata built into the package.

First, get the dataset by typing the following codes in the command window:

sysuse nlsw88, clear

Note: for your data, open it from your working directory by clicking File → Open → your .dta file

In the command window, type:

su

Stata will give us the following summary table.

###### Variable |        Obs        Mean    Std. dev.       Min        Max -------------+---------------------------------------------------------       idcode |      2,246    2612.654    1480.864          1       5159          age |      2,246    39.15316    3.060002         34         46         race |      2,246    1.282725    .4754413          1          3      married |      2,246    .6420303    .4795099          0          1 never_marr~d |      2,246    .1041852    .3055687          0          1 -------------+---------------------------------------------------------        grade |      2,244    13.09893    2.521246          0         18     collgrad |      2,246    .2368655    .4252538          0          1        south |      2,246    .4194123    .4935728          0          1         smsa |      2,246    .7039181    .4566292          0          1       c_city |      2,246    .2916296    .4546139          0          1 -------------+---------------------------------------------------------     industry |      2,232    8.189516    3.010875          1         12   occupation |      2,237    4.642825    3.408897          1         13        union |      1,878    .2454739    .4304825          0          1         wage |      2,246    7.766949    5.755523   1.004952   40.74659        hours |      2,242    37.21811    10.50914          1         80 -------------+---------------------------------------------------------      ttl_exp |      2,246    12.53498    4.610208   .1153846   28.88461       tenure |      2,231     5.97785    5.510331          0   25.91667

Interpretation

• The above table reports number of observations, mean value, standard deviation, the minimum value, and the maximum value of each of the variables in the dataset.
• Let's interpret the summary stat of the second variable age. The mean age of the individuals in the dataset is 39.2 and the standard deviation of the variable is 3.06. The minimum age of the individuals in the dataset is 34 and the maximum age is 46.
• married looks like a dummy variable as the minimum value of it is 0 and the maximum value is 1. If 1 is defined as married in the codebook, we can say 64% of the people are married as the mean value of the variable is 0.6420303.

The "summarize, detail" command is useful for getting a comprehensive overview of the statistical properties and distribution of each variable in the dataset, For getting the detail set of summary statistics for each of the variables in the dataset, type:

su, detail

Stata will give us the following detail summary statistics .

Note: we reported the detailed summary stat for the first three variables in the dataset below.

###### 50%            1                      Mean           1.282725                         Largest       Std. dev.      .4754413 75%            2              3 90%            2              3       Variance       .2260444 95%            2              3       Skewness       1.284394 99%            3              3       Kurtosis       3.409155

Interpretation

• Let's interpret the detail sum stat for the Age variable from the above table. The percentile column displays several percentiles (e.g., p10, p25, p50, p75, p90), which are points in the data distribution that divide the data into specified percentages. p50 is the median, which represents the middle value of 'age' when the data is sorted, which is 39. Mean, Variance, and Std. Dev. indicates the mean value, variance, and the standard deviation of the variable, which is 39.2, 9.4, and 3.1 respectively for the 'age' variable. The value of Skewness is .20 which suggests that the data may have a slight positive skew.

We can get the summary statistics for a particular variable with a condition. For instance, if we want to get the summary statistics of the variable 'age' if the person is 'married', type

su age if married == 1

Stata will give us the following table.

###### Variable |        Obs        Mean    Std. dev.       Min        Max -------------+---------------------------------------------------------          age |      1,442     39.1165    3.066058         34         45

The above results indicate that the average age of the married people in the dataset is 39.12, and the minimum age of a married person in the database is 34.

If we want the summary statistics for a set of variables, we will have to type su and then mention the name of the set of variables. For instance, to get the summary statistics for the variables 'age', 'race', 'occupation', 'union, and 'wage' type:

su age race occupation union wage

Stata will give us the following table:

###### Variable |        Obs        Mean    Std. dev.       Min        Max -------------+---------------------------------------------------------          age |      2,246    39.15316    3.060002         34         46         race |      2,246    1.282725    .4754413          1          3   occupation |      2,237    4.642825    3.408897          1         13        union |      1,878    .2454739    .4304825          0          1         wage |      2,246    7.766949    5.755523   1.004952   40.74659

If we want the summary statistics for a range of variables, such as 'idcode' to 'married', type:

su idcode-married

Stata will give us the following table

###### Variable |        Obs        Mean    Std. dev.       Min        Max -------------+---------------------------------------------------------       idcode |      2,246    2612.654    1480.864          1       5159          age |      2,246    39.15316    3.060002         34         46         race |      2,246    1.282725    .4754413          1          3      married |      2,246    .6420303    .4795099          0          1

Grouped summary statistics: We can get the summary statistics separately for different groups within a variable. For instance, if we want the summary statistics of 'grade' and 'wage' variable for each group of the 'occupation' variable, we will have to type :

by occupation , sort: summarize grade wage

Stata will give us the following table: (note: we reported the partial table below)

## 4. Codebook of the Variables (codebook)

The codebook command in Stata is a valuable tool to get detailed information about the variables in a dataset. It provides information on variable names, value labels, data types, summary statistics, and other relevant details. Follow the following steps to apply the codebook command.

First, get the dataset by typing the following codes in the command window:

sysuse nlsw88, clear

To generate a codebook for a specific variable "wage," type:

codebook wage

Stata will give us the following outputs

###### Percentiles:     10%       25%       50%       75%       90%                         3.22061   4.25926   6.27227   9.59742   12.7778

Notice that the above table gives us the basic summary statistics (e.g., mean, std. dev., range) of the variable "wage" along with number of missing values and percentiles.

Conditional codebook: To generate a codebook for the variable "wage" if the person is from "south", type:

codebook wage if south == 1

Stata will give us the following outputs

###### Percentiles:     10%       25%       50%       75%       90%                         2.89855   3.94525    5.5475   8.05153   11.4171

Compact codebook: this command provides us a compact summary statistics of the variables in a dataset. To get it, type:

codebook, compact

Stata will give us the following outputs:

## 5. Frequency Tables (tab, tab1, crosstab)

We use the tab command to produce frequency tables in Stata. We will show different ways to create frequency tables in Stata.

First, get the dataset by typing the following codes in the command window:

sysuse nlsw88, clear

To create a frequency table for a single categorical variable "married" type:

tab married

Stata will give us the following frequency table.

###### Married |      Freq.     Percent        Cum. ------------+-----------------------------------      Single |        804       35.80       35.80     Married |      1,442       64.20      100.00 ------------+-----------------------------------       Total |      2,246      100.00

Interpretation

• The above table indicates that 804 people in the dataset are single, which is 35.80% of total observation. Similarly, the table also indicates that 1,442 people in the dataset are married, which is 64.20% of total observation.

To get the numerical values of the above categorical variable rather than value labels, type:

tab married, nolabel

Stata will give us the following table:

###### Married |      Freq.     Percent        Cum. ------------+-----------------------------------           0 |        804       35.80       35.80           1 |      1,442       64.20      100.00 ------------+-----------------------------------       Total |      2,246      100.00

In the above table, o indicates single and 1 indicates married.

To create a frequency table with one-way bar plots for a single categorical variable "occupation" type:

tab occupation, plot sort

Stata will give us the following output table:

###### Occupation |      Freq. -----------------------+------------+------------------------------------------                  Sales |        726 |****************************************** Professional/Technical |        317 |******************               Laborers |        286 |*****************         Managers/Admin |        264 |***************             Operatives |        246 |**************                  Other |        187 |***********     Clerical/Unskilled |        102 |******              Craftsmen |         53 |***              Transport |         28 |**                Service |         16 |*          Farm laborers |          9 |*      Household workers |          2 |                Farmers |          1 | -----------------------+------------+------------------------------------------                  Total |      2,237

tab1: If we want to get frequencies for more than one variable (e.g., "married", "south", and "race") at the same time we will have to run the following command:

tab1 married south race

Stata will give us the following frequency outputs.

###### Race |      Freq.     Percent        Cum. ------------+-----------------------------------       White |      1,637       72.89       72.89       Black |        583       25.96       98.84       Other |         26        1.16      100.00 ------------+-----------------------------------       Total |      2,246      100.00

The above table provides frequency table for each of the variables we assigned in the Stata codes.

Crosstab: Crosstabulation is useful if we want to get the common distribution of two variables in a dataset. To get the crosstabulation of the categorical variables "race" and "collgrad", type

Stata will give us the following table.

###### |               Race College graduate |     White      Black      Other |     Total -----------------+---------------------------------+---------- Not college grad |     1,217        480         17 |     1,714      College grad |       420        103          9 |       532  -----------------+---------------------------------+----------            Total |     1,637        583         26 |     2,246

The above table indicates that out of total 532 collage graduated people 420 are White, 103 are Black, and 9 are from other races.

If we want to get percentage of college graduates instead of counts, type:

Stata will give us the following table.

###### |               Race College graduate |     White      Black      Other |     Total -----------------+---------------------------------+---------- Not college grad |     71.00      28.00       0.99 |    100.00      College grad |     78.95      19.36       1.69 |    100.00  -----------------+---------------------------------+----------            Total |     72.89      25.96       1.16 |    100.00

From the above table we can say 78.95% of total collage graduates are White, 19.36% are Black, and 1.69% are from other races.

If we want row percentage in addition to the counts in the above table, type:

Stata will give us the following table.

###### |               Race College graduate |     White      Black      Other |     Total -----------------+---------------------------------+---------- Not college grad |     1,217        480         17 |     1,714                   |     71.00      28.00       0.99 |    100.00  -----------------+---------------------------------+----------     College grad |       420        103          9 |       532                   |     78.95      19.36       1.69 |    100.00  -----------------+---------------------------------+----------            Total |     1,637        583         26 |     2,246                   |     72.89      25.96       1.16 |    100.00

Interpretation

• The above table displays both counts and percentages of the respective categorical variables. For instance, the table shows that among the total non college graduate people  1,217 are white, and this consists of 71% of total non college graduate individuals.

Notice that the above table reports percentage for rows. If we want to get the column percentage instead of rows type:

Stata will give us the following table:

###### |               Race College graduate |     White      Black      Other |     Total -----------------+---------------------------------+---------- Not college grad |     1,217        480         17 |     1,714                   |     74.34      82.33      65.38 |     76.31  -----------------+---------------------------------+----------     College grad |       420        103          9 |       532                   |     25.66      17.67      34.62 |     23.69  -----------------+---------------------------------+----------            Total |     1,637        583         26 |     2,246                   |    100.00     100.00     100.00 |    100.00

Interpretation

• The above table indicates that 74.34% of total White people are not college graduates, contrary to 25.66% of college graduates. Similarly, 82.33% of total Black people are not college graduates compared to 17.67% of college graduates.

If we want to get both column and row percentage in the same table, type:

Stata will give us the following table:

###### |               Race College graduate |     White      Black      Other |     Total -----------------+---------------------------------+---------- Not college grad |     1,217        480         17 |     1,714                   |     71.00      28.00       0.99 |    100.00                   |     74.34      82.33      65.38 |     76.31  -----------------+---------------------------------+----------     College grad |       420        103          9 |       532                   |     78.95      19.36       1.69 |    100.00                   |     25.66      17.67      34.62 |     23.69  -----------------+---------------------------------+----------            Total |     1,637        583         26 |     2,246                   |     72.89      25.96       1.16 |    100.00                   |    100.00     100.00     100.00 |    100.00

Interpretation

• The above table, the first numeric entry for each pair of variable represents frequency, the second one represents row percentage, and the third one represents column percentage.
• For example, 1,217 indicates that we have 1,217 White non college graduate individuals in the dataset. 71.00 indicates that 71% of total non college graduates are White. 74.34 indicates that 74.34% of total White people in the dataset are not college graduate.

## 6. Customized Tables (tabstat)

tabstat is another command that provides summary statistics. Let's see how we can use this command to explore a dataset using our nlsw88 data.

First, get the data by typing:

sysuse nlsw88, clear

To get the tabstat, type the command name (tabstat) followed by the variable names and an argument (s) specifying the statistics we want to check. For instance, if we want the summary statistics for a list of variables - "age," "married," "collgrad," "south," "c_city," "union," and "wage" type:

tabstat age married collgrad south c_city union wage, s(mean semean median sd var skew k count sum range min max)

Stata will give us the following table:

###### Stats |       age   married  collgrad     south    c_city     union      wage ---------+----------------------------------------------------------------------     Mean |  39.15316  .6420303  .2368655  .4194123  .2916296  .2454739  7.766949 se(mean) |  .0645679   .010118  .0089731  .0104147  .0095926  .0099336  .1214451      p50 |        39         1         0         0         0         0   6.27227       SD |  3.060002  .4795099  .4252538  .4935728  .4546139  .4304825  5.755523 Variance |  9.363614  .2299298  .1808408  .2436141  .2066738  .1853151  33.12604 Skewness |  .2003234 -.5925296  1.237816  .3266212  .9168961   1.18283  3.096199 Kurtosis |  1.932389  1.351091   2.53219  1.106681  1.840698  2.399088  15.85446        N |      2246      2246      2246      2246      2246      1878      2246      Sum |     87938      1442       532       942       655       461  17444.57    Range |        12         1         1         1         1         1  39.74164      Min |        34         0         0         0         0         0  1.004952      Max |        46         1         1         1         1         1  40.74659 --------------------------------------------------------------------------------

Interpretation

• The above table displays the basic summary statistics (mean, median, standard, deviation, variance, skewness, etc.) for the selected variables we listed in the command.
• For instance, the highlighted 39.2 indicates that the average age of the individuals in the dataset is 39.2 years.
• The se(mean) indicates the standard error of the mean. The standard error of the mean provides an estimate of how much the sample mean is likely to vary from the true population mean. The highlighted se(mean) value of 0.645679 for "age" indicates that the sample mean of age does not vary that much around the population mean.
• The highlighted p50 value of 39 for "age" indicates the median age of the individuals in the dataset is 39. The SD = 3.06 indicates the standard deviation of the age variable. The Range = 12 indicates the difference between the minimum and maximum value of age of the individuals in the dataset.

If you are interested in getting the above statistics by "race" just add the option by(race) after the comma. For instance,

tabstat age married collgrad south c_city union wage, by (race) s(mean semean median sd var skew k count sum range min max)

Stata will give us the following table:

## 7. Customized Tables (table)

table is another useful command that help us getting tables in various dimensions and perspectives.

One-way table

First, get the data by typing:

sysuse nlsw88, clear

To create a simple frequency table for a single categorical variable "race", type:

table race

Stata will give us the following table:

###### --------------------         |  Frequency --------+----------- Race    |              White |      1,637   Black |        583   Other |         26   Total |      2,246 --------------------

Two-way table

To create a two-way contingency table examining the relationship between two categorical variables "collgrad" and "race", type:

Stata will give us the following table:

###### ---------------------------------------------------                    |               Race                                |  White   Black   Other   Total -------------------+------------------------------- College graduate   |                                  Not college grad |  1,217     480      17   1,714   College grad     |    420     103       9     532   Total            |  1,637     583      26   2,246 ---------------------------------------------------

To create a table of means for numeric variables "wage" and "hours" across the levels of categorical variable "collgrad", type:

Stata will give us the following table:

###### ------------------------------------------------------                    |  Hourly wage   Usual hours worked -------------------+---------------------------------- College graduate   |                                     Not college grad |     6.910561             36.71888   College grad     |     10.52606             38.82674   Total            |     7.766949             37.21811 ------------------------------------------------------

Multi-way table

To create a three dimension table for three categorical variables "collgrad", "race", and "union", type:

Stata will give us the following table:

###### ----------------------------------------------                    |        Union worker                          |  Nonunion   Union   Total -------------------+-------------------------- College graduate   |                             Not college grad |                               Race           |                                 White        |       792     194     986       Black        |       298     114     412       Other        |        11       5      16       Total        |     1,101     313   1,414   College grad     |                               Race           |                                 White        |       259     108     367       Black        |        52      37      89       Other        |         5       3       8       Total        |       316     148     464   Total            |                               Race           |                                 White        |     1,051     302   1,353       Black        |       350     151     501       Other        |        16       8      24       Total        |     1,417     461   1,878 ----------------------------------------------

To create a table containing mean values for various numerical variables ("age", "hours", "wage", and  "tenure") with respect to a categorical variables ("race"), type:

table race, statistic(mean age hours wage tenure)

Stata will give us the following table:

###### --------------------------------------------------------------------------------------         |  Age in current year   Usual hours worked   Hourly wage   Job tenure (years) --------+----------------------------------------------------------------------------- Race    |                                                                                White |             39.27245             36.90398      8.082999             5.808236   Black |             38.81132             38.12048      6.844558             6.501586   Other |             39.30769             36.80769      8.550781             4.948718   Total |             39.15316             37.21811      7.766949              5.97785 --------------------------------------------------------------------------------------

To create a table containing mean values for various numerical variables ("age", "hours", and "wage") with respect to multiple categorical variables ("race", "collgrad"), type:

table race collgrad , statistic(mean age hours wage)

Stata will give us the following table:

## Useful Resources

DSS Data Analysis Guides. Available at https://library.princeton.edu/dss/training

Introduction to econometrics / James H. Stock, Mark W. Watson. 4th ed., Boston: Pearson Addison Wesley, 2019.

Introductory econometrics: A modern approach / Jeffrey M. Wooldridge. 6th ed., Cengage learning, 2015.

## Data Consultant

He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051

## Data Consultant

Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519