Research Guides: Exploring Data using Stata: Descriptive Statistics

1. Introduction

This guide discusses techniques to explore data using Stata. To explore data, we usually need to know about the format of the variables, summary statistics, crosstab, frequency, etc. We will provide Stata command to do all of this exploration. We will use built-in Stata data throughout this guide, which we can get by typing the following codes in the Stata command window:

sysuse nlsw88, clear

Note: To practice the commands using your data, you have to open your data from your working directory. You can do it using the point and click technique. For instance, to open a Stata dataset, which is stored as a .dta file, click on

File → Open → your .dta file

2. Descriptive Statistics (describe)

Descriptive statistics is vital to understanding the nature of your data. It provides a basic description of your data and allows you to explore the formats ("display format") of the variables. We will use the describe command to get descriptive statistics.

We will explore descriptive statistics of dataset nlsw88 provided by Stata with the package.

First, get the dataset by typing:

sysuse nlsw88, clear

Note: for your data, open it from your working directory by clicking File → Open → your .dta file

In the command window, type:

describe

Stata will give us the following description table.

Contains data from C:\Program Files\Stata18\ado\base/n/nlsw88.dta
Observations: 2,246 NLSW, 1988 extract
Variables: 17 1 May 2022 22:52
(_dta has notes)
-----------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-----------------------------------------------------------------------------------------------------------
idcode int %8.0g NLS ID
age byte %8.0g Age in current year
race byte %8.0g racelbl Race
married byte %8.0g marlbl Married
never_married byte %16.0g nev_mar Never married
grade byte %8.0g Current grade completed
collgrad byte %16.0g gradlbl College graduate
south byte %9.0g southlbl Lives in the south
smsa byte %9.0g smsalbl Lives in SMSA
c_city byte %16.0g ccitylbl Lives in a central city
industry byte %23.0g indlbl Industry
occupation byte %22.0g occlbl Occupation
union byte %8.0g unionlbl Union worker
wage float %9.0g Hourly wage
hours byte %8.0g Usual hours worked
ttl_exp float %9.0g Total work experience (years)
tenure float %9.0g Job tenure (years)
-----------------------------------------------------------------------------------------------------------
Sorted by: idcode

Interpretation

The above table provides a summary of the entire dataset. For instance, Variables: 17 indicates that there are 17 variables in the dataset.
Storage type helps us understanding the data type of a variable. Variables in Stata can have different data types such as int, byte, float, double, etc. int indicates that the variable is of integer data type (an integer is a whole number that does not have a decimal or fractional component). byte data type indicates that the variable is stored as a whole number within a limited range (they are typically used when you have categorical or ordinal variables with a small number of distinct values). float indicates that the variable is stored as floating-point numbers with decimals. Str indicates that the variable is stored as string (text). The double data type refers to a variable that stores numeric values as both whole numbers and fractional values with a high degree of precision.
Display format refers to the format in which numeric values are displayed when we view or print the data in Stata. For instance, %9.0g in the wage variable indicates that when you view or print the "wage" values, they will be displayed with a total width of 9 characters. The "%g" indicates that it's a general format, , which is a flexible format that displays values in a compact and easy-to-read manner.

3. Summary Statistics (summarize)

We will use the summarize command in Stata to get the basic summary statics of the variables. If you use the latest versions of Stata, you can use su instead of summarize. Summarize provides basic statistics of your data and helps us understand the essential characteristics of the variables.

We will check the summary statistics for the nlsw88 dataset provided by Stata built into the package.

First, get the dataset by typing the following codes in the command window:

sysuse nlsw88, clear

Note: for your data, open it from your working directory by clicking File → Open → your .dta file

In the command window, type:

Stata will give us the following summary table.

. su

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
idcode | 2,246 2612.654 1480.864 1 5159
age | 2,246 39.15316 3.060002 34 46
race | 2,246 1.282725 .4754413 1 3
married | 2,246 .6420303 .4795099 0 1
never_marr~d | 2,246 .1041852 .3055687 0 1
-------------+---------------------------------------------------------
grade | 2,244 13.09893 2.521246 0 18
collgrad | 2,246 .2368655 .4252538 0 1
south | 2,246 .4194123 .4935728 0 1
smsa | 2,246 .7039181 .4566292 0 1
c_city | 2,246 .2916296 .4546139 0 1
-------------+---------------------------------------------------------
industry | 2,232 8.189516 3.010875 1 12
occupation | 2,237 4.642825 3.408897 1 13
union | 1,878 .2454739 .4304825 0 1
wage | 2,246 7.766949 5.755523 1.004952 40.74659
hours | 2,242 37.21811 10.50914 1 80
-------------+---------------------------------------------------------
ttl_exp | 2,246 12.53498 4.610208 .1153846 28.88461
tenure | 2,231 5.97785 5.510331 0 25.91667

Interpretation

The above table reports number of observations, mean value, standard deviation, the minimum value, and the maximum value of each of the variables in the dataset.
Let's interpret the summary stat of the second variable age. The mean age of the individuals in the dataset is 39.2 and the standard deviation of the variable is 3.06. The minimum age of the individuals in the dataset is 34 and the maximum age is 46.
married looks like a dummy variable as the minimum value of it is 0 and the maximum value is 1. If 1 is defined as married in the codebook, we can say 64% of the people are married as the mean value of the variable is 0.6420303.

The "summarize, detail" command is useful for getting a comprehensive overview of the statistical properties and distribution of each variable in the dataset, For getting the detail set of summary statistics for each of the variables in the dataset, type:

su, detail

Stata will give us the following detail summary statistics .

Note: we reported the detailed summary stat for the first three variables in the dataset below.

. su, detail

NLS ID
-------------------------------------------------------------
Percentiles Smallest
1% 45 1
5% 269 2
10% 521 3 Obs 2,246
25% 1366 4 Sum of wgt. 2,246

50% 2614 Mean 2612.654
Largest Std. dev. 1480.864
75% 3903 5154
90% 4651 5156 Variance 2192957
95% 4925 5157 Skewness -.0232308
99% 5104 5159 Kurtosis 1.828053

Age in current year
-------------------------------------------------------------
Percentiles Smallest
1% 34 34
5% 35 34
10% 35 34 Obs 2,246
25% 36 34 Sum of wgt. 2,246

50% 39 Mean 39.15316
Largest Std. dev. 3.060002
75% 42 45
90% 44 45 Variance 9.363614
95% 44 46 Skewness .2003234
99% 45 46 Kurtosis 1.932389

Race
-------------------------------------------------------------
Percentiles Smallest
1% 1 1
5% 1 1
10% 1 1 Obs 2,246
25% 1 1 Sum of wgt. 2,246

50% 1 Mean 1.282725
Largest Std. dev. .4754413
75% 2 3
90% 2 3 Variance .2260444
95% 2 3 Skewness 1.284394
99% 3 3 Kurtosis 3.409155

Interpretation

Let's interpret the detail sum stat for the Age variable from the above table. The percentile column displays several percentiles (e.g., p10, p25, p50, p75, p90), which are points in the data distribution that divide the data into specified percentages. p50 is the median, which represents the middle value of 'age' when the data is sorted, which is 39. Mean, Variance, and Std. Dev. indicates the mean value, variance, and the standard deviation of the variable, which is 39.2, 9.4, and 3.1 respectively for the 'age' variable. The value of Skewness is .20 which suggests that the data may have a slight positive skew.

We can get the summary statistics for a particular variable with a condition. For instance, if we want to get the summary statistics of the variable 'age' if the person is 'married', type

su age if married == 1

Stata will give us the following table.

. su age if married == 1

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
age | 1,442 39.1165 3.066058 34 45

The above results indicate that the average age of the married people in the dataset is 39.12, and the minimum age of a married person in the database is 34.

If we want the summary statistics for a set of variables, we will have to type su and then mention the name of the set of variables. For instance, to get the summary statistics for the variables 'age', 'race', 'occupation', 'union, and 'wage' type:

su age race occupation union wage

Stata will give us the following table:

. su age race occupation union wage

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
age | 2,246 39.15316 3.060002 34 46
race | 2,246 1.282725 .4754413 1 3
occupation | 2,237 4.642825 3.408897 1 13
union | 1,878 .2454739 .4304825 0 1
wage | 2,246 7.766949 5.755523 1.004952 40.74659

If we want the summary statistics for a range of variables, such as 'idcode' to 'married', type:

su idcode-married

Stata will give us the following table

. su idcode-married

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
idcode | 2,246 2612.654 1480.864 1 5159
age | 2,246 39.15316 3.060002 34 46
race | 2,246 1.282725 .4754413 1 3
married | 2,246 .6420303 .4795099 0 1

Grouped summary statistics: We can get the summary statistics separately for different groups within a variable. For instance, if we want the summary statistics of 'grade' and 'wage' variable for each group of the 'occupation' variable, we will have to type :

by occupation , sort: summarize grade wage

Stata will give us the following table: (note: we reported the partial table below)

. by occupation , sort: summarize grade wage

-----------------------------------------------------------------------------------------------------------
-> occupation = Professional/Technical

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
grade | 317 14.77603 2.132552 8 18
wage | 317 10.72362 6.351074 1.032247 40.19808

-----------------------------------------------------------------------------------------------------------
-> occupation = Managers/Admin

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
grade | 263 13.4943 2.283146 7 18
wage | 264 10.89978 7.521588 2.415459 40.19808

-----------------------------------------------------------------------------------------------------------
-> occupation = Sales

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
grade | 726 12.78237 1.777101 0 18
wage | 726 7.154489 5.042757 1.571983 40.74659

4. Codebook of the Variables (codebook)

The codebook command in Stata is a valuable tool to get detailed information about the variables in a dataset. It provides information on variable names, value labels, data types, summary statistics, and other relevant details. Follow the following steps to apply the codebook command.

First, get the dataset by typing the following codes in the command window:

sysuse nlsw88, clear

To generate a codebook for a specific variable "wage," type:

codebook wage

Stata will give us the following outputs

. codebook wage

--------------------------------------------------------------------------------
wage Hourly wage
--------------------------------------------------------------------------------

Type: Numeric (float)

Range: [1.0049518,40.74659] Units: 1.000e-07
Unique values: 967 Missing .: 0/2,246

Mean: 7.76695
Std. dev.: 5.75552

Percentiles: 10% 25% 50% 75% 90%
3.22061 4.25926 6.27227 9.59742 12.7778

Notice that the above table gives us the basic summary statistics (e.g., mean, std. dev., range) of the variable "wage" along with number of missing values and percentiles.

Conditional codebook: To generate a codebook for the variable "wage" if the person is from "south", type:

codebook wage if south == 1

Stata will give us the following outputs

. codebook wage if south == 1

--------------------------------------------------------------------------------
wage Hourly wage
--------------------------------------------------------------------------------

Type: Numeric (float)

Range: [1.1513683,40.74659] Units: 1.000e-07
Unique values: 535 Missing .: 0/942

Mean: 6.88748
Std. dev.: 5.28311

Percentiles: 10% 25% 50% 75% 90%
2.89855 3.94525 5.5475 8.05153 11.4171

Compact codebook: this command provides us a compact summary statistics of the variables in a dataset. To get it, type:

codebook, compact

Stata will give us the following outputs:

. codebook, compact

Variable Obs Unique Mean Min Max Label
--------------------------------------------------------------------------------
idcode 2246 2246 2612.654 1 5159 NLS ID
age 2246 13 39.15316 34 46 Age in current year
race 2246 3 1.282725 1 3 Race
married 2246 2 .6420303 0 1 Married
never_marr~d 2246 2 .1041852 0 1 Never married
grade 2244 16 13.09893 0 18 Current grade completed
collgrad 2246 2 .2368655 0 1 College graduate
south 2246 2 .4194123 0 1 Lives in the south
smsa 2246 2 .7039181 0 1 Lives in SMSA
c_city 2246 2 .2916296 0 1 Lives in a central city
industry 2232 12 8.189516 1 12 Industry
occupation 2237 13 4.642825 1 13 Occupation
union 1878 2 .2454739 0 1 Union worker
wage 2246 967 7.766949 1.004952 40.74659 Hourly wage
hours 2242 62 37.21811 1 80 Usual hours worked
ttl_exp 2246 1546 12.53498 .1153846 28.88461 Total work experience (years)
tenure 2231 259 5.97785 0 25.91667 Job tenure (years)
--------------------------------------------------------------------------------

5. Frequency Tables (tab, tab1, crosstab)

We use the tab command to produce frequency tables in Stata. We will show different ways to create frequency tables in Stata.

First, get the dataset by typing the following codes in the command window:

sysuse nlsw88, clear

To create a frequency table for a single categorical variable "married" type:

tab married

Stata will give us the following frequency table.

. tab married

Married | Freq. Percent Cum.
------------+-----------------------------------
Single | 804 35.80 35.80
Married | 1,442 64.20 100.00
------------+-----------------------------------
Total | 2,246 100.00

Interpretation

The above table indicates that 804 people in the dataset are single, which is 35.80% of total observation. Similarly, the table also indicates that 1,442 people in the dataset are married, which is 64.20% of total observation.

To get the numerical values of the above categorical variable rather than value labels, type:

tab married, nolabel

Stata will give us the following table:

. tab married, nolabel

Married | Freq. Percent Cum.
------------+-----------------------------------
0 | 804 35.80 35.80
1 | 1,442 64.20 100.00
------------+-----------------------------------
Total | 2,246 100.00

In the above table, o indicates single and 1 indicates married.

To create a frequency table with one-way bar plots for a single categorical variable "occupation" type:

tab occupation, plot sort

Stata will give us the following output table:

. tab occupation , plot sort

Occupation | Freq.
-----------------------+------------+------------------------------------------
Sales | 726 |********************************
Professional/Technical | 317 |
Laborers | 286 |*
Managers/Admin | 264 |*
Operatives | 246 |**
Other | 187 |*
Clerical/Unskilled | 102 |
Craftsmen | 53 |*
Transport | 28 |**
Service | 16 |*
Farm laborers | 9 |*
Household workers | 2 |
Farmers | 1 |
-----------------------+------------+------------------------------------------
Total | 2,237

tab1: If we want to get frequencies for more than one variable (e.g., "married", "south", and "race") at the same time we will have to run the following command:

tab1 married south race

Stata will give us the following frequency outputs.

. tab1 married south race

-> tabulation of married

Married | Freq. Percent Cum.
------------+-----------------------------------
Single | 804 35.80 35.80
Married | 1,442 64.20 100.00
------------+-----------------------------------
Total | 2,246 100.00

-> tabulation of south

Lives in |
the south | Freq. Percent Cum.
------------+-----------------------------------
Not south | 1,304 58.06 58.06
South | 942 41.94 100.00
------------+-----------------------------------
Total | 2,246 100.00

-> tabulation of race

Race | Freq. Percent Cum.
------------+-----------------------------------
White | 1,637 72.89 72.89
Black | 583 25.96 98.84
Other | 26 1.16 100.00
------------+-----------------------------------
Total | 2,246 100.00

The above table provides frequency table for each of the variables we assigned in the Stata codes.

Crosstab: Crosstabulation is useful if we want to get the common distribution of two variables in a dataset. To get the crosstabulation of the categorical variables "race" and "collgrad", type

tab collgrad race

Stata will give us the following table.

. tab collgrad race

| Race
College graduate | White Black Other | Total
-----------------+---------------------------------+----------
Not college grad | 1,217 480 17 | 1,714
College grad | 420 103 9 | 532
-----------------+---------------------------------+----------
Total | 1,637 583 26 | 2,246

The above table indicates that out of total 532 collage graduated people 420 are White, 103 are Black, and 9 are from other races.

If we want to get percentage of college graduates instead of counts, type:

tab collgrad race, row nofreq

Stata will give us the following table.

. tab collgrad race, row nofreq

| Race
College graduate | White Black Other | Total
-----------------+---------------------------------+----------
Not college grad | 71.00 28.00 0.99 | 100.00
College grad | 78.95 19.36 1.69 | 100.00
-----------------+---------------------------------+----------
Total | 72.89 25.96 1.16 | 100.00

From the above table we can say 78.95% of total collage graduates are White, 19.36% are Black, and 1.69% are from other races.

If we want row percentage in addition to the counts in the above table, type:

tab collgrad race, row

Stata will give us the following table.

. tab collgrad race, row

+----------------+
| Key |
|----------------|
| frequency |
| row percentage |
+----------------+

| Race
College graduate | White Black Other | Total
-----------------+---------------------------------+----------
Not college grad | 1,217 480 17 | 1,714
| 71.00 28.00 0.99 | 100.00
-----------------+---------------------------------+----------
College grad | 420 103 9 | 532
| 78.95 19.36 1.69 | 100.00
-----------------+---------------------------------+----------
Total | 1,637 583 26 | 2,246
| 72.89 25.96 1.16 | 100.00

Interpretation

The above table displays both counts and percentages of the respective categorical variables. For instance, the table shows that among the total non college graduate people 1,217 are white, and this consists of 71% of total non college graduate individuals.

Notice that the above table reports percentage for rows. If we want to get the column percentage instead of rows type:

tab collgrad race, column

Stata will give us the following table:

. tab collgrad race, col

+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+

| Race
College graduate | White Black Other | Total
-----------------+---------------------------------+----------
Not college grad | 1,217 480 17 | 1,714
| 74.34 82.33 65.38 | 76.31
-----------------+---------------------------------+----------
College grad | 420 103 9 | 532
| 25.66 17.67 34.62 | 23.69
-----------------+---------------------------------+----------
Total | 1,637 583 26 | 2,246
| 100.00 100.00 100.00 | 100.00

Interpretation

The above table indicates that 74.34% of total White people are not college graduates, contrary to 25.66% of college graduates. Similarly, 82.33% of total Black people are not college graduates compared to 17.67% of college graduates.

If we want to get both column and row percentage in the same table, type:

tab collgrad race, col row

Stata will give us the following table:

. tab collgrad race, col row

+-------------------+
| Key |
|-------------------|
| frequency |
| row percentage |
| column percentage |
+-------------------+

| Race
College graduate | White Black Other | Total
-----------------+---------------------------------+----------
Not college grad | 1,217 480 17 | 1,714
| 71.00 28.00 0.99 | 100.00
| 74.34 82.33 65.38 | 76.31
-----------------+---------------------------------+----------
College grad | 420 103 9 | 532
| 78.95 19.36 1.69 | 100.00
| 25.66 17.67 34.62 | 23.69
-----------------+---------------------------------+----------
Total | 1,637 583 26 | 2,246
| 72.89 25.96 1.16 | 100.00
| 100.00 100.00 100.00 | 100.00

Interpretation

The above table, the first numeric entry for each pair of variable represents frequency, the second one represents row percentage, and the third one represents column percentage.
For example, 1,217 indicates that we have 1,217 White non college graduate individuals in the dataset. 71.00 indicates that 71% of total non college graduates are White. 74.34 indicates that 74.34% of total White people in the dataset are not college graduate.

6. Customized Tables (tabstat)

tabstat is another command that provides summary statistics. Let's see how we can use this command to explore a dataset using our nlsw88 data.

First, get the data by typing:

sysuse nlsw88, clear

To get the tabstat, type the command name (tabstat) followed by the variable names and an argument (s) specifying the statistics we want to check. For instance, if we want the summary statistics for a list of variables - "age," "married," "collgrad," "south," "c_city," "union," and "wage" type:

tabstat age married collgrad south c_city union wage, s(mean semean median sd var skew k count sum range min max)

Stata will give us the following table:

Stats | age married collgrad south c_city union wage
---------+----------------------------------------------------------------------
Mean | 39.15316 .6420303 .2368655 .4194123 .2916296 .2454739 7.766949
se(mean) | .0645679 .010118 .0089731 .0104147 .0095926 .0099336 .1214451
p50 | 39 1 0 0 0 0 6.27227
SD | 3.060002 .4795099 .4252538 .4935728 .4546139 .4304825 5.755523
Variance | 9.363614 .2299298 .1808408 .2436141 .2066738 .1853151 33.12604
Skewness | .2003234 -.5925296 1.237816 .3266212 .9168961 1.18283 3.096199
Kurtosis | 1.932389 1.351091 2.53219 1.106681 1.840698 2.399088 15.85446
N | 2246 2246 2246 2246 2246 1878 2246
Sum | 87938 1442 532 942 655 461 17444.57
Range | 12 1 1 1 1 1 39.74164
Min | 34 0 0 0 0 0 1.004952
Max | 46 1 1 1 1 1 40.74659
--------------------------------------------------------------------------------

Interpretation

The above table displays the basic summary statistics (mean, median, standard, deviation, variance, skewness, etc.) for the selected variables we listed in the command.
For instance, the highlighted 39.2 indicates that the average age of the individuals in the dataset is 39.2 years.
The se(mean) indicates the standard error of the mean. The standard error of the mean provides an estimate of how much the sample mean is likely to vary from the true population mean. The highlighted se(mean) value of 0.645679 for "age" indicates that the sample mean of age does not vary that much around the population mean.
The highlighted p50 value of 39 for "age" indicates the median age of the individuals in the dataset is 39. The SD = 3.06 indicates the standard deviation of the age variable. The Range = 12 indicates the difference between the minimum and maximum value of age of the individuals in the dataset.

If you are interested in getting the above statistics by "race" just add the option by(race) after the comma. For instance,

tabstat age married collgrad south c_city union wage, by (race) s(mean semean median sd var skew k count sum range min max)

Stata will give us the following table:

race | age married collgrad south c_city union wage
-------+----------------------------------------------------------------------
White | 39.27245 .7025046 .2565669 .3457544 .2211362 .2232077 8.082999
| .0760678 .0113025 .0107977 .0117588 .0102605 .0113245 .1471846
| 39 1 0 0 0 0 6.545891
| 3.077691 .457296 .4368717 .4757589 .4151389 .4165504 5.955069
| 9.472181 .2091196 .1908569 .2263466 .1723403 .1735143 35.46285
| .1514898 -.8859315 1.114778 .6486171 1.343883 1.329465 3.00474
| 1.919862 1.784875 2.24273 1.420704 2.806021 2.767478 14.74577
| 1637 1637 1637 1637 1637 1353 1637
| 64289 1150 420 566 362 302 13231.87
| 12 1 1 1 1 1 39.19313
| 34 0 0 0 0 0 1.004952
| 46 1 1 1 1 1 40.19808
-------+----------------------------------------------------------------------
Black | 38.81132 .4699828 .1766724 .6397942 .490566 .3013972 6.844558
| .1234292 .0206883 .0158092 .0198991 .020722 .0205211 .2102342
| 38 0 0 1 0 0 5.434783
| 2.980246 .4995268 .3817187 .4804722 .5003403 .4593235 5.076187
| 8.881865 .249527 .1457092 .2308536 .2503404 .210978 25.76767
| .3449691 .1202856 1.695517 -.5824029 .0377426 .8656266 3.516731
| 2.029956 1.014469 3.874778 1.339193 1.001425 1.749309 21.15914
| 583 583 583 583 583 501 583
| 22627 274 103 373 286 151 3990.377
| 11 1 1 1 1 1 39.59522
| 34 0 0 0 0 0 1.151368
| 45 1 1 1 1 1 40.74659
-------+----------------------------------------------------------------------
Other | 39.30769 .6923077 .3461538 .1153846 .2692308 .3333333 8.550781
| .6367447 .0923077 .0951486 .0638971 .088712 .0982946 1.021653
| 39 1 0 0 0 0 7.560383
| 3.246774 .4706787 .4851645 .3258126 .4523443 .4815434 5.20943
| 10.54154 .2215385 .2353846 .1061538 .2046154 .2318841 27.13816
| .0047392 -.8333333 .6467617 2.407717 1.040532 .7071068 1.428553
| 1.622899 1.694444 1.418301 6.797101 2.082707 1.5 5.799663
| 26 26 26 26 26 24 26
| 1022 18 9 3 7 8 222.3203
| 10 1 1 1 1 1 23.99913
| 34 0 0 0 0 0 1.80602
| 44 1 1 1 1 1 25.80515
-------+----------------------------------------------------------------------
Total | 39.15316 .6420303 .2368655 .4194123 .2916296 .2454739 7.766949
| .0645679 .010118 .0089731 .0104147 .0095926 .0099336 .1214451
| 39 1 0 0 0 0 6.27227
| 3.060002 .4795099 .4252538 .4935728 .4546139 .4304825 5.755523
| 9.363614 .2299298 .1808408 .2436141 .2066738 .1853151 33.12604
| .2003234 -.5925296 1.237816 .3266212 .9168961 1.18283 3.096199
| 1.932389 1.351091 2.53219 1.106681 1.840698 2.399088 15.85446
| 2246 2246 2246 2246 2246 1878 2246
| 87938 1442 532 942 655 461 17444.57
| 12 1 1 1 1 1 39.74164
| 34 0 0 0 0 0 1.004952
| 46 1 1 1 1 1 40.74659
------------------------------------------------------------------------------

For getting more help on tabstat, click here.

7. Customized Tables (table)

table is another useful command that help us getting tables in various dimensions and perspectives.

One-way table

First, get the data by typing:

sysuse nlsw88, clear

To create a simple frequency table for a single categorical variable "race", type:

table race

Stata will give us the following table:

. table race

--------------------
| Frequency
--------+-----------
Race |
White | 1,637
Black | 583
Other | 26
Total | 2,246
--------------------

Two-way table

To create a two-way contingency table examining the relationship between two categorical variables "collgrad" and "race", type:

table collgrad race

Stata will give us the following table:

. table collgrad race

---------------------------------------------------
| Race
| White Black Other Total
-------------------+-------------------------------
College graduate |
Not college grad | 1,217 480 17 1,714
College grad | 420 103 9 532
Total | 1,637 583 26 2,246
---------------------------------------------------

To create a table of means for numeric variables "wage" and "hours" across the levels of categorical variable "collgrad", type:

table collgrad, statistic(mean wage hours)

Stata will give us the following table:

------------------------------------------------------
| Hourly wage Usual hours worked
-------------------+----------------------------------
College graduate |
Not college grad | 6.910561 36.71888
College grad | 10.52606 38.82674
Total | 7.766949 37.21811
------------------------------------------------------

Multi-way table

To create a three dimension table for three categorical variables "collgrad", "race", and "union", type:

table collgrad race union

Stata will give us the following table:

----------------------------------------------
| Union worker
| Nonunion Union Total
-------------------+--------------------------
College graduate |
Not college grad |
Race |
White | 792 194 986
Black | 298 114 412
Other | 11 5 16
Total | 1,101 313 1,414
College grad |
Race |
White | 259 108 367
Black | 52 37 89
Other | 5 3 8
Total | 316 148 464
Total |
Race |
White | 1,051 302 1,353
Black | 350 151 501
Other | 16 8 24
Total | 1,417 461 1,878
----------------------------------------------

To create a table containing mean values for various numerical variables ("age", "hours", "wage", and "tenure") with respect to a categorical variables ("race"), type:

table race, statistic(mean age hours wage tenure)

Stata will give us the following table:

--------------------------------------------------------------------------------------
| Age in current year Usual hours worked Hourly wage Job tenure (years)
--------+-----------------------------------------------------------------------------
Race |
White | 39.27245 36.90398 8.082999 5.808236
Black | 38.81132 38.12048 6.844558 6.501586
Other | 39.30769 36.80769 8.550781 4.948718
Total | 39.15316 37.21811 7.766949 5.97785
--------------------------------------------------------------------------------------

To create a table containing mean values for various numerical variables ("age", "hours", and "wage") with respect to multiple categorical variables ("race", "collgrad"), type:

table race collgrad , statistic(mean age hours wage)

Stata will give us the following table:

---------------------------------------------------------------------
| College graduate
| Not college grad College grad Total
------------------------+--------------------------------------------
Race |
White |
Age in current year | 39.27034 39.27857 39.27245
Usual hours worked | 36.3347 38.55609 36.90398
Hourly wage | 7.318251 10.29895 8.082999
Black |
Age in current year | 38.90417 38.37864 38.81132
Usual hours worked | 37.77406 39.72816 38.12048
Hourly wage | 5.875918 11.35861 6.844558
Other |
Age in current year | 39.05882 39.77778 39.30769
Usual hours worked | 34.52941 41.11111 36.80769
Hourly wage | 6.938195 11.59678 8.550781
Total |
Age in current year | 39.16569 39.11278 39.15316
Usual hours worked | 36.71888 38.82674 37.21811
Hourly wage | 6.910561 10.52606 7.766949
---------------------------------------------------------------------

Useful Resources

DSS Data Analysis Guides. Available at https://libguides.princeton.edu/c.php?g=1415215

Introduction to econometrics / James H. Stock, Mark W. Watson. 4th ed., Boston: Pearson Addison Wesley, 2019.

Introductory econometrics: A modern approach / Jeffrey M. Wooldridge. 6th ed., Cengage learning, 2015.

Data Consultant

Muhammad Al Amin

He/Him/His

Email Me

Contact:

Firestone Library, A-12-F.1

609-258-6051

Data Consultant

Yufei Qin

Email Me

Contact:

Firestone Library, A.12F.2

6092582519

Exploring Data using Stata: Descriptive Statistics

Descriptive Statistics

Table of Contents

1. Introduction

2. Descriptive Statistics (describe)

3. Summary Statistics (summarize)

. su

. su, detail

NLS ID ------------------------------------------------------------- Percentiles Smallest 1% 45 1 5% 269 2 10% 521 3 Obs 2,246 25% 1366 4 Sum of wgt. 2,246

50% 2614 Mean 2612.654 Largest Std. dev. 1480.864 75% 3903 5154 90% 4651 5156 Variance 2192957 95% 4925 5157 Skewness -.0232308 99% 5104 5159 Kurtosis 1.828053

Age in current year ------------------------------------------------------------- Percentiles Smallest 1% 34 34 5% 35 34 10% 35 34 Obs 2,246 25% 36 34 Sum of wgt. 2,246

50% 39 Mean 39.15316 Largest Std. dev. 3.060002 75% 42 45 90% 44 45 Variance 9.363614 95% 44 46 Skewness .2003234 99% 45 46 Kurtosis 1.932389

Race ------------------------------------------------------------- Percentiles Smallest 1% 1 1 5% 1 1 10% 1 1 Obs 2,246 25% 1 1 Sum of wgt. 2,246

50% 1 Mean 1.282725 Largest Std. dev. .4754413 75% 2 3 90% 2 3 Variance .2260444 95% 2 3 Skewness 1.284394 99% 3 3 Kurtosis 3.409155

. su age if married == 1

Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- age | 1,442 39.1165 3.066058 34 45

. su age race occupation union wage

. su idcode-married

Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- idcode | 2,246 2612.654 1480.864 1 5159 age | 2,246 39.15316 3.060002 34 46 race | 2,246 1.282725 .4754413 1 3 married | 2,246 .6420303 .4795099 0 1

. by occupation , sort: summarize grade wage

----------------------------------------------------------------------------------------------------------- -> occupation = Professional/Technical

Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- grade | 317 14.77603 2.132552 8 18 wage | 317 10.72362 6.351074 1.032247 40.19808

----------------------------------------------------------------------------------------------------------- -> occupation = Managers/Admin

Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- grade | 263 13.4943 2.283146 7 18 wage | 264 10.89978 7.521588 2.415459 40.19808

----------------------------------------------------------------------------------------------------------- -> occupation = Sales

Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- grade | 726 12.78237 1.777101 0 18 wage | 726 7.154489 5.042757 1.571983 40.74659

4. Codebook of the Variables (codebook)

. codebook wage

-------------------------------------------------------------------------------- wage Hourly wage --------------------------------------------------------------------------------

Type: Numeric (float)

Range: [1.0049518,40.74659] Units: 1.000e-07 Unique values: 967 Missing .: 0/2,246

Mean: 7.76695 Std. dev.: 5.75552

Percentiles: 10% 25% 50% 75% 90% 3.22061 4.25926 6.27227 9.59742 12.7778

. codebook wage if south == 1

-------------------------------------------------------------------------------- wage Hourly wage --------------------------------------------------------------------------------

Type: Numeric (float)

Range: [1.1513683,40.74659] Units: 1.000e-07 Unique values: 535 Missing .: 0/942

Mean: 6.88748 Std. dev.: 5.28311

Percentiles: 10% 25% 50% 75% 90% 2.89855 3.94525 5.5475 8.05153 11.4171

. codebook, compact

5. Frequency Tables (tab, tab1, crosstab)

. tab married

Married | Freq. Percent Cum. ------------+----------------------------------- Single | 804 35.80 35.80 Married | 1,442 64.20 100.00 ------------+----------------------------------- Total | 2,246 100.00

. tab married, nolabel

Married | Freq. Percent Cum. ------------+----------------------------------- 0 | 804 35.80 35.80 1 | 1,442 64.20 100.00 ------------+----------------------------------- Total | 2,246 100.00

. tab occupation , plot sort

. tab1 married south race

-> tabulation of married

Married | Freq. Percent Cum. ------------+----------------------------------- Single | 804 35.80 35.80 Married | 1,442 64.20 100.00 ------------+----------------------------------- Total | 2,246 100.00

-> tabulation of south

Lives in | the south | Freq. Percent Cum. ------------+----------------------------------- Not south | 1,304 58.06 58.06 South | 942 41.94 100.00 ------------+----------------------------------- Total | 2,246 100.00

-> tabulation of race

Race | Freq. Percent Cum. ------------+----------------------------------- White | 1,637 72.89 72.89 Black | 583 25.96 98.84 Other | 26 1.16 100.00 ------------+----------------------------------- Total | 2,246 100.00

. tab collgrad race

| Race College graduate | White Black Other | Total -----------------+---------------------------------+---------- Not college grad | 1,217 480 17 | 1,714 College grad | 420 103 9 | 532 -----------------+---------------------------------+---------- Total | 1,637 583 26 | 2,246

. tab collgrad race, row nofreq

| Race College graduate | White Black Other | Total -----------------+---------------------------------+---------- Not college grad | 71.00 28.00 0.99 | 100.00 College grad | 78.95 19.36 1.69 | 100.00 -----------------+---------------------------------+---------- Total | 72.89 25.96 1.16 | 100.00

. tab collgrad race, row

+----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+

. tab collgrad race, col

+-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+

. tab collgrad race, col row

+-------------------+ | Key | |-------------------| | frequency | | row percentage | | column percentage | +-------------------+

6. Customized Tables (tabstat)

7. Customized Tables (table)

. table race

-------------------- | Frequency --------+----------- Race | White | 1,637 Black | 583 Other | 26 Total | 2,246 --------------------

. table collgrad race

Useful Resources

Data Consultant

Data Consultant

Comments or Questions?

NLS ID
-------------------------------------------------------------
Percentiles Smallest
1% 45 1
5% 269 2
10% 521 3 Obs 2,246
25% 1366 4 Sum of wgt. 2,246

50% 2614 Mean 2612.654
Largest Std. dev. 1480.864
75% 3903 5154
90% 4651 5156 Variance 2192957
95% 4925 5157 Skewness -.0232308
99% 5104 5159 Kurtosis 1.828053

Age in current year
-------------------------------------------------------------
Percentiles Smallest
1% 34 34
5% 35 34
10% 35 34 Obs 2,246
25% 36 34 Sum of wgt. 2,246

50% 39 Mean 39.15316
Largest Std. dev. 3.060002
75% 42 45
90% 44 45 Variance 9.363614
95% 44 46 Skewness .2003234
99% 45 46 Kurtosis 1.932389

Race
-------------------------------------------------------------
Percentiles Smallest
1% 1 1
5% 1 1
10% 1 1 Obs 2,246
25% 1 1 Sum of wgt. 2,246

50% 1 Mean 1.282725
Largest Std. dev. .4754413
75% 2 3
90% 2 3 Variance .2260444
95% 2 3 Skewness 1.284394
99% 3 3 Kurtosis 3.409155

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
age | 1,442 39.1165 3.066058 34 45

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
idcode | 2,246 2612.654 1480.864 1 5159
age | 2,246 39.15316 3.060002 34 46
race | 2,246 1.282725 .4754413 1 3
married | 2,246 .6420303 .4795099 0 1

-----------------------------------------------------------------------------------------------------------
-> occupation = Professional/Technical

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
grade | 317 14.77603 2.132552 8 18
wage | 317 10.72362 6.351074 1.032247 40.19808

-----------------------------------------------------------------------------------------------------------
-> occupation = Managers/Admin

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
grade | 263 13.4943 2.283146 7 18
wage | 264 10.89978 7.521588 2.415459 40.19808

-----------------------------------------------------------------------------------------------------------
-> occupation = Sales

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
grade | 726 12.78237 1.777101 0 18
wage | 726 7.154489 5.042757 1.571983 40.74659

--------------------------------------------------------------------------------
wage Hourly wage
--------------------------------------------------------------------------------

Range: [1.0049518,40.74659] Units: 1.000e-07
Unique values: 967 Missing .: 0/2,246

Mean: 7.76695
Std. dev.: 5.75552

Percentiles: 10% 25% 50% 75% 90%
3.22061 4.25926 6.27227 9.59742 12.7778

--------------------------------------------------------------------------------
wage Hourly wage
--------------------------------------------------------------------------------

Range: [1.1513683,40.74659] Units: 1.000e-07
Unique values: 535 Missing .: 0/942

Mean: 6.88748
Std. dev.: 5.28311

Percentiles: 10% 25% 50% 75% 90%
2.89855 3.94525 5.5475 8.05153 11.4171

Married | Freq. Percent Cum.
------------+-----------------------------------
Single | 804 35.80 35.80
Married | 1,442 64.20 100.00
------------+-----------------------------------
Total | 2,246 100.00

Married | Freq. Percent Cum.
------------+-----------------------------------
0 | 804 35.80 35.80
1 | 1,442 64.20 100.00
------------+-----------------------------------
Total | 2,246 100.00

Married | Freq. Percent Cum.
------------+-----------------------------------
Single | 804 35.80 35.80
Married | 1,442 64.20 100.00
------------+-----------------------------------
Total | 2,246 100.00

Lives in |
the south | Freq. Percent Cum.
------------+-----------------------------------
Not south | 1,304 58.06 58.06
South | 942 41.94 100.00
------------+-----------------------------------
Total | 2,246 100.00

Race | Freq. Percent Cum.
------------+-----------------------------------
White | 1,637 72.89 72.89
Black | 583 25.96 98.84
Other | 26 1.16 100.00
------------+-----------------------------------
Total | 2,246 100.00

| Race
College graduate | White Black Other | Total
-----------------+---------------------------------+----------
Not college grad | 1,217 480 17 | 1,714
College grad | 420 103 9 | 532
-----------------+---------------------------------+----------
Total | 1,637 583 26 | 2,246

| Race
College graduate | White Black Other | Total
-----------------+---------------------------------+----------
Not college grad | 71.00 28.00 0.99 | 100.00
College grad | 78.95 19.36 1.69 | 100.00
-----------------+---------------------------------+----------
Total | 72.89 25.96 1.16 | 100.00

+----------------+
| Key |
|----------------|
| frequency |
| row percentage |
+----------------+

+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+

+-------------------+
| Key |
|-------------------|
| frequency |
| row percentage |
| column percentage |
+-------------------+

--------------------
| Frequency
--------+-----------
Race |
White | 1,637
Black | 583
Other | 26
Total | 2,246
--------------------