This guide discusses techniques to explore data using Stata. To explore data, we usually need to know about the format of the variables, summary statistics, crosstab, frequency, etc. We will provide Stata command to do all of this exploration. We will use built-in Stata data throughout this guide, which we can get by typing the following codes in the Stata command window:
sysuse nlsw88, clear
Note: To practice the commands using your data, you have to open your data from your working directory. You can do it using the point and click technique. For instance, to open a Stata dataset, which is stored as a .dta file, click on
File → Open → your .dta file
Descriptive statistics is vital to understanding the nature of your data. It provides a basic description of your data and allows you to explore the formats ("display format") of the variables. We will use the describe command to get descriptive statistics.
We will explore descriptive statistics of dataset nlsw88 provided by Stata with the package.
First, get the dataset by typing:
sysuse nlsw88, clear
Note: for your data, open it from your working directory by clicking File → Open → your .dta file
In the command window, type:
describe
Stata will give us the following description table.
Interpretation
We will use the summarize command in Stata to get the basic summary statics of the variables. If you use the latest versions of Stata, you can use su instead of summarize. Summarize provides basic statistics of your data and helps us understand the essential characteristics of the variables.
We will check the summary statistics for the nlsw88 dataset provided by Stata built into the package.
First, get the dataset by typing the following codes in the command window:
sysuse nlsw88, clear
Note: for your data, open it from your working directory by clicking File → Open → your .dta file
In the command window, type:
su
Stata will give us the following summary table.
Interpretation
The "summarize, detail" command is useful for getting a comprehensive overview of the statistical properties and distribution of each variable in the dataset, For getting the detail set of summary statistics for each of the variables in the dataset, type:
su, detail
Stata will give us the following detail summary statistics .
Note: we reported the detailed summary stat for the first three variables in the dataset below.
Interpretation
We can get the summary statistics for a particular variable with a condition. For instance, if we want to get the summary statistics of the variable 'age' if the person is 'married', type
su age if married == 1
Stata will give us the following table.
The above results indicate that the average age of the married people in the dataset is 39.12, and the minimum age of a married person in the database is 34.
If we want the summary statistics for a set of variables, we will have to type su and then mention the name of the set of variables. For instance, to get the summary statistics for the variables 'age', 'race', 'occupation', 'union, and 'wage' type:
su age race occupation union wage
Stata will give us the following table:
If we want the summary statistics for a range of variables, such as 'idcode' to 'married', type:
su idcode-married
Stata will give us the following table
Grouped summary statistics: We can get the summary statistics separately for different groups within a variable. For instance, if we want the summary statistics of 'grade' and 'wage' variable for each group of the 'occupation' variable, we will have to type :
by occupation , sort: summarize grade wage
Stata will give us the following table: (note: we reported the partial table below)
The codebook command in Stata is a valuable tool to get detailed information about the variables in a dataset. It provides information on variable names, value labels, data types, summary statistics, and other relevant details. Follow the following steps to apply the codebook command.
First, get the dataset by typing the following codes in the command window:
sysuse nlsw88, clear
To generate a codebook for a specific variable "wage," type:
codebook wage
Stata will give us the following outputs
Notice that the above table gives us the basic summary statistics (e.g., mean, std. dev., range) of the variable "wage" along with number of missing values and percentiles.
Conditional codebook: To generate a codebook for the variable "wage" if the person is from "south", type:
codebook wage if south == 1
Stata will give us the following outputs
Compact codebook: this command provides us a compact summary statistics of the variables in a dataset. To get it, type:
codebook, compact
Stata will give us the following outputs:
We use the tab command to produce frequency tables in Stata. We will show different ways to create frequency tables in Stata.
First, get the dataset by typing the following codes in the command window:
sysuse nlsw88, clear
To create a frequency table for a single categorical variable "married" type:
tab married
Stata will give us the following frequency table.
Interpretation
To get the numerical values of the above categorical variable rather than value labels, type:
tab married, nolabel
Stata will give us the following table:
In the above table, o indicates single and 1 indicates married.
To create a frequency table with one-way bar plots for a single categorical variable "occupation" type:
tab occupation, plot sort
Stata will give us the following output table:
tab1: If we want to get frequencies for more than one variable (e.g., "married", "south", and "race") at the same time we will have to run the following command:
tab1 married south race
Stata will give us the following frequency outputs.
The above table provides frequency table for each of the variables we assigned in the Stata codes.
Crosstab: Crosstabulation is useful if we want to get the common distribution of two variables in a dataset. To get the crosstabulation of the categorical variables "race" and "collgrad", type
tab collgrad race
Stata will give us the following table.
The above table indicates that out of total 532 collage graduated people 420 are White, 103 are Black, and 9 are from other races.
If we want to get percentage of college graduates instead of counts, type:
tab collgrad race, row nofreq
Stata will give us the following table.
From the above table we can say 78.95% of total collage graduates are White, 19.36% are Black, and 1.69% are from other races.
If we want row percentage in addition to the counts in the above table, type:
tab collgrad race, row
Stata will give us the following table.
Interpretation
Notice that the above table reports percentage for rows. If we want to get the column percentage instead of rows type:
tab collgrad race, column
Stata will give us the following table:
Interpretation
If we want to get both column and row percentage in the same table, type:
tab collgrad race, col row
Stata will give us the following table:
Interpretation
tabstat is another command that provides summary statistics. Let's see how we can use this command to explore a dataset using our nlsw88 data.
First, get the data by typing:
sysuse nlsw88, clear
To get the tabstat, type the command name (tabstat) followed by the variable names and an argument (s) specifying the statistics we want to check. For instance, if we want the summary statistics for a list of variables - "age," "married," "collgrad," "south," "c_city," "union," and "wage" type:
tabstat age married collgrad south c_city union wage, s(mean semean median sd var skew k count sum range min max)
Stata will give us the following table:
Interpretation
If you are interested in getting the above statistics by "race" just add the option by(race) after the comma. For instance,
tabstat age married collgrad south c_city union wage, by (race) s(mean semean median sd var skew k count sum range min max)
Stata will give us the following table:
For getting more help on tabstat, click here.
table is another useful command that help us getting tables in various dimensions and perspectives.
One-way table
First, get the data by typing:
sysuse nlsw88, clear
To create a simple frequency table for a single categorical variable "race", type:
table race
Stata will give us the following table:
Two-way table
To create a two-way contingency table examining the relationship between two categorical variables "collgrad" and "race", type:
table collgrad race
Stata will give us the following table:
To create a table of means for numeric variables "wage" and "hours" across the levels of categorical variable "collgrad", type:
table collgrad, statistic(mean wage hours)
Stata will give us the following table:
Multi-way table
To create a three dimension table for three categorical variables "collgrad", "race", and "union", type:
table collgrad race union
Stata will give us the following table:
To create a table containing mean values for various numerical variables ("age", "hours", "wage", and "tenure") with respect to a categorical variables ("race"), type:
table race, statistic(mean age hours wage tenure)
Stata will give us the following table:
To create a table containing mean values for various numerical variables ("age", "hours", and "wage") with respect to multiple categorical variables ("race", "collgrad"), type:
table race collgrad , statistic(mean age hours wage)
Stata will give us the following table:
If you have questions or comments about this guide or method, please email data@Princeton.edu.