This guide provides instructions to generate basic figures/graphs using Stata that are useful for exploratory data analysis.
You can type codes in the Stata command window or use a do-file.
If you use a do-file, set your working directory by typing the following:
cd "C:\YourDirectoryPath"
After setting the working directory, open a do-file by clicking the "New Do-file Editor" icon in the Stata window.
Save the do file for later use.
First, open a Stata data file by typing the following codes:
use "https://dss.princeton.edu/training/wdipol.dta", clear
Inspect the data to get a better idea about the data. Type:
browse
describe
summarize
Generate a line graph for the variables unemp unempf unempm for the United States. Type:
line unemp unempf unempm year if country=="United States"
Stata will give us the following line graph:
The graph does not look good. Let's check the variables more carefully. Type:
summarize unemp unempf unempm
We see each of the variables contains 0% unemployment rates. Let us remove zeros for each variable to get a nicer-looking graph. To remove zeros, type:
replace unemp=. if unemp==0
replace unempf=. if unempf==0
replace unempm=. if unempm==0
Check the summary stat again. Type:
summarize unemp unempf unempm
Let us generate the line graph again using the cleaner data. Type:
line unemp unempf unempm year if country=="United States"
Stata now gives us a better-looking graph.
Let us add a legend, line pattern, y-axis title, and graph title to make the graph more beautiful. Type the following codes:
line unemp unempf unempm year if country=="United States", ///
title("Unemployment rate in the US, 1980-2012") ///
legend(label(1 "Total") label(2 "Females") label(3 "Males")) ///
lpattern(solid dash dash_dot) ///
ytitle("Percentage")
Stata will give us the following graph.
We can present the above graph by connecting the lines and adding symbols (circle, diamond, square, etc.) to the lines. Type:
twoway connected unemp unempf unempm year if country=="United States", ///
title("Unemployment rate in the US, 1980-2012") ///
legend(label(1 "Total") label(2 "Females") label(3 "Males")) ///
msymbol(circle diamond square) ///
ytitle("Percentage")
Stata will give us the following graph.
Line Graphs by Country Names
We can use Stata's two-way connected command to create separate line graphs for a selected set of countries. In this case, we have to provide the country name and use by country command. Use the following codes:
twoway connected unemp year if country=="United States" | ///
country=="United Kingdom" | ///
country=="Australia" | ///
country=="Qatar", ///
by(country, title("Unemployment Rate")) ///
msymbol(circle_hollow)
Stata will give us the following graph.
We can present lines for each country above in a single graph. Type the following codes:
twoway (connected unemp year if country=="United States", msymbol(diamond_hollow)) ///
(connected unemp year if country=="United Kingdom", msymbol(triangle_hollow)) ///
(connected unemp year if country=="Australia", msymbol(square_hollow)) ///
(connected unemp year if country=="Qatar", ///
title("Unemployment Rate") ///
msymbol(circle_hollow) ///
legend(label(1 "USA") label(2 "UK") label(3 "Australia") label(4 "Qatar")))
Stata will give us the following graph.
Let us now generate a similar graph for the variable gdppc.
twoway connected gdppc year if gdppc>40000, by(country) msymbol(diamond)
Stata will give us the following graph.
We can add more than one line in each of the graphs of a panel graph. Let us create two new variables, gdppc_mean and gdppc_median. Type:
bysort year: egen gdppc_mean=mean(gdppc)
bysort year: egen gdppc_median=median(gdppc)
Let us now generate line graphs for the variables gdppc_mean and gdppc_median for selected countries. Type:
twoway connected gdppc gdppc_mean year if country=="United States" | ///
country=="United Kingdom" | ///
country=="Australia" | ///
country=="Qatar", ///
by(country, title("GDP pc (PPP, 2005=100)")) ///
legend(label(1 "GDP-PC") label(2 "Mean GDP-PC")) ///
msymbol(circle_hollow)
Stata will give us the following graph.
Line Graphs by Country Names in Panel Data Setting
To declare the dataset as a panel data, type:
xtset country year
Running the codes gives us an error message as the country variable is strings. To assign numeric values to the string variable country, type:
encode country, gen(country1)
To declare the dataset as a panel again, type:
xtset country1 year
Now, let us create a line graph for the countries with per capita GDP greater than $35,000. Type:
xtline gdppc if gdppc>35000, overlay ///
title(Per Capita GDP for the Richest Countries)
Stata will give us the following graph.
NOTE: To get an idea about different kinds of graph markers type:
palette symbolpalette
palette linepalette
palette color green
help palette
This section describes how to generate bar graphs.
First, get the data. Type:
use "https://dss.princeton.edu/training/wdipol.dta", clear
Let us create a horizontal bar graph for the variable gdppc for each country in the dataset. Type:
graph hbar (mean) gdppc, over(country, sort(1) descending label(labsize(*0.50)))
Stata will give us the following graph.
Country names in the graph are unclear. To make the graph clearer, we may keep the countries with a mean per capita GDP greater than $18,000. To do this, type:
graph hbar (mean) gdppc if gdppc>18000, ///
over(country, sort(1) descending label(labsize(*0.7))) ///
bar(1, color(ebblue))
Stata will give us the following graph.
For the countries with per capita GDP less than $1500, type:
graph hbar (mean) gdppc if gdppc<1500, ///
over(country, sort(1) descending label(labsize(*0.6))) ///
bar(1, color(ebblue))
Stata will give us the following graph.
We can compare mean per capita GDP with the median per capita GDP for the countries with gdppc>18000. Type:
graph hbar (mean) gdppc (median) gdppc if gdppc>18000, ///
over(country, sort(1) descending label(labsize(*0.8))) ///
legend(label(1 "GDPpc (mean)") label(2 "GDPpc (median)")) ///
bar(1, color(blue)) ///
bar(2, color(brown))
Stata will give us the following graph.
For more information about bar graphs, type:
help graph bar
Boxplot is a valuable tool to detect outliers in a dataset. This sub-section provides instructions on creating basic boxplots using Stata.
First, open a Stata data file. Type:
use "https://dss.princeton.edu/training/wdipol.dta", clear
Let us create a basic boxplot for the variable gdppc. Type:
graph hbox gdppc
Stata will give us the following graph.
In the above graph, we see lots of outliers when gdppc is greater than 40,000. We can set the maximum value for gdppc to get a better idea about the min, max, median, and quartile values. To do this, type:
graph hbox gdppc if gdppc <40000
Stata will give us the following graph.
We will now create a boxplot for the variable gdppc with respect to a categorical variable. Let us recode the polity2 variable and make a categorical variable regime based on it. Currently, polity2 ranges between -10 and 10. We will create the regime variable with three categories by defining Autocracy with a score of -10 and -6, Anocracy with a score of -5 and 6, and Democracy with a score of 7 to 10. Use the following codes:
tab polity2
recode polity2 (-10/-6=1 "Autocracy") ///
(-5/6=2 "Anocracy") ///
(7/10=3 "Democracy") ///
(else=.), ///
gen(regime) label(polity_rec)
To inspect the newly created regime variable, type:
tab regime
tab regime, nolabel
tab country regime
tab country regime, row
To generate a boxplot for gdppc with respect to the categorical variable regime, type:
graph box gdppc, over(regime) yline(9482.966) ///
title("Regime Type and Per capita GDP")
Stata will give us the following graph.
Note: we used mean gdppc to plot the dotted y line.
We can create the above graph with the horizontal boxplot. Type:
graph box gdppc, over(regime) horizontal yline(9482.966) ///
title("Regime Type and Per capita GDP")
Stata will give us the following graph.
We can create a boxplot for two numerical variables (gdppc and trade) with respect to a categorical variable (regime). Change the scales of the variables gdppc and trade by taking logs, which would provide a nicer boxplot. Type:
gen log_gdppc = log(gdppc)
gen log_trade = log(trade)
To make the boxplot, type:
graph box log_gdppc log_trade, over(regime) ///
title("Regime Type, Per capita GDP, and International Trade")
Stata will give us the following graph.
For more information about boxplots, type:
help graph box
First, get the Stata data file by typing:
use "https://dss.princeton.edu/training/wdipol.dta", clear
Let us create a basic scatterplot for the variables export and import. Type:
scatter import export
Stata will give us the following graph.
Now, we will add a line graph fitting the export and import values. We also add some codes to make the graph more informative and beautiful. Type:
scatter import export, title("Export and Import") ytitle("Import (const 2005 USD)") xtitle("Export (const 2005 USD)") mcolor(navy) || lfit import export, legend(off)
Stata will give us the following graph.
To mention the names of the outlier countries, type:
twoway (scatter import export, title("Export and Import") ytitle("Imports") xtitle("Exports")) ///
(scatter import export if export>1000000, mlabel(country) mcolor(blue) legend(off)) ///
(lfit import export, note("Constant values, 2005, millions US$"))
Stata will give us the following graph.
For more help, type:
help scatter
help twoway scatter
Scatterplots with Linear Fit and Confidence Intervals
First, get the data. Type:
use "http://dss.princeton.edu/training/students.dta", clear
Now, to create a graph with scatterplot, line, and confidence interval, type:
twoway (lfitci sat newspaperreadershiptimeswk) ///
(scatter sat newspaperreadershiptimeswk, mlabel("")), ///
title("SAT Scores by Newspaper Readership") ytitle("Sat")
Stata will give us the following graph.
First, get the Stata data file. Type:
use "https://dss.princeton.edu/training/wdipol.dta", clear
Let us start with a simple histogram showing the probability density of the continuous variable unemployment. Type:
hist unemp
Stata will give us the following graph.
Let us check the histogram for another variable, gdppc. Type:
hist gdppc
Stata will give us the following graph.
Notice that the density values in the y-axis are very small, which is correct. The sum of the probability density should be one. If you added up the area of the bars in the above graph, you would get 1.
You will get a better-looking histogram for the variable gdppc if you report fraction instead of density in the y-axis. Type:
hist gdppc, fraction
Stata will give us the following graph.
We can also report frequency in the y-axis. Type:
hist gdppc, frequency
Stata will give us the following graph.
To add a density curve in the histogram, type:
hist gdppc, kdensity
Stata will give us the following graph.
To add a normal curve with the density curve, type:
hist gdppc, kdensity normal
Stata will give us the following graph.
You can add a title, and set the width and the color of the bin by typing:
hist gdppc, fraction kdensity normal width(10000) ///
title("Per Capita GDP") ///
color(teal)
Stata will give us the following graph.
We can use the histogram's discrete option to treat a variable as discrete. To create a histogram for the discrete variable polity2, type:
hist polity2, discrete ///
color(teal) ///
title("Polity2 Scores of the Sample Countries")
Stata will give us the following graph.
We can get histograms for gdppc for a group of countries in a single graph. Type:
hist gdppc if country=="United States" | country=="United Kingdom" | country=="Germany" | country=="France", bin(10) by(country) ///
color(teal)
Stata will give us the following graph.
We can create a twoway histogram, which is helpful to compare statistics for two groups or countries. Type:
twoway hist gdppc if country=="United States", bin(10) || ///
hist gdppc if country=="United Kingdom", bin(10) ///
fcolor(none) lcolor(red) legend(label(1 "USA") label(2 "UK"))
Stata will give us the following graph.
For more help, type:
help hist
We can plot regression coefficients in a graph using the coefplot command.
First, install the coefplot package in Stata. Type:
ssc install coefplot
Let us use a Stata in-built data nlsw88 to generate a coefplot graph. To get the data, type:
sysuse nlsw88.dta, clear
Run a simple OLS model with a number of independent variables:
reg wage grade age race married c_city south union, robust
To plot the regression coefficients, type:
coefplot, drop(_cons) xline(0)
Stata will give us the following graph.
Plotting Coefficients for Multiple Model
We can plot coefficients for multiple models in a single graph using the coefplot command.
Let us run two separate models for college graduate and noncollege graduate workers. To know how college graduates and noncollege graduates are categorized in the dataset, use the codebook command, type:
codebook collgrad
Stata will give us the following table.
From the table, it is evident that "College grad" was assigned the value of 1 and "Not college grad" was assigned the value of 0 in the dataset.
Let us now run separate regressions for College grad and Not college grad and store the estimated coefficients. When storing the estimated coefficients, we can provide the models name as we want. In this case, we assign the name "College" for the first model and "NonCollege" for the second model. Type the following codes:
reg wage grade age race married c_city south union if collgrad ==1, robust
estimates store College
reg wage grade age race married c_city south union if collgrad ==0, robust
estimates store NonCollege
coefplot College NonCollege, drop(_cons) xline(0)
Stata will give us the following graph, which contains coefficients for the models "College" and "NonCollege".
For more help, type:
help coefplot
If you have questions or comments about this guide or method, please email data@Princeton.edu.