# Data Visualization in Stata: Generating Basic Graphs/Figures

This tutorial provides instructions to generate basic graphs/figures using Stata.

#### Generating Basic Graphs/Figures

###### 6. Histograms

7. Plotting Regression Coefficients

## 1. Introduction

This guide provides instructions to generate basic figures/graphs using Stata that are useful for exploratory data analysis.

You can type codes in the Stata command window or use a do-file.

If you use a do-file, set your working directory by typing the following:

cd "C:\YourDirectoryPath"

After setting the working directory, open a do-file by clicking the "New Do-file Editor" icon in the Stata window.

Save the do file for later use.

## 2. Line Graphs

First, open a Stata data file by typing the following codes:

use "https://dss.princeton.edu/training/wdipol.dta", clear

Inspect the data to get a better idea about the data. Type:

browse
describe
summarize

Generate a line graph for the variables unemp unempf unempm for the United States. Type:

line unemp unempf unempm year if country=="United States"

Stata will give us the following line graph:

The graph does not look good. Let's check the variables more carefully. Type:

summarize unemp unempf unempm

We see each of the variables contains 0% unemployment rates. Let us remove zeros for each variable to get a nicer-looking graph. To remove zeros, type:

replace unemp=. if unemp==0
replace unempf=. if unempf==0
replace unempm=. if unempm==0

Check the summary stat again. Type:

summarize unemp unempf unempm

Let us generate the line graph again using the cleaner data. Type:

line unemp unempf unempm year if country=="United States"

Stata now gives us a better-looking graph.

Let us add a legend, line pattern, y-axis title, and graph title to make the graph more beautiful. Type the following codes:

line unemp unempf unempm year if country=="United States", ///
title("Unemployment rate in the US, 1980-2012") ///
legend(label(1 "Total") label(2 "Females") label(3 "Males")) ///
lpattern(solid dash dash_dot) ///
ytitle("Percentage")

Stata will give us the following graph.

We can present the above graph by connecting the lines and adding symbols (circle, diamond, square, etc.) to the lines. Type:

twoway connected unemp unempf unempm year if country=="United States", ///
title("Unemployment rate in the US, 1980-2012") ///
legend(label(1 "Total") label(2 "Females") label(3 "Males")) ///
msymbol(circle diamond square) ///
ytitle("Percentage")

Stata will give us the following graph.

Line Graphs by Country Names

We can use Stata's two-way connected command to create separate line graphs for a selected set of countries. In this case, we have to provide the country name and use by country command. Use the following codes:

twoway connected unemp year if country=="United States" | ///
country=="United Kingdom" | ///
country=="Australia" | ///
country=="Qatar", ///
by(country, title("Unemployment Rate")) ///
msymbol(circle_hollow)

Stata will give us the following graph.

We can present lines for each country above in a single graph. Type the following codes:

twoway (connected unemp year if country=="United States", msymbol(diamond_hollow)) ///
(connected unemp year if country=="United Kingdom", msymbol(triangle_hollow)) ///
(connected unemp year if country=="Australia", msymbol(square_hollow)) ///
(connected unemp year if country=="Qatar", ///
title("Unemployment Rate") ///
msymbol(circle_hollow) ///
legend(label(1 "USA") label(2 "UK") label(3 "Australia") label(4 "Qatar")))

Stata will give us the following graph.

Let us now generate a similar graph for the variable gdppc

twoway connected gdppc year if gdppc>40000, by(country) msymbol(diamond)

Stata will give us the following graph.

We can add more than one line in each of the graphs of a panel graph. Let us create two new variables, gdppc_mean and gdppc_median. Type:

bysort year: egen gdppc_mean=mean(gdppc)
bysort year: egen gdppc_median=median(gdppc)

Let us now generate line graphs for the variables gdppc_mean and gdppc_median for selected countries. Type:

twoway connected gdppc gdppc_mean year if country=="United States" | ///
country=="United Kingdom" | ///
country=="Australia" | ///
country=="Qatar", ///
by(country, title("GDP pc (PPP, 2005=100)")) ///
legend(label(1 "GDP-PC") label(2 "Mean GDP-PC")) ///
msymbol(circle_hollow)

Stata will give us the following graph.

Line Graphs by Country Names in Panel Data Setting

To declare the dataset as a panel data, type:

xtset country year

Running the codes gives us an error message as the country variable is strings. To assign numeric values to the string variable country, type:

encode country, gen(country1)

To declare the dataset as a panel again, type:

xtset country1 year

Now, let us create a line graph for the countries with per capita GDP greater than \$35,000. Type:

xtline gdppc if gdppc>35000, overlay ///
title(Per Capita GDP for the Richest Countries)

Stata will give us the following graph.

NOTE: To get an idea about different kinds of graph markers type:

palette symbolpalette
palette linepalette
palette color green
help palette

## 3. Bar Graphs

This section describes how to generate bar graphs.

First, get the data. Type:

use "https://dss.princeton.edu/training/wdipol.dta", clear

Let us create a horizontal bar graph for the variable gdppc for each country in the dataset. Type:

graph hbar (mean) gdppc, over(country, sort(1) descending label(labsize(*0.50)))

Stata will give us the following graph.

Country names in the graph are unclear. To make the graph clearer, we may keep the countries with a mean per capita GDP greater than \$18,000. To do this, type:

graph hbar (mean) gdppc if gdppc>18000, ///
over(country, sort(1) descending label(labsize(*0.7))) ///
bar(1, color(ebblue))

Stata will give us the following graph.

For the countries with per capita GDP less than \$1500, type:

graph hbar (mean) gdppc if gdppc<1500, ///
over(country, sort(1) descending label(labsize(*0.6))) ///
bar(1, color(ebblue))

Stata will give us the following graph.

We can compare mean per capita GDP with the median per capita GDP for the countries with gdppc>18000. Type:

graph hbar (mean) gdppc (median) gdppc if gdppc>18000, ///
over(country, sort(1) descending label(labsize(*0.8))) ///
legend(label(1 "GDPpc (mean)") label(2 "GDPpc (median)")) ///
bar(1, color(blue)) ///
bar(2, color(brown))

Stata will give us the following graph.

help graph bar

## 4. Boxplots

Boxplot is a valuable tool to detect outliers in a dataset. This sub-section provides instructions on creating basic boxplots using Stata.

First, open a Stata data file. Type:

use "https://dss.princeton.edu/training/wdipol.dta", clear

Let us create a basic boxplot for the variable gdppc. Type:

graph hbox gdppc

Stata will give us the following graph.

In the above graph, we see lots of outliers when gdppc is greater than 40,000. We can set the maximum value for gdppc to get a better idea about the min, max, median, and quartile valuesTo do this, type:

graph hbox gdppc if gdppc <40000

Stata will give us the following graph.

We will now create a boxplot for the variable gdppc with respect to a categorical variable. Let us recode the polity2 variable and make a categorical variable regime  based on it. Currently, polity2 ranges between -10 and 10. We will create the regime variable with three categories by defining Autocracy with a score of -10 and -6, Anocracy with a score of -5 and 6, and Democracy with a score of 7 to 10. Use the following codes:

tab polity2
recode polity2 (-10/-6=1 "Autocracy") ///
(-5/6=2 "Anocracy") ///
(7/10=3 "Democracy") ///
(else=.), ///
gen(regime) label(polity_rec)

To inspect the newly created regime variable, type:

tab regime
tab regime, nolabel
tab country regime
tab country regime, row

To generate a boxplot for gdppc with respect to the categorical variable regime, type:

graph box gdppc, over(regime) yline(9482.966) ///
title("Regime Type and Per capita GDP")

Stata will give us the following graph.

Note: we used mean gdppc to plot the dotted y line.

We can create the above graph with the horizontal boxplot. Type:

graph box gdppc, over(regime) horizontal yline(9482.966) ///
title("Regime Type and Per capita GDP")

Stata will give us the following graph.

We can create a boxplot for two numerical variables (gdppc and trade) with respect to a categorical variable (regime). Change the scales of the variables gdppc and trade by taking logs, which would provide a nicer boxplot. Type:

gen log_gdppc = log(gdppc)

To make the boxplot, type:

graph box log_gdppc log_trade, over(regime)  ///
title("Regime Type, Per capita GDP, and International Trade")

Stata will give us the following graph.

help graph box

## 5. Scatterplots

First, get the Stata data file by typing:

use "https://dss.princeton.edu/training/wdipol.dta", clear

Let us create a basic scatterplot for the variables export and import. Type:

scatter import export

Stata will give us the following graph.

Now, we will add a line graph fitting the export and import values. We also add some codes to make the graph more informative and beautiful. Type:

scatter import export, title("Export and Import") ytitle("Import (const 2005 USD)") xtitle("Export (const 2005 USD)") mcolor(navy) || lfit import export, legend(off)

Stata will give us the following graph.

To mention the names of the outlier countries, type:

twoway (scatter import export, title("Export and Import") ytitle("Imports") xtitle("Exports")) ///
(scatter import export if export>1000000, mlabel(country) mcolor(blue) legend(off)) ///
(lfit import export, note("Constant values, 2005, millions US\$"))

Stata will give us the following graph.

For more help, type:

help scatter

help twoway scatter

Scatterplots with Linear Fit and Confidence Intervals

First, get the data. Type:

use "http://dss.princeton.edu/training/students.dta", clear

Now, to create a graph with scatterplot, line, and confidence interval, type:
twoway (lfitci sat newspaperreadershiptimeswk) ///
(scatter sat newspaperreadershiptimeswk, mlabel("")), ///
title("SAT Scores by Newspaper Readership") ytitle("Sat")

Stata will give us the following graph.

## 6. Histograms

First, get the Stata data file. Type:

use "https://dss.princeton.edu/training/wdipol.dta", clear

Let us start with a simple histogram showing the probability density of the continuous variable unemployment. Type:

hist unemp

Stata will give us the following graph.

Let us check the histogram for another variable, gdppc. Type:

hist gdppc

Stata will give us the following graph.

Notice that the density values in the y-axis are very small, which is correct. The sum of the probability density should be one. If you added up the area of the bars in the above graph, you would get 1.

You will get a better-looking histogram for the variable gdppc if you report fraction instead of density in the y-axis. Type:

hist gdppc, fraction

Stata will give us the following graph.

We can also report frequency in the y-axis. Type:

hist gdppc, frequency

Stata will give us the following graph.

To add a density curve in the histogram, type:

hist gdppc, kdensity

Stata will give us the following graph.

To add a normal curve with the density curve, type:

hist gdppc, kdensity normal

Stata will give us the following graph.

You can add a title, and set the width and the color of the bin by typing:

hist gdppc, fraction kdensity normal width(10000) ///
title("Per Capita GDP") ///
color(teal)

Stata will give us the following graph.

We can use the histogram's discrete option to treat a variable as discrete. To create a histogram for the discrete variable polity2, type:

hist polity2, discrete ///
color(teal) ///
title("Polity2 Scores of the Sample Countries")

Stata will give us the following graph.

We can get histograms for gdppc for a group of countries in a single graph. Type:

hist gdppc if country=="United States" | country=="United Kingdom" | country=="Germany" | country=="France", bin(10) by(country) ///
color(teal)

Stata will give us the following graph.

We can create a twoway histogram, which is helpful to compare statistics for two groups or countries. Type:

twoway hist gdppc if country=="United States", bin(10) || ///
hist gdppc if country=="United Kingdom", bin(10) ///
fcolor(none) lcolor(red) legend(label(1 "USA") label(2 "UK"))

Stata will give us the following graph.

For more help, type:

help hist

## 7. Plotting Regression Coefficients

We can plot regression coefficients in a graph using the coefplot command.

First, install the coefplot package in Stata. Type:

ssc install coefplot

Let us use a Stata in-built data nlsw88 to generate a coefplot graph. To get the data, type:

sysuse nlsw88.dta, clear

Run a simple OLS model with a number of independent variables:

reg wage grade age race married c_city south union, robust

To plot the regression coefficients, type:

coefplot, drop(_cons) xline(0)

Stata will give us the following graph.

Plotting Coefficients for Multiple Model

We can plot coefficients for multiple models in a single graph using the coefplot command.

Let us run two separate models for college graduate and noncollege graduate workers. To know how college graduates and noncollege graduates are categorized in the dataset, use the codebook command, type:

Stata will give us the following table.

###### Tabulation: Freq.   Numeric  Label                         1,714         0  Not college grad                           532         1  College grad

From the table, it is evident that "College grad" was assigned the value of 1 and "Not college grad" was assigned the value of 0 in the dataset.

Let us now run separate regressions for College grad and Not college grad and store the estimated coefficients. When storing the estimated coefficients, we can provide the models name as we want. In this case, we assign the name "College" for the first model and "NonCollege" for the second model. Type the following codes:

reg wage grade age race married c_city south union if collgrad ==1, robust

estimates store College

reg wage grade age race married c_city south union if collgrad ==0, robust

estimates store NonCollege

coefplot College NonCollege, drop(_cons) xline(0)

Stata will give us the following graph, which contains coefficients for the models "College" and "NonCollege".

For more help, type:

help coefplot

## Useful Resources

DSS Data Analysis Guides. Available at https://library.princeton.edu/dss/training

Getting started: how to use coefplot. Available at: https://repec.sowi.unibe.ch/stata/coefplot/getting-started.html

Kane, J. V, (2023). Combined “marginsplots” for Regression Analysis in Stata. Available at: https://medium.com/the-stata-gallery/combined-marginsplots-for-regression-analysis-in-stata-b107b5f237fc

Mirza, M, (2022). Top 25 Stata Visualizations — With Full Code.  Available at: https://medium.com/the-stata-gallery/top-25-stata-visualizations-with-full-code-668b5df114b6

Mitchell, N. M, (2022). A Visual Guide to Stata Graphics. 4th Ed, Stata Press.

Stata. Publication-quality Graphics. Available at: https://www.stata.com/features/publication-quality-graphics/

Stata. Visual overview for creating graphs. Available at: https://www.stata.com/support/faqs/graphics/gph/stata-graphs/

The World Bank. Stata Coding Practices: Visualization. Available at: https://dimewiki.worldbank.org/Stata_Coding_Practices:_Visualization

## Data Consultant

He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051

## Data Consultant

Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519