Class 1: From Excel to Stata

A first approximation: Stata as an enhanced excel

Open Stata. Four windows organized as a consolidated workspace will automatically pop-up. Nothing seems too interesting, there is only a welcome message in the Results window with some information about our version and license.

To have some action, we need to load a database. All of the windows in this workspace are instances that either:

In most cases, the things we want to do are only meaningful when there is a database loaded. So lets lead our first database to have some action!

First, download and unzip the materials for this class. Select file > Import > Text Data from the top menu and browse for the file data/original/gasData.csv in the folder of this class. Click ok.

Now Stata has loaded a database. Double click the file to inspect it in Excel (do not save afterwards!). As is generally the case in Excel, Stata follows the convention that each row represents an observation and each column a variable . In this case, observations are all the census block groups of Boston and Cambridge, and variables are some demographics from the 2010 Census and data on the status of gas leaks reported by utility companies in 2016 (more on the data later).

After loading the database, some of our windows have been populated with information:

The Bare Stata Window that shows up when opening Stata

A bare Stata window.

We have already indirectly introduced our first command when we imported the data (well, Stata did it for us). Now, type the following command in the command window:

browse

There we go! We can inspect our database. Try ordering the data by different columns (right click on a column).

Exit the Data Editor Window. Now you can see that your Review window is populated with commands! Click in one of the commands that order the database. The command is now written into your Command window. Try editing it to sort by another variable and type browse again to see what happened. The database should now be ordered by that variable!

We will now give Stata some commands to obtain useful information about our database, this will be our first in-class exercise.

Exercise 1: Preparing a meeting on a hurry

Assignment

Suppose you work as a policy expert in an environmental NGO that advocates for fixing gas leaks in the Great Boston Area (as for example HEET). You have received the GasLeaks.csv database from a colleague at 11:00 AM, and you have an important meeting with members of the Boston Public Health Commission at noon. Although in this very first meeting you did not expect to have data to share, showing them some aggregate figures about the extent of the gas leak problem in Boston and Cambridge would definitely be helpful to make a future cooperation with their organization more likely. By answering this questions you would make your point clear:

  1. What is the total number of leaks in each Neighborhood?
  2. What is the average, standard deviation, minimum, and maximum number of leaks in block groups per town, when we only consider those with at least a hundred inhabitants?
  3. Is there a clear relation at the neighborhoods level between the fraction of latino population and the number of leaks per 10,000 inhabitants?

This is the email your received from your colleague:

Dear Colluegue:

After a week putting together the data from the Census, the American Community Survey and the reports from the utility companies, I am happy to share with you this first version of this dataset.

You can check the definition of each variable in the file variables_description.csv, which is located in the same folder than the database. The gas leak data deserves some further explanation though. As you know, utility companies classify leaks in three grades, that go from one (requires immediate reparation because of safety concerns) to three (only requires monitoring). During 2016, some leaks were repaired and others were not. Hence, we have six types of leaks whose totals are reported in different columns in the database: yes1 (grade-1 leaks that were repaired), no1 (grade-1 leaks that were not repaired), yes2 (grade-2 leaks that were repaired), no2 (grade-2 leaks that were not repaired), yes3 (grade-3 leaks that were repaired), and no3 (grade-3 leaks that were not repaired).

I hope this information is useful for your meeting!

All the best,

Solution

Before starting, it is a good practice to explore the database. Try the following commands:

describe

This command provides a brief overview of each variable, telling us the name and the type of data (string, integer, float,...)

summarize

This command gives us more information for the variables that are numeric: The total number of non-missing observations and some summary statistics.

table neighborhood_name

This command gives us a frequency table for a variable.

Both describe and summarize can be followed by one or several variable names to restrict the output to those variables. If you want to know more about a command, the very first place you should ask is... Stata itself! Try:

help summarize

Stata will prompt a window with very useful information. It starts with the syntax of the command, something like:

summarize [varlist] [if] [in] [weight] [, options]

Not all commands follow this syntax, but it is pretty common. The command starts with the name of the command, followed by some arguments which may or may not be optional (they are optional if they are in between squared brackets). You can restrict attention to a subset of your database using the if and in, and specify options after a comma that will slightly modify the behavior of the command. Try:

summarize no3, d

Now Stata is giving us more detailed information about un-repaired Grade-3 leaks (d is short for detail).

What if we are interested in knowing this summary only for Boston? We can use then the [if] part of the command as follows:

summarize no3 if town_name=="Boston", d

Note the structure of what follows to the if, it is a logical test. We are asking if town_name (which is a string variable) is equal to a specific value. Boston is in between quotes because that is how Stata nows you are talking about a specific value of a string and not something else (as for example a variable in your dataset).

Ok, now we will start to prepare our meeting. We will open a log file. This file will save all the results prompted into the Results window into a file. To to this, go to File > Log > Begin..., and create a log file with the name "exercise1" into the results folder of our class. As you might have expected at this point, Stata just printed the relevant command into the Command line and pressed enter for us.

Now, type these commands one by one in the Command line to achieve our first goal:

display "The results that follow show what is the total number of leaks in each Neighborhood in Boston and Cambridge"

This command will simply print to the Result window (and hence to the log file) what you write between quotes. You can use it to make your log file more readable.

gen total_leaks = no1 + no2 + no3 + yes1 + yes2 + yes3

This command generates a new variable and makes it equal to the sum of all leaks. Once a variable has been generated, you can not generate another one with the same name. If you want to replace the value stored in a variable that already exists, you should use the replace command. Try to give to your variables names that are short to type but descriptive of what they are. Variable names cannot start with numbers, are case sensitive, and cannot have spaces of spacial characters such as hyphens, spaces or parenthesis (they can have underscores).

preserve

This command tells Stata to save a snapshot of our database. It is useful when we want to make a transformation of our database to reach a transitory goal but would like to come back to where we were later on.

collapse (sum) total_leaks, by (neighborhood_name)

This command tells Stata to create a new database where each row is the sum of the variable total over each neighborhood.

list

This command prints the database to the console.

restore

This command restores our database to the state it was when we entered the preserve command.

log close

This command closes our log file (stop recording). We could have done this through the top menu.

Ok, we are done with the first question. Open your log file to verify that all this effort is safely saved on a file that we can print for our meeting. We are ready to start with the second part. Type this commands:

log using path_to_log_file,append

We are telling Stata to open the log file we saved and append the new results at the end. You have to replace path_to_log_file with the path to the log file we saved.

display "The results that follow show summary stats on the number of leaks in block groups in Boston and Cambridge, separately for each town."

bysort town: summarize(total_leaks) if tot_pop>100

This is a very useful combination of commands. The part before the colon says that whatever follows the colon should be done for each group define in the variable that follows bysort. The part that follows the colon asks to provide a simple summary of the total variable only considering observations with more than a hundred inhabitants.

log close

We are left with the third part. There is a myriad of ways of assessing if there is a meaningful relation. We could calculate correlations, run regressions, build a nice table, or create a plot. Of all of them, a plot stands as one of the most useful, as almost everyone can understand what it represents. You can create very fancy plots using Stata (see some inspiring examples here), but now we will explore the most simple plot in Stata (don't worry, we will get to those fancy plots in class 4). You will realize that even if not very beautiful, the simple plot is super helpful to explore your data.

It is time for you to put into practice what we have learned so far! Append to our log file a serious of commands that print to the log file a simple plot with the share of latinos in the x-axis an the per-capita number of leaks in the y-axis. To guide you, I'll provide you with a list of instructions you need to translate to Stata:

  1. Append you work to the log file.
  2. Create a variable that stores the number of latino per block-group from tot_pop and latino_frac.
  3. If you have not done it for the previous exercise, create a variable that count the total number of leaks per block-group.
  4. Preserve and collapse your database to obtain a new database where the unit of observation is a neighborhood and you have three variables: the number of latino, the total population and the number of leaks.
  5. Create two variables: The fraction of latino and the number of leaks per 10,000 inhabitants.
  6. Plot the relation between both variables using plot. Type help plot to find out how to use this command.
  7. Restore your database and close the log file.

log using path_to_log_file, append

display "The results that follow show a plot that relates the fraction of latino in a neighborhood with the number of leaks per 10,000 inhabitants"

gen latino = latino_frac*tot_pop

cap gen total_leaks = no1+no2+no3+yes1+yes2+yes3

cap is a nice command that captures errors. If total_leaks already exists it will not overwrite it, and will not report an error.

preserve

collapse (sum) latino total_leaks tot_pop, by(neighborhood_name)

gen latino_frac = latino/tot_pop

Note that we can use the name latino_frac again because in the new database created after collapse that name does not exist.

gen gas_leaks_per_10000 = 10000*total_leaks/tot_pop

plot gas_leaks_per_10000 latino_frac

restore

log close

Taking reproducibility seriously: Your first do file

We will create a do file now. To a first approximation, a do file is simply a file where you stack some Stata commands line after line. Although simple, using do files efficiently will have a huge impact on your productivity, the quality of your empirical work, and will make you accountable for your work.

We will write our first do file through a second in-class exercise.

Exercise 2: Sharing your work with your colleague

Assignment

Congratulations! The members of the Boston Public Health Commission were astonished by how well prepared were you for your meeting and are looking forward for future cooperation with your NGO! After having a deserved nap, you check you inbox an see a new email from your colleague:

Dear Colluegue:

I heard you did excellent on the meeting, congratulations! Would you mind to share your do file in case I have to present similar results in the future?

All the best,

In the rush of preparing the meeting you forgot to write your commands in a do file as you should always do, but don't panic, it is never late. We will go step by step:

  1. Open a new do file by clicking on the Do File Editor button on the top of the Stata workspace.
  2. Type clear all on the first line and save the dofile in the scripts folder located in the folder for this class. Name the file exercise2.do. The clear all command tells Stata to remove whatever database is loaded into memory.
  3. Write a new line in the do file that declares the working directory:
  4. cd path_to_class1_folder

    After entering this command, all the paths to files that follow in the do file are relative to the declared working directory. For example, if my working directory is /Users/Colluegue/class1 and I want to open a file located at /Users/Colluegue/class1/data/dataset.csv, I only have to type data/dataset.csv when calling a file. What if the file I want is not a child of my working directory? You can either enter an absolute path or use the double dot notation to access a parent directory. For example, if you want to access a file located in /Users/Colluegue/other_project/a_file.csv you can call it either directly or using ../other_project/a_file.csv.

  5. Write a new line in the do file that imports the database as follows:
  6. import delimited suing "/data/original/gasData.csv"

    In the solution you will see that I used the command insheet using "/data/original/gasData.csv", c instead. This command also works, but is depreciated. It is important to know what it does though, because a lot of Stata users still use it.

  7. Write a new line that opens a log file:
  8. log using "/data/results/exercise1.smcl", replace

    Note that we are using the replace option to overwrite the file.

  9. Copy and paste to the do file, line by line, each one of the commands you used to answer the three questions. Skip those that open and close log files, but remember to finish your do file with log close to close the log file.

Solution

You can download the solution to this exercise here (please try to solve the exercise by yourself before looking at the solution).

Congratulations! You have written your first do file. Now you can share your work with anyone that has Stata by giving them your Class 1 folder. All the work the other person will have to do to reproduce your analysis is to change the working directory at the very beginning of the do file. That is why we prefer relative path to files after the working directory has been declared, because that makes reproducibility easier.

Now, run the do file by clicking on the upper right icon that says "run". There are four other ways to run a do file:

From now on, all our work will be done in do files. In practice, you will use the command line in your work to make some experimentation and explore your database, but most of your work will be in do files. Before introducing the homework for this class, let me briefly comment on some points I want to stress:

Homework: A disparity analysis

You did such a great job in your meeting that a couple of weeks later the Boston Public Health Commission hired a report from your organization. The name of the project is "A disparity analysis of exposure to gas leaks and access to gas leaks reparation in the Great Boston Area" and you have been chosen as the most capable employee for this task. You have received the following email from the director of your organization:

Dear Colluegue:

I have decided to assign you the report for the Boston Public Health Commission. As a first step, I want you to give me a table that shows for each quintile of income the average number of leaks per 10,000 inhabitants and the average fraction of grade-1, grade-2, and grade-3 leaks that were repaired during 2016. The table should look something like this:

Quintile Average Number of Leaks per 10,000 inhabitants Fraction of Grade-1 Leaks Repaired Fraction of Grade-2 Leaks Fixed Fraction of Grade-3 Leaks Fixed
Bottom 20%
20-40%
40-60%
60-80%
Top 20%

Please send me the table as a .dat file and as an excel sheet. Also attach your do file and log file.

All the best,

To complete this homework, you will need to make use of three new commands:

Use the help command to learn how to use these commands.

You can download the solution to this homework here (please try to solve the homework by yourself before looking at the solution).