Class 2: Manipulating your Data

In this class we will learn how to manipulate the data. Although this is clearly the less glamorous part of an empirical analysis, it is as important as the analysis of the data itself, because there can be no analysis with no data. In your professional life, you have probably been exposed to raw data that could have been really useful to inform decision-makers if someone had gone through the process of standardizing the data and putting the different sources together to build a useful dataset. Feeling comfortable in that process will open you the door to an almost infinity set of exciting data sources expecting to be analyzed, without needing an army of RA's crunching the data in Excel through an irreproducible process. You are on the right track to master this process, as Stata has powerful tools to go through it without loosing your mind on the way.

You will be introduced to the main tools for manipulating data through three interconnected short in-class exercise and a homework. All these exercises will build on top of each other to complete the processing of real data on bills payed by chilean hospitals to their suppliers, with which we will obtain relevant information to assess the extent of the problem of late payment in this context.

Set up

Imagine your are a young professional at the Chilean office of budget planning. You work at a recently funded task-force devoted to promote prompt payment throughout the public sector. You turn on your computer on a Monday morning after a refreshing weekend and receive the following email from your counterpart in the Minister of Health:

Dear Colluegue:

The minister of Health read your last report on the relative performance of different agencies on prompt payment to suppliers with a lot of interest. As you know, public hospitals have had a hard type paying their supplier on time, and he would like to have more information from your task-force to understand what are the potential consequences of late payment and how to improve hospitals' performance on this dimension. Could you please send us a database that shows for each hospital and available year what is the average number of days they take to pay their suppliers, and the fraction of bills payed after the limit stablished in the law (60 days for EAR hospitals and 30 days for the rest)? Please also provide the result for small, medium and large suppliers, as defined by the tertiles of the distribution of total sales to hospitals. Finally, please include a final column that shows for each hospital-year what is its relative rank as compared to other hospitals on that year, taking the fraction of bills payed late as the standard (the smaller, the better), and whether the hospital improved, maintained, or dropped its position with respect to the previous year.

The table should look like this:

Hospital Year EAR Rank Change in Rank Average Payment (Days) Late Payments (Fraction) Average Payment, small suppliers Average Payment, medium suppliers Average Payment, large suppliers Late Payments, small suppliers Late Payments, medium suppliers Late Payments, large suppliers
Hospital A 2010
Hospital A 2011

Thank you. We really appreciate your effort to improve prompt payment throughout the government.

All the best,

P.S: EAR stands for Establecimiento Autogestionado en Red, which translates to in-network self-managed establishment.

Road Map

Download the data for this class. You have all the data needed to achieve your goal in the data/original directory of the folder for this class, that contains data for a sample of hospitals. In broad terms, this is what we need to do to achieve our goal.

  1. Stack related databases:
    • Vertically combine the information of bills from different hospitals.
    • Reshape the CPI data to resemble the structure of the bill data.
    • Obtain month and year from dates in both databases.
  2. Horizontally combine bills data with:
    • The CPI series constructed above.
    • Information on hospitals to identify which are EAR.
  3. Build the final database:
    • Build a new database that identifies small, medium and large suppliers.
    • Horizontally combine the bill database with the above database that identifies small, medium and large suppliers.
    • Identify bills that were payed late.
    • Collapse and reshape the database to obtain the desired result.
    • Rank hospitals in each year and note whether its position improved, remain the same, or went down.

Wow! This looks like a lot. Do not feel overwhelm by this task, we will chop it into small pieces and see that with Stata this is not difficult at all!

In the scripts folder of the folder for this class there are four do files. The files exercise1-template.do , exercise2-template.do and exercise3-template.do provide templates where you can code the answer to each one of the exercises of this class, which correspond to the three main steps indicated above.

Exercise 1

Open the file exercise1-template.do and pay attention to your instructor.

Exercise 2

Open the file exercise2-template.do and pay attention to your instructor.

Exercise 3

Open the file exercise3-template.do and pay attention to your instructor.

Bonus

Run the three do in consecutive order using the files master-template.do. You will have to edit this file putting the paths to your do files.

You can download the solutions to these exercises here (please try to solve the exercises by yourself before looking at the solutions).

Homework 1

You have had an intense afternoon of Stata coding and feel very satisfied with your work. Just a second before sending the email to your counterpart in the Minister of Health, you receive a new email from him:

Dear Colluegue:

I hope you haven't started building the database yet! I forgot to attach the information from three hospitals we received yesterday (you can download it here). Could you please include these hospitals?

Also, I would like to try another ranking. Hospitals with bad performance may feel discourage by the ranking I proposed, so I discussed another option with my team that we liked better. The idea is to calculate the change (in percentage points) in the average days of payment for each hospital-year. So, for example, if a hospital payed its average bill in 2011 and 2012 in 30 and 27 days respectively, its score in 2012 will be -0.1 ([27-20]/30). The lower the score, the better the ranking (lower numbers in the ranking are better). Please add this ranking and whether the position of an hospital improved, remain the same or worsened to the database, without dropping the other ranking.

I hope this does not add too much work to your evening.

All the best,

Other people would have their dinner plans ruined by this new email, but not you. As an experienced professional that works with data, you have all the procedure coded on do files. Adding the additional hospitals is just repeating some lines of code, and the new ranking will require only a few new lines. This shows one of the main advantage of doing your work in do files: if you need to make a similar analysis you don't need to start from scratch. Go ahead and send the database before your counterpart changes his mind and sends another email!

To add the new datasets you should add the following lines after the database for the fifth hospital has been saved in the do file for exercise 1 (you should move the new database to the data/original/bills folder):

* Sixth hospital

insheet using "data/original/bills/Hospital F.csv", clear

save "data/processed/F.dta", replace

* Seventh hospital

insheet using "data/original/bills/Hospital G.csv", clear

save "data/processed/G.dta", replace

* Eigth hospital

insheet using "data/original/bills/Hospital H.csv", clear

save "data/processed/H.dta", replace

And add the following line after the data for hospital A has been appended in the do file for exercise 1:

append using "data/processed/E.dta"

append using "data/processed/G.dta"

append using "data/processed/F.dta"

To add the new ranking you should add the following lines of code after declaring the dataset as a longitudinal dataset (xtset) in the do file for exercise 3:

gen score = (payment_delay-l1.payment_delay)/l1.payment_delay

egen rank2 = rank(score) if score!=., by(year)

gen rank2_improved= rank2 < l1.rank2 if l1.rank2!=.

egen replace rank2_improved=-1 if rank2 > l1.rank2 & l1.rank2!=.

egen replace rank2_improved=-1 if rank2 > l1.rank2 & l1.rank2!=.

Finally, you have to create labels for these new variables and include them in the order command at the end

Homework 2

A couple of weeks after finishing the database, you receive a call from the director of Cenabast, a Chilean public agency that acts as a retailer for drug purchases. They are worried that hospitals might be paying them particularly late because they do not have good legal mechanisms to enforce prompt payment. You have been given this data, that shows what are the suppliers that have been selected in Cenabast's auctions between 2008 and 2015. You should combine this information with the information you already have to make a table that shows for each hospital-year what is the average time hospitals took to pay to Cenabast, the fraction of bills payed late to Cenabast, and the same figures for the rest of their suppliers. The group of suppliers considered for each year should be restricted to those that were selected in a Cenabast auction for that year. So, for example, if in 2011 Cenabast selected only suppliers X and Y, we want to include bills to suppliers X and Y when calculating the average days of payment and the fraction of late payments to the rest of suppliers in 2011. This will ensure that we are comparing the same suppliers (and hence hopefully similar products).

The final table should look like this:

Hospital Year EAR Average Payment to Cenabast (Days) Late Payments to Cenabast (Fraction) Average Payment to Other Suppliers (Days) Late Payments to Other Suppliers (Fraction)
Hospital A 2010
Hospital A 2011

Consider the following important points to complete this homework:

You can download the solution to this exercise here (please try to solve the exercise by yourself before looking at the solutions).