This was an early project I had to opportunity to work on to improve the internal reporting tool our team used for cleaning timesheet data. Our team wanted to start pointing our time to revenue generating projects before the company had approved, vetted, and paid for a licensed time tracking service. We got our survey team to create something our team could use as a defacto timesheet tracker - something where we could outline what client projects we were working on, how much per week we put toward those client projects, and how we were utilizing the rest of our time to benefit the company.
The problem was the created survey/timesheet output was a complete mess. The team had created something that worked, but the output data wasn’t formatted. The variables were stretched out in a wide format rather than a long/tidy format. The result was 839 variables which needed to be organized, cleaned and relabeled so the team of directors could use the data.
I’ve had to exclude a lot of private information from the code snippets below, but the process is outlined from start to finish. I include this as a portfolio piece because of its complexity and to illustrate how I use the resources available to me to benefit the team and larger organization. Once I understood the structure, It turned into a puzzle that could be solved instead of the mess it looked like.
Data Cleaning
The survey team used default naming conventions on their side. So instead of “Project Info” the variable was named “QTx2r07c98”. I’m sure it made sense to someone, somehwere but it wasn’t useful to our team. On top of that, due the way the survey was coded, it automatically included all possible entries regardless of whether there was data or not. The result was 50 different variables that may or may not have data. The first step I took to untanlge this mess was reorganizing the data into a structure I could use through advanced pivot functions.
Then I needed to find and replace all the varible groups with regex strings corresponding to all possible data columns for that group. This made renaming variables easier through their default naming patterns.
Then it was a matter of reformatting the data according to the correct labels, removing any NA values, and coalescing data to remove unnecissary columns.
Finaly, I used the included datamap from the API pull to relabel the data values so names and internal category labels were shown instead of a string of characters.
The final data was then sent to our team of directors for weekly reporting to senior leadership. I had to do this for 2 additional teams at the end of each month and abstracting it to the level I did made things much easier. Sure, you could do all this in Excel, but I enjoy the process of writing data cleaning code. Plus, the upfront work here made subsequent data pulls effortless.