Timesheet Data Cleaning

August 13, 2024 5 minute read

This was an early project I had to opportunity to work on to improve the internal reporting tool our team used for cleaning timesheet data. Our team wanted to start pointing our time to revenue generating projects before the company had approved, vetted, and paid for a licensed time tracking service. We got our survey team to create something our team could use as a defacto timesheet tracker - something where we could outline what client projects we were working on, how much per week we put toward those client projects, and how we were utilizing the rest of our time to benefit the company.

The problem was the created survey/timesheet output was a complete mess. The team had created something that worked, but the output data wasn’t formatted. The variables were stretched out in a wide format rather than a long/tidy format. The result was 839 variables which needed to be organized, cleaned and relabeled so the team of directors could use the data.

I’ve had to exclude a lot of private information from the code snippets below, but the process is outlined from start to finish. I include this as a portfolio piece because of its complexity and to illustrate how I use the resources available to me to benefit the team and larger organization. Once I understood the structure, It turned into a puzzle that could be solved instead of the mess it looked like.

Data Cleaning

The survey team used default naming conventions on their side. So instead of “Project Info” the variable was named “QTx2r07c98”. I’m sure it made sense to someone, somehwere but it wasn’t useful to our team. On top of that, due the way the survey was coded, it automatically included all possible entries regardless of whether there was data or not. The result was 50 different variables that may or may not have data. The first step I took to untanlge this mess was reorganizing the data into a structure I could use through advanced pivot functions.

# import -------------------------------------------------------------------
key <- Sys.getenv("KEY")

# get data straight from API
get_survey <- function(path) {
  url <- modify_url("https://COMPANY_NAME.com",
                    path = paste0("api/v1/surveys/", path, "/data"),
                    query = list(format = "csv", cond = "qualified"))
  resp <- GET(url, add_headers(`x-apikey` = key))

  if (status_code(resp) != 200) {
    status_msg <- paste0("Response error code ", status_code(resp), ". Check url.\n\n", )
    cat(status_msg)
  } else {
    status_msg <- paste0("Data successfully retrieved (status code: ", status_code(resp), ")\n\n")
    cat(status_msg)
    readr::read_csv(resp$content)
  }
}

data <- get_survey(path = 'SOME_URL_PATH', save = FALSE)

# clean -------------------------------------------------------------------

# match columns by regex
cols <- names(data)
patterns <- c("QTx1r\\d+cTask",
              "QTxDetailr\\d+cDetail",
              "QTxdeliveryTyper\\d+cDT",
              "QTxBUr\\d+cBU",
              "QTx2r\\d+c9\\d",
              "dxQTx6r\\d+c10",
              "QTx[32]r\\d+c\\d+")

match_cols <- str_subset(cols, paste0(patterns, collapse = "|"))

# select report columns, remove columns with only NA's, convert project
# name/number info to string
report <- data %>%
  select(date, hxWeekDater1, hxWeekDater7, qEmployeeName, q2, q1, all_of(match_cols), HidMonth) %>% 
  mutate(across(ends_with("c98"), ~as.character(.x)))

# pivot by "line item"
# this will add nrow(x)*50 due to how the survey is constructed
# filter any task rows with all NA's 
# remove any additional all NA columns created from pivot
report_piv <- report %>%
  mutate(across(matches("QTx2r\\d+c98"), as.character)) %>% 
  pivot_longer(cols = !c(date, hxWeekDater1, hxWeekDater7, qEmployeeName, q2, q1, HidMonth), 
               names_to = c(".value", "line_item"),
               names_pattern = "(^.*QTx.*(r\\d+).*$)",
               values_drop_na = TRUE) %>% 
  filter(if_any(matches("QTx1r\\d+cTask"), ~ !is.na(.))) %>% 
  select(where(~!all(is.na(.x))))

# this pivot structure staggers values across column groups
# next steps will create final columns and grab first value for column groups

# convert date cols to date type
report_piv <- report_piv %>% 
  mutate(date = mdy_hm(date)) %>%
  mutate(hxWeekDater1 = dmy(hxWeekDater1)) %>% 
  mutate(hxWeekDater7 = dmy(hxWeekDater7)) %>%
  rename(week_start = hxWeekDater1,
         week_end = hxWeekDater7)

Then I needed to find and replace all the varible groups with regex strings corresponding to all possible data columns for that group. This made renaming variables easier through their default naming patterns.

# create string of column group regex patterns
group_patterns <- c("QTx1r\\d+cTask",
                    "QTxDetailr\\d+cDetail",
                    "QTxdeliveryTyper\\d+cDT",
                    "QTxBUr\\d+cBU",
                    "QTx2r\\d+c98",
                    "dxQTx6r\\d+c10",
                    "QTx3r\\d+c1", "QTx3r\\d+c2", "QTx3r\\d+c3", "QTx3r\\d+c4", "QTx3r\\d+c5", "QTx3r\\d+c6", "QTx3r\\d+c7",
                    "QTx2r\\d+c99")

# create string of corresponding names for column groups
group_names <- c("task", 
                 "task_detail", 
                 "deliv", 
                 "bu",
                 "proj_info", 
                 "auto", 
                 "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday", 
                 "week")

# list of column group variables from regex pattern where present in report_piv
group_cols <- 
  map(group_patterns, ~ grep(.x, names(report), value = TRUE)) %>% 
  map(., ~ intersect(.x, names(report_piv)))

# label list for 
names(group_cols) <- group_names

Then it was a matter of reformatting the data according to the correct labels, removing any NA values, and coalescing data to remove unnecissary columns.

# get first non-NA value for each question group in group_cols
for (i in seq_along(group_cols)) {
  nm <- names(group_cols[i])
  if (length(group_cols[[i]] < 1)) {
    report_piv <- report_piv %>% 
      mutate( := coalesce(!!! select(., any_of(group_cols[[i]]))), 
             .before = line_item)
  }
}

# select final vars for report
report_clean <- report_piv %>% 
  mutate(Hours = rowSums(across(ends_with("day")), na.rm = TRUE),
         Quarter = lubridate::quarter(HidMonth),
         `Employee Number` = qEmployeeName) %>%
  select(date, qEmployeeName,`Employee Number`, q2, q1, 
         any_of(names(group_cols)), Hours, week_start, week_end, HidMonth, Quarter)

Finaly, I used the included datamap from the API pull to relabel the data values so names and internal category labels were shown instead of a string of characters.

# key-value pairs ---------------------------------------------------------

value_list <-
  list(datamap_labels$qEmployeeName,
       datamap_labels$q2,
       datamap_labels$q1,
       datamap_labels$QTx1,
       datamap_labels$QTxDetail,
       datamap_labels$QTxdeliveryType,
       datamap_labels$QTxBU)

recode_cols <-
  c("qEmployeeName",
    "q2",
    "q1",
    "task",
    "task_detail",
    "deliv",
    "bu")

# recode values -----------------------------------------------------------

# recode values for easy reporting
for (i in seq_along(value_list)) {
  nm <- recode_cols[[i]]
  if (is.element(recode_cols[[i]], names(report_clean))) {
    report_clean <- report_clean %>% 
      mutate( := recode(.data[[recode_cols[[i]]]], !!!value_list[[i]]))
  }
}

# rename cols consistent with notes from Luis
report_final <- report_clean %>% 
  rename("Input Date" = date,
         "Employee Name" = qEmployeeName,
         "Role" = q2,
         "Team" = q1,
         "Task Type" = task,
         "Detailed" = task_detail,
         "Deliverable Type" = deliv,
         "BU" = bu,
         "Project Name/Number" = proj_info,
         "Automation/Streamline Solution" = auto,
         "Week Start" = week_start,
         "Week End" = week_end,
         "Month" = HidMonth) %>% 
  rename_with(., str_to_title, c(ends_with("day"), week)) %>% 
  relocate(Hours, .before = Monday) %>% 
  # mutate(Year = year(`Week Start`)) %>% 
  relocate(Week, .after = `Week End`) %>% 
  arrange(`Input Date`)


# export ------------------------------------------------------------------

## xlsx
report_final %>% 
  filter(`Week End` > '2023-04-01') %>% 
  openxlsx::write.xlsx(file = "FILE_NAME.xlsx", 
                       rowNames = F)

## csv 
report_final %>%
  write.csv(., file = paste("FILE_NAME", str_trim(Sys.Date()), ".csv"),
            row.names = FALSE,
            na = "")

The final data was then sent to our team of directors for weekly reporting to senior leadership. I had to do this for 2 additional teams at the end of each month and abstracting it to the level I did made things much easier. Sure, you could do all this in Excel, but I enjoy the process of writing data cleaning code. Plus, the upfront work here made subsequent data pulls effortless.

Jacob Copple

Data Cleaning