Capstone: Google Data Analytics Professional Certificate

I went through this case study to conclude my Google Data Analytics Professional Certificate. Even though I was trained in grad school using tools like SAS and continue to use advanced Excel functions daily and I am currently pursuing Machine Learning using Python, I thought it best to have that Certificate to show my specialties before pursuing more advanced ones in the Data Science field. I chose this case because as someone who lives in one of the nicest college towns in the country, I see the potential success of a bikeshare in places investing in their infrastructure to be more bike friendly.

I could have accomplished these tasks in a fraction of the time using Python and its vast arsenal but the goal was to use specifics tools used in that program, namely, spreadsheets, SQL, and R.

The dataset used in this case is provided by Motivate International Inc. under this license. The data has been anonymized by simulate real life cases. I am using the past 12 months data, covering the 2021 calendar year.

This case study required me to act as a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago, IL. Like many other organizations nowadays, a part of Cyclistic’s leadership is hypothesizing that that future success will closely be corelated to the number of annual memberships given the finance analysts conclusion that annual memberships are more profitable casual riders (single-ride passes, full-day passes). To provide an educated assessment of the business task, I need to analyze how annual members and casual riders use our service differently. If the analysis supports the leadership’s hypothesis and its recommendations approved, insights gained from my analysis will help design a new marketing strategy to convert casual riders into annual members.

Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations throughout the city. We set ourselves apart from the competition by offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. Though most riders opt for traditional bikes; about 8% of riders use the assistive options. Thirty percent (30%) of our users use our service to commute to work each day, making most of our users more likely to ride for leisure.

Since its launch in 2016, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments though the use of flexible pricing plans: single-ride passes, full-day passes, and annual memberships.

The stakeholders are the marketing analytics team I am a member of, Lily Moreno, my boss and Director of the Marketing department, and finally the company’s Executive team that will have a final say on the recommendations I will make.

Prepare

I downloaded the data from Motivate International Inc. and imported them in R using the syntax below (after setting the working directory). I could have easily automated that part of the process using Python loops, but went the manual way in R. However, I am planning to use R loops whenever my work schedule allow me to in the future.

d01_2021 <- read_csv("202101-divvy-tripdata.csv")
d02_2021 <- read_csv("202102-divvy-tripdata.csv")
d03_2021 <- read_csv("202103-divvy-tripdata.csv")
d04_2021 <- read_csv("202104-divvy-tripdata.csv")
d05_2021 <- read_csv("202105-divvy-tripdata.csv")
d06_2021 <- read_csv("202106-divvy-tripdata.csv")
d07_2021 <- read_csv("202107-divvy-tripdata.csv")
d08_2021 <- read_csv("202108-divvy-tripdata.csv")
d09_2021 <- read_csv("202109-divvy-tripdata.csv")
d10_2021 <- read_csv("202110-divvy-tripdata.csv")
d11_2021 <- read_csv("202111-divvy-tripdata.csv")
d12_2021 <- read_csv("202112-divvy-tripdata.csv")

Then, I converted the ride_id, and rideable_type columns into character data types for better stacking. I could have also waited to make this change of data types until all files where bound into a single one.

d01_2021 <-  mutate(d01_2021, ride_id = as.character(ride_id)
                    ,rideable_type = as.character(rideable_type)) 
d02_2021 <-  mutate(d02_2021, ride_id = as.character(ride_id)
                    ,rideable_type = as.character(rideable_type)) 
d03_2021 <-  mutate(d03_2021, ride_id = as.character(ride_id)

And finally, I have combined all 12 data frames into a single annual data frame. It had a 5,595,063 rows and 13 columns.

annual_df <- bind_rows(d01_2021,d02_2021,d03_2021,d04_2021,d05_2021,d06_2021,d07_2021,d08_2021,d09_2021,d10_2021,d11_2021,d12_2021)

Like usual, when I have a data frame, I like to have a quick overview of its content using the functions below.

colnames(annual_df)
nrow(annual_df)
dim(annual_df)
head(annual_df, n =9)
tail(annual_df, n = 7)
str(annual_df)
summary(annual_df)

There are some additional steps we need to take to be able to complete analysis later on. Namely, we can currently aggregate our data only at the ride level. We need to create additional columns to be group our data by date, and month. We will also need to set a calculated field to return the length of the ride. Finally, we will need to make sure no ride length is negative or last less than a minute or more than 24 hours per company policy.

Process

annual_df$date <- as.Date(annual_df$started_at) #The default format is yyyy-mm-dd
annual_df$month <- format(as.Date(annual_df$date), "%m")
annual_df$day <- format(as.Date(annual_df$date), "%d")
annual_df$year <- format(as.Date(annual_df$date), "%Y")
annual_df$day_of_week <- format(as.Date(annual_df$date), "%A")
annual_df$ride_length <- difftime(annual_df$ended_at,annual_df$started_at)

annual_df$ride_length <- as.numeric(as.character(annual_df$ride_length))

Then, I removed the incomplete and problematic data and assigned the value to a new data frame.

annual_df2 <- annual_df2[!(annual_df2$start_station_name == "HQ QR" | annual_df2$ride_length<60 | annual_df2$ride_length>86400),]

annual_df2 <- distinct(annual_df2, ride_id, .keep_all=TRUE)

annual_df2 <- annual_df2[!(is.na(annual_df3$start_station_id) | 
                               is.na(annual_df2$end_station_id) | 
                               is.na(annual_df$ride_id) | 
                               is.na(annual_df2$rideable_type) | 
                               is.na(annual_df2$started_at) | 
                               is.na(annual_df2$ended_at) | 
                               is.na(annual_df2$end_lat) | 
                               is.na(annual_df2$end_lng)),]

We then remove records that started or ended at administrative stations DIVVY CASSETTE REPAIR MOBILE STATION or HUBBARD ST BIKE CHECKING (LBS-WH-TEST).

annual_df2<- annual_df2[!(annual_df2$start_station_name == "DIVVY CASSETTE REPAIR MOBILE STATION" | 
                            annual_df2$start_station_name == "HUBBARD ST BIKE CHECKING (LBS-WH-TEST)" | 
                            annual_df2$start_station_name == "WATSON TESTING DIVVY" | 
                            annual_df2$end_station_name == "DIVVY CASSETTE REPAIR MOBILE STATION" | 
                            annual_df2$end_station_name == "HUBBARD ST BIKE CHECKING (LBS-WH-TEST)" | 
                            annual_df2$end_station_name == "WATSON TESTING DIVVY"),]

In case we would like to group rides by month and export them, we could use

monthly_rides <- annual_df2 %>%
  group_by(month) %>%
  summarise(count = length(ride_id),
            '%' = (length(ride_id) / nrow(annual_df2)) * 100,
            'pct_members' = (sum(member_casual == "member") / length(ride_id)) * 100,
            'pct_casual' = (sum(member_casual == "casual") / length(ride_id)) * 100,
            'diff_pct_memb_cas' = pct_members - pct_casual)

monthly_rides %>%
  write.csv("monthly_ride_count.csv")

Descriptive Analysis

Let's see the ride length per day for each membership type

annual_df2$day_of_week <- ordered(annual_df2$day_of_week, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
aggregate(annual_df2$ride_length ~ annual_df2$member_casual + annual_df2$day_of_week, FUN = mean)

Let's analyze the ridership type and day

annual_df2 %>% 
  mutate(day = wday(started_at, label = TRUE)) %>%      #creates weekday field using wday()
  group_by(member_casual, day) %>%         #groups by usertype and weekday
  summarise(number_of_rides = n()						   	#calculates the number of rides and average duration 
            ,average_duration = mean(ride_length)) %>% 	    	# calculates the average duration
  arrange(member_casual, day)								       # sorts

Let's visualized this result using ggplot2.

annual_df2 %>% 
  mutate(day = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, day) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, day)  %>% 
  ggplot(aes(x = day, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = 'Weekly Rides by Type', subtitle = "Spike in Casual ridership over weekends",
       caption = "Data analyzed by Ashek Ag Mohamed")

Let's now visualize the average duration.

# Let's create a visualization for average duration
annual_df2 %>% 
  mutate(day = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, day) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, day)  %>% 
  ggplot(aes(x = day, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = 'Ride Duration per Day', subtitle = "Casual ridership 2+ times longer",
       caption = "Data analyzed by Ashek Ag Mohamed")

We see here that casual riders use the services over two times longer than members and there is even spike over the weekends.

Finally, I exported that data frame into a CSV file for further visualization in Tableau.

write.csv(annual_df2, file = "annual_df2.csv")

Capstone: Google Data Analytics Professional Certificate

Prepare

Process

Descriptive Analysis

Related Articles

R: Movie Data Analysis