A Case Study of a fictional company “Cyclistic”. A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
tidyverse
package must be installed to work on the
data.
library("tidyverse")
## Warning: package 'tidyverse' was built under R version 4.2.1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
To determine market strategies aimed at converting casual riders into annual members.
Insights of this analysis can be used to design strategies to increase annual members.
Find the differences between casual and annual members riders and how digital media could influence them.
The data used in this project is open-source.
The data is located in Amazon AWS S3 bucket. The dataset link is: https://divvy-tripdata.s3.amazonaws.com/index.html
The data is organized on the basis of months.
The data is ROCCC, since, the population of data is given by the company itself. It is reliable, original, concurrent, comprehensive, current and cited.
The data does not contain any personally identifiable information. It is licensed by company itself.
Data is consistent and has proper data types.
This data can provide insights about the riders’ usage of services provided by Cyclistic.
The nomenclature of the data files is inconsistent.
A collection of data from July, 2021 to June, 2022 given by Cyclistic.
A pre-processing step of data analysis to make sure the data is ready for analysis operations.
csv_files <- list.files(path = "./data", recursive = TRUE, full.names = TRUE)
data_merged <- do.call(rbind, lapply(csv_files, read.csv))
head(data_merged)
data_merged_nn <- data_merged[!apply(data_merged == "", 1, all),]
print(paste("Deleted ", nrow(data_merged) - nrow(data_merged_nn), " rows."))
## [1] "Deleted 0 rows."
data_merged <- data_merged_nn
data_merged_rd <- data_merged[!duplicated(data_merged$ride_id), ]
print(paste("Deleted ", nrow(data_merged) - nrow(data_merged_rd), " rows."))
## [1] "Deleted 0 rows."
data_merged <- data_merged_rd
data_merged$started_at <- as.POSIXct(data_merged$started_at, format = "%Y-%m-%d %H:%M:%S")
data_merged$ended_at <- as.POSIXct(data_merged$ended_at, format = "%Y-%m-%d %H:%M:%S")
It represents the duration of ride in minutes.
data_merged <- mutate(data_merged, ride_time_m = as.numeric(data_merged$ended_at - data_merged$started_at) / 60)
summary(data_merged)
## ride_id rideable_type started_at
## Length:5900385 Length:5900385 Min. :2021-07-01 00:00:22.00
## Class :character Class :character 1st Qu.:2021-08-26 07:57:58.00
## Mode :character Mode :character Median :2021-10-27 17:35:55.00
## Mean :2021-12-12 00:11:36.51
## 3rd Qu.:2022-04-25 13:41:23.00
## Max. :2022-06-30 23:59:58.00
##
## ended_at start_station_name start_station_id
## Min. :2021-07-01 00:04:51.00 Length:5900385 Length:5900385
## 1st Qu.:2021-08-26 08:11:00.00 Class :character Class :character
## Median :2021-10-27 17:49:46.00 Mode :character Mode :character
## Mean :2021-12-12 00:31:53.47
## 3rd Qu.:2022-04-25 13:57:17.00
## Max. :2022-07-13 04:21:06.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5900385 Length:5900385 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
##
## end_lat end_lng member_casual ride_time_m
## Min. :41.39 Min. :-88.97 Length:5900385 Min. : -137.42
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.: 6.28
## Median :41.90 Median :-87.64 Mode :character Median : 11.17
## Mean :41.90 Mean :-87.65 Mean : 20.28
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 20.20
## Max. :42.17 Max. :-87.49 Max. :49107.15
## NA's :5374 NA's :5374
An attribute with year and month
data_merged <- mutate(data_merged,
year_month = paste(strftime(data_merged$started_at, "%Y"), "-",
strftime(data_merged$started_at, "%m"),
paste("(",strftime(data_merged$started_at, "%b"), ")", sep="")))
unique(data_merged$year_month)
## [1] "2021 - 07 (Jul)" "2021 - 08 (Aug)" "2021 - 09 (Sep)" "2021 - 10 (Oct)"
## [5] "2021 - 11 (Nov)" "2021 - 12 (Dec)" "2022 - 01 (Jan)" "2022 - 02 (Feb)"
## [9] "2022 - 03 (Mar)" "2022 - 04 (Apr)" "2022 - 05 (May)" "2022 - 06 (Jun)"
Represents the day of the week
data_merged <- mutate(data_merged, weekday = paste(strftime(data_merged$ended_at, "%u"),
"-", strftime(data_merged$ended_at, "%a")))
unique(data_merged$weekday)
## [1] "5 - Fri" "3 - Wed" "7 - Sun" "4 - Thu" "6 - Sat" "1 - Mon" "2 - Tue"
Represents hour of the day
data_merged <- mutate(data_merged, start_hour = strftime(data_merged$ended_at, "%H"))
unique(data_merged$start_hour)
## [1] "15" "17" "11" "22" "16" "12" "18" "20" "07" "08" "19" "09" "10" "13" "05"
## [16] "01" "21" "23" "00" "04" "14" "02" "06" "03"
write.csv(data_merged, "data_cleaned.csv")
We will be using R programming language to work with the large datasets and is easier to visualise data.
Data integrity has been ensured.
Duplicate values were removed, column names were correctly formatted.
The data is cleaned by removing duplicates, and missing values. Erroneous values can be corrected, and proper data types can be assigned to the attributes.
All the steps of this project are documented in this notebook.
A documentation of data cleaning and verification process.
Analysing data for insights.
data_cleaned <- read.csv("data_cleaned.csv")
head(data_cleaned)
Summary of the cleaned data.
summary(data_cleaned)
## X ride_id rideable_type started_at
## Min. : 1 Length:5900385 Length:5900385 Length:5900385
## 1st Qu.:1475097 Class :character Class :character Class :character
## Median :2950193 Mode :character Mode :character Mode :character
## Mean :2950193
## 3rd Qu.:4425289
## Max. :5900385
##
## ended_at start_station_name start_station_id end_station_name
## Length:5900385 Length:5900385 Length:5900385 Length:5900385
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## end_station_id start_lat start_lng end_lat
## Length:5900385 Min. :41.64 Min. :-87.84 Min. :41.39
## Class :character 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88
## Mode :character Median :41.90 Median :-87.64 Median :41.90
## Mean :41.90 Mean :-87.65 Mean :41.90
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93
## Max. :45.64 Max. :-73.80 Max. :42.17
## NA's :5374
## end_lng member_casual ride_time_m year_month
## Min. :-88.97 Length:5900385 Min. : -137.42 Length:5900385
## 1st Qu.:-87.66 Class :character 1st Qu.: 6.28 Class :character
## Median :-87.64 Mode :character Median : 11.17 Mode :character
## Mean :-87.65 Mean : 20.28
## 3rd Qu.:-87.63 3rd Qu.: 20.20
## Max. :-87.49 Max. :49107.15
## NA's :5374
## weekday start_hour
## Length:5900385 Min. : 0.00
## Class :character 1st Qu.:11.00
## Mode :character Median :15.00
## Mean :14.36
## 3rd Qu.:18.00
## Max. :23.00
##
data_cleaned %>% group_by(member_casual) %>% summarise(count = length(ride_id), '%' = (length(ride_id) / nrow(data_cleaned)) * 100)
options(repr.plot.width = 16, repr.plot.height = 8)
ggplot(data_cleaned, aes(member_casual, fill=member_casual)) + geom_bar() + labs(x="Casual vs Annual members", title="Chart 1 - Casual vs Annual members distribution")
data_cleaned %>% group_by(year_month) %>% summarise(count = length(ride_id),
"%" = (length(ride_id) / nrow(data_cleaned)) * 100,
"annual" = (sum(member_casual == "member") / length(ride_id)) * 100,
"casual" = (sum(member_casual == "casual") / length(ride_id)) * 100,
"Annual vs Casual Percent Difference" = annual - casual)
data_cleaned %>%
ggplot(aes(year_month, fill=member_casual)) + geom_bar() + labs(x="Month", title="Chart 2 - Distribution by Month") + coord_flip()
data_cleaned %>% group_by(weekday) %>% summarise(count = length(ride_id),
"%" = (length(ride_id) / nrow(data_cleaned)) * 100,
"annual" = (sum(member_casual == "member") / length(ride_id)) * 100,
"casual" = (sum(member_casual == "casual") / length(ride_id)) * 100,
"Annual vs Casual Percent Difference" = annual - casual)
data_cleaned %>%
ggplot(aes(weekday, fill=member_casual)) + geom_bar() + labs(x="Weekday", title="Chart 3 - Distribution by Weekday")
data_cleaned %>% group_by(start_hour) %>% summarise(count = length(ride_id),
"%" = (length(ride_id) / nrow(data_cleaned)) * 100,
"annual" = (sum(member_casual == "member") / length(ride_id)) * 100,
"casual" = (sum(member_casual == "casual") / length(ride_id)) * 100,
"Annual vs Casual Percent Difference" = annual - casual)
data_cleaned %>%
ggplot(aes(start_hour, fill=member_casual)) + geom_bar() + labs(x="Hour", title="Chart 4 - Distribution by hour of the day")
data_cleaned %>% group_by(rideable_type) %>% summarise(count = length(ride_id),
"%" = (length(ride_id) / nrow(data_cleaned)) * 100,
"annual" = (sum(member_casual == "member") / length(ride_id)) * 100,
"casual" = (sum(member_casual == "casual") / length(ride_id)) * 100,
"Annual vs Casual Percent Difference" = annual - casual)
data_cleaned %>%
ggplot(aes(rideable_type, fill=member_casual)) + geom_bar() + labs(x="Hour", title="Chart 5 - Distribution by type of Rides")
data_cleaned %>%
group_by(member_casual) %>%
summarise(mean = mean(ride_time_m),
'first_quarter' = as.numeric(quantile(ride_time_m, .25)),
'median' = median(ride_time_m),
'third_quarter' = as.numeric(quantile(ride_time_m, .75)),
'IR' = third_quarter - first_quarter)
data_cleaned %>% ggplot(aes(x=member_casual, y=ride_time_m, fill=member_casual)) +
labs(x="Member type", y="Riding time", title="Chart 6 - Distribution of Riding time") +
geom_boxplot()
The data is merged from different CSV files and cleaned simultaneously. Later the cleaned data is stored in a separate CSV file.
The data is correctly formatted.
There are more number of classic bike riders than any other type of bikes, this comes out to be a surprise.
The insights gained from the data can be used to determine the usage of bikes by casual members and convert them to annual members.
Act phase usually happens when the insights and actually used to drive the decisions taken by the company.
Annual and Casual members use the bikes differently.
The insights can be used to devise marketing strategies to convert casual riders to annual members of the company.
Stakeholders can further analyse the insights and validation of the strategies based on the data.
We analysed a case study of a fictional company called “Cyclistic” and gained some insights from the data provided to use. We used R programming language as our primary tool in Preparing, Processing and Analysing the data. We used R Notebook to share the insights to our stakeholders.
This marks the end of this notebook.