Gathering Data and Identifying Key Variables
If you’ve been watching the 2020 Olympics Games from Tokyo, you’ve likely come to appreciate how much effort is expended in achieving athletic excellence. Athletes are closely coached, monitored, and receive guidance on each performance. Their progress is tracked and recorded, and data plays a big part in the athletes’ training.
I worked for a fitness-training startup that offered personalized recommendations to help people reach their fitness goals. The company wanted to know if applying data analytics and artificial intelligence/machine learning (AI/ML) techniques could answer some of their business questions and enhance trainees’ performance. I told them that indeed it was a doable task. But with limited resources, a small budget, and no data whatsoever, I soon realized the big challenge ahead of me.
How could I deliver actionable insight to add value to this startup? It was time to think out of the box.
To design and implement an effective low-cost solution, I used some of the powerful cloud-based collaborative tools from Google Workspace (Forms, Sheets, Docs, etc.) together with the Trifacta Data Engineering Cloud and the open-source R language comprehensive library, serving up the results as fully interactive tables and easy-to-digest visualizations in Data Studio.
This low-cost analytics solution allowed me to generate the relevant data, reshape and refine it, and visually discover and extract actionable insight that could be used immediately to address issues of the training-fitness startup, as well as help its trainers and domain experts to deliver recommendations and customized workout routines oriented to enhance trainees’ health and athletic performance.
This use case proves two key facts: 1) it’s possible to deliver analytics solutions for an organization of any size, and 2) you don’t need large data volumes to harness the power of AI/ML and achieve reliable results. Even in this almost no-data-available situation, it was possible to design and implement an end-to-end analytics solution to answer this startup’s business questions.
I’ll be exploring this use case in a 3-part blog series.
- Blog 1: Gathering Data and Identifying Key Variables
- Blog 2: Graph Analysis and Weighted Association Rules Mining (WARM)
- Blog 3: Implementing a Content-Based Recommendation Engine (CBRE)
Creating Online Surveys to Collect Data
The toughest challenge I had to tackle in this use case was the absence of reliable data. The easiest and cheapest solution was to use a free tool, Google Forms, to build a few online surveys and ask the trainees to complete them. I took advantage of free online tutorials and videos to learn how to improve survey results. These taught me how to build Google Forms surveys using different types of questions, including single- and multiple-choice, short-answer and open-ended narrative, and checkbox and dropdown menus.
The first survey was designed to explore the trainees’ experience and satisfaction with the app. Using Google Forms’ visualization capabilities, I easily identified a few issues that had thus far gone undetected, and they were quickly addressed. Talk about immediate ROI!
Figure 1 shows the survey used to collect trainees’ general data, such as age, sex, email (to be used as a unique identifier).
Figure 1: Survey to Collect Trainees’ General Data
Figure 2 shows trainees’ general survey responses saved as a Google Sheet. This powerful, free, easy-to-use, cloud-based spreadsheet tool makes it easy to share Google Sheets and configure access and roles of multiple users.
Figure 2: Survey Results Saved as a Google Sheet
Preparing data for analysis is the most important step in any data science workflow. Considering the extremely limited data I had to work with for this use case, I had to apply advanced data preparation techniques to extract every single drop of information from the survey responses. (This handy tutorial shows how to structure survey data in an easy-to-leverage format.) As shown in Figure 3, I uploaded survey response data in the Trifacta Engineering Cloud.
Figure 3: Data Uploaded in the Trifacta Data Engineering Cloud
Using Principal Component Analysis to Explore Data
I used a well-known dimensionality reduction technique called a Principal Component Analysis (PCA) to explore the survey’s responses. The goal of the PCA is to identify the most relevant and informative variables for analysis. I explored the survey’s responses and selected a reduced set of variables, or principal components, from them.
Figure 4 shows the data in the Trifacta Data Engineering Cloud and the recipe implemented to reshape the data and ready it for PCA. Additional recipes were implemented to reshape the data for advanced visualization, graph analysis, and the overall data modeling requirements. I’ll discuss these in more detail in the remaining blogs in this series.
Figure 4: Data Anonymized for PCA the Trifacta Data Engineering Cloud
I relied on the R packages FactoMineR and factoextra to carry out the PCA and build some useful visualizations. The packages’ ability to handle quantitative and categorical variables together was key to obtaining reliable results quickly. Statistical Tools for High-Throughput Data Analysis (STHDA) is an outstanding source for tutorials and examples of applications of the R packages already mentioned.
Figure 5 shows one of the graphic tools to visually deliver the PCA results. The image can be interpreted as follows: variables located far from the Dim1 (PCA1) and Dim2 (PCA2) axis intersection are the variables of the GREATEST VARIABILITY in the data, namely, the most informative or impactful variables. Taking into account the suggestions of trainers and the startup’s domain experts, a set of 16 out of 27 variables (highlighted inside the dotted blue curve) were selected.
Figure 5: PCA Results – Selection of Key Variables
Armed with the PCA results, the next step was to return to the Trifacta Data Engineering Cloud and reformat the available data and generate the inputs to apply Graph and Weighted Association Rules Mining analysis. I’ll explain these results in Blog #2 of this series. Stay tuned!