How I Used Data Engineering to Build the Data-driven Culture in a Student Organization — Part 1: Catching the Problems
Let me start with a story.
In 2020, I became one of the staff of a student organization at my college. It’s a newborn organization, and my year is the first cabinet. In the early year, we have to define a ‘Grand Design’ of the events we plan to hold throughout the year. The Grand Design should include parameters of each event to measure is it held successfully or, otherwise, failed. The result also needs to be written in an accountability report to be published at the end of our management period.
To ensure our event qualifies the parameters, we decided to spread a feedback survey in a Google Form to the participants. We mostly asked on the Likert scale how satisfied the participant to:
- The speaker
- The committee’s service
- The overall event
It also asks for critics and recommendations so the committee members will be getting better at organizing another event next time. Besides that, we asked for some demography profiles, such as their current job and college/work instance. Furthermore, we decided that only participants who filled in the survey could get the certificate of attendance/completion, so it attracted them to respond to the feedback survey (and so we gathered their full names and email). The below image is a sample response for an event.
This pattern of held an event then share a feedback survey runs smoothly. We could also work out the accountability report for each event.
But, as time goes by, there are possibilities to improve the management and analysis of the collected feedback.
Possible Data Analysis
For single event feedback analysis, Google Sheet that contained the Form responses is enough to analyze the satisfying rate. But, that’s all. Satisfying rate. In a single sheet.
What if we want to compare two or more events? How about the improvement of the rates all the time? It’s kind of hard to be done in a spreadsheet.
It also becomes complicated to analyze other data, especially the college/work instance. The Google Form survey asked it in a text field, making it highly varied and hard to be analyzed. We need to clean it.
Just in case we have cleaner data, how about we build a dashboard that visualize better?
Multi-dimensional Analysis
Okay, imagine we already have cleaner data. The other possibility is we could do a multi-dimensional analysis. It also means that we will be able to answer more questions. Let’s talk about some examples:
- What is the satisfying rate of speakers for Event A, given by those who work as a Student, at the University of X?
- How many participants for the events from June to August?
- Do participants from the University of X tend to give a higher rate? How about their critics, what sentiment do they have?
- And many more…
With cleaner and more well-prepared data, it opens up room to do more analysis and gain more insights.
Historical Data Gathering (Data Warehouse)
I’ve mentioned that my year is the first cabinet in the organization. At the time this article was published, the second cabinet already started holding events. And as the first cabinet did, they also spread feedback surveys in Google Form.
As more sheets are produced, we need to manage how we stored them. It could be a business rule such as preparing a Google Drive folder to store all feedback sheets and ordering the committee to upload the sheet there. Since a sheet has a structural form, we have an alternative to upload the sheet to a data warehouse platform, let’s say Google BigQuery.
The advantage of storing the previous events is we could analyze them by the time dimension. Moreover, some of the events are the “next version” of the previous year’s event. In that case, once again, some analytical questions may come.
Is there a decrease in the satisfaction rate and the number of participants? What improvement have we achieved compared to the previous year? Have we held the event better?
It’s still in the second cabinet. Imagine the next, next, and further next cabinet need to do this kind of analysis. It’s better to prepare the infrastructure from now, isn’t it?
Machine Learning Application and Language-based Analysis
The feedback survey also asked critics and recommendations in a paragraph field. Here, it’s possible to implement an analysis by machine learning.
Besides the popular “sentiment analysis”, we could also analyze what aspect the participants often criticize. Is it the contact service? Is the sound not clear — we hold many virtual events during when pandemic — ? Or maybe, they dislike the Master of Ceremony?
By answering those questions, the committee will be able to do an exact improvement for the next event.
This kind of analysis should be done programmatically, let’s say in Python. Surely we couldn’t do this in a spreadsheet.
What’s Next?
This far, we have caught the problems in the current feedback analysis methods. There are absolutely rooms to improve. We have also defined some technologies we could use to build our solution.
- Dashboard for better visualization.
- Google Drive to store all the feedback spreadsheets. Or with the alternative way, Google BigQuery.
- Python language for machine learning.
But, wait!
There is a more fundamental requirement: prepare cleaner data. We need to do a data engineering process.
For now, this is the main objective.
In the next part of this article, I would like to talk about the development of the data pipeline used to serve more prepared data for analysis.
See ya in the next part!