Cleaning and domain knowledge in data analysis

03 Nov 2021

There’s a wealth of information on data, data analysis, and tips and tricks of the trade out there. The R community in particular is extremely active and there are many free, open-sourced resources to help people along the way. One resource I came across recently is Data Management in Large Scale Educational Research by Crystal Lewis.

In particular, the section ‘Data Cleaning Plan’ gives an excellent overview on an often overlooked aspect of data analysis - data cleaning or processing. Anyone who works with data will tell you that 80% of their time is spent cleaning and transforming data before any real analysis can begin. I’ve seen in many data analysis programs or online courses time is most focused on using functions, code, and visualization. However, good data analysis starts with clean, usable data. Here’s a good infographic of the data analysis cycle taken from this Medium article by Durgesh Anand.

Cross Industry Standard Process for Data Mining

Although all aspects of this cycle are important, that sliver of Data Preparation will really be instrumental in getting the best outputs of your analysis.

In a simliar vein, Data Understanding, or domain knowledge, is also critical to framing your research question and transforming and cleaning your data in a way to reveal useful and actionable insights. A great resource for understanding and learning techniques for data in the educational sector is Data Science in Educational using R which is another great open-resource. Each domain has its own challenges, and the authors take great care to lay out the issues in educational data research: ethics and legal concerns, lack of processes and guidelines that plauge many institutions, and other considerations.

While I am a firm believer that if you are great at data analysis, you can be successful in most industries, there is much to be said for those with domain knowledge and what they can bring to an organization seeking to gain insights gleaned from data. These types of insights are learned throughout years of experience in industry. And while all good data analysts will seek to stay on top of their game by sharpening and gaining new skills, domain knowledge is just as critical.