What makes a good Data Scientist? A good data scientist is a software engineer with a solid background in statistics, or a Statistician who likes to code. I am a software engineer with a solid background in statistics. I thought to share my knowledge of statistics in this blog, focusing on important foundational tasks which every software engineer/statistician needs to know.
In my last post, I introduced the very basics: Classification and Regression. In this blog post, I want to talk about some methods which are statistical in nature, and can also be used in Data Quality exercises. They are Similarity Matching and Clustering. Both can be helpful to Data Quality and Data Governance teams who are looking to reduce data duplication, but also to predict correct attribute values in the absence of authoritative data.
Similarity Matching is a foundational task which can support classification and regression activities later. Here, we are trying to identify similar data members based upon the known attributes of those data members. Examples: a company may use similarity matching to find new customers that closely resemble their very best customers — they can be targeted for special offers or other customer retention strategies. Or, a company may look for similarities in data across raw materials from vendors to optimize costs.
Clustering is another foundation task, in that it can be preliminary to further exercises. Clustering attempts to find natural groupings of data entities, without necessarily being driven by a particular purpose. The results can be an input to decision making machine learning: what products or services should we offer these customers? is the population large enough to market to specifically?
In my next post, I’ll continue differentiating data science tasks by character and purpose. Many tasks are related and so we’ll talk about some which complement others already under discussion.