Monday, September 5, 2016

Data Depth Primer

My current research revolves around the notion of data depth, so I thought that an introductory primer on data depth would be good topic kick off research related posts. This first post introduces data depth and a taxonomy that is able to classify all of the different types of data depth formulations that I've come across so far. 



What is Data Depth? Given an ensemble of data objects, typically drawn from an underlying probability distribution, data depth is a method to quantify how central or deep any particular data object is with regard to that distribution. For example, given an ensemble of points drawn from a bivariate normal distribution, data depth can help identify the most central point as well as quantify the centrality of any point in the plane with regard to that particular distribution. 



As can be guessed from the example, data depth methods are interesting for working with ensembles of complex data types where a method to order the data members is not obvious or trivial as it is, for instance, in the case of points on a real number line. For data types more complex than points on a real number line, several kinds of formulations of data depth have been developed. The figure below shows some examples of these formulations, and how those formulations can be classified into various (overlapping) classes based on the characteristics of the data types that they operate on as well characteristics of the formulation itself.



A taxonomy of data depth formulations.

In follow up posts, I plan to dig into the various classes of data depth and their respective properties in more depth (pun intended). As seen in the above figure, these classes are:

  • Distance based formulations
  • Weighted mean formulations
  • Band based formulations
  • Data depth formulations for Euclidean data
  • Data depth formulations for Non-Euclidean data
  • Kernelized formulations

A useful reference: Mosler, Karl. "Depth statistics." Robustness and complex data structures. Springer Berlin Heidelberg, 2013. 17-34. 
get arXiv version

Also, here is the link to my next blog post on data depth where I enumerate the important desirable properties expected in data depth formulations.

No comments:

Post a Comment