Saturday, July 29, 2017

Weighted Mean Formulations of Data Depth

Data depth is a family of nonparametric methods that provide a measure of centrality by which multivariate data can be ordered. My previous post on data depth was an overview of distance based formulations. Another type of data depth method is based on weighted mean (WM) regions1. Weighted mean regions are nested convex regions that are centered around the geometric center of a distribution. These convex regions are composed of weighted means of the data members, with a general set of restrictions on the weights that ensure their nested arrangement. This arrangement of nested convex (WM) regions is then used to determine the data depth value of each data member. Various strategies of the assigning weights lead to different notions of weighted mean depths. An example of a weighted mean depth is the Zonoid depth2 can be stated as follows.


Let $x, x_1, \ldots , x_n \in \mathbb{R}^d$. Then the zonoid depth of $x$ with respect to $x_1, \ldots , x_n$ is:

$$D_{\textrm{zonoid}}(x|X) = \sup \{ \alpha : x \in D_{\alpha}(x_1, \ldots, x_n) \}$$
where
$$D_{\alpha}(x_1, \ldots, x_n) = \bigg\{ \sum_{i=1}^n \lambda_i x_i: \sum_{i=1}^n \lambda_i=1, 0\leq\lambda_i, \alpha\lambda_i \leq \frac{1}{n} \; \textrm{for all } i\bigg\}$$
Here $D_{\alpha}(\cdot)$ denotes the WM region that indicates the region with depth greater than $\alpha$ and is also known as the $\alpha$-trimmed region. Note that when $\alpha=1$, the WM region collapses to the mean of the data, while $\alpha \leq \frac{1}{n}$leads to a WM region that is the convex hull of data. Other examples of weighted mean depths include expected convex hull depth and geometrical depth.


Weighted mean based formulations of depth, in comparison to the distance based formulations, are more effective in capturing the shape of the distribution. However, weighed mean formulations are more susceptible to outliers in data as the shape of the WM regions, consequently data depth, can be strongly influenced by pathological outliers. They are also more computationally expensive and often involve solving an optimization problem.

References:

[1] Mosler, Karl. "Depth statistics." Robustness and complex data structures. Springer Berlin Heidelberg, 2013.

[2] Dyckerhoff, Rainer, Karl Mosler, and Gleb Koshevoy. "Zonoid data depth: Theory and computation." COMPSTAT. Physica-Verlag HD, 1996.

Thursday, January 26, 2017

Plotting in Python using Matplotlib

Matplotlib is a great plotting library when working in Python. Despite having used it several times, I usually end up searching online for a template to draw the first figure and associated customization details; especially if using it after a hiatus. So, in order to save time, I prepared a custom matplotlib boilerplate template (linked below) that contains my most frequently used functionality with parameters filled-in with typical default values.


There are two interfaces (APIs) to use matplotlib: a procedural (matlab-like) interface and an object oriented interface. An overview of matplotlib and its different interfaces can be found hereThe template provided below is of the latter type as it is more powerful and also happens to be my matplotlib interface of choice. At the end of the post is an example figure and the boilerplate code to draw it. As evident, this includes the code for the following: draw functions, scatterplot, heatmap, contours, quiverplot, labels, legends, adjusting axis ratio, set background color, updating a figure without blocking for input and writing to disk. Of course, this is still only a fraction of all the functionality provided by matplotlib! 

Lastly, if you are looking for more than just static images, a new generation of visualization tools is emerging that are based on matplotlib, and offer built-in support for advanced interactivity features such as: linking, brushing and highlighting, among others. Some examples of such tools are plotly, seaborn and mpld3.


Example figure generated by the following Matplotlib code.