Saturday, July 29, 2017

Weighted Mean Formulations of Data Depth

Data depth is a family of nonparametric methods that provide a measure of centrality by which multivariate data can be ordered. My previous post on data depth was an overview of distance based formulations. Another type of data depth method is based on weighted mean (WM) regions1. Weighted mean regions are nested convex regions that are centered around the geometric center of a distribution. These convex regions are composed of weighted means of the data members, with a general set of restrictions on the weights that ensure their nested arrangement. This arrangement of nested convex (WM) regions is then used to determine the data depth value of each data member. Various strategies of the assigning weights lead to different notions of weighted mean depths. An example of a weighted mean depth is the Zonoid depth2 can be stated as follows.


Let $x, x_1, \ldots , x_n \in \mathbb{R}^d$. Then the zonoid depth of $x$ with respect to $x_1, \ldots , x_n$ is:

$$D_{\textrm{zonoid}}(x|X) = \sup \{ \alpha : x \in D_{\alpha}(x_1, \ldots, x_n) \}$$
where
$$D_{\alpha}(x_1, \ldots, x_n) = \bigg\{ \sum_{i=1}^n \lambda_i x_i: \sum_{i=1}^n \lambda_i=1, 0\leq\lambda_i, \alpha\lambda_i \leq \frac{1}{n} \; \textrm{for all } i\bigg\}$$
Here $D_{\alpha}(\cdot)$ denotes the WM region that indicates the region with depth greater than $\alpha$ and is also known as the $\alpha$-trimmed region. Note that when $\alpha=1$, the WM region collapses to the mean of the data, while $\alpha \leq \frac{1}{n}$leads to a WM region that is the convex hull of data. Other examples of weighted mean depths include expected convex hull depth and geometrical depth.


Weighted mean based formulations of depth, in comparison to the distance based formulations, are more effective in capturing the shape of the distribution. However, weighed mean formulations are more susceptible to outliers in data as the shape of the WM regions, consequently data depth, can be strongly influenced by pathological outliers. They are also more computationally expensive and often involve solving an optimization problem.

References:

[1] Mosler, Karl. "Depth statistics." Robustness and complex data structures. Springer Berlin Heidelberg, 2013.

[2] Dyckerhoff, Rainer, Karl Mosler, and Gleb Koshevoy. "Zonoid data depth: Theory and computation." COMPSTAT. Physica-Verlag HD, 1996.

Thursday, January 26, 2017

Plotting in Python using Matplotlib

Matplotlib is a great plotting library when working in Python. Despite having used it several times, I usually end up searching online for a template to draw the first figure and associated customization details; especially if using it after a hiatus. So, in order to save time, I prepared a custom matplotlib boilerplate template (linked below) that contains my most frequently used functionality with parameters filled-in with typical default values.


There are two interfaces (APIs) to use matplotlib: a procedural (matlab-like) interface and an object oriented interface. An overview of matplotlib and its different interfaces can be found hereThe template provided below is of the latter type as it is more powerful and also happens to be my matplotlib interface of choice. At the end of the post is an example figure and the boilerplate code to draw it. As evident, this includes the code for the following: draw functions, scatterplot, heatmap, contours, quiverplot, labels, legends, adjusting axis ratio, set background color, updating a figure without blocking for input and writing to disk. Of course, this is still only a fraction of all the functionality provided by matplotlib! 

Lastly, if you are looking for more than just static images, a new generation of visualization tools is emerging that are based on matplotlib, and offer built-in support for advanced interactivity features such as: linking, brushing and highlighting, among others. Some examples of such tools are plotly, seaborn and mpld3.


Example figure generated by the following Matplotlib code.


Wednesday, December 14, 2016

Distance based Formulations of Data Depth

In an earlier post, I had described a taxonomy of the various types of depth formulations. In this post, the focus will be on distance based formulations of data depth. As the name suggests, these are a class of depth formulations that have something (or a lot) to do with the choice of the distance metric in the space of data. For example, as in many of the formulations, the depth of a data object within an ensemble turns out to be related to the inverse of the sum of distances from all the other objects in the ensemble. As we can choose from among the several valid distance metrics, the choices lead to different corresponding properties in the depth formulations. Let us look at some examples:


$L_2$  depth: The $L_2$  depth of a point, $x$ in an ensemble, $X$, is defined as the inverse of the expected value of the distance of $x$ from another point in the ensemble.

$$D^{L_2} (x|X) = (1+ E||x-X||)^{-1}$$
While being easy to grasp, and also to compute, the $L_2$ depth is unable to properly capture anisotropy in the distribution. For example, in a high-dimensional ellipsoid-shaped point cloud, a point near the center along the minor axis would tend to be assigned a depth similar to another at the same distance along a major axis.


Mahalanobis depth: The Mahalanobis depth is able to capture the anisotropy in the structure of a distribution up to it's second moment. We can compute it by simply replacing the distance metric in the definition of the $L_2$ depth by the Malalanobis distance.



$$D^{\textrm{Mah}}(z|X) = \big( 1 + (x-E[X]) {\sum_X}^{-1} (x-E[X])\big)^{-1}$$

Projection Depth: In projection depth, the idea is that if we sort the magnitude projections of the points along a centered unit vector, the central points will tend to have a projection with a smaller magnitude while outlier points will have projections with larger magnitudes. We can invert these projections to depth values that decrease monotonically from the center.


$$ D^{\textrm{proj}}(x|X) = \bigg(1+\textrm{sup}_{p \in S^{d-1}} \frac{| \langle p,x \rangle - \textrm{med} ( | \langle p,X \rangle ) |}{ \textrm{Dmed}(\langle p,X \rangle)} \bigg)$$


where $ S^{d-1} $ denotes the unit sphere in $ \mathbb{R}^d $ , $ \langle , \rangle $ is the inner product, med($U$) denotes the univariate median of random variable U, and $ \textrm{Dmed}(U)=\textrm{med}(|U-\textrm{med}(U)|) $.


Oja Depth: The Oja depth is based on the average volume of simplices (or convex hulls) that have the vertices from the data. The idea is that the average volume of the simplices supported by a central vertex would be higher than the average volume of the simplices supported by an outlying vertex. Again, we can invert the average volume of simplices associated with a particular point (as permanent support) to get depth values that decrease monotonically from the center.



$$ D^{\textrm{Oja}}(x|X) = \bigg(1+\frac{E (\textrm{vol}_{d}(co\{x, X_1,\ldots,X_d\}))}{\sqrt{\textrm{det}\sum_{X}}} \bigg)^{-1}$$


where $co$ denotes the convex hull, $V_d$ is the $d$-dimensional volume, and $\sum_X$ is the covariance matrix.

While we don't explily see a distance metric in the formulations for projection and Oja depth, they are still considered as "distance based" because of the close relationship of the distance metric to the inner product and volume, respectively. Here is a reference for more information on the ability of the above formulations to discriminate between the distribution structures:


Reference:
 Mosler, Karl. "Depth statistics." Robustness and complex data structures. Springer Berlin Heidelberg, 2013. 17-34. 

Wednesday, November 16, 2016

Visweek 2016 and Baltimore Trip

At the end of last month, I travelled to Baltimore to attend the annual IEEE Visualization conference (a.k.a. Visweek). As in the previous year, the conference also included a session for selected papers published in the IEEE Computer Graphics & Applications (CG&A) journal. I presented the work in my CG&A paper on "Evauating Shape Alignment via Ensemble Visualization".


This also happened to be my first Visweek conference, and I was exited to be part of it. Apart from the interesting line up of research presentations, I liked the meetups that were organized on the side; during the breaks. These were more informal gatherings to have an open discussion on specific topics (e.g. the blogging and podcasting meetup!). I thought that these were particularly helpful for the newer attendees. Another tradition I found out about, although too late for me this year, was the Velo Club de VIS; a post-vis bicycling club. I hope to be able to make it back next year and join the bike ride in Phoenix. It was also nice to see and meet researchers whom I knew from their work.  



This year the conference introduced "The Test of Time awards" to recognize decades old research that have stood the test of time. It made me think of the incremental nature of research, and all the other pieces of work that are key to the conceptualization of the most successful research ideas, in the past and in the future. I also enjoyed talking to fellow student presenters on their conference experience, often eliciting interesting responses that made me smile, and think. It was a great experience overall, however one thing I'd do differently in the future would be to read the abstracts of interesting papers before landing at the conference. This blog post has turned out to be just like my trip, comprising mostly of the conference, and very little of Baltimore!

Saturday, October 15, 2016

Desirable Properties for Data Depth Formulations

In a previous blog post here, I described data depth as a way to quantify how central or deep a member is within a distribution. While this description intuitively makes sense when we think of distributions of points in a Euclidean space, often, data depth formulations also deal with objects that are more abstract where it is not clear what could be considered "deep within a distribution". Having a clear set of desirable properties help in such cases to evaluate the utility of a depth formulation. In face such such properties, in addition to being used to characterize existing formulations of data depth, can also act as an aid for developing new depth formulations.

Zuo and Serfling proposed the following basic properties desired in any depth function. Typically depth formulations are shown to satisfy these properties under certain assumptions such as the distribution being continuous and angular symmetric. Angular symmetry just means that Prob[x] = Prob[-x] and by this definition, an angularly symmetric distribution must also have a center at the location where x=0. 

Zuo and Serfling's properties for depth formulations:

1. Null at infinity: Depth of a member falls to zero as its distance from the center of the distribution tends to infinity.
2. Maximum at center: Depth is maximum at the center of angular symmetry of the distribution.
3. Monotonicity: The depth falls off monotonically in the direction of any arbitrarily chosen center outward ray.
4. Affine invariance: The depth is invariant if the same affine transformation is performed for all members of the population.

There are also other properties associated with depth functions, in addition to those mentioned above, such as upper semicontinuity which means that the level sets of the depth evaluations across the population are a convex closure. However, I don't think these are as critical as the ones above. For example, even the popular simplicial depth formulation does not satisfy upper semicontinuity. Furthermore, the absence of upper semicontinuity in itself is not a necessarily a drawback but in fact, depending on the application, it can enable the simplicial depth to be able to capture the structure of the distribution better than formulations that satisfy upper semicontinuity!

Finally, here is the link to Zuo and Serfling's paper for the formal descriptions of these properties: Zuo, Yijun; Serfling, Robert. General notions of statistical depth function. Ann. Statist. 28 (2000)

Wednesday, September 21, 2016

Reference Managers: Moving from Mendeley to Zotero

A reference manager can be real force multiplier in dealing with the multitude of manuscripts that one would have to go through in any kind of research endeavor. There are several available options of reference managers to choose from, many of which even allow to get started for free.


I have been using Mendeley since the Fall 2013 when I started out with my PhD research, and found it to be tremendously helpful to keep track of papers that I go through. However, I'm currently in the process of making the switch to Zotero, mainly because I wanted the following functionality that is quite easy to set up in Zotero but not available in Mendeley (yet!):


  • Searching inline annotations in PDFs.
  • Opening the annotated PDFs in multiple windows.

As a caveat, while Zotero seems to be able to do everything that Mendeley supports, much of the functionality needs to be set up by user. For example, the highly useful "recently read" filter comes off the shelf in Mendeley, but would have set up by the user as a saved search in Zotero. Talking about searchesI do think that Zotero has a more powerful search capability in general. 


While there seem to be options to help move files and metadata (notes and organization structure) from to Zotero from Mendeley, I decided to move manually(!) as part of my reorganization. I did end up leaving behind a whole bunch of references which I thought were unlikely to be needed again in the predictable future. This was essentially so as it turned out that the core corpus of files that I routinely work with (surprisingly) still of a manageable size. While I'm currently under the Zotero's 300MB free data quota, I do plan upgrade to the 2GB data plan soon. There are also ways to set up an alternate online file storage for the so inclined.



One final thing to mention would be the Zotfile add-on (Yup, in addition to being open source, Zotero allows for functionality to be extended via add-ons!). Zotfile automates the process of performing some essential tasks such as adding new files to Zotero items, batch renaming/moving attachments and, perhaps most important for me personally, extract inline annotations from the PDF files. Zotfile extracts the inline annotations from the PDFs and converts them into Zotero item attachments---which are searchable!. This workflow does lead to duplicate Zotero attachments on updating the PDF's inline notes as Zotfile would then create a new attachment corresponding to the updated annotation. A simple solution to remove the duplicates is to create a Zotero saved search to filter (based on timestamps) and delete the attachments corresponding to the older inline annotations.

Monday, September 5, 2016

Data Depth Primer

My current research revolves around the notion of data depth, so I thought that an introductory primer on data depth would be good topic kick off research related posts. This first post introduces data depth and a taxonomy that is able to classify all of the different types of data depth formulations that I've come across so far. 



What is Data Depth? Given an ensemble of data objects, typically drawn from an underlying probability distribution, data depth is a method to quantify how central or deep any particular data object is with regard to that distribution. For example, given an ensemble of points drawn from a bivariate normal distribution, data depth can help identify the most central point as well as quantify the centrality of any point in the plane with regard to that particular distribution. 



As can be guessed from the example, data depth methods are interesting for working with ensembles of complex data types where a method to order the data members is not obvious or trivial as it is, for instance, in the case of points on a real number line. For data types more complex than points on a real number line, several kinds of formulations of data depth have been developed. The figure below shows some examples of these formulations, and how those formulations can be classified into various (overlapping) classes based on the characteristics of the data types that they operate on as well characteristics of the formulation itself.



A taxonomy of data depth formulations.

In follow up posts, I plan to dig into the various classes of data depth and their respective properties in more depth (pun intended). As seen in the above figure, these classes are:

  • Distance based formulations
  • Weighted mean formulations
  • Band based formulations
  • Data depth formulations for Euclidean data
  • Data depth formulations for Non-Euclidean data
  • Kernelized formulations

A useful reference: Mosler, Karl. "Depth statistics." Robustness and complex data structures. Springer Berlin Heidelberg, 2013. 17-34. 
get arXiv version

Also, here is the link to my next blog post on data depth where I enumerate the important desirable properties expected in data depth formulations.