Wednesday, December 14, 2016

Distance based Formulations of Data Depth

In an earlier post, I had described a taxonomy of the various types of depth formulations. In this post, the focus will be on distance based formulations of data depth. As the name suggests, these are a class of depth formulations that have something (or a lot) to do with the choice of the distance metric in the space of data. For example, as in many of the formulations, the depth of a data object within an ensemble turns out to be related to the inverse of the sum of distances from all the other objects in the ensemble. As we can choose from among the several valid distance metrics, the choices lead to different corresponding properties in the depth formulations. Let us look at some examples:


$L_2$  depth: The $L_2$  depth of a point, $x$ in an ensemble, $X$, is defined as the inverse of the expected value of the distance of $x$ from another point in the ensemble.

$$D^{L_2} (x|X) = (1+ E||x-X||)^{-1}$$
While being easy to grasp, and also to compute, the $L_2$ depth is unable to properly capture anisotropy in the distribution. For example, in a high-dimensional ellipsoid-shaped point cloud, a point near the center along the minor axis would tend to be assigned a depth similar to another at the same distance along a major axis.


Mahalanobis depth: The Mahalanobis depth is able to capture the anisotropy in the structure of a distribution up to it's second moment. We can compute it by simply replacing the distance metric in the definition of the $L_2$ depth by the Malalanobis distance.



$$D^{\textrm{Mah}}(z|X) = \big( 1 + (x-E[X]) {\sum_X}^{-1} (x-E[X])\big)^{-1}$$

Projection Depth: In projection depth, the idea is that if we sort the magnitude projections of the points along a centered unit vector, the central points will tend to have a projection with a smaller magnitude while outlier points will have projections with larger magnitudes. We can invert these projections to depth values that decrease monotonically from the center.


$$ D^{\textrm{proj}}(x|X) = \bigg(1+\textrm{sup}_{p \in S^{d-1}} \frac{| \langle p,x \rangle - \textrm{med} ( | \langle p,X \rangle ) |}{ \textrm{Dmed}(\langle p,X \rangle)} \bigg)$$


where $ S^{d-1} $ denotes the unit sphere in $ \mathbb{R}^d $ , $ \langle , \rangle $ is the inner product, med($U$) denotes the univariate median of random variable U, and $ \textrm{Dmed}(U)=\textrm{med}(|U-\textrm{med}(U)|) $.


Oja Depth: The Oja depth is based on the average volume of simplices (or convex hulls) that have the vertices from the data. The idea is that the average volume of the simplices supported by a central vertex would be higher than the average volume of the simplices supported by an outlying vertex. Again, we can invert the average volume of simplices associated with a particular point (as permanent support) to get depth values that decrease monotonically from the center.



$$ D^{\textrm{Oja}}(x|X) = \bigg(1+\frac{E (\textrm{vol}_{d}(co\{x, X_1,\ldots,X_d\}))}{\sqrt{\textrm{det}\sum_{X}}} \bigg)^{-1}$$


where $co$ denotes the convex hull, $V_d$ is the $d$-dimensional volume, and $\sum_X$ is the covariance matrix.

While we don't explily see a distance metric in the formulations for projection and Oja depth, they are still considered as "distance based" because of the close relationship of the distance metric to the inner product and volume, respectively. Here is a reference for more information on the ability of the above formulations to discriminate between the distribution structures:


Reference:
 Mosler, Karl. "Depth statistics." Robustness and complex data structures. Springer Berlin Heidelberg, 2013. 17-34. 

Wednesday, November 16, 2016

Visweek 2016 and Baltimore Trip

At the end of last month, I travelled to Baltimore to attend the annual IEEE Visualization conference (a.k.a. Visweek). As in the previous year, the conference also included a session for selected papers published in the IEEE Computer Graphics & Applications (CG&A) journal. I presented the work in my CG&A paper on "Evauating Shape Alignment via Ensemble Visualization".


This also happened to be my first Visweek conference, and I was exited to be part of it. Apart from the interesting line up of research presentations, I liked the meetups that were organized on the side; during the breaks. These were more informal gatherings to have an open discussion on specific topics (e.g. the blogging and podcasting meetup!). I thought that these were particularly helpful for the newer attendees. Another tradition I found out about, although too late for me this year, was the Velo Club de VIS; a post-vis bicycling club. I hope to be able to make it back next year and join the bike ride in Phoenix. It was also nice to see and meet researchers whom I knew from their work.  



This year the conference introduced "The Test of Time awards" to recognize decades old research that have stood the test of time. It made me think of the incremental nature of research, and all the other pieces of work that are key to the conceptualization of the most successful research ideas, in the past and in the future. I also enjoyed talking to fellow student presenters on their conference experience, often eliciting interesting responses that made me smile, and think. It was a great experience overall, however one thing I'd do differently in the future would be to read the abstracts of interesting papers before landing at the conference. This blog post has turned out to be just like my trip, comprising mostly of the conference, and very little of Baltimore!

Saturday, October 15, 2016

Desirable Properties for Data Depth Formulations

In a previous blog post here, I described data depth as a way to quantify how central or deep a member is within a distribution. While this description intuitively makes sense when we think of distributions of points in a Euclidean space, often, data depth formulations also deal with objects that are more abstract where it is not clear what could be considered "deep within a distribution". Having a clear set of desirable properties help in such cases to evaluate the utility of a depth formulation. In face such such properties, in addition to being used to characterize existing formulations of data depth, can also act as an aid for developing new depth formulations.

Zuo and Serfling proposed the following basic properties desired in any depth function. Typically depth formulations are shown to satisfy these properties under certain assumptions such as the distribution being continuous and angular symmetric. Angular symmetry just means that Prob[x] = Prob[-x] and by this definition, an angularly symmetric distribution must also have a center at the location where x=0. 

Zuo and Serfling's properties for depth formulations:

1. Null at infinity: Depth of a member falls to zero as its distance from the center of the distribution tends to infinity.
2. Maximum at center: Depth is maximum at the center of angular symmetry of the distribution.
3. Monotonicity: The depth falls off monotonically in the direction of any arbitrarily chosen center outward ray.
4. Affine invariance: The depth is invariant if the same affine transformation is performed for all members of the population.

There are also other properties associated with depth functions, in addition to those mentioned above, such as upper semicontinuity which means that the level sets of the depth evaluations across the population are a convex closure. However, I don't think these are as critical as the ones above. For example, even the popular simplicial depth formulation does not satisfy upper semicontinuity. Furthermore, the absence of upper semicontinuity in itself is not a necessarily a drawback but in fact, depending on the application, it can enable the simplicial depth to be able to capture the structure of the distribution better than formulations that satisfy upper semicontinuity!

Finally, here is the link to Zuo and Serfling's paper for the formal descriptions of these properties: Zuo, Yijun; Serfling, Robert. General notions of statistical depth function. Ann. Statist. 28 (2000)

Wednesday, September 21, 2016

Reference Managers: Moving from Mendeley to Zotero

A reference manager can be real force multiplier in dealing with the multitude of manuscripts that one would have to go through in any kind of research endeavor. There are several available options of reference managers to choose from, many of which even allow to get started for free.


I have been using Mendeley since the Fall 2013 when I started out with my PhD research, and found it to be tremendously helpful to keep track of papers that I go through. However, I'm currently in the process of making the switch to Zotero, mainly because I wanted the following functionality that is quite easy to set up in Zotero but not available in Mendeley (yet!):


  • Searching inline annotations in PDFs.
  • Opening the annotated PDFs in multiple windows.

As a caveat, while Zotero seems to be able to do everything that Mendeley supports, much of the functionality needs to be set up by user. For example, the highly useful "recently read" filter comes off the shelf in Mendeley, but would have set up by the user as a saved search in Zotero. Talking about searchesI do think that Zotero has a more powerful search capability in general. 


While there seem to be options to help move files and metadata (notes and organization structure) from to Zotero from Mendeley, I decided to move manually(!) as part of my reorganization. I did end up leaving behind a whole bunch of references which I thought were unlikely to be needed again in the predictable future. This was essentially so as it turned out that the core corpus of files that I routinely work with (surprisingly) still of a manageable size. While I'm currently under the Zotero's 300MB free data quota, I do plan upgrade to the 2GB data plan soon. There are also ways to set up an alternate online file storage for the so inclined.



One final thing to mention would be the Zotfile add-on (Yup, in addition to being open source, Zotero allows for functionality to be extended via add-ons!). Zotfile automates the process of performing some essential tasks such as adding new files to Zotero items, batch renaming/moving attachments and, perhaps most important for me personally, extract inline annotations from the PDF files. Zotfile extracts the inline annotations from the PDFs and converts them into Zotero item attachments---which are searchable!. This workflow does lead to duplicate Zotero attachments on updating the PDF's inline notes as Zotfile would then create a new attachment corresponding to the updated annotation. A simple solution to remove the duplicates is to create a Zotero saved search to filter (based on timestamps) and delete the attachments corresponding to the older inline annotations.

Monday, September 5, 2016

Data Depth Primer

My current research revolves around the notion of data depth, so I thought that an introductory primer on data depth would be good topic kick off research related posts. This first post introduces data depth and a taxonomy that is able to classify all of the different types of data depth formulations that I've come across so far. 



What is Data Depth? Given an ensemble of data objects, typically drawn from an underlying probability distribution, data depth is a method to quantify how central or deep any particular data object is with regard to that distribution. For example, given an ensemble of points drawn from a bivariate normal distribution, data depth can help identify the most central point as well as quantify the centrality of any point in the plane with regard to that particular distribution. 



As can be guessed from the example, data depth methods are interesting for working with ensembles of complex data types where a method to order the data members is not obvious or trivial as it is, for instance, in the case of points on a real number line. For data types more complex than points on a real number line, several kinds of formulations of data depth have been developed. The figure below shows some examples of these formulations, and how those formulations can be classified into various (overlapping) classes based on the characteristics of the data types that they operate on as well characteristics of the formulation itself.



A taxonomy of data depth formulations.

In follow up posts, I plan to dig into the various classes of data depth and their respective properties in more depth (pun intended). As seen in the above figure, these classes are:

  • Distance based formulations
  • Weighted mean formulations
  • Band based formulations
  • Data depth formulations for Euclidean data
  • Data depth formulations for Non-Euclidean data
  • Kernelized formulations

A useful reference: Mosler, Karl. "Depth statistics." Robustness and complex data structures. Springer Berlin Heidelberg, 2013. 17-34. 
get arXiv version

Also, here is the link to my next blog post on data depth where I enumerate the important desirable properties expected in data depth formulations.

Sunday, August 28, 2016

Cyber Seniors Pilot



At the beginning of this past summer (of 2016) I came across a call for volunteer coordinator(s) to help with a "Cyber Senior" pilot project being organized by UServe Utah. It turns out that the current generation of teenagers, being digital natives, are very proficient in working with the latest digital apparatus such as computers, tablets and smart phones. The idea of the Cyber Seniors project was to help facilitate the transfer of some of those skills from teenagers over to seniors, who had lived a larger part of their lives before the digital revolution and were likely to benefit greatly on being introduced to recent developments such as the Internet. Originally, this idea was the basis of a high school project in Toronto, Canada before becoming quite well known thanks in part to the popularity of this documentary. Having seen the positive impact of technology awareness within my own family, I thought it was a fabulous idea and signed up. 



The program started with volunteer coordinators meeting up to review and polish the curriculum for seniors which was then followed up by a training session for the youth (middle and high school kids from Salt Lake City area). Later, the youth were paired up with the seniors at a local senior center for one hour a week and go over various items in the curriculum, such as, setting up email and social media accounts for the seniors. It was clear from the start that this was going to be an interesting experience for all involved. There always seemed to be considerable excitement about the sessions at the senior center. Seniors seemed impressed by what is possible by technology as well as how comfortable the youth were in navigating the digital world and answering their questions. The youth were upbeat after being able to help seniors with their technology related concerns



Personally, I liked that I was able work toward something that 1) had a positive impact on the broader community 2) had an immediate impact. The second part is interesting as it contrasts sharply with my day (and night!) job doing research where its not always clear how far into the future, if at all, its really going to make a difference.



Now, along with the summer, the pilot phase of the project is over and the youth are back in school. I think the pilot was a success and it looks like others share this opinion as there seem to be plans of this being expanded to more locations across Utah starting spring 2017.


Saturday, August 20, 2016

Setting Up a Personal Backup System



Backups are critical. If I could only save one (inanimate) entity from a fire or similar, that would be my laptop or more specifically, the data on it. While some bits and pieces of my work are usually backup up in remote code repositories, its hard to believe that until recently I did not have a comprehensive personal backup system in place. While I've heard good things about Apple Time Capsule, here I describe how I set up a backup system using a couple of regular Seagate external drives that I already owned. 



My goal for this system: To set up a 'backup' command to run from the terminal that performs a back of all my important data.



Step 1: Make a list of all important files and folders to back up.  This also helped to organize my data and get rid of all the clutter. Some of these for me were:


  • All the settings settings files for the IDEs I'm using. These settings files are usually stored in different places but I think its important to have those settings fine tuned over the years backed up.
  • Work files. I ended up organizing all my work related files inside a single work folder. 
  • My Documents folder.
  • Pictures and other media.


Step 2: Write a shell script. This is the main backup script. As an example, here is a link to my backup shell script. It allows me to pick from a list of my personal external drives. Also it allows to select form performing either incremental or full backups. In the script, I use the "rsync" command (as opposed to "cp") as it affords the incremental backup option. The list of files and folders to backup are stored in another file "include.txt". Finally, it records the time and other metadata regarding the backup in another text file "latest.txt".


Step 3: Add an alias to you "~/.bash_profile". This makes performing a backup as simple as typing a single command in the terminal. We only need to add the following line to "~/.bash_profile".



alias backup="path_to_backup_script"/backup.sh"



Step 4: Add a recurring backup reminder to the calendar application. This is a simple but important step. I prefer not fully automating the backup to prevent it from annoyingly starting up when I am working on a compute intensive task. Just adding a calendar reminder makes sure that I don't go without running the backup script too long.