Wednesday, September 21, 2016

Reference Managers: Moving from Mendeley to Zotero

A reference manager can be real force multiplier in dealing with the multitude of manuscripts that one would have to go through in any kind of research endeavor. There are several available options of reference managers to choose from, many of which even allow to get started for free.


I have been using Mendeley since the Fall 2013 when I started out with my PhD research, and found it to be tremendously helpful to keep track of papers that I go through. However, I'm currently in the process of making the switch to Zotero, mainly because I wanted the following functionality that is quite easy to set up in Zotero but not available in Mendeley (yet!):


  • Searching inline annotations in PDFs.
  • Opening the annotated PDFs in multiple windows.

As a caveat, while Zotero seems to be able to do everything that Mendeley supports, much of the functionality needs to be set up by user. For example, the highly useful "recently read" filter comes off the shelf in Mendeley, but would have set up by the user as a saved search in Zotero. Talking about searchesI do think that Zotero has a more powerful search capability in general. 


While there seem to be options to help move files and metadata (notes and organization structure) from to Zotero from Mendeley, I decided to move manually(!) as part of my reorganization. I did end up leaving behind a whole bunch of references which I thought were unlikely to be needed again in the predictable future. This was essentially so as it turned out that the core corpus of files that I routinely work with (surprisingly) still of a manageable size. While I'm currently under the Zotero's 300MB free data quota, I do plan upgrade to the 2GB data plan soon. There are also ways to set up an alternate online file storage for the so inclined.



One final thing to mention would be the Zotfile add-on (Yup, in addition to being open source, Zotero allows for functionality to be extended via add-ons!). Zotfile automates the process of performing some essential tasks such as adding new files to Zotero items, batch renaming/moving attachments and, perhaps most important for me personally, extract inline annotations from the PDF files. Zotfile extracts the inline annotations from the PDFs and converts them into Zotero item attachments---which are searchable!. This workflow does lead to duplicate Zotero attachments on updating the PDF's inline notes as Zotfile would then create a new attachment corresponding to the updated annotation. A simple solution to remove the duplicates is to create a Zotero saved search to filter (based on timestamps) and delete the attachments corresponding to the older inline annotations.

Monday, September 5, 2016

Data Depth Primer

My current research revolves around the notion of data depth, so I thought that an introductory primer on data depth would be good topic kick off research related posts. This first post introduces data depth and a taxonomy that is able to classify all of the different types of data depth formulations that I've come across so far. 



What is Data Depth? Given an ensemble of data objects, typically drawn from an underlying probability distribution, data depth is a method to quantify how central or deep any particular data object is with regard to that distribution. For example, given an ensemble of points drawn from a bivariate normal distribution, data depth can help identify the most central point as well as quantify the centrality of any point in the plane with regard to that particular distribution. 



As can be guessed from the example, data depth methods are interesting for working with ensembles of complex data types where a method to order the data members is not obvious or trivial as it is, for instance, in the case of points on a real number line. For data types more complex than points on a real number line, several kinds of formulations of data depth have been developed. The figure below shows some examples of these formulations, and how those formulations can be classified into various (overlapping) classes based on the characteristics of the data types that they operate on as well characteristics of the formulation itself.



A taxonomy of data depth formulations.

In follow up posts, I plan to dig into the various classes of data depth and their respective properties in more depth (pun intended). As seen in the above figure, these classes are:

  • Distance based formulations
  • Weighted mean formulations
  • Band based formulations
  • Data depth formulations for Euclidean data
  • Data depth formulations for Non-Euclidean data
  • Kernelized formulations

A useful reference: Mosler, Karl. "Depth statistics." Robustness and complex data structures. Springer Berlin Heidelberg, 2013. 17-34. 
get arXiv version

Also, here is the link to my next blog post on data depth where I enumerate the important desirable properties expected in data depth formulations.