Jagged Thoughts | Dr. John Linwood Griffin

July 9, 2008

PDL visit day

Filed under: Reviews — JLG @ 11:17 PM

In May 2008 I attended the PDL Spring Industry Visit Day in Pittsburgh, a workshop of sorts where students display their work in poster and demo form, industry visitors catch up with their old storage acquaintances, and everybody gets together for German food and beer afterward. (What’s not to like?)

Here are some of the larger tidbits I took away from the event:

1. Filesystems statistics survey

Garth Gibson organized a 5-year DoE institute, the Petascale Data Storage Institute, to explore issues of interest to folks like the national labs. A nifty thing they’re doing is putting together public repositories of useful data for storage researchers. For example, the Computer Failure Data Repository contains the data Garth and Bianca used for the MTTF FAST paper.

So, the latest one is the “filesystem statistics survey.” There is a tool that anyone can run and a respository for folks to upload their results. The type of results that they’ve generated so far are:

  • In archival file systems (at the national labs), most space is consumed by a small number of large files: 90% of space is consumed by files 32MB or greater in size, whereas 90% of files are smaller than 32MB.
  • In 75% of the archival file systems, 80%-90% of the files consume less than 2KB apiece.

This is available at:

2. Hadoop

I hadn’t heard about Hadoop before today (do I live under a rock? does everyone know what this is?) Hadoop is an open-source implementation of MapReduce — i.e., a toolset to help a user easily fire off map() and reduce() functions on his or her own cluster of heterogeneous boxes. An example from my favorite online encyclopedia: “The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 1.1 million finished PDFs in the space of 24 hours at a computation cost of just $240.”

So I guess distributed computing is just getting easier and easier. One of my colleagues was setting up a Condor cluster just as I was leaving CMU so I didn’t get to learn a lot about it or see it in action. If you have experience with Condor or Hadoop I’d appreciate your giving me an overview sometime.

My favorite Hadoop-related project was applying the “fingerpointing” techinque (from Priya Narasimhan and her students) to identify in real time which nodes are the source of performance slowdowns in a Hadoop-based system. Fingerpointing is their take on failure detection and root-cause analysis in distributed systems, described here:

One of the topics I care about (related to #1) is using what auditable information has been collected about a system to actually do some useful auditing, which is why I’m interested in this particular work.

3. Home media storage

My favorite of the projects is “Perspective”, described here:

They are looking at information stored in home media environments and asking questions about how real users want to interact with their storage: how easy is it to accomplish tasks such as “make sure a movie is on Randal’s ipod before he leaves for his upcoming trip” or “make sure this set of files in Zach’s JPEG archive can’t be viewed by anyone else in his household.”

User studies in computer science is an underdeveloped field. I got really interested in this after I saw some interesting work at IBM (the Sparcle project, linked below) that did a user study to see how well computer-literate people were able to specify access control policies. A lot of CS work suffers from a lack of user-centric design, so I’m happy to see any work that tries to address the problem. Sparcle is here: http://domino.research.ibm.com/comm/research_projects.nsf/pages/sparcle.index.html

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.