Monday, August 18, 2008

New Phone Number

My phone bricked. I'll expand this post with details later, but for now, what's important is that my new number is 217-369-6980 until further notice.

Update: My number is back to 217-898-9662. Been that way for a while, actually. However, what with major life events (marriage, honeymoon, visiting family in Brazil, and moving to Seattle), I haven't had time to post updates. I'll try to be better about it in the upcoming weeks.

Friday, July 18, 2008

Restricting method access in Perl objects

This a geeks-only entry in the "Perl: Handy, but Ugly" series...

I often want to restrict access to certain methods in a class. One classic example is public and private methods. As another, I've written a class for data storage with both read and write methods, and sometimes I want an instance to be read-only, and other times write-only. I could implement this with an internal read/write flag. However, while I want that flag to be flippable, I don't want just anyone flipping it. That sort of thing is hard to do in Perl because it doesn't believe in enforced privacy.

Fortunately, Perl does believe in being powerful and flexible. So I've found a neat way of wrapping object instances in what I call adapters, which expose only a subset of the object's methods.

The basic desiderata are as follows:

  1. The adapter should be an object wrapping another object.
  2. It should only define the methods it exposes, so that the wrapped object's unexposed methods aren't even there
  3. There should be no way of getting to the wrapped object through the adapter (otherwise, you can get to the unexposed methods)
  4. Finally, I don't want to write a new adapter for every class I want to wrap, or every subset of methods I want to expose

Wait a second, you say. I want adapters to be classes defining a custom set of methods, but I don't want to write a new adapter each time? Yes. And because Perl is "Handy, but Ugly", I can do it.

The trick is that Perl gives you direct access to the symbol table: that magical hash that knows what reference you mean when you use a variable or subroutine name in your code. And since a class is just a set of symbols, it's possible to create a class entirely on the fly just by inserting the proper subroutine references into the symbol table.

With that, I present my AdapterFactory perl module. It's fairly well commented, so I'll leave groking it as an exercise for the reader. A couple of hints:

  • With no strict, a string can be dereferenced as if it were a reference to the variable whose name is the string's value. This works only for non-lexical variables (i.e., those not defined with "my"). For instance, $h = "hash"; %$h is equivalent to $h = \%hash; %$h, or %hash
  • For some reason, even with use strict, strings on either side of the arrow operator can be dereferenced to the package or method whose name is the string value. For instance, $p = "Package"; $m = "new"; $p->$m() is equivalent to Package->new()
  • The symbols for a package are kept in a hash with the name of the package plus "::". Thus, symbols for package "foo" are kept in hash %foo::
  • The * sigil is used to set values in the symbol table
AdapterFactory.pm
###
# Author: Pedro DeRose
# Creates adapters, or objects that wrap another object, but expose only a
# subset of its methods. Useful for separating public/private methods, or
# restricting functionality. Does not provide any handle to the object itself.
#
# Usage example:
#
#     use AdapterFactory qw(defineAdapter adapt)
#
#     defineAdapter('Foo::Public', [ qw(get set print) ]);
#     my $fooAdapter = AdapterFactory::Foo::Public->new($fooObj);
#     my $barAdapter = adapt('Foo::Public', $barObj);
#     
#     defineAdapter('Foo::Private', { secret => [ 'default' ] });
#   
#   Defines the AdapterFactory::Foo::Public adapter exposing the get(), set(),
#   and print() methods, then creates adapters wrapping $fooObj and $barObj.
#   Finally, defines the AdapterFactory::Foo::Private adapter exposing the
#   secret() method, and specifies that "default" should always be passed to it.
###   
package AdapterFactory;
use Exporter 'import';
@EXPORT_OK = qw(defineAdapter adapterDefined adapt);

use strict;

# Keep map of adapter to object as a lexical variable so that adapter objects
# don't store the object themselves, where other code can get to it.
my %adapterToObj;

###
# Defines a new adapter class whose name is the name of this class, plus "::"
# then the given name appended (e.g., given name "Foo::Bar", the name is
# "AdapterFactory::Foo::Bar"). It wraps the object passed to its new()
# constructor, exposing the specified methods. Methods can be specified in two
# ways. When an array reference of method names, they are called directly. When
# a map from method name to an array reference of arguments, the adapter's
# methods call the wrapped object's methods with the given arguments always
# appended. See the usage example above for how to use the adapter.
#   name: the name of the adapter class
#   methods_r: reference to methods to expose
#   returns true if the definition was successful, false otherwise
###
sub defineAdapter {
    my ($name, $methods_r) = @_;
    $name or die "Missing name";
    ref($methods_r) eq 'HASH' or ref($methods_r) eq 'ARRAY' or die "Bad methods";

    if(adapterDefined($name)) {
        warn "Adapter $name already exists.";
        return undef;
    }

    # Lots of symbol table manipulation, so stop yer whining
    no strict;

    # Compose the adapter class name
    my $class = __PACKAGE__."\::$name";

    # Turn method array ref into method hash ref with no method arguments
    if(ref($methods_r) eq 'ARRAY') { $methods_r = { map { ($_ => []) } @$methods_r }; }

    # Directly create symbol table entry for each exposed method.
    foreach my $method (keys %$methods_r) {
        my @args = defined($methods_r->{$method})? @{$methods_r->{$method}} : ();
        *{"$class\::$method"} = sub {
            # Look up object using adapter's reference, then call the method
            my $self = shift;
            return $adapterToObj{$self}->$method(@_, @args)
        };
    }

    # Create the constructor last, so it clobbers any "new" method in methods_r
    *{"$class\::new"} = sub {
        my ($class, $obj_r) = @_;

        # Map the given obj to this adapter
        my $self = {};
        bless($self, $class);
        $adapterToObj{$self} = $obj_r;

        return $self;
    };

    return 1;
}

###
# Returns whether an adapter with the given name is already defined
#   name: the name of the adapter class
#   returns true if an adapter with the name is defined, false otherwise
###
sub adapterDefined {
    my ($name) = @_;
    no strict;
    return scalar(%{__PACKAGE__."\::$name\::"});
}

###
# Creates and returns an adapter for a given object. Equivalent to calling the
# new() constructor on the adapter created with the given name, and passing the
# given object.
#   name: the name of the adapter class
#   obj_r: reference to the object being wrapped
###
sub adapt {
    my ($name, $obj_r) = @_;
    $name or die "Missing name";
    UNIVERSAL::isa($obj_r, 'UNIVERSAL') or die "Object must be a blessed reference";

    # Create and return the adapter
    my $class = __PACKAGE__."\::$name";
    return $class->new($obj_r);
}


1;

Thursday, July 3, 2008

Perl: Handy, but Ugly

In what will probably be a many-part series, here's an oddity of Perl that had me tearing out my hair for a couple of hours...

If you know Perl well, feel free to skip this paragraph. Perl has a handy but ugly notion of context. Specifically, code execute in either a scalar or a list context: if a single value is expected, the code executes in scalar context; if a list of values is expected, it executes in list context (that's vague, but good enough for now). Then, code behaves differently depending on the context.

One example of context is getting the length of a list. Given a list @foo = ('a', 'b', 'c'), then @foo in scalar context is the length of @foo. Thus, $x = @foo sets the single value $x to 3 (the code executes in scalar context because $x is a single value, so Perl expects a single value assigned to it).

Now for a pop quiz. If @foo = ('a', 'b', 'c'); $x = @foo sets $x to 3, what does $x = ('a', 'b', 'c') do? Turns out it sets $x to c. Fascinating, isn't it?

The reason is that the comma does different things in list and scalar contexts. In a list context, comma is the list building operator. Thus, ('a', 'b', 'c') in list context (such as when assigned to the list variable @foo) returns a list with three items. However, in scalar context, comma is like C's comma: it executes both its left and right operands, then returns the result of the right. For instance, 'a', 'b' returns b, and 'a', 'b', 'c' returns c. Thus, when we assign ('a', 'b', 'c') to a single value, the code executes in scalar context, returning c.

Of course, I wasn't lucky enough to have this bite me in such a simple form. Instead, consider this (still heavily simplified) example:

sub foo {
    $a = "hello";
    $b = "world";
    return ($a, $b);
}

print join(" ", foo()) . "\n";
print scalar(foo()) . "\n";
I naively thought this would print hello world then 2. Instead, we get hello world then world. Today's lesson, then: when returning lists from functions, assign them to a list variable first.

Thursday, June 26, 2008

Project Note Taking System

A while ago, I went looking for a good note taking system. Notes, as in on paper. I work on a lot of projects, and since I grok things better when I write them down, I needed a way to organize ideas, meeting minutes, tasks, and progress.

I found several hacks to turn my preferred notebook, a Moleskine into a full-fledged PDA replacement using GTD. However, I didn't want a PDA replacement. I wanted a simple way to organize project ideas.

I also found a lot of good note-taking systems. Of these, the Cornell system was closest to what I wanted. I liked the idea of taking notes, then adding higher-level comments off to one side. Unfortunately, page division doesn't work well in a small notebook, and the system isn't very project-oriented.

Thus, after some trial and error, I've mostly settled on something that works well for me. I begin with a large, graph paper Moleskine, though any notebook should work. Next, I take notes on the right-hand page, then write higher-level comments on the left page. That's the gist. The fun part is the details.

On the right-hand page, I always first write the date in the upper-right-hand corner. This makes finding old notes a lot easier. After that, I take notes however I like — outlines, drawings, mindmaps, whatever.

Then, both while writing notes and when reviewing them, I write higher-level comments on the left page. I find it useful to vertically align them with the part of the notes they comment on. Each comment is labeled, in the form Label: comment, so that I can immediately tell what kind of comment it is. I use five labels:

Topic
Every left-hand page has a single Topic comment first thing on the page. It's short phrases or keywords to remind me what the notes are about. By using only one per page and putting it at the top, it's easy to flip through the notebook and find notes about particular topics.
Thought
These are interesting thoughts about the notes, such as summarizations, ideas, etc.
Tip
Often my notes include good lessons, so Tips are things I want to do differently in the future.
Task
These are things that I need to do based on the notes. As I do them (or move them to a better task management system), I check them off.
Tack/Tank
This pair of labels keeps track of tacks we've taken in the project, and why we've decided to tank them. I use them because I found that projects often cycle back to old ideas without remembering the very good reasons they were killed in the first place. To illustrate their use, suppose we have a project meeting on Monday and decide to use MySQL. My notes on the right-hand page contain our reasoning, and I add "Tack: use MySQL" to the left-hand page, leaving some space underneath. On Tuesday, we change our minds, and decide to use SQLite instead. So now I add "Tack: use SQLite" to Tuesday's left-hand page. Then, I go back to Monday's page, and under the "Tack: use MySQL" comment, I add a Tank comment explaining why we're no longer using MySQL.

That's it. Fairly easy to use and well organized, and relatively easy to find information later. Of course, it's not ideal. What I really want is a lightweight tablet PC, about the size of my Moleskine but all screen, with a swivel keyboard, and nice note software with tags, tree-structure organization, and handwriting search. But before saying such things exist, I also want it to be affordable. Good luck to me. Until then, I'll keep buying Moleskines.

Thursday, May 29, 2008

Industry vs. Research

As a graduate student looking for jobs, a common question I heard was "Industry or research?" Industry jobs include developers, technical managers, and even applied researchers to a large degree. Fundamental research jobs are professors and research scientists at industrial labs, such as Microsoft Research and Yahoo! Research. The definitions are somewhat fuzzy (applied research is industry? industrial labs are research?), but a generally distinguishing characteristic is whether publishing papers is a primary aspect of the job.

It was this characteristic that made me realize the fundamental difference between industry and research. It's one I wish I had known when I started graduate school. In short,

Industry is primarily about selling products, while research is primarily about selling stories

This is why publishing papers is telling: papers are a medium for selling stories. Of course, a good story helps sell a product, and a working product helps sell a story. So there is definite overlap. However, it's telling how well the pros and cons of industry and research derive from this basic difference.

To illustrate, consider some classic pros and cons. In industry, since you sell products, your work has direct impact on people that use the product. Since people will typically pay for this impact, the product itself is the source of funding. And if your product is sufficiently impactful, it is the source of a lot of funding, and you get rich. However, this means it's critical to quickly and consistently create marketable products. The result is a dampening effect on the problems targeted by industry: they are dictated by the market, and typically have shorter-term visions with fewer (or at least more calculated) risks.

In contrast, research has significantly more freedom in the problems it tackles. They are often longer-term, riskier visions. Research can do this because it only has to sell stories describing core ideas, not fully working products. Thus, it can focus on interesting technical problems. However, "selling" a story does not usually mean for money, but rather convincing people that it describes a good idea (e.g., getting a paper accepted to a conference). Since neither the story nor the idea generates money directly, researchers must seek out external funding such as grants, or, in industrial labs, income from products (which, to be fair, often contain the final fruits of research).

Given such pros and cons, the distinction of product vs. story seems obvious in hindsight. However, what made me first realize it was a more subtle situation. My advisor asked me to devise a data model for the system we're building. I came back with two options: a very common model, and a novel model that was simpler and more expressive. I favored the novel model, but my advisor said we should use the common one. His reason was that the data model was not our primary contribution, and papers with too many innovations can confuse readers. And he was right. Even though the novel model would make for a better system, the common model makes for a better story — and I'm currently in the business of selling stories. At some later date, after we sell our current story, we may sell another story that focuses on a new data model.

To conclude, I want to say that this isn't meant to promote either industry or research. In my particular case, I've found that I lean more towards selling products than stories. However, I've spoken with both developers and researchers, and both agree with the product vs. story differentiation, and each prefers their side. Of course, I'd love to hear from anyone else on the topic. I just think that understanding this difference is vital to making an informed decision about graduate school, and life afterwards.

Saturday, May 24, 2008

Job Decision

It's finally official. After nearly two months of applications, interviews, travel, negotiation, introspection, and extremely hard thinking, I've made a job decision. Come January next year, I'll start as a Program Manager in Microsoft's SQL Server team.

This was a very difficult decision, as I had to choose between five compelling offers. In the end, there were two primary considerations: location, and how I want to contribute to my field.

My offers spanned two locations: Microsoft in Seattle, and the others in Silicon Valley. I characterized my options as better quality of life in Seattle vs. proximity to networking and friends in Silicon Valley. Seattle's quality of life is better due to lower cost of living, much cheaper housing (I can actually afford a nice house my first year), and significantly nearer mountains. It also feels more laid-back. On the other hand, Silicon Valley hosts constant interaction between innumerable tech companies, providing excellent networking opportunities and mobility. Also, several of my friends live there.

For me, Seattle and Silicon Valley were effectively tied. However, this was a two-person decision, so Sarah joined me in visiting both places. She met and loved my Silicon Valley friends, and received a great tour of Seattle courtesy of Microsoft. Sarah sees locations differently than I do. I pick a job, and that decides the location; Sarah picks a location, then finds a job. Location is part of how she defines where she wants her life headed. As it happens, before we were engaged, she was already looking to move to the Pacific Northwest. Thus, though she liked California, and especially my friends, Washington is closer to where she wants to be. This was one consideration.

The other strong consideration clarified after many conversations with mentors. The key question is how I want to contribute to my field. One path is as a technical luminary, with primarily technical contributions. This path includes god-like developers, researchers, and other deeply technical people. My offers at IBM, Oracle, and Yahoo! followed this path. Another path is as a technical manager, with primarily leadership and strategic contributions. This path includes general managers, CEOs, and other big-picture people. My offers at Google and Microsoft followed this path. I've spent most of my life as a deep techie. However, due to some eye-opening experiences and a lot of introspection, I've decided that, at least currently, my calling is management and leadership.

Neither of these considerations alone decided me. But due to both together, plus several others secondary, I've accepted the Microsoft offer. A couple things in particular really impressed me about the position. First, I got to meet several team members, including my future boss, and they're all amazing. Second, Microsoft is very serious about investing in people and building careers, so the opportunities for mentorship and advancement are fantastic. I'm extremely excited, and really looking forward to starting. All that's left is to finish my doctorate!

Finally, to wrap up, I want to very sincerely thank everyone who helped me throughout this process. All of my mentors for their advice; all of my friends for their time, love, connections, and support; and all of my family for putting up with weeks of waffling (individuals may fall into more than one category). I know not everyone will be happy with my decision, but I hope you will all be happy for me. Of course, feel free to send along any particularly strong variations on "You fool!". I promise no hard feelings.

Sunday, March 30, 2008

Interviews and Conference and Travel, Oh My!

The last few weeks have been insane, and the next few weeks promise to be just as crazy. I've started my job search, which includes a lot of interviews. I've already interviewed on-site at Microsoft for two positions, and received offers from both. On Monday, I have two phone interviews with Google, again for two positions. Then, in the upcoming weeks, I have on-site interviews at Rapleaf, Oracle, IBM Almaden, and Yahoo! Research.

Just interviewing involves a lot of travelling. However, as added fun, I'll also be in Cancun for ICDE 2008, where I'll be presenting one of my papers.

So between interviews, the conference, and associated travelling, there's basically no time for posting. However, once all is over, I'll share interview resources, conference tidbits, and maybe details about the work I'm presenting. At any rate, just didn't want everyone to think I'd given up on blogging.

Wednesday, March 12, 2008

Tasty Favorites: Pedro Pasta

At the urging of a couple persuasive friends, I'd like to start sharing the results of one of my pastimes: cooking. Though I enjoy making somewhat involved meals, I also like coming up with tasty foods that encourage me to eat at home. This means quick, cheap, and with minimal clean-up.

One favorite is what Sarah calls "Pedro Pasta". It's simple, delicious, and (not counting the time to boil water) takes about five minutes.

Photos courtesy of Sarah

First, an aside: I'm a huge fan of Michael Chu's Cooking for Engineers, especially his tabular recipe notation. It's an extremely elegant and concise way of displaying recipes, so I'm borrowing it for these posts.

For this recipe, I'm not too picky about quantities. The recipe is simple enough that it's easy to pick appropriate quantities based on how many people you're serving, and taste.

Pedro Pasta
Water Boil Cook Drain Toss Let sit
5 min.
Pasta
Cherry Tomatoes Cut half in half Combine
(squeeze cut)
Stir
Olive Oil
Garlic Mince
Italian Seasoning
Salt

Steps:

  1. Boil water for pasta. I generally add salt to the water to flavor the pasta and help cook it (salt water boils hotter).
  2. While the water boils, prepare the other ingredients: cut about half of the cherry tomatoes in half, and mince the garlic. If you're using fresh herbs instead of the dried "Italian Seasoning" mix (basil, oregano, rosemary, etc.), chop up the herbs as well.
  3. Add pasta to the boiling water. I like spaghetti or angel hair, but any pasta works fine.
  4. While the pasta cooks, prepare the sauce: put olive oil in a large bowl, and squeeze into it the cherry tomatoes you cut in half. Then stir in the tomatoes (both squeezed and whole), garlic, herbs, and salt to taste. You'll want enough olive oil/tomato mixture to coat the pasta, but this is also largely a matter of taste.
  5. Once the pasta has finished cooking, put it into the bowl with the sauce, and toss it.
  6. Finally, let sit for five minutes or so. The heat from the pasta will cook the garlic, and generally help spread the flavors around.

And that's it. Serve with some grated Parmesan cheese, and you have a quick and tasty meal. Of course, some may complain that it's lacking in protein. To address this, I make an accompaniment:

Pedro Pasta Accompaniment
Garlic Olive Oil Heat (med-low) Lightly
sautee
Sautee Cook (med)
Extra Firm Tofu Cube
Garlic Mince
Salt
Italian Seasoning
Lemon Pepper
Red Wine

Steps:

  1. Heat some garlic olive oil over medium-low heat (regular olive oil works fine if you don't have the garlicky variety).
  2. Add cubed extra firm tofu and minced garlic, then sautee lightly, just until the tofu starts to get a bit golden.
  3. Add salt, Italian seasoning, and lemon pepper to taste (you can buy lemon pepper at just about any grocery store). Continue sauteeing until the tofu is nice and golden.
  4. Add a splash of red wine, then turn the heat up to medium and cook until the tofu browns (or however you like it).

Once finished, just throw it in with the Pedro Pasta.

If you're not vegetarian (Sarah is, but I'm not), you can use chicken instead of tofu. The steps are the same, except replace tofu with chicken, and red wine with white. Also, make sure the chicken cooks thoroughly, which may require a slightly higher temperature in the early steps.

Thursday, February 28, 2008

Google Sites, MS SharePoint...Creating Communities

Google recently announced a new product, called Google Sites. The basic idea is that it lets you gather and share data (e.g., Google Apps documents, files, free-form wiki-style pages) pertaining to a particular purpose (e.g., business, team, project). Inevitably, Sites is being compared to Microsoft SharePoint, which addresses a similar need.

What's fascinating to me about Sites and SharePoint are how they relate to my research on community information management. Briefly, a community has a shared topic or purpose on which they have data, such as web sites, mailing lists, and documents. This data is often unstructured. I research systems that process this unstructured community data to extract structured information about entities and relationships, then provide structured services beyond keyword search (e.g., querying, browsing, monitoring). I've built a very alpha prototype system, DBLife, for the database research community, and also published some papers on the topic.

How Sites and SharePoint relate is very exciting: they build communities. Currently, my research helps builders select web pages and other data sources. However, by integrating with a Microsoft SharePoint installation or a Google Site, the data is already there. And the benefits to users is palpable. Instead of only keyword search, users would have powerful structured access methods, making the application much more useful. Truly, it'd be very exciting to see something like that happen.

Wednesday, February 27, 2008

Bash Command-line Programming: Flow Control

Time for another post on handy techniques for command-line bash programming. This post covers some useful command-line techniques for flow control.

Even when writing quick programs on the command-line, I often need to branch or loop. Especially loop, as I often need to do something over every file in a directory, or every line in a file. Below are some techniques I commonly use. For more neat bash-isms, check out the Bash FAQ.

  • cmd && trueCmd || falseCmd: if cmd executes successfully, run trueCmd, else run falseCmd. This is a pithy version of
    if cmd; then trueCmd; else falseCmd; fi
  • while cmd; do stuff; done: execute stuff while cmd executes successfully. Use
    while true; do stuff; done
    for an infinite loop.
  • for W in words; do stuff; done: sets the variable $W to each word in words, then executes stuff. For instance, to run foo on every text file in a directory tree, use
    for W in $(find . -name '*.txt'); do foo "$W"; done
    Note that words are split automatically based on whitespace. This means that filenames with spaces will be split into multiple words (I know find has the -exec option, but it can be cumbersome, and this is just an example). To avoid splitting on whitespace, see the next tip.
    Edit: Originally, my example used /bin/ls *.txt rather than find. However, as HorsePunchKid points out,
    for W in *.txt; do foo "$W"; done
    works on filenames with whitespace (and is also cleaner). This is an excellent point, but the only expansion done after word splitting is pathname expansion, so it applies only to file globs. If you're processing the output of a command, or the contents of an environment variable, then you'll still have a word splitting problem.
  • while read L; do stuff; done: sets the variable $L to each line in stdin, then executes stuff. Use this to handle input with spaces. For example, to run foo on every text file in a directory tree, including those with spaces, use
    find . -name '*.txt' | while read L; do foo "$L"; done

The last tip has a caveat: a piped command executes in a subshell with its own scope. Thus, if I use cmd | while read L; do stuff; done, variables set in stuff are not available outside of the loop. For example, if I want to run foo on every text file, then print how many times foo succeeded, I could try this:

I=0
find . -name '*.txt' | while read L; do foo "$L" && let I++; done
echo $I

However, this prints 0. The reason is because $I outside the pipe is a different variable than $I inside the pipe. To fix this, avoid a pipe using a trick from my earlier post:

I=0
while read L; do foo "$L" && let I++; done < <(find . -name '*.txt')
echo $I

Friday, February 22, 2008

Managing papers with GMail

As a graduate student, I read a lot of papers. Then, I often want to write notes about these papers, categorize them, find them quickly, etc. However, despite being a common problem for graduate students (or anyone else keeping track of documents), there are few free solutions that are any good. Thus, I rolled my own using GMail.

Available Solutions are Limited

Unfortunately, there aren't many free solutions for managing papers. In fact, the only decent one I've found is Richard Cameron's CiteULike. CiteULike provides all the necessities: online storage, tagging, metadata search, and note taking. It also has two other draws: one-click paper bookmarking from supported sites, and social features for sharing and collaboration.

However, CiteULike has a deal-breaker for me: its search capabilities are very limited. It provides keyword search only over paper titles, author last names, venues, and a part of the abstract (to the best of my knowledge, since it doesn't list what it searches). It does not search the paper's full text, or even your notes. This can make finding papers based on vaguely remembered information very difficult.

Using GMail to Manage Papers

To address CiteULike's limited search, I decided to manage papers with GMail. The basic idea is that I keep each paper and its notes in an email thread. Then, further notes are replies to the thread. This supports writing richly formatted notes, as well as GMail's search over each paper's full text and any notes I've written.

Below, I describe the steps to set up the solution, add a new paper, take notes on a paper, and find a paper I've read. Finally, I compare the advantages of using this solution to using CiteULike.

Setup

Setup is trivial, consisting of creating a new gmail account for storing papers. I'll refer to this account as papers@gmail.com.

Adding a New Paper

After creating the account, I add new papers by sending email to papers@gmail.com. To ease finding the paper later, I use the following steps, which take only a minute or so:

  1. Start a new email to papers@gmail.com. Then, fill in the paper information. The key is to put the paper title as the subject, and include the author name, venue, and any other metadata you may want to search for later. The image to the right is an example (click to enlarge).
  2. Attach the PDF or PS file of the paper to the email.
  3. Send the email. Since it's to me, it will appear in my inbox.
  4. Respond to the email with the full text of the paper (if necessary, delete any other text first). To get the text from the PS or PDF file, I use the pstotext or ps2ascii Linux programs. The xclip program is handy for putting the text in the clipboard, from which I paste it into the response.

These steps accomplish three things. First, they store the PDF or PS of the paper in GMail. Second, they make the paper's full text searchable. Finally, they put the paper's author, venue, year, and other important data in an email with an attachment. This last is important because it lets me search over just this information by restricting the search to emails with attachments (see the section on finding papers below).

Taking Notes

Papers I have added but not finished reading are in my inbox. As I read a paper, I add notes by replying to the conversation from within GMail (first deleting the quoted text). Thus, my notes can use GMail's rich text features, such as lists and bolding.

Once I finish reading a paper, I tag the conversation with appropriate tags. Finally, I archive the conversation.

Finding a Paper

To find a paper, I use GMail's search functionality. This searches the full paper text and all notes, and supports searching on tags and dates. Furthermore, due to how I add papers, I can find paper titles by restricting the search to email subjects, or restrict it to emails with attachments to find author names, venues, and other information in the first email of each paper.

Comparing Solutions

Given the above procedure, GMail can compete with CiteULike as a system for managing papers. However, though better in some ways, it is also limited in others.

Specifically, my solution has these limitations:

  • No one-click adding of papers from supported sites.
  • No automatic BibTex generation. However, though not quite as good, BibTex entries from Citeseer, Google Scholar, or other sites can still be saved as notes.
  • Can't easily edit existing notes. Instead, must copy and paste the old note into a new note, then delete the original.
  • No social or community features, such as sharing papers.

However, my solution has these advantages:

  • Can search the full text of papers and notes.
  • Supports more sophisticated searches, including dates.
  • Richly formatted notes, and a nice interface for writing and reading them.
  • Can easily print or forward one or all notes about a paper (tip: before printing/forwarding all notes, delete the note containing the full paper text, then restore it afterwards).

Depending on what advantages are more important to you, it may be worth giving this a try.

Monday, February 11, 2008

Bash Command-line Programming: Redirection

The bash shell is also a pretty handy programming language. One way to use this is writing scripts. However, another use is writing ad-hoc, one-time-use programs, for very specific tasks, right on the command line. I do this a lot, and find myself using the same techniques over and over.

In this post, I'll share some useful command-line techniques for redirection.

There are many ways other than pipes for redirecting stdin and stdout:

  • cmd &>file: send both stdout and stderr of cmd to file. Equivalent to cmd >file 2>&1.
  • cmd <file: pipes the contents of file into cmd. Similar to cat file | cmd, except that while pipes execute in a subshell with their own scope, this keeps everything in the same scope.
  • cmd <<<word: expands word and pipes it into cmd. word can be anything you'd type as a program argument. For example, cmd <<<$VAR pipes the value of $VAR into cmd.

Also, sometimes programs need arguments on the command line, rather than through stdin:

  • cmd $(<file): expands the contents of file as arguments to cmd. For example, if the file toRemove contains a list of files, rm $(<toRemove) removes those files.
  • cmd1 <(cmd2): creates a temporary file containing the output of cmd2, then puts the name of that file as an argument to cmd1. This is handy when cmd1 expects filename arguments. For example, to see the difference between the contents of directories dir1 and dir2, use diff <(ls dir1) <(ls dir2). This is conceptually equivalent to
    ls dir1 >/tmp/contentsDir1
    ls dir2 >/tmp/contentsDir2
    diff /tmp/contentsDir1 /tmp/contentsDir2
    rm /tmp/contentsDir1 /tmp/contentsDir2
    
    (only conceptually, though, since it actually uses fifos). For another handy command for this, check out comm.

Finally, you sometimes want to redirect to and from multiple programs at once:

  • {cmd1; cmd2; cmd3;} | cmd: pipes output of cmd1, cmd2, and cmd3 to cmd.
  • cmd | tee >(cmd1) >(cmd2) >(cmd3) >/dev/null: pipes output of cmd to cmd1, cmd2, and cmd3 in parallel. This trick is a tweak on that here. In the same way <(cmd) is replaced with a file containing the stdout of cmd, >(cmd) is replaced with a file that becomes the stdin of cmd. Since tee writes its stdin to each given file, you can combine it with >(cmd) to send the output of one command to the stdin of many. The final >/dev/null discards the stdout of tee, which we no longer need. Doesn't come up too often, but it's certainly neat.

Monday, January 28, 2008

A new lifestyle: losing 85 lbs in 7 months

Back in late June/early July, I found that I weighed 300 lbs. For the first time, I had to buy a pair of pants at a "big men" store: size 46 waist. I'm a big guy. I'm almost 6'3", and have broad shoulders. But this was ridiculous. I decided to do something.

I had already tried Body for Life, with which I lost (then regained) 30lbs a couple of times. I also ran into some blog posts by Tim Ferris. These shared much with Body for Life, but were new enough that I tried them for a few weeks. Unfortunately, they didn't work well for me. The problem was that I was on a diet, and what I really needed was a new lifestyle.

7 months: -85lbs, -10 inches, same jeans

For me, the difference between a diet and a lifestyle is that a diet is temporary, while a lifestyle is something you can continue forever. A diet is something you can't wait to be done with, while a lifestyle is your daily routine.

I learned two key lessons while finding a good lifestyle:

  1. You must want to be fit more than you want to binge or not sweat. A healthy lifestyle will involve some moderation in eating, and some exercise. So, if binging and not sweating are more important to you than being fit, you'll have a hard time even starting.
  2. You can't resent your lifestyle. This is crucial, even if it means a lifestyle that's somewhat less effective. Simply put, if you resent it, eventually you won't do it.

Since I quite love stuffing myself, and I hate exercise, one would think I'd resent any lifestyle that required differently. However, I found that if I keep such sacrifices reasonable, the resulting weight loss makes up for them. In other words, I don't resent moderate exercise so much when the pounds are coming off.

Now I'll describe the lifestyle I settled on after a few weeks of experimentation. I've maintained it with little effort for 7 months, during which I've gone from 300 lbs and size 46 pants to 215 lbs and size 36 (the pictures above show me then and now, wearing the same jeans). Most importantly, it's easy enough to maintain that I see no reason to ever stop. It comprises two aspects: eating right and exercising.

Eating Right

My goal is to keep my metabolism high, and avoid foods that quickly become fat. Thus, I follow these guidelines:

  • Eat four to six small meals a day, every 3 to 4 hours (I try to leave feeling slightly hungry; I'll eat again soon enough)
  • Eat more protein than carbohydrates.
  • Drink lots of water (I drink multiple cups with each meal, and between meals).
  • Avoid carbohydrates after lunch.
  • Avoid fatty foods, such as dairy, certain cuts of meat, etc.
  • No empty calories, such as desserts or non-diet soda.

Then, there's the two most important guidelines of all:

  • Once a week, eat whatever and however much you want. Indian buffets, pizza, ice cream, pasta...anything. Partly, this actually helps you lose weight because it convinces your body there's no famine. But more importantly, it lets you address any cravings that build during the week. My free day is Saturday, and I look forward to it all week.
  • If a rare opportunity for food comes along, it's okay to take it. Opportunities like parties, or travel (I feasted for a whole week while in Vienna). And don't feel bad, or give up your weekly binge for it. Sure, it's sub-optimal. But what's even less optimal is feeling so restricted you give up on your lifestyle entirely. Just make sure it happens rarely.

These last two are the most important guidelines because they keep away that fatal resentment.

Given these guidelines, my typical meals are as follows:

Breakfast: 9:00am

  • 1/4 cup low fat cottage cheese, or
  • 1 egg, fried in olive oil or hard-boiled

I prefer quick breakfasts, and I'm not hungry in the morning. Thus, a typical breakfast is 1/4 cup of low fat cottage cheese (0.75g saturated fat, 2.5g carbs, 6g protein), with some freshly ground black pepper for seasoning. Another option is an egg fried in olive oil, or hard-boiled.

Lunch: 12:30pm

  • Moderate portion at restaurant, with vegetables and as much protein as carbs

Since it's still early in the day and my metabolism is going, I eat more for lunch than other meals, including more carbs. This makes lunch my most free meal. Anything with moderate portions, vegetables, and low-fat carbs and protein is good. The key is to eat as much protein as carbs, and to leave a bit hungry.

Typical lunch, from a Peruvian restaurant

There are several restaurants near my workplace that I frequent for lunch. I like almost any special at a nearby Chinese restaurant, though I eat only half the rice. There are Peruvian, Korean, and African restaurants with perfect meals (Madison is great for restaurants). There's also a sports bar with a good chicken salad, and a burger joint with lean bison and ostrich burgers. Again, the key is low fat, as much protein as carbs, and leaving a bit hungry. Given this, it's easy to find good lunches almost anywhere.

Snack: 4:00pm

  • 3 or 4 pieces of beef jerky, or
  • 1/3 of a high protein meal replacement bar
1/3 of a high-protein meal replacement bar for a snack

Three or four hours later, I have a snack. This is a very light meal, about 100 calories, and protein-heavy. One good snack is three or four pieces of beef jerky (0g saturated fat, 5g carbs, 11g protein). Since I get tired of jerky, I've started alternating with high-protein meal replacement bars from nutrition stores like GNC. I look for low fat and few net carbs, then eat about a third of a bar (1/3 of a Myoplex Carb Control bar is 90 calories, 1.5g saturated fat, 1.3g net carbs, 8.3g protein). The bars aren't as ideal as jerky, but they're nice variety and sweet, which helps to avoid resentment.

Dinner: 7:30pm

  • 1 cup of mixed vegetables and protein

I eat dinner after getting home and exercising. Dinner isn't as light as my breakfast or snack, but it's much lighter than lunch. I typically have 1 cup of mixed vegetables and protein. For example, my last few dinners were tofu stir-fry. Other common dinners are seasoned ground beef with peas or broccoli, or leftovers from a Mexican restaurant (no carbs, so a meat dish in sauce, like pollo ranchero or steak in green chile sauce)

Light dinner: tofu stir fry

Since dinner is small, I typically cook a large portion, then have a little each evening. This applies equally well to take-out: a dinner portion at a Mexican restaurant serves for three or four dinners.

Dinner is usually my last meal. However, if I'm going to be up past midnight, I sometimes have another snack around 11:00pm.

Exercising

I detest exercising. I hate it beforehand, during, and afterwards. Thus, to keep resentment low, I minimize exercise time: 20-30 minutes per day, 5-6 days a week. I don't exercise on Saturday, my free day, and I only exercise on Sunday if I feel inspired. However, when I exercise, it's with high intensity.

I do both cardio and resistance training: on Mondays, Wednesdays, and Thursdays, I use a stationary bike or run; on Tuesdays and Thursdays, I lift weights. Since I hate exercising, if I have an excuse, I'll eventually take it. To combat this, I bought a stationary bicycle, a weight bench, and a dumbbell set. This minimizes time overhead, since I can exercise at home.

Below I describe my typical workouts. They're all short, but high-intensity.

Cardio: Stationary Bike

I use a stationary bicycle for cardio, largely because they're relatively inexpensive and fit easily in my apartment. However, my workout works well with any aerobic exercise with variable intensity. It is a variation on the Body for Life cardio workout.

The workout comprises four parts:

  • Low intensity warm-up: 1 minute
  • 3 sets of
    • Mid intensity: 3 minutes
    • High intensity: 3 minutes
  • Very high intensity sprint: 1 minute
  • Low intensity cool-down: 1 minute

This is a tougher workout than it sounds. I push so that I don't think I'll be able to finish the last set. Then, when I do, I push one more minute of all-out sprinting. I am definitely sweaty and exhausted by the end. However, since it only lasts 21 minutes, it's over quickly.

Resistance: Bench and Dumbbells

Muscle burns fat, so more muscle burns more fat. It also makes you stronger and look better. Thus, resistance training is just as important as cardio. It doesn't take much equipment, either: my whole workout uses only one pair of dumbbells with changeable weights, and a weight bench.

For each muscle group I exercise, I do three sets of six repetitions, with 30-60 seconds of rest between sets. Then, without rest after the third set, I do a fourth set of six reps, but of a different exercise for the same muscle. For instance, for pectorals, I might do a set of fly, rest, another fly set, rest, a third fly set, and immediately after a set of bench press. The key is to exhaust the muscle; if I can finish the fourth set a couple workouts in a row, then I increase the weight.

I concentrate on my upper body. I know this is sub-optimal, since it ignores the big muscles in my legs. However, I really don't like lower body exercises, and they were making me dread my workout. Thus, to minimize resentment, I left my lower body to the stationary bicycle.

I exercise four muscle groups: chest, biceps, triceps, and shoulders. For each group, I pick a primary and secondary exercise, then swap every few months. These are the exercises I like:

Results

So far, I've stuck to my new lifestyle for 7 months, without much effort. I've lost 85 pounds, and 10 inches off my waist. I've also gained noticeable muscle. I feel much healthier: I've got more energy, and can participate in many more activities. I've been told I even move much differently. Given all this, I'd say my new lifestyle is a success, and I plan to continue it indefinitely.

Wednesday, January 23, 2008

Dotfile replication with subversion

Whenever I got a new *nix account, the first thing I did was copy over dotfiles (.bashrc, .vimrc, .screenrc, etc.). Managing these copies was a pain, since I had to manually replicate changes. For example, if I thought of something clever to add to .vimrc, I had to manually add it to .vimrc on every account. Of course, this meant my dotfiles got hopelessly out of sync. This in turn made copying dotfiles to a new account even more painful, since I had to remember which copies were the latest.

To address this annoyance, I now keep all my dotfiles in a subversion repository. I assume by now everyone knows of subversion. If not, in short, it's a version control system, like CVS. If you don't know what that is either, chances are you're not worried about managing dotfiles.

The process was actually pretty simple, comprising only four steps (I'll assume basic familiarity with subversion, or at least a willingness to read the subversion handbook):

1. Create the subversion repository

See here for creating subversion repositories, and here for serving them. I called my repository "dotfiles".

2. Create the .dotfiles/ directory

I'll assume your most recent dotfiles are all on one account. First, on this account, create a .dotfiles/ directory in your home directory. Then, copy the following shell script to .dotfiles/linkToDotFiles.sh:

~/.dotfiles/linkToDotFiles.sh
#!/bin/sh

SCRIPT_DIR=$(echo $0 | sed -e 's/\(.*\)\/.*/\1/')
SCRIPT_FILE=$(echo $0 | sed -e 's/.*\///')
[ "$SCRIPT_DIR" = "." ] && SCRIPT_DIR=$(pwd)

for F in $(/bin/ls $SCRIPT_DIR | grep -v $SCRIPT_FILE)
do
   echo -n "$SCRIPT_DIR/$F: "
   if [ -e "$HOME/.$F" ]; then
       echo "~/.$F already exists, skipping."
   else
       if ln -s "$SCRIPT_DIR/$F" "$HOME/.$F"; then
           echo "linked to ~/.$F."
       fi
   fi
done

We'll use this script shortly to create symlinks (i.e., shortcuts) from your home directory to dotfiles in the .dotfiles/.

Finally, import .dotfiles/ to your subversion repository

3. Add dotfiles to .dotfiles/ and the repository

First, copy any dotfile you want to replicate to other accounts into .dotfiles/, but renaming them to remove the leading dot. For example, copy .vimrc to .dotfiles/vimrc. You can do the same for configuration directories, like .vim/. Then, add and commit these additions to the repository.

4. Replicate dotfiles to a new account

First, in the home directory of a new account (not the one where you created the .dotfiles/), checkout your repository. This will create a .dotfiles/ directory in the new account.

Then, run the linkToDotfiles.sh script by using this command from your home directory: bash .dotfiles/linkToDotfiles.sh. This creates symlinks in your home directory to files in .dotfiles/. For instance, it will create a symlink from ~/.vimrc to ~/.dotfiles/vimrc.

That's pretty much it. You can add new dotfiles by repeating step 3, and you can modify existing dotfiles and commit them from within .dotfiles/. If you commit changes, you can update .dotfiles/ in each account, which will preserve local, uncommitted modifications.

Friday, January 18, 2008

mediactrl: controlling laptop media buttons

My laptop media buttons don't do quite what I want. Specifically, I want

  • Volume changes to affect multiple channels.
  • A sound to play when I change volume.
  • Play/pause, forward, rewind, and stop to control whatever application I'm using, not just always the same one (e.g., pause the video if I'm watching a video, but the music if I'm listening to music).

So I first wrote a script for volume changing and media control, then told KDE to call it when I press media buttons.

Modifying multiple channels and playing a sound on volume changes is trivial. Controlling multiple media players is trickier. A heuristic that works well is to have a list of players that are controlled in order. In my case, I first list my video players, kaffeine and kplayer, then my music player, amarok. Thus, since amarok is always running, the media buttons control it by default. However, when I watch a video with kplayer, the media buttons control it instead because I listed it before amarok.

The player list, and other parameters, are in a config file:

/etc/mediactrl.conf
################################################################################
# Volume settings
################################################################################

# What percent to step on volume up/down
VOL_STEP_PERCENT=5

# The sound to play when volume is changed
VOL_NOTIFICATION_SOUND="/usr/share/sounds/KDE_Click.wav"

# The player to use when volume is changed (default is /usr/bin/aplay)
#VOL_NOTIFICATION_PLAYER="/usr/bin/artsplay"

# Comma-separated list of channels to alter (default is "PCM")
VOL_CHANNELS="Master, Master Mono, Headphone, PCM"


################################################################################
# Media player settings
################################################################################

# Comma-separated list of media players in the order in which they should be
# checked. (these need to be the actual command name). For each of these, the
# command to be run when play/pause, stop, prev, and next are pressed should be
# specified. Examples are below.
MEDIA_PLAYERS="kaffeine, kplayer, amarok"

kaffeine_playpause='dcop kaffeine KaffeineIface pause'
kaffeine_stop="dcop kaffeine KaffeineIface stop"
kaffeine_prev="dcop kaffeine KaffeineIface posMinus"
kaffeine_next="dcop kaffeine KaffeineIface posPlus"

kplayer_playpause="dcop kplayer kplayer-mainwindow#1 activateAction player_pause"
kplayer_stop="dcop kplayer kplayer-mainwindow#1 activateAction player_stop"
kplayer_prev="dcop kplayer kplayer-mainwindow#1 activateAction player_backward"
kplayer_next="dcop kplayer kplayer-mainwindow#1 activateAction player_forward"

amarok_playpause="dcop amarok player playPause"
amarok_stop="dcop amarok player stop"
amarok_prev="dcop amarok player prev"
amarok_next="dcop amarok player next"

For each player, the config specifies commands to execute for each media button. Since I happen to use all KDE apps, these commands are calls to dcop. The config should be placed in /etc/mediacntrl.conf.

The script itself is below:

mediactrl
#!/bin/bash

function usage() {
  echo "Usage: mediactrl "
  echo "  playpause    - toggles play/pause on active media player"
  echo "  stop         - stops play on active media player"
  echo "  prev         - goes to previous on active media player"
  echo "  next         - goes to next on active media player"
  echo "  volume  - alters volume.  is up, down, mute, unmute, or toggle"
  exit
}

if [ ! -f /etc/mediactrl.conf -o ! -r /etc/mediactrl.conf ]; then
   echo "Cannot read /etc/mediactrl.conf!" >&2
   exit 1
fi
. /etc/mediactrl.conf

case "$1" in
   playpause|stop|prev|next)
       echo $MEDIA_PLAYERS | perl -ne 's/,\s*/\n/g; print' | while read PLAYER; do
           if pgrep -u "$USER" "$PLAYER" > /dev/null; then
               eval CMD="\${${PLAYER}_${1}}"
               [ -n "$CMD" ] && eval "$CMD"
               break
           fi
       done
       ;;
   volume)
       case "$2" in
           up|down)
               echo ${VOL_CHANNELS:=PCM} | perl -ne 's/,/\n/g; print' | while read CHANNEL; do
                   if [ -z "$VOL" ]; then
                       VOL=$(amixer get "$CHANNEL" | grep % | head -n 1 | sed -e 's/.*\[\(.*\)%\].*/\1/')
                       PLUS_MINUS=$(if [ "$2" = "up" ]; then echo "+"; else echo "-"; fi)
                       VOL=$((VOL $PLUS_MINUS ${VOL_STEP_PERCENT:=5}))
                       [ $VOL -lt 0 ] && VOL=0
                       amixer set "$CHANNEL" "$VOL%" > /dev/null
                       dcop kded kmilod displayProgress "Volume " "$VOL"
                   else
                       amixer set "$CHANNEL" "$VOL%" > /dev/null
                   fi
               done
               ;;
           mute|unmute|toggle)
               echo ${VOL_CHANNELS:=PCM} | perl -ne 's/,/\n/g; print' | while read CHANNEL; do
                   amixer set "$CHANNEL" "$2" > /dev/null
               done

               CHANNEL=$(echo $VOL_CHANNELS | sed -e 's/,.*$//')
               MUTE=$(amixer get "$CHANNEL" | grep % | head -n 1 | sed -e 's/.*\[\(.*\)\]/\1/')
               dcop kded kmilod displayText "Sound $MUTE"
               ;;
           *) usage;;
       esac

       [ -n "$VOL_NOTIFICATION_SOUND" ] && eval "${VOL_NOTIFICATION_PLAYER:=/usr/bin/aplay}" "$VOL_NOTIFICATION_SOUND" 1>/dev/null 2>&1
       ;;
   *) usage ;;
esac

The bit of craziness with setting volumes ensures all channels are truly set to the same level.

Note that this script was written for a laptop running Kubuntu, and may need tweaking for other settings. For example, it uses kmilo to display a volume bar, which will not work unless the kmilo service is running.

Once this script is executable (chmod u+x mediactrl), I map it to the laptop media buttons through the KDE Accessibility settings (K Menu -> System Settings -> Accessibility -> Input Actions). Voila! More intelligent media buttons.

Monday, January 14, 2008

Gentoo to Ubuntu, or bottom-up to top-down

After many years of using and proselytizing to others about Gentoo, last week, I converted to Ubuntu (Kubuntu, to be precise).

This transition has made me think about how I use computers. Thanks to years of *nixes before user-friendly distros like Ubuntu, my understanding is bottom-up: I start with low-level functionality, then build higher-level abstractions. This is in contrast to most users' top-down understanding: start with provided high-level abstractions, then dig deeper if necessary.

To illustrate the difference, consider connecting to a wireless network. I know that I first bring up the network interface with ifconfig, then assign an ESSID and WEP key with iwconfig, then finally get an IP address with dhcpcd. I can execute these low-level steps by hand. However, to make life easier, I create a script to execute them. I can then extend this script to pick one of a set of networks, run a VPN program after connecting to a particular network, etc. This script is a higher-level abstraction that hides low-level details.

In contrast, a typical user has a top-down understanding; they start with the high-level abstraction. For example, in Kubuntu, the knetworkmanager program puts a nice icon in the system tray from which users manage their networks. The user never needs to know about ifconfig, iwconfig, or dhcpcd. Until something breaks and they have to dig deeper, anyway.

The top-down approach is vital to making computers usable to laymen. For years, Linux effectively required a bottom-up approach, which is one reason it was so inaccessible. This is why it's good to see new distros, like Ubuntu, moving toward a top-down approach.

In keeping with this spirit, I'm trying to use Ubuntu in a top-down manner. In other words, I'm trying to use only the high-level abstractions it provides, without digging into low-level programs and configuration files to build my own. This has worked well enough that I'm recommending Ubuntu to my non-Linux friends. Still, I miss the power and configurability of the bottom-up approach. Which is why you'll periodically see some of my nicer abstractions on this blog.

And so it begins

A conversation at Del's with friends Steve and Jay convinced me to find a place to collect and share random thoughts, ideas, and things. Here it is.

I've got a few things saved up, which I'll post by and by.

Also, as a side note: don't expect this blog to look pretty, barring someone else's template. I'm very left-brained, and though I can recognize prettiness, I have a hard time creating it.