Thursday, February 28, 2008

Google Sites, MS SharePoint...Creating Communities

Google recently announced a new product, called Google Sites. The basic idea is that it lets you gather and share data (e.g., Google Apps documents, files, free-form wiki-style pages) pertaining to a particular purpose (e.g., business, team, project). Inevitably, Sites is being compared to Microsoft SharePoint, which addresses a similar need.

What's fascinating to me about Sites and SharePoint are how they relate to my research on community information management. Briefly, a community has a shared topic or purpose on which they have data, such as web sites, mailing lists, and documents. This data is often unstructured. I research systems that process this unstructured community data to extract structured information about entities and relationships, then provide structured services beyond keyword search (e.g., querying, browsing, monitoring). I've built a very alpha prototype system, DBLife, for the database research community, and also published some papers on the topic.

How Sites and SharePoint relate is very exciting: they build communities. Currently, my research helps builders select web pages and other data sources. However, by integrating with a Microsoft SharePoint installation or a Google Site, the data is already there. And the benefits to users is palpable. Instead of only keyword search, users would have powerful structured access methods, making the application much more useful. Truly, it'd be very exciting to see something like that happen.

Wednesday, February 27, 2008

Bash Command-line Programming: Flow Control

Time for another post on handy techniques for command-line bash programming. This post covers some useful command-line techniques for flow control.

Even when writing quick programs on the command-line, I often need to branch or loop. Especially loop, as I often need to do something over every file in a directory, or every line in a file. Below are some techniques I commonly use. For more neat bash-isms, check out the Bash FAQ.

  • cmd && trueCmd || falseCmd: if cmd executes successfully, run trueCmd, else run falseCmd. This is a pithy version of
    if cmd; then trueCmd; else falseCmd; fi
  • while cmd; do stuff; done: execute stuff while cmd executes successfully. Use
    while true; do stuff; done
    for an infinite loop.
  • for W in words; do stuff; done: sets the variable $W to each word in words, then executes stuff. For instance, to run foo on every text file in a directory tree, use
    for W in $(find . -name '*.txt'); do foo "$W"; done
    Note that words are split automatically based on whitespace. This means that filenames with spaces will be split into multiple words (I know find has the -exec option, but it can be cumbersome, and this is just an example). To avoid splitting on whitespace, see the next tip.
    Edit: Originally, my example used /bin/ls *.txt rather than find. However, as HorsePunchKid points out,
    for W in *.txt; do foo "$W"; done
    works on filenames with whitespace (and is also cleaner). This is an excellent point, but the only expansion done after word splitting is pathname expansion, so it applies only to file globs. If you're processing the output of a command, or the contents of an environment variable, then you'll still have a word splitting problem.
  • while read L; do stuff; done: sets the variable $L to each line in stdin, then executes stuff. Use this to handle input with spaces. For example, to run foo on every text file in a directory tree, including those with spaces, use
    find . -name '*.txt' | while read L; do foo "$L"; done

The last tip has a caveat: a piped command executes in a subshell with its own scope. Thus, if I use cmd | while read L; do stuff; done, variables set in stuff are not available outside of the loop. For example, if I want to run foo on every text file, then print how many times foo succeeded, I could try this:

I=0
find . -name '*.txt' | while read L; do foo "$L" && let I++; done
echo $I

However, this prints 0. The reason is because $I outside the pipe is a different variable than $I inside the pipe. To fix this, avoid a pipe using a trick from my earlier post:

I=0
while read L; do foo "$L" && let I++; done < <(find . -name '*.txt')
echo $I

Friday, February 22, 2008

Managing papers with GMail

As a graduate student, I read a lot of papers. Then, I often want to write notes about these papers, categorize them, find them quickly, etc. However, despite being a common problem for graduate students (or anyone else keeping track of documents), there are few free solutions that are any good. Thus, I rolled my own using GMail.

Available Solutions are Limited

Unfortunately, there aren't many free solutions for managing papers. In fact, the only decent one I've found is Richard Cameron's CiteULike. CiteULike provides all the necessities: online storage, tagging, metadata search, and note taking. It also has two other draws: one-click paper bookmarking from supported sites, and social features for sharing and collaboration.

However, CiteULike has a deal-breaker for me: its search capabilities are very limited. It provides keyword search only over paper titles, author last names, venues, and a part of the abstract (to the best of my knowledge, since it doesn't list what it searches). It does not search the paper's full text, or even your notes. This can make finding papers based on vaguely remembered information very difficult.

Using GMail to Manage Papers

To address CiteULike's limited search, I decided to manage papers with GMail. The basic idea is that I keep each paper and its notes in an email thread. Then, further notes are replies to the thread. This supports writing richly formatted notes, as well as GMail's search over each paper's full text and any notes I've written.

Below, I describe the steps to set up the solution, add a new paper, take notes on a paper, and find a paper I've read. Finally, I compare the advantages of using this solution to using CiteULike.

Setup

Setup is trivial, consisting of creating a new gmail account for storing papers. I'll refer to this account as papers@gmail.com.

Adding a New Paper

After creating the account, I add new papers by sending email to papers@gmail.com. To ease finding the paper later, I use the following steps, which take only a minute or so:

  1. Start a new email to papers@gmail.com. Then, fill in the paper information. The key is to put the paper title as the subject, and include the author name, venue, and any other metadata you may want to search for later. The image to the right is an example (click to enlarge).
  2. Attach the PDF or PS file of the paper to the email.
  3. Send the email. Since it's to me, it will appear in my inbox.
  4. Respond to the email with the full text of the paper (if necessary, delete any other text first). To get the text from the PS or PDF file, I use the pstotext or ps2ascii Linux programs. The xclip program is handy for putting the text in the clipboard, from which I paste it into the response.

These steps accomplish three things. First, they store the PDF or PS of the paper in GMail. Second, they make the paper's full text searchable. Finally, they put the paper's author, venue, year, and other important data in an email with an attachment. This last is important because it lets me search over just this information by restricting the search to emails with attachments (see the section on finding papers below).

Taking Notes

Papers I have added but not finished reading are in my inbox. As I read a paper, I add notes by replying to the conversation from within GMail (first deleting the quoted text). Thus, my notes can use GMail's rich text features, such as lists and bolding.

Once I finish reading a paper, I tag the conversation with appropriate tags. Finally, I archive the conversation.

Finding a Paper

To find a paper, I use GMail's search functionality. This searches the full paper text and all notes, and supports searching on tags and dates. Furthermore, due to how I add papers, I can find paper titles by restricting the search to email subjects, or restrict it to emails with attachments to find author names, venues, and other information in the first email of each paper.

Comparing Solutions

Given the above procedure, GMail can compete with CiteULike as a system for managing papers. However, though better in some ways, it is also limited in others.

Specifically, my solution has these limitations:

  • No one-click adding of papers from supported sites.
  • No automatic BibTex generation. However, though not quite as good, BibTex entries from Citeseer, Google Scholar, or other sites can still be saved as notes.
  • Can't easily edit existing notes. Instead, must copy and paste the old note into a new note, then delete the original.
  • No social or community features, such as sharing papers.

However, my solution has these advantages:

  • Can search the full text of papers and notes.
  • Supports more sophisticated searches, including dates.
  • Richly formatted notes, and a nice interface for writing and reading them.
  • Can easily print or forward one or all notes about a paper (tip: before printing/forwarding all notes, delete the note containing the full paper text, then restore it afterwards).

Depending on what advantages are more important to you, it may be worth giving this a try.

Monday, February 11, 2008

Bash Command-line Programming: Redirection

The bash shell is also a pretty handy programming language. One way to use this is writing scripts. However, another use is writing ad-hoc, one-time-use programs, for very specific tasks, right on the command line. I do this a lot, and find myself using the same techniques over and over.

In this post, I'll share some useful command-line techniques for redirection.

There are many ways other than pipes for redirecting stdin and stdout:

  • cmd &>file: send both stdout and stderr of cmd to file. Equivalent to cmd >file 2>&1.
  • cmd <file: pipes the contents of file into cmd. Similar to cat file | cmd, except that while pipes execute in a subshell with their own scope, this keeps everything in the same scope.
  • cmd <<<word: expands word and pipes it into cmd. word can be anything you'd type as a program argument. For example, cmd <<<$VAR pipes the value of $VAR into cmd.

Also, sometimes programs need arguments on the command line, rather than through stdin:

  • cmd $(<file): expands the contents of file as arguments to cmd. For example, if the file toRemove contains a list of files, rm $(<toRemove) removes those files.
  • cmd1 <(cmd2): creates a temporary file containing the output of cmd2, then puts the name of that file as an argument to cmd1. This is handy when cmd1 expects filename arguments. For example, to see the difference between the contents of directories dir1 and dir2, use diff <(ls dir1) <(ls dir2). This is conceptually equivalent to
    ls dir1 >/tmp/contentsDir1
    ls dir2 >/tmp/contentsDir2
    diff /tmp/contentsDir1 /tmp/contentsDir2
    rm /tmp/contentsDir1 /tmp/contentsDir2
    
    (only conceptually, though, since it actually uses fifos). For another handy command for this, check out comm.

Finally, you sometimes want to redirect to and from multiple programs at once:

  • {cmd1; cmd2; cmd3;} | cmd: pipes output of cmd1, cmd2, and cmd3 to cmd.
  • cmd | tee >(cmd1) >(cmd2) >(cmd3) >/dev/null: pipes output of cmd to cmd1, cmd2, and cmd3 in parallel. This trick is a tweak on that here. In the same way <(cmd) is replaced with a file containing the stdout of cmd, >(cmd) is replaced with a file that becomes the stdin of cmd. Since tee writes its stdin to each given file, you can combine it with >(cmd) to send the output of one command to the stdin of many. The final >/dev/null discards the stdout of tee, which we no longer need. Doesn't come up too often, but it's certainly neat.