'Why, I'm Posterity -- and so are you.'

Who’s afraid of the Wolfram search?

Posted: May 5th, 2009 | Author: Mark Phillipson | Filed under: Library musings | Tags: , , , , , , , , | 2 Comments »

I might be.

The Wolfram|Alpha “computational knowledge engine” has been generating buzz for some time, especially since Stephen Wolfram, its eccentric progenitor, announced that it would be going live in mid-May. Expect the twittering to reach a crescendo.

Since the Wolfram|Alpha (WA, let’s say) promises to answer questions typed into a simple text box, it’s being described in the press as a Google-killer. The idea, in an alpha nutshell, is that WA interprets a natural language query and then combs through a gigantic pile of databases, both public and licensed, in order to respond with an answer — rather than Google’s list of web pages that may or may not contain an answer.

Wolfram recently gave a demonstration of WA at Harvard’s Berkman Center. The whole presentation is posted, but you can get a quicker sense of what WA aims to do in this surprisingly murky collection of screenshots:

From this demo and other the-Wolfram-is-coming reviews blooming like tremulous flowers in the rain, WA looks to be a fancy calculator, an atlas on steroids, a deft collator of visualized data.

But is it more than that? Beyond looking up and presenting information, will it give us genuine and new answers? Will it represent a significant push beyond Google’s suddenly modest ambition to “organize the world’s information and make it universally accessible and useful”?

Wolfram himself seems to think so:

…what about all the actual knowledge that we as humans have accumulated?

A lot of it is now on the web—in billions of pages of text. And with search engines, we can very efficiently search for specific terms and phrases in that text.

But we can’t compute from that. And in effect, we can only answer questions that have been literally asked before. We can look things up, but we can’t figure anything new out.

So how can we deal with that? Well, some people have thought the way forward must be to somehow automatically understand the natural language that exists on the web. Perhaps getting the web semantically tagged to make that easier.

… I realized there’s another way: explicitly implement methods and models, as algorithms, and explicitly curate all data so that it is immediately computable.

Wolfram is know for making audacious claims about the power of computation; his massive boiling down of all complexity into relatively simple mathematical rules, A New Kind of Science, was a ‘surprise best seller’ on Amazon even though Wolfram posts all of it for free. The promise of a simple handle on an immensely complex world–frothing up into a good dose of post-religious hype–is irresistible. It’s quite congruent, when you think about it, to Google’s keyword-search doorway to the infinite.

But Google is best used to locate information, not to solve problems. Sure, if you type into its search field “square root of 81″ it will offer you a quick answer atop the usual pagerank results. Google has dabbled, in fact, with calculator functions. This slippage between search and calculation, though, is what alarms me.

A pernicious information illiteracy takes root — the world of clear ascription of responsibility suffers another blow — anytime someone starts assigning oracular power to the Google search algorithm. “It says [fill in information claim here].” I’ve seen college students actually cite a Google search in research–not research on Google search, mind you, but research on a subject informed by something that the search dug up one night. Who wrote and published the data is unimportant: in the middle of that dreary night, “It says….”

At an extreme point, we reach the absurdity of Carol Beer in Little Britain, overriding every thought and instinct as she dabbles on the keyboard and announces, after desultory searches, “Computer says no…”

Of course any decent web calculator will draw on good data, and won’t be nearly as mechanistic or useless or funny as Carol. But even an amazing one — and WA promises to be amazing — shouldn’t be confused with actual intelligence; assembling and synthesizing only gets you so far. One of WA’s biggest cheerleaders, Twine founder Nova Spivack, makes a similar point:

Wolfram Alpha, at its heart is quite different from a brute force statistical search engine like Google. And it is not going to replace Google — it is not a general search engine: You would probably not use Wolfram Alpha to shop for a new car, find blog posts about a topic, or to choose a resort for your honeymoon. It is not a system that will understand the nuances of what you consider to be the perfect romantic getaway, for example — there is still no substitute for manual human-guided search for that. Where it appears to excel is when you want facts about something, or when you need to compute a factual answer to some set of questions about factual data.

Spivack’s distinction between (WA’s) computation and (Google’s) look-up is helpful, as is his concession that WA, as elegantly structured as it may be, will only be useful in presenting and recombining known facts. Wolfram himself, no stranger to hyperbole, may wish to characterize WA as generating new knowledge. But until it develops algorithms for context, nuance, interpretation, influence, critique, seriousness, incoherence–until it embraces all of human expression, in all of its messiness–it will never offer sufficient answers to questions more debatable than “What was the average rainfall in Boston last year?”–just as Wikipedia cannot extend beyond professed neutrality.

So my fear of WA, knowing little about how it actually will work and feel, is that it will offer a fancy dashboard of pseudo-expertise, subtly diverting human inquiry into what’s pre-known. This seems an old fear, a fear of robots, and maybe, like many old human fears, it will melt away in the light of new threats.

In any case, by WA seems poised to offer a counterpoint to the semantic web, a different model of bringing structure to information to make search more responsive to the questions we ask. The road is strewn with various ‘natural language’ search disappointments — Ask Jeeves was deaf, Powerset seems blind to all but Wikipedia — and there’s reason to hope that Wolfram’s interpretation of natural language will be smarter, that it will process our questions and deliver them to large and various datasets. If it then answers authoritatively, though — caveat emptor.


Xciting connections

Posted: March 31st, 2009 | Author: Mark Phillipson | Filed under: Academia, Libraryworld, Metawriting, ^ | Tags: , , , , | No Comments »

In the perfect world we never seem to live in, migration of scholarship to the web would mean endlessly networked citations. It would mean new metrics for gauging the impact of any given publication, substantiating tenure/promotion and grant proposals with hard evidence. It would give us new tools to map the interplay of research in an interdisciplinary age. Machines would be prosthetic connectors of our truest thoughts.

Citation mapping is a step towards this promise. Academics have been diligently appending to their research footnotes and endnotes of attributions all along; the hooks are there, all we need to do is link them up. Easier said than done, of course, as the Tower of Babylon still smolders. Citation formats and database structures vary; the semantic web is under construction; too often software used to generate citations (MS Office, Endnote, Zotero & the like) is disconnected from the end version of an article, meaning that the article has to be OCR’d and citations re-interpreted. For these and other reasons, as this recent D-Lib article enumerating problems with citation counts points out, “the rates of citation data accuracy and completeness are not precise enough to make fair assessments.”

That’s not stopping efforts to corral citations into paths of discovery, and as usual the science data managers are out in front. Thompson Reuter’s Web of Science, in particular, has been innovating bibliometric analysis and visualization; its Citation Mapping Tool debuted last summer. The tool ‘maps’ articles into generations, allowing you to travel back and forth between cited and citing. Here’s a visualization of how one article cites others:

As this review notes, the tool is far from exhaustive, thanks to database quirks and variation of records across journals. Exporting a citation map is underwhelming at present: you can download it as a flat image, but there is no way to harvest the data into data management. The tool presents some color coding options, so you can sort out ‘types’ of references, but designation of these codes again relies on consistency across fields that cannot be taken for granted.

But perhaps the biggest drawback to this or any version of simple citation mapping is its inability to reflect conceptual relationships. Citations, after all, are made to a variety of sources for a variety of reasons, not all of them equally germane to what an article is about. An article may cite something it’s refuting, or may be cluttered with window-dressing references, or may go out of its way to cite the work of mentors or colleagues more out of a sense of politesse than necessity. Until this variation of citation quality is somehow addressed, along with improved metadata standardization and database interoperation, it seems doubtful that citation mapping can, in the words of the WOS mapping reviewer, “represent, and make access to, the historical progress of human inquiry, including its interdisciplinary aspects.”

***

Time to take another tack? As a recent NYT summary noted, data scientists at Los Alamos have come up with a new mapping of the connections between various disciplines. These connections are charted by tracking logs of click-throughs by researchers moving between journals. The project, detailed in PLoS, is seeking a more accurate way to measure and represent research interconnections than the more traditional citation mapping.

The PLoS report lists advantages of clickstream data: it is immediate information (versus the years that citation data can take to fall into place), it is based on private and actual navigation activity (versus the various motives for citation mentioned above). The report also notes a drawback to relying on clickstreams: “User interactions with scholarly web portals are shaped by many constraints, including citation links, search engine results, and user interface features.” It’s the same infrastructure problem haunting citation mapping.

In any case, the map of click-through connections is quite fun to look at – it’s color-coded by discipline. Humanities sort out to the middle, which is good and proper. Behold what the PLoS authors call a “first-ever glimpse of this terra incognita”:


Life in the taggregate

Posted: November 23rd, 2007 | Author: Mark Phillipson | Filed under: Libraryworld, Tagging, ^ | Tags: , , , , , , , , , , , , , , , , | 1 Comment »

From its earliest days, the promise of the Semantic Web has been to bring networked computers closer to the forms and priorities of human inquiry. This promise depends on mark-up language that gives data some structure, and frameworks that bring such structure into recognizable relationships. As a May 2001 Scientific American piece by Tim Berners-Lee and colleagues put it, “for the semantic web to function, computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning.”

Automated reasoning! This dream may be coming to life in e-science, with its highly structured and interoperable datasets, but in many other contexts the idea of a Semantic Web sits uneasily with the younger and more popular kid on the block, the Participatory Web. Web 2.0 environments amasses a lot of data and, more importantly, a lot of information about this data generated by humans downright impervious to the need of machines for identifiable and consistent structure. Such tags are generally free-form, non-hierarchical, not expressing relationships in a predictable and consistent way; they dance to “folksonomy” not “taxonomy”; they are blithely untethered to “ontologies,” to any URI-based language standards.

Nevertheless there is intriguing thought out there about the potential interplay of the Semantic Web and Web 2.0. The Tagcommons sites lays out Use Cases that envision sharing tags across databases, and sketches out some functional requirements to make that interoperability happen. Tom Gruber, in particular, has argued energetically for “collective intelligence systems” built from syntheses of structured data and social software; his travel-review site RealTravel uses a “snap-to-grid” model to disambiguate and structure user-supplied tags.

And now in Yahoo! Research Berkeley labs, algorithms are starting to take into account aggregate patterns in order to sift out meaning from vast oceans of community-generated tags despite all their unstructured messiness — or, as computer scientists like to say, despite all their “noise.” It’s a matter of inference and cluster analysis. Case in point: the photo-sharing site Flickr‘s new experiments in extracting “practical information about the world” from the snapshots and tags poured into it by the great unwashed. The report “How flickr helps us make sense of the world: context and content in community-contributed media collections,” describes a layered process of tag and image analysis–one that can be conducted entirely by machines–that identifies representational tags as well as place and event semantics.

What does all this do for us? For one thing, it can improve a search through piles of community-contributed materials; my search for “Harlem” stands a better chance of coming up with the most representative picture of the neighborhood, or a set of iteratively varied views of the neighborhood, or even a conglomeration of views for a composite view. I could determine the most visited place in the neighborhood, or the scenes of important events. Yahoo!’s researchers are even thinking about automatic tagging of photos, or suggestions for tags, that are generated by visual content abetted by contextual and geographical cues.

Here are a couple of spins of Yahoo! Labs’ TagMaps:

Flickr World Browser Harlem

^ TagMap’s World Browser analyzes Flickr tags to locate “Harlem” on a map and offer a set of representative photos (on the right). Harlem seems pushed to the west, and the chicken picture is a little odd, but this machine-generated guess seems viable enough.

TagMap World Browser Paris

^ A search for ‘Paris’ in TagMap’s World Browser whisks us to a city in the middle of France, not Texas, and avoids any pictures of over-photographed heiresses. See: machines have taste too.

Teasing meaning out of cacophony, evaluating ‘where what & when’ through dumb processing of inconsistent human traces: it’s not hard to sense an artificial intelligence awakening here with its own priorities, despite the human decision (conscious or not) to ignore machine-oriented information conventions. What is the ultimate effect of algorithms trained to crunch through the idiosyncratic and identify the representational? Could such aggregate processing of unstructured data fuel a general regression to the mean, as alchemist Jonah Bossewitch muses? As a Trekkie (or is it Trekker?) might say, streaming into yet another convention, resistance is futile.

The fear of human conglomeration coming into sudden sentience is nothing new, of course. I just re-read Frankenstein with a set of fresh young readers, and alarmist correlations of that good old story to a improbably persistent, flexible, and collective-mashed form of AI doubtlessly come too easily to me now. But I do sometimes wonder whether we too will wake up from our most logocentric tagging idylls to sense senseless and unblinking eyes, watching us in the dark and hungry for more.