I’m referring, of course, to users’ identity habits, which as many people have noted (including myself, in a W3C position paper) are far more promiscuous than we might wish. How can we work towards more robust privacy and security if people simply don’t care? What does it take to get people to shut up about themselves?
An article in New Scientist reports that the NSA is researching “mass harvesting of the information that people post about themselves on social networks.” One example given seems benign and even useful:
The research ARDA [the Advanced Research Development Activity] funded was designed to see if the semantic web could be easily used to connect people. The research team chose to address a subject close to their academic hearts: detecting conflicts of interest in scientific peer review. Friends cannot peer review each other’s research papers, nor can people who have previously co-authored work together.
So the team developed software that combined data from the RDF tags of online social network Friend of a Friend (www.foaf-project.org), where people simply outline who is in their circle of friends, and a semantically tagged commercial bibliographic database called DBLP, which lists the authors of computer science papers.
Joshi [one of the team’s leaders] says their system found conflicts between potential reviewers and authors pitching papers for an internet conference. “It certainly made relationship finding between people much easier,” Joshi says. “It picked up softer [non-obvious] conflicts we would not have seen before.”
The article places more emphasis on RDF and the formal semantic web than I think is warranted; arbitrary (well-documented) XML formats, microformats, and even HTML used in a regularized manner can be harvested or at least screen-scraped. And it’s actually very hard to do precise equivalence mapping between RDF (or any!) schemas in practice, just because taxonomies in the real world are so messy (are “given names” and “first names” and “Christian names” the same thing?). So it’s likely that well-known attribute schemas of whatever type will be just as effective targets for harvesting as RDF schemas will be. But the point remains: greater data portability and more accessible semantics for personal information add up to easier harvesting by other parties, whether they wear black hats or white.
Even if users have the opportunity to give informed consent, in many cases they may choose not to spend time thinking hard on the consequences of allowing access — possibly a form of rational ignorance if they never pay those consequences. An example from a more general context appears in Bill Cheswick’s talk from the the inaugural SOUPS conference:
To most attendees, it came as no surprise that the Cheswick found his father’s Windows machine chock-full of adware and spyware. Also unsurprising was the fact that even after a full cleanup, the machine was infected again within weeks (when the speaker visited his father next). Here’s the punch-line: the father was adamant that none of the security “fixes” or “solutions” break his machine. After all, explicit and annoying pop-up ads notwithstanding, he was still getting his work done, wasn’t he? Why fix something that ain’t broke?
(SOUPS is the “Symposium on Usable Privacy and Security”; its 2006 program looks incredibly meaty — soup-to-nutsy? — and I sure wish I could go.)
For those who do want to exercise more care, or if the consequences begin to be felt (Tag Boy points me to this example of googling-before-hiring), applying strong human-computer interaction principles in identity UIs should help in reducing misunderstanding and fatigue. And we could allow users to set up policies for avoiding annoying interactions involving identity exchange — reserving synchronous interaction for garnering point-of-“sale” consent for areas with a large potential for loss (of privacy, money, or whatever). Identity Rights Agreements could be a useful tactic, if users can get to know the options and if the interfaces for managing them are value-add rather than value-subtract.
The article concludes, in part:
… Tim Finin, a colleague of Joshi’s, thinks the spread of such technology is unstoppable. “Information is getting easier to merge, fuse and draw inferences from. There is money to be made and control to be gained in doing so. And I don’t see much that will stop it,” he says.
I’ve mentioned some forces that could potentially “stop it”, but people still have to want them. Let’s say that the perfect interfaces have been developed and people use them to set up policy-based bounds on identity sharing. But a really awesome new social networking program is all the rage and it requires fairly wide access in order to provide, say, genealogy linkages. A user has clicked all the right buttons to prove that they have given consent, or they’ve selected the desired identity “card” and sent it along. Their information gets used in some cool new way that was accounted for by the consent they gave, but embarrasses them or gets them into hot water. Has a confidence been breached? Are they just SOL?
Is it possible to come up with a “do what I mean” button for identity info exchange?
Hmm, if someone freely releases personal information on the Web, and some group uses it for evil purposes, does the fault lie with the person *sharing* their data, or the bad people?
But it’s undeniable that current systems are painfully crude when it comes to privacy, and even if they were a lot more granular, it’s hard to see how the end user could be shielded from any unexpected, unpleasant future use. Antiserendipity?
I agree with you that there is plenty more identity data out there than RDF (must confess it was that acronym that caught my eye ;-), but presumably the article emphasized that because the research in question was using it. I’d also quibble on this point: “hard to do precise equivalence mapping between RDF..schemas”. Such mappings are very easy to express in RDF, and what’s more you may not want precise equivalence mappings, e.g.
x:christianName rdfs:subClassOf foaf:firstName .
Because of this, no matter what the original source of the data (scraped HTML, microformats etc) RDF does offer a good way of expressing and using the mappings, even if it doesn’t solve the human-taxonomy problem.
Incidentally, the RDF approach should be useful at the other end of the system. For the person doing google-before-hiring, something like timbl’s “Oh Yeah” button (to give a logical paper trail of where the info came from) would make nice symmetry with your “do what I mean” button.
Informed consent implies ‘informing’ the user not only of what uses are planned for identity but also what will be the consequences (positive or not) of doing so. I blogged an interesting application of this idea at http://connectid.blogspot.com/2006/05/identity-selector-sequence.html
Users are explcitly told how the level of service they will receive at a SP depends on the identity information they choose to release to it.
Wrt a ‘Do what I mean’ button, this might work if it were ‘Do what I’ve said before’, i.e. an indication from the user that they wanted the current identity transaction to be governed by a distant policy (PIP or PDP?) point at which they had previously stored privacy preferences. As it stands, the user ends up with ‘privacy policy siloes’
Tag Boy
Hi Danny– Your point about the mechanism of mapping within RDF is well-taken (and I do recognize that the article was using RDF a “hook” for the privacy discussion). My point was just that deciding on what concept is actually synonymous with, or encompasses, or encompassed by, something else is tricky and often context-dependent. A first name is often a given (personal) name but not in some cultures, and a Christian name is often a given name but not for people of other religions, etc. If you have three existing databases, each of which uses one of the choices, you’re *probably* safe in mapping them as equivalent but maybe not. Etc. YMMV. It’s just a classic semantics problem — nothing to do with RDF!
And hi Paul– The mockup you point to is indeed interesting! It definitely gives a feel, *at the point of collection*, for how the information will be shared — which is precisely when the user is paying the most attention.