Originally published as: Andre Oboler, Kristopher Welsh, Lito Cruz, The danger of big data: Social media as computational social science, First Monday, Volume 17, Number 7 – 2 July 2012
Social networking Web sites are amassing vast quantities of data and computational social science is providing tools to process this data. The combination of these two factors has significant implications for individuals and society. With announcements of growing data aggregation by both Google and Facebook, the need for consideration of these issues is becoming urgent. Just as Web 2.0 platforms put publishing in the hands of the masses, without adequate safeguards, computational social science may make surveillance, profiling, and targeting overly accessible.
The academic study of computational social science explains the field as an interdisciplinary investigation of the social dynamics of society with the aid of advanced computational systems. Such investigation can operate at the macro level of global attitudes and trends, down to the personal level of an individual’s psychology. This paper uses the lenses of computation social science to consider the uses and dangers that may result from the data aggregation social media companies are perusing. We also consider the role ethics and regulation may play in protecting the public.
Computational social science is the interdisciplinary investigation of the social dynamics of society, conducted from an information perspective, through the medium of advanced computational systems (Cioffi–Revilla, 2010). Computational social science can span all five traditional social science disciplines: social psychology, anthropology, economics, political science and sociology. It can operate at various level of analysis: individual cognition, decision making, and behaviour; group dynamics, organization and management; and, societal behaviour in local communities, nation states and the world system.
Computational social science is, like microbiology, radio astronomy, or nanoscience, an instrument based discipline (Cioffi–Revilla, 2010). Through a key instrument, an instrument based discipline enables the observation and empirical study of phenomena. Whether the instrument is a microscope, radar, electron microscope or some other tool, the instrument serves as a lens making an otherwise invisible subject matter visible to the observer. In computational social science, the instrument takes the shape of computer systems and datasets; their availability and sophistication drives the development of theory, understanding and practical advances.
As the reach of computational social science grows, questions of both methodology and ethics, drawn from the underlying fields of computational and social sciences, need to be considered. These considerations apply not only to the research context, but also, and more importantly, to the worlds of government and commerce where philosophical concerns are less likely to rebuff immediate practical benefits. Most significantly, these concerns need to be considered in the context of social media platforms which have become computational social science tools that sit in easy reach of businesses, governments, private citizens, and the platform operators themselves.
Without a resolution to outstanding ethical issues on data storage, access and use by actors in a variety of different roles, advancements in computational social science may put the public at increased risk. To date, research in this area has been limited. We aim to provoke thought and discussion on how the use of social media as a computational social science tool should be constrained, both legally and ethically, to protect society. Prior to the public listing of Facebook on the Nasdaq, CEO Mark Zuckerberg announced his core values to potential investors; Zuckerberg’s promotion of risk taking, of the need to ‘move fast and break things’, highlights the need for external constraints so society is not left bearing the cost of mistakes by social media innovators (Oboler, 2012).
This paper begins with a consideration of the nature and risks of computational social science, followed by a focus on social media platforms as social science tools. We then discuss the aggregation of data and the expansion of computational social science along both horizontal and vertical axes. We consider the problems aggregation has raised in past social science research, as well as the potential problems raised by the use of social media as computational social science by business customers, government, platform providers and platform users; this discussion includes consideration of consumer protection, ethical codes, and civil liberty impacts. We end by highlighting the richness of social media data for computational social science research and the need to ensure this data is used ethically and the public is protected from abuse. The danger today is that computational social science is being used opaquely and near ubiquitously, without recognition or regard for the past debate on ethical social science experimentation.
Computational social science involves the collection, retention, use and disclosure of information to answer enquiries from the social sciences. As an instrument based discipline, the scope of investigation is largely controlled by the parameters of the computer system involved. These parameters can include: the type of information people will make available, data retention policies, the ability to collect and link additional information to subjects in the study, and the processing ability of the system. The capacity to collect and analyze data sets on a vast scale provides leverage to reveal patterns of individual and group behaviour (Lazer, et al., 2009).
The revelation of these patterns can be a concern when they are made available to business and government. It is, however, precisely business and government who today control the vast quantities of data used for computational social science analysis.
Some data should not be readily available: this is why we have laws restricting the use of wiretaps, and protecting medical records. The potential damage from inappropriate disclosure of information is sometimes obvious. However, the potential damage of multiple individually benign pieces of information being combined to infer, or a large dataset being analysed to reveal, sensitive information (or information which may later be considered sensitive) is much harder to foresee. A lack of transparency in the way data is analysed and aggregated, combined with a difficulty in predicting which pieces of information may later prove damaging, means that many individuals have little perception of potential adverse effects of the expansion in computational social science.
Both the analysis of general trends and the profiling of individuals can be investigated through social sciences. Applications of computational social science in the areas of social anthropology and political science can aid in the subversion of democracy. More than ever before, groups or individuals can be profiled, and the results used to better manipulate them. This may be as harmless as advertising for a particular product, or as damaging as political brainwashing. At the intersection of these examples, computational social science can be used to guide political advertising; people can be sold messages they will support and can be sheltered from messages with which they may disagree. Access to data may rest with the incumbent government, with those able to pay, or with those favoured by powerful data–rich companies.
Under its new terms of service, Google could for instance significantly influence an election by predicting messages that would engage an individual voter (positively or negatively) and then filtering content to influence that user’s vote. The predictions could be highly accurate making use of a user’s e–mail in their Google provided Gmail account, their search history, their Google+ updates and social network connections, and their online purchasing history through Google Wallet, data in their photograph collection. The filtering of information could include “recommended” videos in YouTube; videos selectively chosen to highlight where one political party agrees with the user’s views and where another disagrees with them. In Google News, articles could be given higher or lower visibility to help steer voters into making “the right choice”.
Such manipulation may not be immediately obvious; a semblance of balance can be given with an equal number of positive and negative points made against each party. What computational social science adds is the ability to predict the effectiveness of different messages for different people. A message with no resonance for a particular voter may seem to objectively to provide balanced, while in reality making little impact. Such services could not only be sold, but could be used by companies themselves to block the election of officials whose agenda runs contrary to their interests.
The ability to create such detailed profiles of individuals extends beyond the democratic process. The risk posed by the ubiquity of computational social science tools, combined with an ever–increasing corpus of data, and free of the ethical restrictions placed on researchers, poses serious questions about the impact that those who control the data and the tools can have on society as a whole. Traditionally, concerns about potential abuses of power focus on government and how its power can be limited to protect individuals; that focus needs to widen.
Computational social science, for good or ill, is limited by the availability of data. Issues surrounding the acquisition of data that can feed computational social science, and issues of control over access to that data are key areas of public policy. By limiting data acquisition, sharing, and use, and by raising public awareness of the implications of its availability, there is a chance the ethical implications may be considered before the kind of privacy horror stories that are today relatively rare become more commonplace. Computational social science can be a great benefit in our search for knowledge, but like all scientific advances, we must be aware of its risks.
If an employer looks at an employee’s Facebook wall, is that an application of computational social science? Is Facebook itself a computational social science tool? Is ad–targeting based on browsing habits or personal information from other applications a form of computational social science? We see these examples as every day uses of social media–based computational social science.
Social media systems contain particularly valuable information. This data derives its value from its detail, personal nature, and accuracy. The semi–public nature of the data means it is exposed to scrutiny within a user’s network; this increases the likelihood of accuracy when compared to data from other sources. The social media data stores are owned and controlled by private companies. Applications such as Facebook, LinkedIn, and the Google suite of products, (including Google search, YouTube, DoubleClick and others), are driven by information sharing, but monetized through internal analysis of the gathered data — a form of computational social science. The data is used by four classes of users: business clients, government, other users within the social media platform, and the platform provider itself.
Business clients draw on this computational social science when they seek to target their advertisements. Facebook, for example, allows advertisers to target users based on variables that range from standard demographics such as age, gender, and geographical location to more personal information such as sexual preferences. Users can also be targeted based on interests, associations, education level and employer. The Facebook platform makes this data (in aggregated form) available to advertisers for a specific purpose, yet Facebook’s standard user interface can also be used as a general computational social science tool for other purposes.
To take an example, the Australian Bureau of Statistics (ABS) estimates the current population of Australia at 22.5 million (Australian Bureau of Statistics, 2010a). The Facebook advertising platform gives an Australia population (on Facebook) of 9.3 million; over 41 percent of the national population. As there is less coverage at the tails, Facebook has only 0.29 million people over 64, while the ABS says there are 3.06 million Australians over 65 (Australian Bureau of Statistics, 2010b), the sample for some age ranges must be approaching the entire population and may provide a very good model as a computational social science tool. For example, research shows that about two percent of the Australia population is not heterosexual (Wilson, 2004). From the Facebook advertising platform, we can readily selection a population of Australians, aged 18 to 21, who are male, and whose sexual preference is for men. The platform immediately tells us the population size is 11,580 people. By comparing this to the total size of the Australian male Facebook population who expressed a sexual preference, we can see this accounts for 2.89 percent of this population, indicating that the data available to Facebook is of similar utility to that available to social scientists for research.
The second class of users of social media as computational social science tools is governmental. This is demonstrated by the U.S. government’s demands to Twitter (via court orders) for data on Wikileaks founder Julian Assange and those connected to him. The court order was only revealed after Twitter took legal action to lift a court imposed censorship order relating to the requests (Dugan, 2011). The Wikileaks affair demonstrates how government can act when it sees social media as acting against its interests.
The very existence of social media can also promote government’s agenda. During the Iranian elections, for example, Twitter was asked not to take their service off–line for scheduled maintenance (Musgrove, 2009). In another example, the U.S. State Department provided training ‘using the Internet to effect social change’ to Egyptian dissidents between 2008 and 2010, then sought (unsuccessfully) to keep social media access available during the January 2011 Egyptian anti–government protests (Morrow, 2011). The Egyptian effort was defeated after Egypt responded by taking the entire country off the Internet, a move perhaps more in response to the U.S. than the protestors. While social media might enable activism, computational social science favours the state or at least those with power. Computational social science tools combined with social media data can be used to reconstruct the movements of activists, to locate dissidents, and to map their networks. Governments and their security services have a strong interest in this activity.
The third class of actors are other social media platform users. Journalist Ada Calhoun has described as an epiphany that left her “freaked out” the realisation that anyone could research her just as she researched others while writing their obituaries. In her article, Calhoun reflected that some amateur experts on the anarchic message board 4chan, or professional experts working for government agencies, could likely find out far more than she could (Calhoun, 2011). The everyday danger that can result when anyone can research anyone else can be demonstrated through two scenarios:
|Scenarios one involves Mary who has been a Facebook user for some years. Through Facebook Mary reconnected with an old friend Fred. As time went on, Mary and Fred grew closer and became a couple. One day Mary logged into her Facebook account and noticed that Fred has still not updated his details to say he is in a relationship with her. This makes Mary feel very insecure, and causes her to begin doubting Fred’s intentions. Due to this discovery, Mary broke off her relationship with Fred.Joe applied to a company as a Human Resource team leader. The hiring manager, Bob, found Joe’s resume appealing and considered him a good candidate. Bob decides to check Joe’s Facebook information. On Joe’s publically viewable wall, Bob sees several pictures of Joe in what Bob considers to be “questionable settings”. The company never called Joe for an interview. Joe has been given no opportunity to explain, nor any explanation on why his application was rejected.|
Both Mary and Bob used Facebook as a computational tool to extract selected information as part of an investigation into the social dynamics of society, or in these cases, a particular individual’s interactions with society. In this sense, Facebook could be considered a computational social science tool. Mary’s inference may be based on a wider realisation that Fred’s interactions with her are all in private and not part of his wider representation of himself. Bob may have drawn his conclusions from a combination of text, pictures, and social interactions.
These situations are far from hypothetical. Research released in November 2011 by Telstra, Australia’s largest telecommunications company, revealed that over a quarter of Australian bosses were screening job candidates based on social media (Telstra, 2011). At the start of 2012 the Australia Federal Police began an advertising campaign designed to warn the public of the need to protect their reputation online. The advertisement featured a job interview where the interviewer consults a paper resume then proceeds to note various positive attributes about the candidate; all seems to be going very well. The interviewer then turns to his computer screen and adds “and I see from your recent online activity you enjoy planking from high rise buildings, binge drinking, and posting embarrassing photos of your friends online” (Australian Federal Police, 2012). The advertisement is an accurate picture of the current approach, which takes place at the level of one user examining another. Computational social science may soon lead to software programs that automatically complete pre–selection and filtering of candidates for employment.
The final class of actor we consider are social media platform providers themselves. While Facebook provides numerous metrics to profile users for advertisers, far more data and scope for analysis is available to a platform provider like Facebook itself. Internet advertisements are often sold on a “cost per–click” (CPC) or “cost per–impression” (CPM — with M indicating costs typically conveyed per–thousand impressions). Thus, Facebook may maximise advertising revenue by targeting advertisements to achieve the greatest possible number of clicks for a given number of impressions. This maximisation of the click–through rate (CTR) can be achieved using a wealth of hidden information to model which users are most likely to respond to a particular advertisement. Computational science can help a company like Facebook correctly profile its users, showing the right advertisements to the right people so as to maximize revenue. But what else can a company like Facebook or Google do? This depends on the data they hold.
While horizontal expansion of computational social science allows greater access to selected aggregate data, vertical expansion allows larger operators to add depth to their models. This depth is a result of triangulation, a method originally from land surveying. Triangulation gives a confirmation benefit by using additional data points to increase the accuracy and confidence in a measurement. In a research context triangulation allows for information from multiple sources to be combined in a way that can expose underlying truths and increase the certainty of conclusions (Patton, 1990).
Social media platforms have added to their data either by acquiring other technology companies, as Google did when acquiring DoubleClick and YouTube, or by moving into new fields as Facebook did in when it created “Facebook Places”: a foursquare–like geolocation service (McCarthy, 2010). From a computational social science perspective, geolocation services in particular add high value information. Maximising the value of information requires a primary key that connects this data with existing information; a Facebook user ID, or a Google account name provides just such a key.
The breadth of an account measures how many types of online interaction the one account connects. It lets the company providing the account know about a wider slice of a user’s life. Three situations are possible. The first involves distinct accounts on multiple sites and allows no overlap of data: what occurs on one site stays on that site. The second situation is where there is a single traceable login, for example your e–mail address, which is used on multiple sites but where the sites are independent. Someone, or some computational social science tool, with access to the datasets could aggregate the data. The third possibility is a single login with complete data sharing between sites. All the data is immediately related and available to any query the underlying company devises. It is this last scenario that forms the Holy Grail for companies like Facebook and Google, and causes the most concern for users.
The announcement by Alma Whitten, Google’s Director of Privacy, Product and Engineering in January 2012 that Google would aggregate its data and “treat you as a single user across all our products” (Whitten, 2012), has led to a sharp response from critics. Jeffrey Chester, executive director of the Center for Digital Democracy told the Washington Post: “There is no way a user can comprehend the implication of Google collecting across platforms for information about your health, political opinions and financial concerns” (Kang, 2012). In the same article, Common Sense Media chief executive James Steyer states bluntly that “Google’s new privacy announcement is frustrating and a little frightening”.
The depth of an account measures the amount of data an account connects. There are three possible situations. The first is an anonymous login with no connection to personal details, the virtual profile is complete in and of itself — it may or may not truthfully represent the real world. The second situation is an account that where user details are verified, for example a university login that is only provided once a student registers and identification papers have been checked. A number of online services and virtual communities are now using this model and checking government issued identification to verify age (Duranske, 2007). The third situation involves an account that has a verified identity aggregated with other data collected from additional sources, for example, a credit card provider knows who its customers are, as well as where they have been and what they have bought. The temporal nature of the data is also a matter of depth; your current relationship status has less depth than your complete relationship history.
Facebook’s Timeline feature signifies as large a change to depth as Google’s policy change does to breadth. Timeline lets users quickly slide to a previous point in time, unearthing social interactions that had long been buried. A Facebook announcement on 24 January 2012 informed the world that Timeline was not optional and would in a matter of weeks be rolled out across all Facebook profiles (McDonald, 2012).
As Sarah Jacobsson Purewal noted in PC World, with Timeline it takes only a few clicks to see data that previously required around 500 clicks on the link labelled “older posts”, each click separated by a few seconds delay while the next batch of data loads (Purewal, 2012). Purewal (2012) provides a step–by–step guide to reasserting privacy under the new timeline regime, the steps are numerous and the ultimate conclusion is that “you may want to just consider getting rid of your Facebook account and starting from scratch”. Though admittedly not scientific, a poll by Sophos, an IT security and data protection company, showed that over half those polled were worried about Timeline (Cluley, 2012a). The survey included over 4,000 Facebook users from a population that is likely both more concerned and more knowledgeable about privacy and security than the average user. If that wasn’t telling enough, the author of the announcement, Sophos’ senior technology consultant, Graham Cluley, announced in the same article that he had shutdown his Facebook account. Cluley’s reasoning was a response to realising exactly how much of his personal data Facebook was holding, and fatigue at Facebook’s ever changing and non–consultative privacy regime (Cluley, 2012a; 2012b).
All accounts have both a breadth and a depth. Accounts that are identity–verified, frequently updated, and used across multiple aspects of a person’s life present the richest data and pose the greatest risk. The concept of a government–issued national identity card has created fierce debate in many countries, yet that debate has been muted when the data is collected and held by non–government actors. Google’s new ubiquitous account and Facebook’s single platform for all forms of social communication should raise similar concerns for individuals as both consumers and citizens.
The rise of social media, with social science capabilities, has placed technology professionals in a decision making role over new ethical dilemmas. While ethical controversies are well known in both the technology field and the social sciences; the nature of the issues can, however, be different. In addition to a greater understanding of the ethical codes that apply to their own discipline, today’s technology professionals in the social media space need an appreciation of the ethics of social science.
In 1969 a doctoral candidate at Harvard, Laud Humphreys, created one of the largest ethical controversies in social science. Constance Holden, writing in Science on ethics in social science research, described Humphrey as having “deceived his subjects, failed to get anything remotely resembling informed consent from them, lied to the Bureau of Motor Vehicles, and risked doing grave damage to the psyches and reputations of his subjects.” Humphrey had chosen subjects, without their consent, and then collected and arrogated data about them. His data was collected multiple times, in multiple different guises, and without informing them of his true purpose (Holden, 1979). His experiment, which examined the behaviour of homosexuals, led to a book entitled Tearoom trade (Humphreys, 1970) which aimed to demonstrate that homosexuals were regular people and not a danger to society.
Today, research like Humphrey’s would by necessity include an element of computational social science. Indeed, Calhoun (2011) details how she engaged in just such research when writing the story of Tyler Clementi, a gifted teenage violinist who committed suicide after a sexual encounter with another man in his dorm room was allegedly streamed over the Internet.
In discussing the ethics of social science research, Holden noted two schools of thought: utilitarianism (also known as consequentialism) holds that an act can only be judged on its consequences; deontologicalism (also known as non–consequentialism) is predominantly about absolute moral ethics. In the 1960s utilitarianism was dominant, along with moral relativism; in the late 1970s deontologicalism began to hold sway (Holden, 1979). In computational social science, the debate seems to be academic with little regard given to ethics. Conditions of use are typically one–sided without user input, although Wikipedia is a notable exception (Konieczny, 2010). Companies expand their services and data sets with little regard for ethical considerations, and market forces in the form of user backlashes from the first, and often only, line of resistance.
One such backlash occurred over Facebook’s Beacon software, which was eventually cancelled as part of an out of court settlement. Beacon connected people’s purchases to their Facebook account; it advertised to their friends what a user had purchased, where they got it, and whether they got a discount. In one instance, a wife found out about a surprise Christmas gift of jewellery after her husband’s purchased was broadcast to all his friends — including his wife (Nakashima, 2007). Others found their video rentals widely shared, raising concerns it might out people’s sexual preferences and other details of their private life (Nakashima, 2007). In addition to closing down Beacon, the settlement involved the establishment of a fund to better study privacy issues, an indication that progress was stepping well ahead of ethical considerations (Kravets, 2010).
The caveat emptor view of responsibility for disclosure of personal data by social networking sites is arguably unsustainable. Through Beacon, retailers shared purchasing information with Facebook based on terms and conditions purchasers either failed to notice, or failed to fully appreciate. Beacon took transactions outside consumers’ reasonable expectations. While Facebook was forced to discontinue the service, appropriate ethical consideration by technology professionals could have highlighted the problems at a much earlier stage.
Privacy is not the only interest users have in social media platforms. Some platforms, such as Wikipedia, are created to share non–personal information. In these platforms computational social science can play a different role, exposing, predicting, and helping to eliminate disruptive behaviour. In Wikipedia, computational social science has been used to analyse patterns of editing which reduce quality in order to promote particular agendas, and to build profiles of different types of problem users who work at manipulating the encyclopaedia (Oboler, et al., 2010). Computational social science can, therefore, serve a positive role in promoting the interests of the community in a social media platform. Computational social science must be seen not only as a risk, but also as a potential benefit, and the ethical considerations need to be weighed up on a case by case basis by those able to influence the technology.
In 2010 Andrew Lewis posted on MetaFilter: “If you are not paying for it, you’re not the customer; you’re the product being sold” (blue_beetle, 2010). The post was quickly adopted as a new adage of the social media age (Mai, 2012). Commercially successful social media companies are driven by online advertising revenue; their business model places the individual’s interest in privacy at war with the advertisers’ interest in greater customer profiling (Newman, 2011). Just like magazines in niche demographics that command higher advertising rates for access to a target market, Web 2.0 sites command higher advertising rates by allowing advertisers to target select demographics. The Web 2.0 approach is, however, more exact; where a trade magazine may allow advertisers to target textile industry insiders, a social media Web site can target gay males, aged 20–30, who live in Paris and like to go clubbing.
The dramatic increase in advertisers targeting precision has demonstrated unintended and unexpected consequences. In a non–Internet example, U.S. retailer Target analysed purchasing patterns to identify potential customers of baby paraphernalia. The analysis, based on purchasing history of unrelated items, highlighted potential pregnancies with a high degree of accuracy. Target sent advertising material to its target market, triggering an angry backlash from one father whose teenage daughter received the advertising. Weeks later it was the father who was apologising after his daughter confirmed she was actually pregnant. Marketing had revealed the daughter’s pregnancy before she was ready to tell her family. This sort of analysis is quite legal, though a statistician involved has highlighted the ethical concern saying this sort of analysis might make people “queasy” (Hill, 2012).
Social media Web sites accmulate a great deal of data. They could use users’ access times and locations to identify insomniacs to advertise sleeping pills, or could analyse data on users’ age and sexual preference to target advertisements for adult products. Clearly, this kind of analysis has the potential to cause distress or unauthorised disclosure of sensitive personal information. This poses both technical and ethical problems: it may be possible to identify pregnant users, but impossible to say if the users wish the fact to be shared. What constitutes highly sensitive data can vary significantly between people. The underlying question is: “Is any technically possible use of personal data ethically acceptable?” The social networking industry — and the companies that advertise through them — both need to address this fundamental question.
Some progress towards a responsible system — in which users can exert a degree of control over their data — is evident when looking at Amazon’s product recommendations. Amazon use statistical analysis of other users’ purchase histories to suggest products that are often bought by those with intersecting purchases. Crucially, however, Amazon allows users to see why a recommendation has been made, in the form of: “You bought this tea, so we recommend these biscuits”. Furthermore, Amazon allows users to flag purchases as not to be used to form recommendations. This is useful both in cases of gift purchases, and in cases where the product is of a sensitive nature. We suggest that this fine–grained ability, which allows users to instruct a Web site not to use a specific piece of personal data for a specific purpose, represents a significant improvement on the current personal data free–for–all model used by both social networking companies and their corporate customers.
With the increasing amount of personal information shared by users of social networking platforms, and a tendency for the data to be stored indefinitely, there is a strong need for consumer protection. The rights of users need to be protected while minimizing the impact on both platform providers and other companies who are seeking to provide new innovative services.
Ethics, as an academic discipline, suggests two very different approaches to address the balance between the protection of users and the freedom of companies. The deontological approach judges actions based on their adherence to rules. The rules may be seemingly arbitrary; a contravention, which may do no real harm, might be treated very seriously. This is the approach adopted when roads are given speed limits. The rule is enshrined in law and has strict liability. The alternative ethical approach is consequentialism, a virtue–oriented approach which judges actions based on their ultimate impact. This is the approach adopted in negligence cases, where the wrong must cause damage before it becomes actionable.
The consequentialist approach would give companies more freedom, but also greater liability. It does little to protect consumers from preventable harm; as it is impossible to predict the future use of personal data, or the consequences that may result, that harm may be significant. The deontological approach would place the burden on social media sites to restrict the storage, retrieval and manipulation of data in ways that limit its usefulness. This would prevent abuse but would also limit innovation. The introduction of such regulation could have a particularly stifling effect on new market entrants.
A consequentialist approach has held sway as the social media industry has developed, but the public is increasingly looking to government to safeguard personal rights and freedoms. Regulation requires a deontological approach. This section explores some of the difficulties faced by regulators seeking to protect users’ privacy without placing an undue burden on the nascent industry.
The fundamental premise of a social networking site’s business model is that users provide personal information and content, such as pictures, that they have created. In return for facilitating a sharing of content between users, the platform displays targeted advertising. The targeting of adverts is based on data mining of the information the platform holds. The system allows advertisers to target advertisements at users most likely to be interested in them, and allows the social networking website to charge a premium per advertisement view or click. Agreement to this operating model is the basis of a social networking Web site’s end user license agreement (EULA), which every user agrees to upon joining. However, what happens when a user decides they no longer wish to participate? What if the user would prefer the social networking site removed the personal information previously handed over? What if the user lied about their age, or was otherwise unable to enter into the EULA? What about people whose data was uploaded by third parties and who have no privity of contract  with platform provider? How is a user supposed to make an informed decision as to whether they would like their data removed if it is collected and used opaquely, as is the case with Google’s data collection? These are the questions regulators need to consider before drafting legislation, and so far they appear to have been largely overlooked.
In late 2010, the European Commission announced a public consultation on personal data protection in the European Union (European Commission, 2011). Several mainstream news outlets reported that the Commission was considering a “right to be forgotten” (BBC News, 2010; Warman, 2010). Such a right would require social networking sites to provide a mechanism by which users may remove their profile and related information. Most existing social networking Web sites currently provide this facility either of their own volition, or as a result of past public pressure over the issue. Facebook famously tried to avoid providing this facility, instead implementing what commentators called the ‘Hotel California’ policy (Williams, 2007; M.G., 2010), whereby users could deactivate, but not remove their profiles. This was reversed in 2008 after significant public pressure (Williams, 2008). Even so, the effort required to delete an account is significant. Facebook has implemented multiple strategies to push users into ‘deactivating’ their account rather than deleting it (Cluley, 2012b). Deactivation stops data being publically shared, but allows Facebook to continue to hold it until the user takes some action to interact with Facebook, at which point the account is revived.
The ability to be forgotten is a blunt and indiscriminate tool. It does little to allow users control over their personal information: it merely grants users a right to end the agreement into which they entered on joining the social network. Users’ information is removed, at the price of them being unable to continue using the social networking service. Furthermore, with social networking Web sites offering authorisation services to independent sites (OAuth, 2011), a “forgotten” user loses the ability to access these third–party sites with the possibility of yet more personal information on the third–party sites becoming orphaned in the process.
We argue that a “right to be forgotten” as proposed by the European Commission is too coarse–grained, leaving users with a Hobson’s Choice  between allowing the continued retention of all their personal data, or losing access to what may potentially be a large proportion of their online experience. More fine–grained controls, allowing users to remove specific pieces or clusters of personal information, without affecting their ability to use social networking sites (save for any inevitable consequence of the information’s unavailability), are essential if users are to be given genuine control over their personal data. This fundamental requirement cannot be fulfilled without significant implementation difficulty, but without it regulators’ initiatives will be ineffective.
Personal information is also held by social networking sites on individuals who do not, and have never used the service or agreed to a EULA. Facebook, for example, allows people to be identified (‘tagged’) by name in images even if they have no Facebook account. These pictures are typically taken and uploaded by those known to the individual, and the social networking site will also contain information on when and where the picture was taken, what event was being held, and which other Facebook members attended. Facebook developers have previously shown interest in developing a tag–based image search (“Facebook Photo tagged searches,” n.d.), and facial–recognition based image searching is also becoming available (Face.com, 2012). Both features allow a profile of an individual’s past movements, activities and social interactions to be built despite the individual having never granted any consent for any of this information to be held or used. Clearly, any “right to be forgotten” must extend to all people on which data is held by a social networking site, and not just to registered users.
Even if regulators arrive at an effective regulatory framework that affords individuals the right to remove data held on them by social networking sites, unilateral lawmaking on the part of a single jurisdiction has rarely proven effective on the Internet (Goldsmith, 2000; Benkler, 2000). The very nature of the Internet means that Web sites may chose the physical location of their hosting infrastructure freely. They are free to practice jurisdiction shopping and to choose to operate from a jurisdiction of least regulation. To enforce effective control on social networking sites, law makers are faced with the difficult task of legislating multilaterally.
Governments are not just the guardians of the rights of citizens. They are also heavy users of data, with their own strong interests in computational social science. Not only do they have vast amounts of government data, but any data held by a company operating within a government’s jurisdiction must be disclosed when the government issues a lawful request.
Most countries balance the state’s desire to access information, for example for law enforcement and national security reasons, with the citizens’ right to privacy. Law enforcement agencies must usually show reasonable cause to a judge before they can obtain a warrant to search premises, or demand a company disclose customer information. While the precise rules and procedures vary, safeguards to prevent abuse of government power are usually strong in western democracies. A rare exception, and the resulting public backlash, can be seen in recent moves in the U.K. to allow warrantless real–time access by police to all Internet connection information (Whitehead, 2012). The proposal has seen Britain compared to China and Iran.
The ability of government to demand personal information held within its jurisdiction is not new: the problem lies with the increase in the amount of data that is available, and the ability of computational social science to process and make sense of it. Advanced computational social science tools can be used by oppressive regimes for increased surveillance, target dissidents, and erode civil rights. The questionable use by liberal democracies of similar powers legitimizes the problematic use by these regimes.
The U.S. has also made questionable demands for personal data. Worse still, the demands have been made in secret. For example, Twitter was ordered to hand over details of the users associated with the whistle–blowing site Wikileaks, as well as details related to accounts that followed Wikileaks (BBC News, 2011). The move was particularly unusual given that the action Twitter was involved with, the act of publication, is protected under the First Amendment (Elsea, 2010). Twitter had the data, and the government wanted it.
Dutch hacker Rop Gonggrijp noted that, “it appears that Twitter, as a matter of policy, does the right thing in wanting to inform their users” when access to their data is demanded (Dugan, 2011). This was no small thing as Twitter itself had to fight a suppression order through the courts. Gonggrijp, whose information has been provided to the U.S. government in relation to Wikileaks, wondered how many other social media platforms received similar subpoenas and handed over data without efforts to enable disclosure to their users (Dugan, 2011). The lack of response from Google and Facebook to multiple media enquiries on the topic raises serious questions.
Once gathered by a technology company, personal information can be obtained by any government with sufficient jurisdiction. The government use of such information is distinct from the data’s intended use by other classes of users. For government, the availability of computational social science tools, and the ability to potentially access vast amounts of private data, is a potent mix. This is of particular concern to civil rights activists and others whom government may deem subversive. The very government seeking to silence criticism also gets to determine the regulation governing the social networking provider’s duty to disclose information. Even sites operating outside of a government’s jurisdiction may be forced to cooperate under threat of government sanctions such as a denial of access to their market or the risk of an intergovernmental incident.
The lack of importance given to ethical consideration is largely a result of the interdisciplinary nature of computational social science. Traditionally, computer science research involves minimal collection of data on individuals. The ethical barriers to experimentation are low. The social sciences, by contrast, are primarily focused on personal information. Ethical considerations play a larger role in the social sciences and data sets tend to be more limited in size and must be explicitly collected. Social scientists are not more ethical than computer scientists or engineers, but they are likely to have more relevant training and experience when it comes to managing personal data.
Professional engineering and computing bodies have relevant codes of ethics. The IEEE Code of Ethics (IEEE, 2012a) commits members to ‘improve the understanding of technology’ as well as ‘its appropriate application, and potential consequences’. The ACM Code of Ethics and Professional Conduct (ACM, 1992) commits members to ‘avoid harm to others’, ‘respect the privacy of others’, and ‘improve public understanding of computing and its consequences’. A joint IEEE Computer Society/ACM Code of Ethics for Software Engineers, created in 1999 (IEEE, 2012b), states that, ‘software engineers shall act consistently with the public interest’, and that, ‘Software engineers shall act in a manner that is in the best interests of their client and employer, consistent with the public interest’ (Lazer, et al., 2009).
The problem then is not a lack of ethical guidelines, but rather a lack of relevant application of these principles when it comes to social media. Relevant application begins with an understanding that social media platforms are a form of computational social science. Once that point is realised, social media technology professionals need to be informed by social science’s ethical concerns, and specifically by the concerns raised by computational social science. Cross disciplinary teams can be used not only to assist development, but also to assist in the ethical considerations of new projects. Both computer scientists and social scientists need to contribute to the discussion, together new angles can be considered and problems can be avoided. Good design will not solve all the problems, but it can reduce the opportunity for the abuse of a very powerful tool.
Users of social media have an ethical responsibility to one another; both education and cultural change is needed. An information producer code of ethics could promote the required change in online society. Such a code could highlight the issues to be considered when publishing information. For example, when a Facebook user uploads photographs, their action may reveal information about others in their network; the impact on those other people should be considered under a producer’s code of ethics. A consumer code of ethics is also needed; such a code would cover users viewing information posted by others through a social media platform. A consider code could raise questions of when it is appropriate to further share information, for example by reposting it. Producer and consumer are the most basic roles in a social media platform; more specialised types of role, with their own consideration both as producers and consumers, can be derived from a wide variety of real–world relationships.
Specialised roles range from the everyday, such as a parent and child, or employer and employee, to the exceptional, for example credit and insurance companies and their clients. In establishing the boundaries of what is ethically permissible, regard should be had for the existing boundaries in society and to existing levels of regulation. People in some relationships should be prevented from using computational social science at the individual level altogether, health insurance companies are an example.
Platform providers need to supplement existing terms of service by providing clearer guidelines to help users determine what they are publishing, to avoid a beacon like situation, and to alert them to the potential impact of publishing information. A “Principles of Engagement” (POE) document could be developed to provide guidance, and the power of social media itself can be used to let those who can see content warn the owner when the content may pose a risk. General recommendations can also be useful, for example, that users limit themselves to “fair comment” when access is open to more than personal friends. Another general suggestion is limiting photographs to those each of the subjects in the photograph would be happy their parents or employers seeing. Such guidelines can help users avoid potential future issues for themselves or others.
Ethics on the consumer side relate to how information is accessed and how the information gained it is used. We need a cultural mind shift to become more forgiving of information exposed through social media; or an acceptance that social media profiles are private and must be locked down with ever more complex filters; or an acceptance that even if available, information from social media should not be used in certain settings. These approaches would each change the nature of social media as a computational social science tool; they would move some aspects in and others out of the tools field of observation. As an instrument based discipline, the way the field is understood can be changed either by changing the nature of the tool, or by changing the way we allow it to be used.
Social media platforms are collecting vast amounts of data and are functioning as tools for computational social science. The most visible example, Facebook, provides relatively open access to potentially sensitive personal information about individual users; in less visible examples, such as the dataset created by Google, data is held and used opaquely. The large scale collection, retention, aggregation, use and disclosure of detailed, triangulated personal information offers the possibility of incredibly powerful computational social science tools, but brings with it the potential for abuse by governments and private entities.
Social networking has made available a rich and wide ranging dataset covering large sections of the population. Even at this current nascent stage, the social networking industry offers users, researchers and governments access to a powerful ability to identify trends in behaviour amongst a large population, and to find vast quantities of information on an individual user. As the industry develops, it would be reasonable to expect that these abilities will increase in scope, accuracy and usefulness. Future development may be driven by technological progress developing better tools, or by natural expansion in the underlying datasets. Data expansion will inevitably include greater aggregation of data as company acquisitions merge previously discrete datasets. Society is only just beginning to consider the possible privacy and ethical implications of this amount of personal information being so readily available, and at present there are few ethical or regulatory barriers restricting the collection, retention and use of personal information.
We have shown that the data collected by social networking sites can be useful to social scientists, and that the sites themselves can be viewed as computational social science tools. As a research tool, social networking data offers considerable breadth and depth, but offers limited coverage of some groups, e.g., the elderly.
We have discussed the legal and ethical implications of social networking as a computational social science tool. We have argued that computational social science, as an interdisciplinary approach, must apply the broad ethical considerations adopted by computer and engineering professional bodies in a manner consistent with, and informed by, the ethics of traditional social science research. As part of this, guidelines and regulation that sets limits on the collection, retention, use and disclosure of personal information are needed for end users.
We have argued that the type of regulation currently being considered will fall well short of providing an acceptable level of protection for individuals, and that despite the additional burden placed on social networking site operators, a more fine–grained approach must be developed. Users should be able to remove individual pieces of data they no longer wish a social media company, or its users, to be able to access. Recent developments by the titans in the social media landscape, Facebook and Google, suggest the risk to the public is going to rise until regulation intervenes.
About the authors
Andre Oboler is CEO of the Online Hate Prevention Institute and a postgraduate law student at Monash University. He holds a Ph.D. in computer science from Lancaster University and completed a post–doctoral fellowship in political science at Bar–Ilan University.
E–mail: andre [at] Oboler [dot] com
Kristopher Welsh is a lecturer in the School of Computing at the University of Kent. He holds a Ph.D. in computer science from Lancaster University.
E–mail: K [dot] Welsh [at] kent [dot] ac [dot] uk
Lito Cruz is a teaching associate at Monash University and a part–time lecturer at Charles Sturt University. He holds a Ph.D. in computer science from Monash University.
E–mail: lcruz [at] sleekersoft [dot] com
1. Privity of contract is a rule of contract law which holds that only the parties to a contract are bound by the contract and only they can enforce the contract. The rule prevents the burden of a contract falling on someone who had no part in the act of agreeing to accept the contract; Tweddle v Atkinson  121 ER 762 (see also http://en.wikipedia.org/wiki/Tweddle_v_Atkinson).
Association for Computing Machinery (ACM), 1992. “ACM code of ethics and professional conduct,” at http://www.acm.org/about/code-of-ethics, accessed 30 January 2011.
Australian Bureau of Statistics, 2010a, “3101.0 — Australian Demographic Statistics Jun, 2010,” at http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3101.0Jun%202010, accessed 30 January 2011.
Australian Bureau of Statistics, 2010b, “3201.0 — Population by Age and Sex, Australian States and Territories Jun 2010,” at http://www.abs.gov.au/Ausstats/abs@.nsf/mf/3201.0, accessed 30 January 2011.
Australian Federal Police, 2012. “Protecting your reputation online,” at http://www.afp.gov.au/AFP-Homepage/policing/cybercrime/crime-prevention.aspx, accessed 10 January 2012.
BBC News, 2011. “US wants Twitter details of Wikileaks activists” (8 January), at http://www.bbc.co.uk/news/world-us-canada-12141530, accessed 30 January 2011.
BBC News, 2010. “EU wants ‘right to be forgotten’ online” (4 November), at http://www.bbc.co.uk/news/business-11693026, accessed 28 January 2012.
Yochai Benkler, 2000. “Internet regulation: A case study in the problem of unilateralism,” European Journal of International Law, volume 11, number 1, pp. 171–185.
blue_beetle, 2010. “User–driven discontent,” MetaFilter (26 August), at http://www.metafilter.com/95152/Userdriven-discontent#3256046, accessed 20 May 2012.
Ada Calhoun, 2011. “I can find out so much about you,” Salon (18 January), at http://www.salon.com/life/internet_culture/?story=/mwt/feature/2011/01/18/what_i_can_find_online, accessed 30 January 2011.
Claudio Cioffi–Revilla, 2010. “Computational social science,” Wiley Interdisciplinary Reviews: Computational Statistics, volume 2, issue 3, pp. 259–271.
Graham Cluley, 2012a. “Poll reveals widespread concern over Facebook Timeline,” Naked Security (27 January), at http://nakedsecurity.sophos.com/2012/01/27/poll-reveals-widespread-concern-over-facebook-timeline/, accessed 10 February 2012.
Graham Cluley, 2012b. “Why I left Facebook,” BBC College of Journalism blog (10 January), at http://www.bbc.co.uk/journalism/blog/2012/01/why-i-left-facebook.shtml, accessed 10 February 2012.
Emily Dugan, 2011. “US demands Twitter release Assange details,” Independent (9 January), at http://www.independent.co.uk/news/world/americas/us-demands-twitter-release-assange-details-2179740.html, accessed 30 January 2011.
Benjamin Duranske, 2007. “IMVU deploys third–party age verification solution” (24 September), at http://virtuallyblind.com/2007/09/24/imvu-age-verifiction/, accessed 28 May 2012.
Jennifer K. Elsea, 2010. “Criminal prohibitions on the publication of classified defense information,” Congressional Research Service (10 September), at http://fpc.state.gov/documents/organization/148793.pdf, accessed 30 January 2012.
European Commission, 2011. “Consultation on the Commission’s comprehensive approach on personal data protection in the European Union,” at http://ec.europa.eu/justice/news/consulting_public/news_consulting_0006_en.htm, accessed 10 February 2012.
Face.com, 2012, at http://face.com/, accessed 10 March 2012.
“Facebook Photo tagged searches,” at http://www.catchmeifyouknowhow.com/index.php/tutorials/social-media/item/facebook-photo-tagged-searches, accessed 30 January 2012.
Jack Goldsmith, 2000. “Unilateral regulation of the Internet: A modest defence,” European Journal of International Law, volume 11, number 1, pp. 135–148.
Kashmir Hill, 2012. “How Target figured out a teen girl was pregnant before her father did,” Forbes (16 February), at http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/, accessed 20 May 2012.
Constance Holden, 1979. “Ethics in social science research,” Science, volume 206, number 4418 (2 November), pp. 537–538.
Laud Humphreys, 1970. Tearoom trade; Impersonal sex in public places. Chicago: Aldine.
Institute of Electrical and Electronics Engineers (IEEE), 2012a. “IEEE code of ethics,” at https://www.ieee.org/about/corporate/governance/p7-8.html, accessed 30 January 2011.
Institute of Electrical and Electronics Engineers (IEEE), 2012b. “FAQ and resources: Software engineering code of ethics and professional practice,” at http://www.computer.org/portal/web/certification/resources/code_of_ethics, accessed 17 June 2012.
Cecilia Kang, 2012. “Google announces privacy changes across products; users can’t opt out,” Washington Post (24 January), at http://www.washingtonpost.com/business/economy/google-tracks-consumers-across-products-users-cant-opt-out/2012/01/24/gIQArgJHOQ_story.html, accessed 28 January 2012.
Piotr Konieczny, 2010. “Adhocratic governance in the Internet age: A case of Wikipedia,“ Journal of Information Technology & Politics, volume 7, number 4, pp. 263–283.
David Kravets, 2010. “Judge approves $9.5 million Facebook ‘Beacon’ accord,” Wired (17 March), at http://www.wired.com/threatlevel/2010/03/facebook-beacon-2/, accessed 30 January 2011.
David Lazer, Alex Pentland, Lada Adamic, Sinan Aral, Albert–László Barabási, Devon Brewer, Nicholas Christakis, Noshir Contractor, James Fowler, Myron Gutmann, Tony Jebara, Gary King, Michael Macy, Deb Roy, and Marshall Van Alstyne, 2009. “Computational social science,” Science, volume 323, number 5915 (6 February), pp. 721–723.
M.G., 2010. “Facebook and transparency: Facebook and the Hotel California,” Economist (6 October), at http://www.economist.com/node/21011590, accessed 30 January 2011.
Philip Mai, 2012. “If you’re not paying for it, you’re the product: What is the $value of social data?” Social Media Lab (9 April), at http://socialmedialab.ca/?p=6076, accessed 20 May 2012.
Caroline McCarthy, 2010. “Facebook granted geolocation patent,” CNet News (6 October), at http://news.cnet.com/8301-13577_3-20018783-36.html, accessed 30 January 2011.
Paul McDonald, 2012. “Timeline: Now available worldwide” (24 January), at http://blog.facebook.com/blog.php?post=10150408488962131, accessed 28 January 2012.
Adrian Morrow, 2011. “U.S. officials backed rebels planning Egyptian uprising in 2008: WikiLeaks,” Globe and Mail (28 January), at http://www.theglobeandmail.com/news/world/africa-mideast/us-officials-backed-rebels-planning-egyptian-uprising-in-2008-wikileaks/article1887439/, accessed 30 January 2011.
Mike Musgrove, 2009. “Twitter is a player in Iran’s drama,” Washington Post (17 June), at http://www.washingtonpost.com/wp-dyn/content/article/2009/06/16/AR2009061603391.html, accessed 30 January 2011.
Ellen Nakashima, 2007. “Feeling betrayed, Facebook users force site to honor their privacy,” Washington Post (30 November), at http://www.washingtonpost.com/wp-dyn/content/article/2007/11/29/AR2007112902503.html, accessed 28 April 2012.
Nathan Newman, 2011. “You’re not Google’s customer — You’re the product: Antitrust in a Web 2.0 world,” Huffington Post (29 March), at http://www.huffingtonpost.com/nathan-newman/youre-not-googles-custome_b_841599.html, accessed 1 May 2012.
OAuth, 2011. “OAuth 2.0,” at http://oauth.net/2/, accessed 30 January 2011.
Andre Oboler, 2012. “What market forces mean for Facebook,” Jerusalem Post (24 May), at http://blogs.jpost.com/content/what-market-forces-mean-facebook, accessed 25 May 2012.
Andre Oboler, Gerald Steinberg and Rephael Stern, 2010. “The framing of political NGOs in Wikipedia through criticism elimination,” Journal of Information Technology & Politics, volume 7, number 4, pp. 284–299.
Michael Patton, 1990. Qualitative evaluation and research methods. Second edition. Newbury Park, Calif.: Sage.
Sarah Jacobsson Purewal, 2012. “Facebook Timeline privacy tips: Lock down your profile,” PC World (31 January), at http://www.pcworld.com/article/249019/facebook_timeline_privacy_tips_lock_down_your_profile.html, accessed 10 February 2012.
Telstra, 2011. “Aussies urged to consider their cyber CVs as bosses head online” (28 November), at http://www.telstra.com.au/abouttelstra/media-centre/announcements/aussies-urged-to-consider-their-cyber-cvs-as-bosses-head-online.xml, accessed 5 December 2011.
Matt Warman, 2010. “EU proposes ‘right to be forgotten’,” Telegraph (5 November), at http://www.telegraph.co.uk/technology/internet/8112702/EU-proposes-online-right-to-be-forgotten.html, accessed 10 February 2012.
Tom Whitehead, 2012. “New powers to record every phone call and email makes surveillance ‘60m times worse’,” Telegraph (2 April), at http://www.telegraph.co.uk/technology/news/9180191/New-powers-to-record-every-phone-call-and-email-makes-surveillance-60m-times-worse.html, accessed 28 April 2012.
Alma Whitten, 2012. “The Official Google Blog: Updating our privacy policies and terms of service” (24 January), at http://googleblog.blogspot.com.au/2012/01/updating-our-privacy-policies-and-terms.html, accessed 24 January 2012.
Christopher Williams, 2008. “UK data watchdogs drop Facebook probe,” Register (26 February), at http://www.theregister.co.uk/2008/02/26/ico_facebook_investigation_complete/, accessed 30 January 2011.
Christopher Williams, 2007. “Microsoft–Facebook: Welcome to the Hotel California,” Register (25 October), at http://www.theregister.co.uk/2007/10/25/microsoft_facebook_comment/, accessed 30 January 2011.
Shaun Wilson, 2004. “Gay, lesbian, bisexual and transgender identification and attitudes to same–sex relationships in Australia and the United States,” People and Place, volume 12, number 4, pp. 12–21.
Received 10 March 2012; revised 31 May 2012; accepted 13 June 2012.
This paper is licensed under a Creative Commons Attribution–NonCommercial–NoDerivs 3.0 Unported License.
The danger of big data: Social media as computational social science
by Andre Oboler, Kristopher Welsh, and Lito Cruz
First Monday, Volume 17, Number 7 – 2 July 2012