The gathering of records on more than 50 million Facebook users has underscored the dangers of online data mining, and the claims of the company that collected the information, Cambridge Analytica, highlighted the possibilities of what could be done with the data.
Cambridge Analytica used the data to create profiles of 50 million users then used the information to support Republican political candidates in the 2016 election, most notably Donald Trump’s presidential campaign, according to reports.
Yet, only 270,000 users took the online quiz created by a Cambridge Analytica contractor to collect the data. The ability to leverage that relatively small number of users into a massive database of 50 million profiles by collecting information on all the quiz takers’ friends underscores the power of social networks.
Unlike much of the information collected online and volunteered by users, the vast majority of those users were victims who did not consent to having their data collected, Gennie Gebhart, a researcher with the Electronic Frontier Foundation, told eWEEK.
“This is information that was very much taken from us,” she said. “We did not mean to share it with any third party, especially one that no one had heard of before this round of [media] coverage,” said Gebhart.
The depth and breadth of the personal information that Cambridge Analytica fooled people into parting with shows the danger of the data-collection ecosystem. Yet, businesses and political activists are only starting to explore what can be done with this data. While direct inferences can be made about political views, health issues and lifestyle, Cambridge Analytica claimed that such interests could be used to change viewpoints as well.
“I think the interesting thing about the case with Cambridge Analytica [is] we tend to be dismissive about what seems just like advertising,” Kirsten E. Martin associate professor of strategic management and public policy at George Washington University, told eWEEK. But it’s what they are feeding people in return that’s even more important, she noted. “It’s more than just Coca Cola versus Pepsi advertising. It skews your perception on what is going on in the world.”
Here is what data-collection and analytics companies can find out about you online.
1. You are not anonymous
Anonymity is almost impossible to attain on the internet. Even people who are careful about posting information online will find that through data collection and data publishing, large-scale analysis can often link together seemingly unconnected and anonymous activities.
In a 2008 paper, for example, researchers at the University of Texas at Austin found that people who posted a handful of movie recommendations on IMDb could be positively matched to a much larger database of anonymized movie recommendations published by Netflix for research purposes.
Such leakage makes a difference. A person who rates popular movies could find themselves identified in a much larger dataset that connects them to dozens or hundreds of other films that they rated privately.
Film ratings could reveal characteristics of the critics, such as sexual preference, political leanings, and health issues. “Even though one should not make inferences solely from someone’s movie preferences, in many workplaces and social settings opinions about movies with predominantly gay themes such as ‘Bent’ and ‘Queer as Folk’—both present and rated in [one] person’s Netflix record—would be considered sensitive,” the researchers stated.
Similar techniques have been used with data from social networks, geolocation data, and online reading preferences.
2. Discovering your browsing habits
You can tell a lot about a person from their browsing history, and interested companies and data brokers have a variety of ways of collecting the information. In 2016, an investigative journalist working for German public radio and television broadcaster Norddeutscher Rundfunk (NDR) and a data scientist revealed that a browser plugin, known as Web of Trust, had been collecting the browser history from 3 million German users.
Because many social media sites include a user identifier in their links, de-anonymizing the owner of the browser history is often very simple. In other cases, just knowing some of the sites a person uses is enough to find them in the database of web links.
Eschewing all browser plugins is not enough. In some cases, vulnerabilities have allowed unethical web sites to discover whether a visitor has also visited a list of other sites. A variety of techniques allow such “history sniffing” techniques, but the information can be found as easily as detecting whether a link has been visited.
Finally, advertising networks collect information on any browser that visits a site in which their ads are displayed, installing cookies or other tracking data to register users as they browse from site to site. Consumer concern over such techniques is one of the reasons for the steady increase in the use of ad blockers—expected to hit 31 percent this year, according to advertising intelligence firm eMarketer.
3. Determining political affiliations
Cambridge Analytica has come under fire for fraudulently collecting data from users to build models for political campaigns. Yet, the techniques are not always accurate, depending significantly on the data used. In 2013, for example, two researchers from McGill University found that other research papers were overly optimistic in their ability to detect political leaning through machine learning. We have “some unfortunate news to deliver: while past work has been sound and often methodologically novel, we have discovered that reported accuracies have been systemically overoptimistic due to the way in which validation datasets have been collected,” the researchers stated in the 2013 paper.
However, machine learning techniques and natural language processing have become progressively better. Lithium, a social networking provider, analyzed Twitter users feeds to determine political leaning, finding a much higher accuracy if the tweets mention other users to help create a social graph. “A training data set that includes only tweets with no mentions under-performs by almost 20 percent in accuracy compared to a data set that includes mentions,” the company wrote.
Facebook users can see what interests – and political leanings — the social network associates with them.
4. Determining sexual orientation
A variety of online data could be used to guess at your sexual orientation, whether your movie ratings or your browser history. Yet, other techniques exist that could be attempt to infer your orientation without less data. A photograph, for example.
In a controversial 2017 paper, a pair of Stanford researchers found that neural networks could detect links between facial features and sexual orientation. Some people criticized the research as reinforcing stereotypes, and other research found that the recognition engine was sensitive to factors such as smiling and head poses. A 2018 critique of the paper’s findings by three Google researchers found that asking yes/no questions about specific habits—such as wearing glasses or having facial hair—could achieve comparable results.
5. Companies know your health
Consumer buying habits reveal a lot about what is going on in their lives. Shopping habits are enough to determine health issues, such as pregnancies. In its efforts to improve its detection of customers who might be new mothers, for example, Target crawled through massive amounts of purchase data and found two dozen products that correlated strongly with pregnancies. The company even detected one high school student’s condition before her father knew, according to a 2012 article in the New York Times.
“We regularly give information to one place to another,” said GWU’s Martin. “They know if you drink too much. They know if you looked up bipolar disorder. They know all these things.”
In addition, many Web sites that pop up in search results are collecting and selling data on their visitors, either themselves or through third-party advertisers. A researcher from the University of Pennsylvania searched for 2,000 common diseases and found that 90 percent of resulting Web sites and advertising networks were tracking what topics interested visitors.
6. Detecting Emotions: Apple, Google, Facebook, Affectiva
Technology giants Apple, Facebook and Google—along with specialized startups, such as Affectiva—have already started analyzing your pictures and social media posts to gauge your emotions at the time that you published them. In a controversial 2014 study, Facebook used machine learning to classify social media posts based on their emotional content and found that positive and negative posts are contagious and allow emotions to essentially spread through social media.
Apple and Google are finding ways to detect and use emotion. Apple, which acquired a firm Emotient in 2016, uses emotion tracking for its Animoji and Face ID technologies to capture and classify facial expressions. Google uses emotion recognition to classify images and offers the technology to developers through its Cloud Vision API.
Marketers are salivating over the potential of automatically detecting the emotional state of consumers checking out products, while some technologists argue that emotionally-aware machines—such as a car that can detect road rage in a driver—are the future. MIT-incubated startup Affectiva, for example, has analyzed 6.5 million faces to detect emotion for a variety of applications.
7. Tracking your location: Mobile Phones, License Plates, Electronic Toll Devices
Your location throughout the day can easily be tracked through the device that most people carry at all times: Your smartphone. When smartphones connect to the network of base stations, the information is registered with the cellular provider. In 2011, a politician in Germany obtained the tracking data from his provider and mapped out six months of his movement.
Other apps may also collect information on your location, whether it is needed or not.
Yet, companies collect location information in other ways as well. Automated license plate readers (ALPRs), for example, are used by law enforcement in investigations and by firms to hunt down repossessed vehicles.
“Taken in the aggregate, ALPR data can paint an intimate portrait of a driver’s life and even chill First Amendment protected activity,” the EFF stated in an analysis of the issue. Since license plates are required, “it’s particularly disturbing that automatic license plate readers are used to track and record the movements of millions of ordinary people, even though the overwhelming majority are not connected to a crime.”
Users of EZPass and other automated toll devices are also giving up their locations. In October, an investigation discovered that the New York City Department of Transportation had used EZPass transponders to track traffic in Manhattan.
The current data economy and lack of consumer-focused privacy legislation has led to a free-for-all in the market, where companies create services that can act as lures to attract consumers and convince them to allow companies to use their data, often without realizing it. There needs to be a new covenant between consumers and Internet companies, said EFF’s Gebhart.
“Defaults should serve user privacy, not the advertisers,” she said.