“Computer science is a field which hasn't yet encountered consequences,” ex-Googler Yonatan Zunger wrote on Sunday. Physics made the atomic bomb, chemistry created chemical weapons, but the cutting-edge sciences of today — fields like machine learning, data science, and computational linguistics — have yet to face their A-bomb moment.
That moment is happening right now, and it’s putting researchers like me in a strange position. Six years ago, when I started graduate school studying corpus linguistics — the branch of linguistics that uses huge text databases (called corpora) to understand natural language — bad people using my work to do terrible things wasn’t at the top of my mind.
Today, the burgeoning science of teaching computers to understand natural language has taken sinister directions. It’s a tool that can use your online presence to profile you, predict your behavior, and manipulate you, and while Cambridge Analytica’s claims to have mastered this are currently making headlines across the world, the company is just the tip of the iceberg.
Every day, scholars like me produce new knowledge, computational tools, and data that is increasingly available to anyone. Data science thrives through open-source licenses for each new computational tool, and there are more text databases easily available, for free, than ever before. Just one example: Anyone can now freely download an archive of 3.3 billion public Reddit comments. And all you need to analyze all that text is the power of a serious desktop computer, not an advanced data center.
The body that funds my research, Canada’s Social Sciences and Humanities Research Council, strongly encourages and sometimes requires academics to publish their work in open-access journals, free to the public. There's also an increasing push to publish data and the code used for statistical analyses alongside our papers.
This is a good thing for science, probably, but it also means that anyone, anywhere, can get ahold of my research, my code, and any data I open-source — and use it for any purpose.
In all the years I’ve been a researcher, I've never once had a class or seminar on the ethics specific to our field. There is a good reason for this: Protecting humanity from the fallout of the knowledge I generate isn’t part of my job description, and university ethics boards exist to protect the individuals who participate in research.
Academics like me take care that the data we scrape from the web, and the research we publish, cannot harm or identify any specific individual. A researcher scraping a million blog posts to make a text database isn’t fundamentally different from a researcher reading and analyzing blog posts one by one, save for the scale, and neither would require clearance from an ethics board. There would be even fewer restrictions or standards for consent if I were doing this work outside of academia. And there are no licensing bodies or professional codes I would need to adhere to.
I have no way of knowing how insights from my research could be used to advance the work of propagandists and other bad actors. But I can take a guess.
The holy grail in computational linguistics is natural language understanding and generation: having a computer comprehend and produce language in a fully humanlike way. We’re still far from this, but there has been remarkable progress in the many sub-problems that need to be solved to get there. For example, in just the last few years, automatic speech recognition has become a mostly solved problem: Spoken language can now be quantified and subjected to analysis as easily as written text.
Fully fledged natural language understanding and generation, in the hands of a hostile propagandist, is a terrifying thought. The internet has already been turned toxic by spam bots and trolls of both the amateur and professional variety. Now imagine having an army of trolls, indistinguishable from real people and just as interactive, that is as large as the computer processing power you can devote to it. This is a long way off, but not as long as we might think.
Currently, each step toward true natural language understanding increases the precision with which computers can profile you through the things you do online. Each incremental improvement in natural language understanding — things like detecting sarcasm, or interpreting words with multiple meanings — makes it easier for machines to understand your traits and behavior.
Do you say sorry a lot on Facebook? You might be higher than average in neuroticism (or just Canadian). In the same way, your socioeconomic status, personality, gender, age, aspects of your mental health, and much more are all fairly predictable from your language use on social media, and these predictions get more accurate with each improvement in our tools. Sentiment analysis can work out what you feel good or bad about, whether it’s a product on Amazon or your senator. In a capitalist democracy, these tools are most valuable for selling things and influencing votes, but in repressive regimes they could be useful for figuring out who needs to be silenced, amplified, frightened, or blackmailed.
The same tools can also do much good: Some researchers are plying their trade to better detect hate speech online, recognize bot accounts on social media, or understand how pervasive gender and racial biases are in everyday speech. One widely used technology that enables machines to measure the meanings of words has been used to reveal how artificial intelligence systems can inherit the racial and gender biases of their training data. In response, methods of “de-biasing” these systems have already been developed and deployed.
As computer scientists deal with our A-bomb moment, we need to think more about how our work can be used by good actors like this, and not merely focus on big-money topics like machine translation or speech recognition. I hope my research does good. But once it is published — and academics need to publish — I have no control over what happens next.
Bryor Snefjella is a PhD candidate in the cognitive science of language at McMaster University, and a graduate resident of the Lewis and Ruth Sherman Centre for Digital Scholarship. More information on his work can be found on his ResearchGate page.