, 25 tweets, 10 min read Read on Twitter
Here's what I think of as the dear #American journalists thread 2.0. It concerns Facebook and #lka, hate speech detection, and language complexities of the Global South, where @LIRNEasia (the think tank I work at) operates.
It seems a perfectly natural reaction, given the phenomenal amount of hate speech, to urge @facebook and @twitter to get their shit together and take it off the platform. Post March 2018, we - as in Sri Lankan civil society - tried.
In fact, one of the longest running arguments @sanjanah, @NalakaG and myself have had with the @facebook India team is that they consistently fail to live up to their own damn Community Standards, and only seem to pay attention as long as the @nytimes or @WIRED is slamming them.
Not only that, but the analysis tools they chose to share, such as CrowdTangle, seem for now to be the exclusive domain of a few privileged researchers in their legit-American-university circles. People like us are down to sending emails, hitting refresh, and seeing nothing.
So we @LIRNEasia decided to better understand hate speech detection ourselves. Topic modelling algorithms, after all, should work. From Blei to Dooley and Corman, there's a whole lot of work out there.
But here we come to the fundamental problem: language. Sinhala and Tamil, to be precise. You can feed a machine learning model anything, but for proper, explainable decisions, the kind we need for policy work . . . the algorithms throw out all manner of quirky shit.
The problem here lies in language. English, for which most of this stuff makes *sense*, belongs to the West Germanic tree. Ours are from the Indo-Aryan (gross oversimplification here). Language structures are different. Word orders are different.
Even running the same topic modelling algorithm on similar languages with good translated parallel texts (see EuroParl corpuses) yield different results.
And don't even get me started on morphological richness; talk to a friendly neighborhood linguist. We have languages that can inflect a word hundreds of ways. Want to pick up on that accurately? Good luck.
The underlying problem is what we call resource poverty in languages. The computational analysis of language requires good corpuses, tokenizers, lemmatizers, and a many other layers of analysis and software.
We don't have that stuff. A lot of universities are trying, but we're a few years away, complicated by different dialects for formal written language and colloquial speaking language. Only a fraction of the research done for English is available for languages in the South.
I keep telling people that what a programmer can do with few Python libraries on English and what I can do with Sinhala are completely different. In countries like #Myanmar (where we do large amounts of research, the pop. language isn't even Unicode compatible).
Any R programmers among you can stop screaming now. Call it the capitalism of languages. More investment has been put into English than anything else; rest of us are nowhere close; Wittgenstein was right.
If you really want to understand this problem, here's a whitepaper we (@NisansaDdS, Yasho and I) have been chewing on for a little too long. lirneasia.net/2019/04/natura…
If you really want to go after @facebook and @twitter and @google, your Big Tech, here's a scalpel instead of a hammer: one way around this is to build machine translation good enough that we can translate to English, analyze with only small margins of error.
No translation is ever fully accurate (see what Clive James says about his new Inferno translation for a better explanation). But with enough language data, and vast amounts of processing power, we may be able to hit GOOD ENOUGH.
And @facebook and @google are probably are the largest repositories of multilingual text data in the world. Yes, they do translation now, but don't tell me it works.
Get a Sri Lankan colleague to translate an English text for you into Sinhala or Tamil. Something longer than seven words. Political speech, perhaps. Put it on Facebook. Hit the translate button. Try not to laugh.
So urge them to get their translation sorted. Push them to hire more linguists, work closer with local universities, to invest in ling/comp.sci work, the development of parallel corpora, and MAKE THAT STUFF OPEN TO US INSTEAD OF JUST TO WESTERN ACADEMICS.
And this isn't a #srilanka or #myanmar problem. You want a real problem? Try India. 20+ official govt languages. 447 living languages listed by enthologue (ethnologue.com/country/IN).
I know this isn't easy to explain. I know I'm being a bit too geeky here. I know we are Homo Narrator, and not Homo Sapiens, and you have word limits, and we gravitate towards simple stories.
But this is a complex world. And from the number of honest journalists that have reached out to myself and colleagues after that first tweet thread of mine, I know at least some of you can and will process.
So, if you want material, have at it. And please talk to linguists - even my whitepaper draft above barely skims to surface. That paper too needs to be simplified a bit, reordered, has one too many T.S. Eliot references, but it's the problem outlined to a general audience.
Tagging @meghara of @BuzzFeedNews and @lmatsakis of @WIRED who have proven themselves willing to approach subjects with nuance and precision. Trust you can get this to those who need to see it.
Tagging @helanigalpaya @Carmenable @adrianshahbaz and @meenakshirv as well. Would tag others, but among a nightmare of local news notifications my phone is slipping gently into that good night.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Yudhanjaya Wijeratne 🌀
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!