PSG@Dravidian-CodeMix-HASOC2021: Pretrained Transformers for Offensive Language Identification in Tanglish
release_gvuxstkyarathisntrzusp2bty
by
Sean Benhur, Kanchana Sivanraju
2021
Abstract
This paper describes the system submitted to Dravidian-Codemix-HASOC2021:
Hate Speech and Offensive Language Identification in Dravidian Languages
(Tamil-English and Malayalam-English). This task aims to identify offensive
content in code-mixed comments/posts in Dravidian Languages collected from
social media. Our approach utilizes pooling the last layers of pretrained
transformer multilingual BERT for this task which helped us achieve rank nine
on the leaderboard with a weighted average score of 0.61 for the Tamil-English
dataset in subtask B. After the task deadline, we sampled the dataset uniformly
and used the MuRIL pretrained model, which helped us achieve a weighted average
score of 0.67, the top score in the leaderboard. Furthermore, our approach to
utilizing the pretrained models helps reuse our models for the same task with a
different dataset. Our code and models are available in
https://github.com/seanbenhur/tanglish-offensive-language-identification
In text/plain
format
Archived Files and Locations
application/pdf 673.1 kB
file_eftgida4snf7hmw5bl7igemfha
|
arxiv.org (repository) web.archive.org (webarchive) |
2110.02852v3
access all versions, variants, and formats of this works (eg, pre-prints)