Detecting racism using Machine Learning Models

A while ago I posted on Twitter that I could create a social media racism detector in a day. This was following the racial abuse that was targeted towards Rashford, Sako and Sancho following the EURO football final. Along with everyone else, I was saddened by the reactions, especially considering it really is only a game. The model did however take longer than a day - purely because it took me forever to get my M1 mac working with TensorFlow as I had to recompile it for the chipset. M1 issues are something for another day.

With the challenge accepted I set about to create a Machine Learning model/algoritm that could be used to support racism detection. Like everything machine learning related it isn't perfect and took a bit of trial and error but I now have a working prototype. The model uses very simple python dictionaries at this stage and a TensorFlow race recognition model, along with an instagram API from API Rapid to pull the posts for a particular individual (marcusrashford) - allowing testing against instagram. For the purposes of testing the model/algorithm I did not infer any real racist terms and decided to substitute with more commonly used emojis / linguistic tokens.

The script source-code is below and contains two core functions and three public variables. The public variables are the headers to call the APIs (I have removed my access token), a language map that should contain the array of racist terms and the username of the indiviudal you want to pull posts and comments to infer racism. If you imagine that this kind of model could be run asynchronously for posts to social media, subsequently flagging comments that appear to be inappropriate / racist then you can see the value.

Simply the model initially calls the instagram API for the first posts on the users instagram account. The script then downloads the image against each post and runs the mtcnn (https://github.com/ipazc/mtcnn) facial recognition model against the image. TensorFlow has multiple racial recognition models and Mtcnn is not the fastest detection model but it seemed to be the model with the most accurate chance of picking up all ethnic groups in images. This is a known issue with facial recognition and is well recorded in the NetFlix documentary CodedBias (https://www.netflix.com/title/81328723).

Once the image and the predominant race in the image has been detected the model then downloads the comments against the model to begin detecting any racist language tokens/phrases within each comment. For the purpose of demonstration the language map consists of the heart (❤) emoji for black groups and the fire (🔥) emoji for white groups. This enables enough comments to be detected that would be flagged as inappropriate - you will need to imagine less appropriate emojis and terms / phrases being mapped. As the comments are downloaded they are checked against the dictionary / array of inappropriate terms and flagged - this is a very crude approach but could also be mixed with sentiment analysis if further anger / emotional input is needed. At this stage the script is rather slow due to the amount of requests needed to download the comments - this would be far quicker if embedded as part of the social media infrastructure. Improvements can be made by using databases for the racist terms, sentiment analysis to determine the tone of the message and the use of a faster / more reliable detection model.

Those of you that can read code can look at the script below, I hope this will be used / modelled into all social platforms using both profile pictures and post picture detection to start taking online abuse more seriously. Adding further sentiment analysis will make the model a bit more robust and further work is required for multiple faces / races within posts. The challenge gets more difficult with videos - although we have to start somewhere. In summary I think this is a pretty good start (for a weekend project) and more work is required but it was a fun exercise and a clear demonstration that more can and should be done to tackle this important issue.

from deepface import DeepFace 
import pickle
import json 
import requests 
import uuid
import os 


headers = {
    'x-rapidapi-key': "",
    'x-rapidapi-host': "instagram85.p.rapidapi.com"
}

languagemaps = {'black ' : ['❤'], 'white' : ['🔥']}
username = 'marcusrashford'

def read_api_comments(y, pageId, raceInfo):
    
    querystring_bycode = {"by":"code"}
    race_info = raceInfo

    if (pageId):
        short_code = y['short_code']
        url_commment = f"https://instagram85.p.rapidapi.com/media/{short_code}/comments?pageId={pageId}"
    else:
        picture_url = y['images']['thumbnail']
        filename = str(uuid.uuid4()) + '.jpg' 
        img_data = requests.get(picture_url).content
        
        with open(filename, 'wb') as handler:
            handler.write(img_data)

        backends = ['opencv', 'ssd', 'dlib', 'mtcnn', 'retinaface']
        result = DeepFace.analyze(filename, detector_backend = backends[3], enforce_detection = False)
        race_info = result['dominant_race']

        os.remove(filename)

        short_code = y['short_code']
        url_commment = f"https://instagram85.p.rapidapi.com/media/{short_code}/comments"
        print(race_info)

    r = requests.request("GET", url_commment, headers=headers, params=querystring_bycode)
    feed = r.content
    comments = json.loads(feed)

    comm_array = comments['data']
    has_more = comments['meta']['has_next']

    if race_info in languagemaps:
        badterms = languagemaps[race_info]
        
        for c in comm_array: 
            
            comment = c['text']

            for term in badterms:
                if str(comment).find(term) > -1:
                    print('race: ' + race_info + ', found in comment ' + term + ' , in ' + comment)

    has_more = comments['meta']['has_next']
    if has_more:
        pageid = comments['meta']['next_page']
        read_api_comments(y, pageid, race_info)


def read_insta_posts(username, pageId):
    if (pageId):
        url = f"https://instagram85.p.rapidapi.com/account/{username}/feed?pageId={pageId}"
    else: 
        url = f"https://instagram85.p.rapidapi.com/account/{username}/feed"

    querystring = {"by":"username"}
    r = requests.request("GET", url, headers=headers, params=querystring)
    feed = r.content
    items = json.loads(feed)

    user = items['data']
    has_more_posts  = items['meta']['has_next']
    pageId = items['meta']['next_page']
    print(f'downloaded feed for {username}, pageId = {pageId}')

    for y in user:
        print('post title: ' + y['caption'])
        read_api_comments(y, None, None)

    while has_more_posts:
        read_insta_posts(username, pageId)


read_insta_posts(username, None)

Comments

  1. Cute code. ... Take it as a complement. It has been a while since I came across a fun post. :)

    ReplyDelete

Post a Comment