Projects:2021s1-13291 Automated Content Moderation
Open-source intelligence refers to information or data that is legally derived from publicly available sources. Within the context of social media, many such items of open-source data exist including images, videos, and text. The use of this media, however, carries a level of risk. Within the context of material uploaded by unknown individuals, exposure to content that is not safe for work can often result. Platforms, as a result, must employ algorithms to process media to detect explicit media in order to remove posts that violate the law and standards or expectations of the community. For the investigator, automatic classification often has the opposite requirement. Rather than removing extreme content, we wish to isolate for further study and possible action on behalf of law enforcement. In this project, an analysis of current machine learning and signal processing methods will be conducted to create an effective system capable of extracting, filtering, flagging, and storing content from the social media site - Snapchat.
Contents
Introduction
Every day people all around the world share their daily lives to people via their social media. In this day and age, a majority of the world’s population has access to the internet and social media. Due to this, many people are exposed to unfiltered content, where things can become viral in a matter of minutes whether they be illegal or not. Society needs to find a way to automate this regulation process. To take this one step future this project researches how to take this data, store and use it for law enforcement purposes. Having this insight there are the following main questions:
- Can open-source media provide for a constant surveillance mechanism?
- How to use social media to prevent/reprimand actions of individuals?
- How to store large quantities of data for long periods of time whilst preventing degradation of information.
Project team
Project students
- Sanjana Tanuku
- Siyu Wang
- Linyu Xu
Supervisors
- Matthew Sorell
- Richard Matthews
Objectives
The objective of the project is to create an effective system capable of extracting, filtering, flagging and storing unsafe content from social media sites such as Snapchat, TikTok, Facebook and Reddit.
Background
Machine Learning
Machine learning can be defined as the science of creating computer systems that can provide logical conclusions and process data about a changing world [2]. They use algorithms which are a set of processes that are performed in a step-by-step manner to divert attention to changing data sets [2]. They are able to discover irregularities and relationships between data sets. In machine learning, it is not the scientist's job to hard code rules about every situation a system can encounter, but it is more about giving the system enough information so it can detect what to look for and evolve [2]. A system should have the capacity to be able to recognise repeating patterns, trends and anomalies in a data set and categorize the initial – training – data provided [2]. It will then use this training provided to work on new sets of data [2]. Machine learning can be split into two streams, called supervised and unsupervised training [2]. In supervised training the programmer will give the system a set of initial data, specifying what sort of things the system must identify, then be given generic datasets for which it must apply itself [2]. Scientists can also use unsupervised training, in which the system is provided with an unlabeled data set and is programmed to identify the strongest categories or structuring principles (often called clusters) within it [2]. One would require training data with these variations to create classifiers that would work well on them [2]. Ideally, machine learning systems should be trained on diverse material to ensure that they are not overspecialized for a single context [1]. In this project the supervised learning mechanism will be used.
Computer Vision
Computer vision is the analysis of digital visual images to understand both the objects that are depicted within them and the scenes from which they were constructed. Computer vision can involve detection of specific objects, segmentation (separation) of multiple objects within one scene, tracking these objects over time and space, three-dimensional reconstruction of the objects and/or scenes, and determination of the placement of the camera in the scene over time and/or space. Numerous methods are used to carry out these tasks, including color analysis, shadow and illumination analysis, geometrical analysis of curves and edges, and photogrammetry [2].
Deep Learning
Deep learning is a relatively new method of machine learning. The reason deep learning is so popular is that deep learning can learn better and capture the features of the target image independently instead of manual extraction, which saves a lot of labour [3]. Deep learning has achieved great success in image recognition and computer vision [3].
Convolutional neural network
Convolutional neural network (CNN) is a neural network that contains multiple convolution calculations and a deep structure, so it belongs to deep learning and it is a branch of machine learning [4]. CNN was created around the 1980s [4]. After continuous development, it has been widely used in computer vision and artificial intelligence [4]. Many CNN models have repeatedly become the optimal algorithm in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [4]. At the same time, the large number of hidden layers of the CNN model gives it a more powerful ability to learn features. Therefore, the CNN model has a good performance in the classification of large-scale images. At present, Google, Microsoft and Facebook are developing and applying CNN on a larger scope [4]. The working principle of CNN is to use convolution between features and images [4]. Each picture has fixed natural features, so CNN can randomly take a piece of local data from the image as a training sample, and use the features learned from this small sample to filter the entire image, that is, combine these local features with the entire image [4]. The image is convolved to achieve the effect of learning the characteristics of each position of the image [4].
Database storage
NoSQL database means non-relational SQL, it provides a mechanism for data storage and retrieval, which is not modeled in the table relationships used in relational databases [7]. NoSQL databases are widely used in big data and real-time network applications [7]. NoSQL is usually classified according to the data model, including Key–value store, Column store, Document stores and Graph database.
The method chosen in this project is MongoDB which is document storage [8]. In document storage, records are stored in the form of documents. Generally, the structure and form of the data do not need to be defined in advance [7]. The field type and structure of each document may be different.The documents stored in the database can be updated by the unique id or the tagged information which means the stored data is semi-structured [7]. So the data can be changed dynamically because there is no separation between data and schema. This helps to create and retrieve documents. Document storage provides a lot of convenience [7]. There is no need to normalize the data and to worry about structures that the database system may or may not support [7].
In the MongoDB database system, BSON is a binary code serialization of JSON-like documents [8]. It stores key-value pairs in binary bytes, key-values are stored in string types, and values are stored in any type including arrays and documents [8].
Method
Materials The datasets used for training the models include 3 types of images: fire, not fire and candle. Part of the dataset provided for the training of this project was from 100 fire videos of Minneapolis Riot [published datasets]. Each video was captured at every 10 frames, and the images were manually classified into fire and non-fire categories. The rest of the images, taken from the internet, were sorted as a single type of candles in training datasets, giving the algorithm the ability to learn the characteristics of different types of fire. The total number of the training datasets is 2671. On the other hand, the final data that need to be classified in this project are videos collected from Snapchat, and materials that need to be stored are the classification results. Methods for data collection, video classification and data storage are explained below.
Algorithm Training: The algorithm used in this project is to retrain 7 pre-trained models of AlexNet which is one kind of CNNs. The datasets used for training are filtered by the red-colour filter, green colour filter, blue colour filter, middle frequency filter, high frequency, luminance filter and a blank filter, each dataset is used to train one model separately. These 7 models are regarded as 7 nodes to judge the classification. The learning rate and number of epochs used for training are 0.002 and 13. There are 70% of the datasets used to train the model and 30% used for the validation while the training process. At the same time, all training processes are under supervision, and all training datasets are pre-processed before taking as input into each model.
Testing: The classification result of testing is based on 7 nodes. The logic used here is that if there are more than ⅓ frames captured from a video that are classified as fire for each node, the output of this node is defined as fire. The same logic goes for the candle and not fire labels. Take the one that occurs the most times out of the seven outputs as the final result. Test the classification result with different videos/images.
Results
Algorithm Figure. 1 shows the image testing results. There are 6 images that have been tested, and each image is output with a classified label and the certainty of this label on the top. These testing images include all 3 types we trained, which are fire, not fire and candle. From Figure. 1, we can see every testing image has been given a correct classification based on the outputs of 7 nodes, even though some correct labels have 5/7 certainty or 6/7 certainty. For example, for the fire image with 5/7 certainty, we can know there are 2 nodes output a wrong label. On the other hand, there are several bright light images used for testing. The bright light images have quite similar features as fire, but they still get the correct classification with 7/7 certainty and 5/7 certainty.
Table 1 shows the testing result of 20 videos (Red represents the wrong classification result). We observe that there is only one video that has been classified incorrectly. The certainty of this identified label is 4/7, meaning there are 4 nodes that give the wrong outputs. It shows that the larger the number of the correct labels given by each node is, the more obvious features the video has, the higher the certainty of the final label will be. On the other hand, most videos that have correct classification results are given a 7/7 certainty, but there are still some videos that are given a relatively low certainty even though they get the correct labels, as shown in Table. 1, such as video 13 & video 16 get ‘not fire’ labels with a 4/7 certainty. Because there are 20 videos for testing, we can clearly know the testing accuracy for these 20 videos is 95%. Table 2 shows the testing result of 20 different videos. We can see there are 2 incorrect labels that have been given with 4/7 certainty and 5/7 certainty which indicates that there are 4 nodes given a wrong output for video 6 and 5 nodes given a wrong output for video 11, and the rest of the classification result is correct. Again, table 2 shows some videos that have been classified correctly with a relatively low certainty. Because there are 20 videos for testing, the testing accuracy here is 90%.
Conclusion
From the current results shown above, it is not enough to show that AI models can provide evidence for the law department. The accuracy of the classification based on 7 nodes is in a range of 90-95%, showing that the accuracy is not consistent, minor changes might have existed according to different testing videos. This tool is able to identify different targets that are unsafe from the SNS, such as drugs, firearms and children exploitation material by changing the training datasets. The future work is to improve the algorithm to be more accurate and make the tool suitable for wider applications.
References
[1] M. Lowenthal and M. Clark, “The Five Disciplines of Intelligence Collection”, CQ Press, pp. 5-43, 2016.
[2] R. Szeliski, Computer vision: Algorithms and applications, 2011th ed. Guildford, England: Springer, 2010.
[3] L. Zhu, Z. Li, C. Li, J. Wu and J. Yue, "High performance vegetable classification from images based on AlexNet deep learning model", Ijabe.org, vol. 11, no. 4, pp. 217-223, 2018.
[4] A. Mohammed, H. Tao and M. Talab, “Review of Deep Convolution Neural Network in Image Classification”, IEEE, 2017.