The idea you wish to view belongs to a community that requires acceptance of terms and conditions.
on
Concept Design
Data Preprocess
Data collection
BeautifulSoup was used to automatically download policy files.
76* unlabeled Cybersecurity policy files were crawled from:
https://www.itu.int/en/ITU-D/Cybersecurity/Pages/National-Strategies-repository.aspx
193 human labeled Cybersecurity policy files were crawled from:
https://www.itu.int/en/ITU-D/Cybersecurity/Pages/Country_Profiles.aspx
Extract text from pdf files
Before extracting text from pdf files, we checked all files and rotated them into a forward angle.
Readable pdf files
If the text in the policy files are readable, PDF miner was used to extract text from the files.
Slides and Images
There exist 10 outliers, 9 of them (Belgium, Brunei, Korea, Latvia, Malawi, Mauritius, Panama, South Africa and Uruguay) are of image format, and the other one (Italy) is locked. So we deployed pytesseract to recognize text after transferring those files into JPEG format.
Initial cleaning
Unreadable ASCII characters are removed from the data set.
The Unicoder was set to ‘UTF-8’
Special page splitter
When extracting text data from pdf files, special page splitters were added to the end to each page, which will be a signal to recognize different pages.
Translate
Google translator * package was used to translate non-English language files into English.
Sentence Extraction
Filter sentences
The title-liked sentences and titles were eliminated first. We used only the normal length sentences.
Predict period
The sentence tokenization was a huge issue, so we used LSTM model implemented on Theano with GPU boosting to predict the periods. The methodology was to use pre-trained model and apply to our dataset. For example, we used roughly first 80% of lines from the Europarl v7 monolingual English corpus as training data, next 10% as development data and last 10% as test data (preprocessing script here). The training set size was about 40 million words. The corpus was obtained from the IWSLT 2012 TED task web page. The accuracy of our model can reach to 87%.
Split sentences
Using the predicted periods, we could split the sentences by these periods. Also, we cleaned the page breaks to extract the right sentences.
Data cleaning
First, the unnecessary symbols including punctuations, numbers and garbled text in front of each sentence were cleaned using the regular expression.
Second, the Porter stemming algorithm was used to remove the commoner morphological and inflexional endings from words in English. And the stop words were removed to filter out the most common words such as the, is, at, which, and on.
Topic Generator
More data cleaning
Country names were removed.
Words whose term frequency of top 5 and frequency smaller than 10 were also removed.
Essential words selection
Essential words were selected by TF-IDF. Words whose term frequency were top three of a sentence were selected to build topics.
LDA modeling
10 topics were generated by LDA according to selected words.
Here are the ten meaningful topics:
Topic tagging
Sentences were tagged by the topic of the most similarity. The similarity of a sentence and a topic were measured by LDA results.
Similarity matrix
Similarity matrix of all sentences was generated according to their degree of relevance to the ten topics. The matrix can be used as a reference for category labeling as similar sentences tend to be similar (or high similarity scores.)
Category Modeling
Category Definition
Before we could train our model, we should make sure the categories can be defined correctly, since our task is to predict the categories (labels). The exact definition of each categories (including the sub-categories) can be found from Global Cybersecurity Index & Cyberwellness Profiles :
https://www.itu.int/dms_pub/itu-d/opb/str/D-STR-SECU-2015-PDF-E.pdf
The category definitions of all 192 countries can be found in this book. Also, we collected more definitions from some papers and reports.
Labeled category
Each category including sub-category was labeled with some specific definition. Then, the labeled file was saved to do the vectorization.
Word2vec
The machine learning model could only deal with numeric data, so we used Word2vec to produce word embeddings, which is the vector representations of words.
Sentence Classification
Sentence modeling
Doc2vec was deployed to transform our sentences into numeric vectors.
Classify by categories
Firstly, sentences were classified by calculating the similarity to each category. Category labels will be tagged when its statistical significance is more than 95%.
Classify by sub-categories
Once finish classifying by categories, sentences were tagged by sub-categories whose similarity are of more than 95% significance.
Tools/Platforms
Related Python Packages
Leo Lu/Hanson Dong/Erica Xu/Clark Chen
https://github.com/AFinalExam/UN-Cybersecurity-Fordham-Core-Team
https://github.com/AFinalExam/UN-Cybersecurity-Fordham-Core-Team
https://github.com/AFinalExam/UN-Cybersecurity-Fordham-Core-Team/blob/master/README.md
Help to Improve This Idea.