#CyberSecurityNLP : Machine-based Text Analytics of National Cybersecurity Strategies.

X Close

Prev | Next

Cybersecurity Classification by Machine Learning

by Hao Sun 02/27/2018 06:49 AM GMT

{{:upVoteCount}}

Move idea from "Winners / Selected for Development" stage to:

Collapse

Do you want to send this idea to AdaptiveWork?

Collapse

Do you want to send this idea to Portfolios?

Parent structure code

Collapse

Which workspace template do you wish to use?

Collapse

I accept the terms and conditions (see side bar). I understand all content I am submitting must be licensed under an open-source software or Creative Commons license as described in the Terms and Conditions:

Description

Concept Design

Data Preprocess

Data collection

BeautifulSoup was used to automatically download policy files.

76* unlabeled Cybersecurity policy files were crawled from:

https://www.itu.int/en/ITU-D/Cybersecurity/Pages/National-Strategies-repository.aspx

193 human labeled Cybersecurity policy files were crawled from:

https://www.itu.int/en/ITU-D/Cybersecurity/Pages/Country_Profiles.aspx

Extract text from pdf files

Before extracting text from pdf files, we checked all files and rotated them into a forward angle.

Readable pdf files

If the text in the policy files are readable, PDF miner was used to extract text from the files.

Slides and Images

There exist 10 outliers, 9 of them (Belgium, Brunei, Korea, Latvia, Malawi, Mauritius, Panama, South Africa and Uruguay) are of image format, and the other one (Italy) is locked. So we deployed pytesseract to recognize text after transferring those files into JPEG format.

Initial cleaning

Unreadable ASCII characters are removed from the data set.

The Unicoder was set to ‘UTF-8’

Special page splitter

When extracting text data from pdf files, special page splitters were added to the end to each page, which will be a signal to recognize different pages.

Translate

Google translator * package was used to translate non-English language files into English.

Sentence Extraction

Filter sentences

The title-liked sentences and titles were eliminated first. We used only the normal length sentences.

Predict period

The sentence tokenization was a huge issue, so we used LSTM model implemented on Theano with GPU boosting to predict the periods. The methodology was to use pre-trained model and apply to our dataset. For example, we used roughly first 80% of lines from the Europarl v7 monolingual English corpus as training data, next 10% as development data and last 10% as test data (preprocessing script here). The training set size was about 40 million words. The corpus was obtained from the IWSLT 2012 TED task web page. The accuracy of our model can reach to 87%.

Split sentences

Using the predicted periods, we could split the sentences by these periods. Also, we cleaned the page breaks to extract the right sentences.

Data cleaning

First, the unnecessary symbols including punctuations, numbers and garbled text in front of each sentence were cleaned using the regular expression.

Second, the Porter stemming algorithm was used to remove the commoner morphological and inflexional endings from words in English. And the stop words were removed to filter out the most common words such as the, is, at, which, and on.

Topic Generator

More data cleaning

Country names were removed.

Words whose term frequency of top 5 and frequency smaller than 10 were also removed.

Essential words selection

Essential words were selected by TF-IDF. Words whose term frequency were top three of a sentence were selected to build topics.

LDA modeling

10 topics were generated by LDA according to selected words.

Here are the ten meaningful topics:

Topic tagging

Sentences were tagged by the topic of the most similarity. The similarity of a sentence and a topic were measured by LDA results.

Similarity matrix

Similarity matrix of all sentences was generated according to their degree of relevance to the ten topics. The matrix can be used as a reference for category labeling as similar sentences tend to be similar (or high similarity scores.)

Category Modeling

Category Definition

Before we could train our model, we should make sure the categories can be defined correctly, since our task is to predict the categories (labels). The exact definition of each categories (including the sub-categories) can be found from Global Cybersecurity Index & Cyberwellness Profiles :

https://www.itu.int/dms_pub/itu-d/opb/str/D-STR-SECU-2015-PDF-E.pdf

The category definitions of all 192 countries can be found in this book. Also, we collected more definitions from some papers and reports.

Labeled category

Each category including sub-category was labeled with some specific definition. Then, the labeled file was saved to do the vectorization.

Word2vec

The machine learning model could only deal with numeric data, so we used Word2vec to produce word embeddings, which is the vector representations of words.

Sentence Classification

Sentence modeling

Doc2vec was deployed to transform our sentences into numeric vectors.

Classify by categories

Firstly, sentences were classified by calculating the similarity to each category. Category labels will be tagged when its statistical significance is more than 95%.

Classify by sub-categories

Once finish classifying by categories, sentences were tagged by sub-categories whose similarity are of more than 95% significance.

Tools/Platforms

Anaconda(Jupyter)

Related Python Packages

pyPDF, pdfminer
googletrans
Beautiful soup
requests
urllib
re
numpy
pandas
gensim
tensorflow
nltk
scikit-learn
pillow
collections
argparse
pytesseract

Co-authors to your solution

Leo Lu/Hanson Dong/Erica Xu/Clark Chen

Link to your concept design and documentation (Required by the final day of the Submission & Collaboration phase)

https://github.com/AFinalExam/UN-Cybersecurity-Fordham-Core-Team

Link to an online working solution or prototype (Required by the final day of the Submission & Collaboration phase):

https://github.com/AFinalExam/UN-Cybersecurity-Fordham-Core-Team

Link to a video or screencast of your solution or prototype (Required by the final day of the Submission & Collaboration phase):

https://github.com/AFinalExam/UN-Cybersecurity-Fordham-Core-Team/blob/master/README.md

Link to source code of your solution or prototype above. (If you submitted a link to an online solution or prototype, or to a video of your solution of prototype, you must provide a link to the source code. This item is required by the final day of the submission phase):

Tags: Cybersecurity,Clusrtering,Classification,Tokenization

Move this Idea

Close this idea

When closing an idea, you must determine whether the idea has exited successfully or unsuccessfully.

Was the idea selected?

What is the Primary annual Impact?*

Quantify based on your selection*

What is the annual Secondary Impact?

Quantify based on your selection

What will the next steps be?*

Cancel Submit

Add Team Members

*Required

Cancel Add Now

Done

Help to Improve This Idea.

life cycle stages

100%

Terms & Conditions

Help to Improve This Idea.

legal.notice.title

View Idea

Cybersecurity Classification by Machine Learning

Move idea from "Winners / Selected for Development" stage to:

Do you want to send this idea to AdaptiveWork?

Do you want to send this idea to Portfolios?

Which workspace template do you wish to use?

Move this Idea

Close this idea

Copy idea to another community

Team Members

Add Team Members

Comments

Help to Improve This Idea.

Tasks

Comparable Ideas

Activities

Terms & Conditions

Help to Improve This Idea.

legal.notice.title

Inbox

View Idea

Cybersecurity Classification by Machine Learning

Move idea from "Winners / Selected for Development" stage to:

Do you want to send this idea to AdaptiveWork?

Do you want to send this idea to Portfolios?

Which workspace template do you wish to use?

Move this Idea

Close this idea

Copy idea to another community

Team Members

Add Team Members

Comments

Help to Improve This Idea.

Tasks

Comparable Ideas

Activities