Your browser does not support JavaScript. Please to enable it.

Terms & Conditions

The idea you wish to view belongs to a community that requires acceptance of terms and conditions.

RejectAccept

    Help to Improve This Idea.

    What's New

    Search

     
    Prev | Next

    Cybersecurity Classification by Machine Learning

    by Hao Sun 02/27/2018 06:49 AM GMT

    • {{:upVoteCount}}
    • {{:downVoteCount}}
    Username * ()

      I accept the terms and conditions (see side bar). I understand all content I am submitting must be licensed under an open-source software or Creative Commons license as described in the Terms and Conditions:

      on

      Description

      Concept Design

      Data Preprocess

      Data collection

      BeautifulSoup was used to automatically download policy files.

      76* unlabeled Cybersecurity policy files were crawled from:

      https://www.itu.int/en/ITU-D/Cybersecurity/Pages/National-Strategies-repository.aspx

      193 human labeled Cybersecurity policy files were crawled from:

      https://www.itu.int/en/ITU-D/Cybersecurity/Pages/Country_Profiles.aspx

      Extract text from pdf files

      Before extracting text from pdf files, we checked all files and rotated them into a forward angle.  

      Readable pdf files

      If the text in the policy files are readable, PDF miner was used to extract text from the files.

      Slides and Images

      There exist 10 outliers, 9 of them (Belgium, Brunei, Korea, Latvia, Malawi, Mauritius, Panama, South Africa and Uruguay) are of image format, and the other one (Italy) is locked. So we deployed pytesseract to recognize text after transferring those files into JPEG format.

      Initial cleaning

      Unreadable ASCII characters are removed from the data set.

      The Unicoder was set to ‘UTF-8’

      Special page splitter

      When extracting text data from pdf files, special page splitters were added to the end to each page, which will be a signal to recognize different pages.

      Translate

      Google translator * package was used to translate non-English language files into English.

      Sentence Extraction

      Filter sentences

      The title-liked sentences and titles were eliminated first. We used only the normal length sentences.

      Predict period

      The sentence tokenization was a huge issue, so we used LSTM model implemented on Theano with GPU boosting to predict the periods. The methodology was to use pre-trained model and apply to our dataset. For example, we used roughly first 80% of lines from the Europarl v7 monolingual English corpus as training data, next 10% as development data and last 10% as test data (preprocessing script here). The training set size was about 40 million words. The corpus was obtained from the IWSLT 2012 TED task web page. The accuracy of our model can reach to 87%.

      Split sentences

      Using the predicted periods, we could split the sentences by these periods. Also, we cleaned the page breaks to extract the right sentences.

      Data cleaning

      First, the unnecessary symbols including punctuations, numbers and garbled text in front of each sentence were cleaned using the regular expression.

      Second, the Porter stemming algorithm was used to remove the commoner morphological and inflexional endings from words in English. And the stop words were removed to filter out the most common words such as the, is, at, which, and on.

      Topic Generator

      More data cleaning

      Country names were removed.

      Words whose term frequency of top 5 and frequency smaller than 10 were also removed.

      Essential words selection

        Essential words were selected by TF-IDF. Words whose term frequency were top three of a sentence were selected to build topics.

        LDA modeling

        10 topics were generated by LDA according to selected words.

        Here are the ten meaningful topics:

        Topic tagging

        Sentences were tagged by the topic of the most similarity. The similarity of a sentence and a topic were measured by LDA results.

        Similarity matrix

        Similarity matrix of all sentences was generated according to their degree of relevance to the ten topics. The matrix can be used as a reference for category labeling as similar sentences tend to be similar (or high similarity scores.)

        Category Modeling

        Category Definition

        Before we could train our model, we should make sure the categories can be defined correctly, since our task is to predict the categories (labels). The exact definition of each categories (including the sub-categories) can be found from Global Cybersecurity Index & Cyberwellness Profiles :

        https://www.itu.int/dms_pub/itu-d/opb/str/D-STR-SECU-2015-PDF-E.pdf

        The category definitions of all 192 countries can be found in this book. Also, we collected more definitions from some papers and reports.

        Labeled category

        Each category including sub-category was labeled with some specific definition. Then, the labeled file was saved to do the vectorization.

        Word2vec

        The machine learning model could only deal with numeric data, so we used Word2vec to produce word embeddings, which is the vector representations of words.

        Sentence Classification

        Sentence modeling

        Doc2vec was deployed to transform our sentences into numeric vectors.

        Classify by categories

        Firstly, sentences were classified by calculating the similarity to each category. Category labels will be tagged when its statistical significance is more than 95%.

        Classify by sub-categories

        Once finish classifying by categories, sentences were tagged by sub-categories whose similarity  are of more than 95% significance.

        Tools/Platforms

        • Anaconda(Jupyter)

        Related Python Packages

        • pyPDF, pdfminer
        • googletrans
        • Beautiful soup
        • requests
        • urllib
        • re
        • numpy
        • pandas
        • gensim
        • tensorflow
        • nltk
        • scikit-learn
        • pillow
        • collections
        • argparse
        • pytesseract
        Co-authors to your solution

        Leo Lu/Hanson Dong/Erica Xu/Clark Chen

        Link to your concept design and documentation (Required by the final day of the Submission & Collaboration phase)

        https://github.com/AFinalExam/UN-Cybersecurity-Fordham-Core-Team

        Link to an online working solution or prototype (Required by the final day of the Submission & Collaboration phase):

        https://github.com/AFinalExam/UN-Cybersecurity-Fordham-Core-Team

        Link to a video or screencast of your solution or prototype (Required by the final day of the Submission & Collaboration phase):

        https://github.com/AFinalExam/UN-Cybersecurity-Fordham-Core-Team/blob/master/README.md

        Link to source code of your solution or prototype above. (If you submitted a link to an online solution or prototype, or to a video of your solution of prototype, you must provide a link to the source code. This item is required by the final day of the submission phase):

        Cybersecurity,Clusrtering,Classification,Tokenization

        Attachments

          Help to Improve This Idea.

          0%
          0%
          100%
          No ideas found!

          Request to become a member

          Type a short message to the owner of this idea.:

          Invite Team Members

            Message
            *Required