DescriptionWe will focus on following improvements:
- Data collection
Transfer data from PDF to text files.
- Text extraction
Extract text contents from the documents.
- Sentence tokenizing
Remove stop words, group sentences with similar meanings, and mark groups with certain labels.
- Classification of category and subcategories
Classify each group based on meaning of sentences.
5. Interface for easily indicating the location of files
Implementation Method:
Tools/Platforms: Anaconda(Jupyter), SPSS Modeler
Python Packages: NLTK, pdfminer3k, Scikit Learn, etc
Main challenges:
Building proper dictionary
Classifying the policies
Making user-friendliness interface
Timeline:
- Before February 28th: Idea
- Before March 10th: Data collection & Methodology & Tools choosing
- Before April 25th: Realization & Results
- Before May 3th: Improvements
- Final results & presentation
Expected Outcomes:
Building a user-friendliness dictionary
Co-authors to your solutionZijing Yu, Qinruo Wu, Zihao Wang
Help to Improve This Idea.