There has not been much work done on this topic in the Persian language so the resulting tool will be quite useful in Persian Q&A websites. Machine Learning techniques are used here to determine the type and the category of questions so they can be more easily tagged and classified. Also identifying the question type can be very helpful in further NLP tasks such as summarization.
The dataset used for our experiments is a set of 2800 Persian questions randomly selected by crawling 140 different social question-and-answer forums or FAQ pages. To define the annotation scheme for the question topic classification, we used the most frequent tags of questions in the main international CQA sites. For the annotation scheme of question types, we integrated the available models mentioned in Table 3 to achieve a more general scheme for this goal. In total, 23 different topics and 12 types were defined for our task.
For both question topics and types, the data were annotated by three annotators who are graduate students and native speakers of Persian.
For each question, the annotators can select up to 3 category labels, while the order of labels should also be taken into the account; i.e., the first label has a higher priority compared to the second one. If none of the available labels are appropriate, they can suggest a new label for the question. The interface also provides a check box for the uncertainty of the annotators. They should fill it if they are not sure about their selected label(s).
combinator.py contains the code for combining the tags of these annotators and evaluating some of the statistics of their tags. A further analysis of the statistics is done in analyser_pro.py.
This dataset is available in Primary_data/result_filtered.csv
We use bag of words as the input for our learning methods. In word_vector_builder.py we find the most frequent words in the questions of our dataset excluding the stop words. Then in training_data_builder.py we create the feature vector and the vector of types and topics for each question.
In model_evaluator.py the different vector representations and training algorithms are evaluated and then in fast_learner.py we use the best algorithms from the previous step to learn models from the training data and dump them as pickle files.
The web_interface directory contains the web app based on Flask and the classifier API would include the file question_classifier.py. It's use would be like what follows:
from question_classifier import QuestionClassifier # initialising the class would load the pre-trained models classifier = QuestionClassifier() # then for each question you can use the Bag of Words or Word2vec pre-trained model # these methods would create the question vector and feed it to the model and then return the resulting list of tags topics_df, types_df = classifier.bow_classify(input_question) topics_df, types_df = classifier.w2v_classify(input_question) # these two outputs are pandas DataFrames # they are lists of all tags along with the likelihood of their assignment to the input question for item in topics_df.values: print('tag:', item) print('likelihood:', item)
A more detailed description of the project will be added soon...
A package that extracts Persian time and date markers by applying regexes.
A toolkit full of handy functions including most used models and utilities for deep-learning practitioners!
Example for calling Zaal (getzaal.com) RESTful API
Multi-class confusion matrix library in Python
Sarcasm is a term that refers to the use of words to mock, irritate, or amuse someone. It is commonly used on social media. The metaphorical and creative nature of sarcasm presents a significant difficulty for sentiment analysis systems based on affective...
Solve some python question in here || سوالات برنامه نویسی به همراه جواب
Listing worthy Machine Learning blogs