FIT5149 2021 S2 Assessment 2
Scientific Document Classification
Sep 2021
Marks 35% of all marks for the unit
Due Date 23:55 Friday 22 Oct 2021
Extension
An extension could be granted for circumstances.
Please refer to the university webpage on special
consideration. A special consideration application form
must be submitted. Please note that ALL special
consideration, including within the semester, is now to
be submitted centrally. All students MUST submit an
online special consideration form via Monash Connect.
Lateness
For all assessment items handed in after the official
due date, and without an agreed extension, a 10%
penalty applies to the student’s mark for each day
after the due date (including weekends, and public
holidays) for up to 5 days. Assessment items handed in
after 5 days will not be considered/marked.
Authorship
This assignment is a group assignment and the final
submission must be identifiable your group’s own work.
Breaches of this requirement will result in an
assignment not being accepted for assessment and
many result in disciplinary action.
Submission
Each group is required to submit two files, one PDF
file contains the report, and another is a ZIP file
containing the implementation and the other required
files. The two files must be submitted via Moodle. All
the group members are required to log in Moodle to
accept the terms and conditions in the Moodle
submission page. A draft submission won’t be marked.
Programming
language
Either R or Python
Note: Please read the description from the start to the end carefully before you start your work!
Given that it is a group assessment, each group should evenly distribute the work among
all the group members.
1
1 Introduction
Scientific document classification is a key step for managing research articles and papers in forums
like arxiv, Google Scholar and Microsoft Academic. In this assessment, you are given some abstracts
crawled from American Chemical Society, the task is to develop classification models which can
make predictions and return the corresponding scientific fields of the source documents. Different
from coarse grained classification tasks like sentiment analysis, this is a fine grained classification
task where there are 19 filed classes in total. There are many machine learning methods that can
be used in the classification task. They can be categorised into supervised method (like SVM) and
unsupervised method (like clustering). Figure 1 shows a typical framework used in the supervised
classification.1
Figure 1: A general framework for the supervised classification.
As shown in the figure, there are three major steps, including generating features, developing a
proper classifier, and applying the classifier to the unseen data. The feature extractor is shared by
both training and prediction, which tells us that data used in training and prediction should share
the same feature space.
The aim of this challenge is to develop a classifier that can assign a set of scientific abstracts
to their corresponding labels as correctly as possible.
2 Dataset
Data Source Data Type classes num. training examples num. testing examples
ACS Material abstracts 19 90,000 10,000
Table 1: Authorship Profiling data set.
We provide the following data sets (Table 1):
train data labels.csv contains training ids , abstracts and labels. It contains abstracts from
90,000 articles and acts as the training data.
test data.csv: only testing ids and abstracts are available. It contains the abstracts from
10,000 articles.
1The figure is download from https://www.nltk.org/book/ch0...
2
Warning: Reverse engineering on the provided dataset is not allowed! Any information about
the test data cannot be used in training the classifiers.
3 Data Preparation & Feature Extration
Selecting relevant features and deciding how to encode them for a classification algorithm is crucial
for learning a good model. Free language text cannot be used directly as input to classification
algorithms. It must be pre-processed and transformed into a set of features represented in a
numerical form. In this section, we will discuss the basic text pre-processing steps and the common
features used in text classification.
The most common and basic pre-processing steps include
Case normalization: Text can contain upper- or lowercase letters. It is a good idea to just
allow either uppercase or lowercase.
Tokenization is the process of splitting a stream of text into individual words.
Stopwords are words that are extremely common and carry little lexical content. The list
of English stop words can be downloaded from the Internet. For example, a comprehensive
stop-word list can be found from Kevin Bouge’s website2.
Removing the most/least frequent word: Besides the stopwords, we usually remove words
appearing in more than 95% of the documents and less than 5% of the documents as well.
The percentages can be varied for corpus to corpus.
Those are only the common steps used in pre-processing text. Please note that the steps are of
your choice and there is no limitation on the pre-processing steps you can use in the
task.
Next, what kind of features one can extract from the free language text for document clas-
sification? There are some common features often considered in document classification, which
include
N -gram feature3: N -grams are basically a set of co-occurring words within a given window.
For example, for the sentence “The cow jumps over the moon”, if N = 2 (known as bigrams),
then the n-grams would be “the cow”, “cow jumps”, “jumps over”, “over the”, “the moon”.
If N = 3 (known as trigram), the n-grams would be “the cow jumps”, “cow jumps over”,
“jumps over the”, “over the moon”.
Unigram feature: a case of N-grams, if N = 1. Given the above sentence, the unigrams are
“The”, “cow”, “jumps”, “over”, “the”, “moon”.
POS tags4: part-of-speech annotation.
TF-IDF5 (Term Frequency-Inverse Document Frequency): It is a measure of how important
a word/n-gram is to a document in a collection.
You can choose to use either an individual feature or the combination of multiple features. The
features listed above are candidate features that you could consider in the task. However, you
can go beyond those features and try to find the set of features that can give you the best possible
classification accuracy.
There are many useful online tutorials on text preprocessing in either R or Python, for example,
Feature extraction in Scikit-learn6
Working with text data7
2https://sites.google.com/site...
3https://www.tidytextmining.co...
4martinschweinberger.de/docs/articles/PosTagR.pdf
5https://www.tidytextmining.co...
6https://scikit-learn.org/stab...
7https://scikit-learn.org/stab...
3
R code: reading, pre-processing and counting text8
“Text Mining with R”9, a tutorial that discusses how the deal with text in R. It provides
compelling examples of real text mining problems
4 Classifier
The task is to develop a classifier that can give you the most accurate prediction in the scientific
document classification task. The algorithm that you can use are not limited to the algorithms
covered in the lectures/tutorials. The goal at this stage is to find the most accurate classifier. In
order to find the most accurate classifier, each group should empirically compare at least 3 different
types of classification methods in the context of scientific document classification, and then submit
the one perform the best in your comparison. Please note an algorithm with different input features
will only count as one type of classifier. For example, logistic regression will be count as one type
of classifier, no matter what features you use.
5 Evaluation
The evaluation method used in testing is the accuracy score, which is defined as the proportion of
correct predictions among all of the predictions.
Accuracy =
number of correct predictions
number of all predictions
You can use the existing python/R code to compute the Accuracy score, for example
Accuracy score in Python10
Accuracy score in R11
6 Submission
To finish this data analysis challenge, all the groups are required to submit the following files:
“pred labels.csv”, where the label prediction on the testing documents is stored.
– In your “pred labels.csv”, there must be two columns: the first one is the test id column,
and the second one is the label column. Remember the first row of your “pred labels.csv”
file should be “test id” and “label”.
– The “pred labels.csv” must be reproducible by the assessor with your submitted R/Python
code.
The R/Python implementation of your final classifier with A README file that tells
the assessor how to set up and run your code. The output of your implementation must
include the label prediction for all the testing documents. The use of Jupyter notebook or R
Markdown is not required. All the files that are required for running your implementation
must be compressed into a zip file, named as “groupName ass2 impl.zip”. Please note
that the unnecessary code must be excluded in your final submission. For example, if you
tried three different types of models, say multinomial regression, LDA and classification tree,
and your group decides to submit LDA as the final model, you should remove the code for
the other two models from the submission. The discussion of the comparison should
be included in your report. However, you should keep a copy of the implementation used
for comparison for the purpose of the interview.
8http://www.katrinerk.com/cour...
schedule-words-in-a-haystack/r-code-the-text-mining-package
9https://www.tidytextmining.co...
10https://scikit-learn.org/stab...
11https://www.rdocumentation.or...
4
A PDF report, where you should document in details the development of the submitted
classifier. The maximum number of pages allowed is 8. The report must be in the
PDF format, named a “groupdName ass2 report.pdf”. The report must include (but
not limited to)
– The discussion of how the data preprocessing/features selection has been done.
– The development of the submitted classifier: To choose an optimal classifier for a task,
we often carry out empirical comparisons of multiple candidate models with different
feature sets. In your report, you should include a comprehensive analysis of how the
comparisons are done. For example, the report can include (but not limited to)
A description of the classifier(s) considered in your comparison.
The detailed experimental settings, which could include, for example, the discussion
of how the cross-validation is set up, how the parameters for the model considered
(if applicable) are chosen, or the setting of semi-supervised learning (if applicable).
The semi-supervised learning process.
Classification accuracy with comprehensive discussion.
The justification of the final model submitted.
Warning: If a report exceeds the page limit, the assessment will only be based on the first
8 pages.
A signed group assignment cover sheet, which will also be included in your zip file.
Warning: typing name is not counted as a signature in the cover sheet.
7 How to submit the files?
The Moodle setup allows you to upload only two files
“groupdName ass2 report.pdf”: A pdf report file, which will be submitted to Turnitin.
“groupName ass2 impl.zip”’: a zip file includes
– the implementation of the final submitted model
– “predict label.csv”, where the label prediction on the testing documents is stored.
– the signed grouped assignment cover sheet
While submitting your assignment, you can ignore the Turnitin warning message generated for
the ZIP file.
Please note that
Only one group member need to upload the two files. But all the group members
have to login to their own Moodle page and click the submit button in order to
make the final submission. If anyone member does not click the submit button, the
uploaded files will remain as a draft submission. A draft submission won’t be marked!
The two files must be uploaded separately.