School of Computing and Mathematical Sciences
Natural Language Processing (COIY064H7)
Question: |
1 |
2 |
Total |
Points: |
50 |
50 |
100 |
Bonus Points: |
0 |
0 |
0 |
Score: |
|
|
|
- For this module, 60% of the overall marks are for your coursework (this programming assignment) and 40% are for an online quiz.
- The coursework consists of a programming assignment in two parts, with equal weight.
- The coursework is marked out of a total of 100.
- You must complete the coursework using the Python 3 programming language.
- In some parts, the questions specify that you must implement something yourself, or use a particular python library, but otherwise you may use any widely available python libraries.
- Most of the question parts only require code to answer. For a few parts you are asked additional questions requiring brief text answers — you should edit the answers.txt file in the coursework template to provide these.
- You should submit your code and answers in a .zip file via the Moodle assignment box – see the full instructions at the end of this document.
Part One — Syntax and Style
In the first part of your coursework, your task is to explore the syntax and style of a set of 19th Century novels using the methods and tools that you learned in class.
The texts you need for this part are in the novels subdirectory of the texts directory in the coursework Moodle template. The texts are in plain text files, and the filenames include the title, author, and year of publication, separated by hyphens. The template code provided in PartOne.py includes function headers for some sub-parts of this question. The main method of your finished script should call each of these functions in order. To complete your coursework, complete these functions so that they perform the tasks specified in the questions below. You may (and in some cases should) define additional functions.
(a) read novels: Each file in the novels directory contains the text of a novel, and the name of the file is the title, author, and year of publication of the novel, separated by hyphens. Complete the python function read texts to do the following:
- create a pandas dataframe with the following columns: text, title, author, year
- sort the dataframe by the year column before returning it, resetting or ignoring the dataframe index.
(b) nltk ttr: This function should return a dictionary mapping the title of each novel to its type-token ratio. Tokenize the text using the NLTK library only. Do not include punctuation as tokens.
(c) flesch kincaid: This function should return a dictionary mapping the title of each novel to the Flesch-Kincaid reading grade level score of the text. Use the NLTK library for tokenization and the CMU pronouncing dictionary for estimating syllable counts.
(d) When is the Flesch Kincaid score *not* a valid, robust or reliable estimator of text difficulty? Give two conditions. (Text answer, 200 words maximum).
(e) parse: The goal of this function is to process the texts with spaCy’s tokenizer and parser, and store the processed texts. Your completed function should:
- Use the spaCy nlp method to add a new column to the dataframe that contains parsed and tokenized Doc objects for each text.
- Serialise the resulting dataframe (i.e., write it out to disk) using the pickle for- mat.
- Return the dataframe.
- Load the dataframe from the pickle file and use it for the remainder of this coursework part. Note: one or more of the texts may exceed the default maxi- mum length for spaCy’s model. You will need to either increase this length or parse the text in sections.
(f) Working with parses: the final lines of the code template contain three for loops. Write the functions needed to complete these loops so that they print:
- The title of each novel and a list of the ten most common syntactic subjects overall in the text. (5 marks)
- The title of each novel and a list of the ten most common syntactic subjects of the verb ‘to say’ (in any tense) in the text, ordered by their frequency. (5 marks)
- The title of each novel and a list of the ten most common syntactic subjects of the verb ‘to say’ (in any tense) in the text, ordered by their Pointwise Mutual Information. (7 marks)
Part One total marks: 50
- Part Two — Feature Extraction and Classification
In the second part of the coursework, your task is to train and test machine learning classifiers on a dataset of political speeches. The objective is to learn to predict the political party from the text of the speech. The texts you need for this part are in the speeches sub-directory of the texts directory of the coursework Moodle template. For this part, you can structure your python functions in any way that you like, but pay attention to exactly what information (if any) you are asked to print out in each part. Your final script should print out the answers to each part where required, and nothing else.
(a) Read the hansard40000.csv dataset in the texts directory into a dataframe. Sub- set and rename the dataframe as follows:
- rename the ‘Labour (Co-op)’ value in ‘party’ column to ‘Labour’, and then:
- remove any rows where the value of the ‘party’ column is not one of the four most common values (excluding the ‘Speaker’ value).
- remove any rows where the value in the ‘speech class’ column is not ‘Speech’.
- remove any rows where the text in the ‘speech’ column is less than 1500 char- acters long.
Print the dimensions of the resulting dataframe using the shape method.
(b) Vectorise the speeches using TfidfVectorizer from scikit-learn. Use the default param- eters, except for omitting English stopwords and setting max features to 4000. Split the data into a train and test set, using stratified sampling, with a random seed of 99.
(c) Train RandomForest (with n estimators=400) and SVM with linear kernel classifiers on the training set, and print the scikit-learn macro-average f1 score and classification report for each classifier on the test set. The label that you are trying to predict is the ‘party’ value.(5 marks)
(d) Adjust the parameters of the Tfidfvectorizer so that unigrams, bi-grams and tri-grams will be considered as features, limiting the total number of features to 4000. Print the classification report as in 2(c) again using these parameters.
(e) Implement a new custom tokenizer and pass it to the tokenizer argument of Tfidfvectorizer. You can use this function in any way you like to try to achieve the best classification performance while keeping the number of features to no more than 4000, and using the same three classifiers as above. (10 marks) Print the classification report for the best performing classifier using your tokenizer
(f) Explain your tokenizer function and discuss its performance. Part Two total marks: 50
Submission
Upload your submission via the Moodle coursework upload box on the assessment tile. This is the only way that we can accept your work. Your submission upload should be a zip file contain- ing your python code and the completed answers.txt and declaration.txt files as in the template provided on the Moodle assessment tile. If you prefer, you can submit Jupyter notebooks (.ipynb files) instead of plain python scripts. Upload the pickle file with your submission unless it is too large. Do not upload the texts.
You must add the following Academic Declaration to name-and-declaration.txt into README.md for your submission
“I have read and understood the sections of plagiarism in the College Policy on as- sessment offences and confirm that the work is my own, with the work of others clearly acknowledged. I give my permission to submit my report to the plagiarism testing database that the College is using and test it using plagiarism detection soft- ware, search engines or meta-searching software.