Due: Tuesday 5/7/2019 at 10:00pm
This is a large project and it incorporates many of the concepts we have studied this semester.
Here is the original project description (.doc). We will not be implementing the entire project (e.g. a graphical interface). This is our vehicle for doing research on hashing and using other containers.
Here are the files that you will use for data.
Milestone
I
You need to be able to process a set of documents in a directory and produce all possible n-word sequences. You should be able to change n relatively easily. Proof of this milestone consists of demonstrating you can print all n-word sequences to the console for a given n. You will not have to turn in the milestone separately, but it should be your first step on this assignment.
Here is a program that gives some help with getting the file names from a directory.
./plagiarismCatcher path/to/files 6 200
which would churn and then produce a list (in order) of all the pairs of files in path/to/docs that shared more than 200 6-word sequences in common.
Your final program should be able to produce meaningful output for at least the small set of documents (25 or so).
Lastly, with Milestone II you will submit a short document (the project README) about what your program does, how to use it, what works, what doesn’t work and any other features, bugs I should know about when I’m looking at your code.
Hint: Hashing the six-word sequences and then looking for collisions is a good strategy for finding improper collaboration.
Notes:
The output could look like this:
700: filenameA, filenameD
350: filenameA, filenameC
350: filenameC, filenameD
205: filenameB, filenameE
204: filenameA, filenameB
Turn in: Each partner should hand in a zipped file to Canvas named cheaters_xxxxxx.zip where xxxxxx is your UTEID.
Last Updated: 4/4/19