This program demonstrates the use of statistical natural language processing to determine the identity of the author of an unknown text. There are many methods of achieving this, one of which is the use of the Bayesian method.
Two things are required for this program
•The test corpus i.e. the text, whose author is to be determined
•Sample works of two or more authors (at least four per author). One of them should be the correct author.
This program uses the unigram language model for each author. The method of solving the problem is as follows:
Let A1, A2, A3 ......., An be n authors, one of which is the true author.
We need to calculate P(T | Ai), where i ϵ [1, n]. (Probability of the test corpus T given author Ai.) by the following formula.
Let w1, w2, w3.... wm be words which occur only once in the test corpus and let P(w1), P(w2), P(w3)... P(wm) be the probability of these words occurring in the sample works of author Ai. We have:

In plain language the procedure is as follows.
•Taking the language model from the corpus of Ai
•Multiplying the probabilities of all the unigrams which occur only once in the test corpus.
•Taking the geometric mean of these probabilities.
The same process is used on all authors. The author with the highest P(T | Ai ) will be the most likely author for the test corpus.
This is the first version of our Authorship Attribution program. It does not include statistical enhancement techniques such as smoothing, discounting etc. Thus it may not be highly accurate. Future versions will include these features and as a result, achieve greater accuracy and efficiency.
Code:
Importing Required Modules from NLTK and the standard Python Library
- from nltk.corpus import PlaintextCorpusReader
- from nltk import FreqDist
- from operator import itemgetter
- from math import log10
- import os
Here we read textual data of all corpora
- def readCorpus(filePathList):
- corpusList = []
- for filePath in filePathList:
- corpusList.append(PlaintextCorpusReader(filePath, '.*'))
- return corpusList
This method counts the number of words in each corpus and returns a list of values.
- def corporaWordCount(corpusList):
- corporaWordCountList=[]
- for corpus in corpusList:
- corporaWordCountList.append(len(corpus.words()))
- return corporaWordCountList
This method uses the FreqDist class of NLTK to calculate the frequency distribution of words in each corpus. It returns a list of FreqDist objects.
- def calculateFreqDist(corpusList):
- corporaFreqDistList=[]
- for corpus in corpusList:
- corporaFreqDistList.append(FreqDist(w.lower() for w in corpus.words()))
- return corporaFreqDistList
This method returns a list of words which occur once in the test corpus. Note that corpusList[-1] refers to the unknown corpus which is the last element in the corpus list.
- def getOnceOccuringWords(corpusList,corporaFreqDistList):
- OnceOccuringWords = []
- for word in corpusList[-1].words():
- if corporaFreqDistList[-1][word] == 1:
- NoOccurenceInCorpusFlag=0
- for corpusFreqDist in corporaFreqDistList[:-1]:
- if corpusFreqDist[word] ==0:
- NoOccurenceInCorpusFlag=1
- if not NoOccurenceInCorpusFlag:
- OnceOccuringWords.append(word)
- return OnceOccuringWords
This method calculates the probability of occurrence of all the words which occur once in the unknown corpus, in each author's corpus.
- def getProbabilityList(OnceOccuringWords,corporaFreqDistList,corporaWordCountList):
- probabilityList = []
- for corpusWordCount in corporaWordCountList[:-1]:
- probabilityList.append(1)
- for word in OnceOccuringWords:
- for index in range(len(corporaWordCountList)-1):
- probabilityList[index] = probabilityList[index] * log10(-1*log10((float(corporaFreqDistList[index][word])/corporaWordCountList[index])))
- return probabilityList
This method calculates the geometric means of probabilities of once occurring words for each author.
- def getGeometricMeanList(probabilityList,CountOfOnceOccuringWords):
- return [probability ** (1.0/CountOfOnceOccuringWords) for probability in probabilityList]
This method determines the most likely author by finding the author with highest geometric mean. It presents results in the appropriate manner.
- def getMostLikelyAuthor(authorList,GeometricMeanList):
- maxMean = max(GeometricMeanList)
- authorIndex=GeometricMeanList.index(maxMean)
- maxProbableAuthor = authorList[authorIndex]
- return maxProbableAuthor
The main function. Execution begins here.
- if __name__ == "__main__":
- authorList = ['Austen','Twain','Doyle','Unknown']
- presentWorkingDirectory = os.getcwd()
- fileList = [presentWorkingDirectory + os.sep + author for author in authorList]
- print "Reading Corpora..."
- corpusList = readCorpus(fileList)
- print "Counting Words in Corpora..."
- corporaWordCountList = corporaWordCount(corpusList)
- print "Calculating Frequency Distributions..."
- corporaFreqDistList = calculateFreqDist(corpusList)
- print "Finding Words Which Occur Once..."
- OnceOccuringWords = getOnceOccuringWords(corpusList,corporaFreqDistList)
- CountOfOnceOccuringWords = len(OnceOccuringWords)
- print "Calculating Probabilities of Words which Occur Once..."
- probabilityList = getProbabilityList(OnceOccuringWords,corporaFreqDistList,corporaWordCountList)
- print "Finalizing Results..."
- GeometricMeanList = getGeometricMeanList(probabilityList,CountOfOnceOccuringWords)
- maxProbableAuthor = getMostLikelyAuthor(authorList,GeometricMeanList)
- print "Max probable Author is :" + maxProbableAuthor
| Attachment | Size |
|---|---|
| AuthAttr.zip | 1.8 MB |