yet more cleaning

This commit is contained in:
Alejandro Moreo Fernandez 2019-06-04 09:21:07 +02:00
parent c2a020e34e
commit 436550cf91
23 changed files with 125 additions and 102 deletions

View File

@ -1 +0,0 @@
IlIustrissimo domino Henricho, inclyto Romanorum imperatori et semper augusto, C. capitaneus ueronensis deuotione fidelitatis continua semper insistere uotis suis. Cum serena pacis tranquillitas, decora genitrix artium et alumna, multiplicet et dilatet quam plurimum commoda populorum, cura uigili procurare tenetur cuiuslibet principatus intenctio que sonoro laudis preconio desiderat predicari, ut inuiolatus permaneat status pacificus subiectorum. Nam, ut lectio testatur diuina, iIIud imperium, iIIud regnum quod diuersis uoluntatibus intercisum in se non continet unionem, desolationem incurrit, nec in ilio corpore sospitatis ilaritas perseuerat, cuius partes uel membra passionibus aliquibus singulariter affliguntur. Quippe recenter uobis hoc notifico euenisse, quod quidam iniquitatis alumni, uasa scelerum et putei uitiorum, quorum propositum est clandestinum et nefandum, sub cuius effectus specie imperiale decus corruere moIiuntur, quod absit, inter uirum magnificum, dominum P., incIytum principem Achaie, et hominem excelse potentie, dominum G., comitem, quos in istis partibus prefeceratis in presides et rectores, malignis afflatibus seminauerunt de nouo semen et materiam uictiorum, ita quod uterque ipsorum, cum suorum comitiua sequacium, contentionum ardoribus concitatus, ad prouintiam alterius prorumpere iam presunsit multotiens. Itaque fere iam partis cuiuslibet acies concurrissent, conquassatis capitibus plurimorum, nisi forent quorundam magnatum fidelium imperii suadele que ad salutem et robur imperialis diadematis aspirantes, pro uiribus studuerunt extinguere iracundiam iam conceptam, quod nondum tamen efficaciter potuerunt, malignante diabolo, bonorum operum subuersore, propter quod prouincia Lombardorum tota concutitur tremebunda timore, ne causa huius scandali lanietur, quassantibus inimicis propter casum huiusmodi, dum ex hoc cogitant euenire quod iampridem attentius desideratis affectibus cupiuerunt. Studeat igitur imperatoria celsitudo sui maturitate consiIii has radices amarissimas et pericula submouere. Nam si membra talia uestri gubernaculi tam excelsi sic inter se iam ceperint debaccari, quin contra se ipsa alia non insurgant non debet fore dubitabile menti uestre.

Binary file not shown.

Binary file not shown.

Binary file not shown.

File diff suppressed because one or more lines are too long

0
README.md Normal file → Executable file
View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 400 KiB

View File

@ -1,13 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Epistola I</title>
</head>
<body>
<h1>Epistola I</h1>
<p><a style="background-color:rgb(171,219,109);">Magnifico atque uictorioso domino, domino Cani Grandi de la Scala, sacratissimi cesarei principatus in urbe Uerona et ciuitate Uicentie uicario generali, deuotissimus suus Dantes Alagherii, Florentinus natione non moribus, uitam orat per tempora diuturna felicem, et gloriosi nominis perpetuum incrementum. Inclita uestre magnificentie laus, quam fama uigil uolitando disseminat, sic distrahit in diuersa diuersos, ut hos in spem sue prosperitatis attollat, hos exterminii deiciat in terrorem. Huius quidem preconium, facta modernorum exsuperans, tanquam ueri existentia latius, arbitrabar aliquando superfluum. </a><a style="background-color:rgb(250,150,86);">Uerum, ne diuturna me nimis incertitudo suspenderet, uelut Austri regina Ierusalem petiit, uelut Pallas petiit Elicona, Ueronam petii fidis oculis discussurus audita, ibique magnalia uestra uidi, uidi beneficia simul et tetigi; et quemadmodum prius dictorum ex parte suspicabar excessum, sic posterius ipsa facta excessiua cognoui. Quo factum est ut ex auditu solo cum quadam animi subiectione beniuolus prius exstiterim; sed ex uisu postmodum deuotissimus et amicus. Nec reor amici nomen assumens, ut nonnulli forsitan obiectarent, reatum presumptionis incurrere, cum non minus dispares connectantur quam pares amicitie sacramento. </a><a style="background-color:rgb(253,173,96);">Nam si delectabiles et utiles amicitias inspicere libeat, illis persepius inspicienti patebit, preheminentes inferioribus coniugari personas. Et si ad ueram ac per se amicitiam torqueatur intuitus, nonne illustrium summorumque principum plerunque uiros fortuna obscuros, honestate preclaros, amicos fuisse constabit? Quidni, cum etiam Dei et hominis amicitia nequaquam impediatur excessu? </a><a style="background-color:rgb(255,253,188);">Quod si cuiquam, quod asseritur, nunc uideretur indignum, Spiritum Sanctum audiat, amicitie sue participes quosdam homines profitentem. Nam in Sapientia de sapientia legitur, quoniam . Sed habet imperitia uulgi sine discretione iudicium; et quemadmodum solem pedalis magnitudinis arbitratur, sic et circa mores uana credulitate decipitur. Nos autem, quibus optimum quod est in nobis noscere datum est, gregum uestigia sectari non decet, quin ymo suis erroribus obuiare tenemur. </a><a style="background-color:rgb(242,104,65);">Nam intellectu ac ratione degentes, diuina quadam libertate dotati, nullis consuetudinibus astringuntur; nec mirum, cum non ipsi legibus, sed ipsis leges potius dirigantur. Liquet igitur, quod superius dixi, me scilicet esse deuotissimum et amicum, nullatenus esse presumptum. Preferens ergo amicitiam uestram quasi thesaurum carissimum, prouidentia diligenti et accurata solicitudine illam seruare desidero. </a><a style="background-color:rgb(249,145,83);">Itaque, cum in dogmatibus moralis negotii amicitiam adequari et saluari analogo doceatur, ad retribuendum pro collatis beneficiis plus quam semel analogiam sequi michi uotiuum est; et propter hoc munuscula mea sepe multum conspexi et ab inuicem segregaui, nec non segregata percensui, dignius gratiusque uobis inquirens. Neque ipsi preheminentie uestre congruum magis comperi magis quam Comedie sublimem canticam, que decoratur titulo Paradisi; et illam sub presenti epistola, tanquam sub epigrammate proprio dedicatam, uobis ascribo, uobis offero, uobis denique recommendo. Illud quoque preterire silentio simpliciter inardescens non sinit affectus, quod in hac donatione plus dono quam domino et honoris et fame conferri potest uideri.Quidni cum eius titulum iam presagiam de gloria uestri nominis ampliandum? </a><a style="background-color:rgb(244,112,68);">Satis actenus uidebar expressisse quod de proposito fuit; sed zelus gratie uestre, quam sitio quasi uitam paruipendens, a primordio metam prefixam urget ulterius. Itaque, formula consumata epistole, ad introductionem oblati operis aliquid sub lectoris officio compendiose aggrediar. </a></p>
</body>
</html>

File diff suppressed because one or more lines are too long

View File

@ -104,7 +104,7 @@ for epistola in [1]:
'RyccardusDeSanctoGermano', 'ZonoDeMagnalis']
paragraphs = range(14, 91)
assert len(authors)==20, f'unexpected number of authors ({len(authors)})'
path+='_tutti'
discarded = 0
f1_scores = []

View File

@ -98,7 +98,6 @@ for epistola in [1]:
authors = ['Dante'] + authors1
elif epistola == 2:
authors = ['Dante'] + authors2
path += '_tutti'
else:
authors = ['Dante'] + authors3

5
src/author_identification.py Normal file → Executable file
View File

@ -13,7 +13,7 @@ from util.color_visualization import color
# TODO: sentence length (Mendenhall-style) ?
for epistola in [2]:
for epistola in [1]:
if epistola==1:
authors = ['Dante','ClaraAssisiensis', 'GiovanniBoccaccio', 'GuidoFaba','PierDellaVigna']
else:
@ -50,7 +50,8 @@ for epistola in [2]:
features_sentenceLengths=True,
tfidf_feat_selection_ratio=0.1,
wordngrams=True, n_wordngrams=(1, 2),
charngrams=True, n_charngrams=(3, 4, 5), preserve_punctuation=False,
charngrams=True, n_charngrams=(3, 4, 5),
preserve_punctuation=False,
split_documents=True, split_policy=split_by_sentences, window_size=3,
normalize_features=True)

76
src/author_verification.py Normal file → Executable file
View File

@ -4,6 +4,8 @@ from data.features import *
from model import AuthorshipVerificator, f1_from_counters
from sklearn.svm import LinearSVC, SVC
from util.color_visualization import color
import pickle
import os
# DONE: ngrams should contain punctuation marks according to Sapkota et al. [39] in the PAN 2015 overview
# (More recently, it was shown that character
@ -13,49 +15,69 @@ from util.color_visualization import color
# TODO: sentence length (Mendenhall-style) ?
for epistola in [2]:
for epistola in [1]:
print('Epistola {}'.format(epistola))
print('='*80)
path = '../testi_{}'.format(epistola)
if epistola==1:
paragraphs = range(1, 14)
if epistola==2:
path+='_interaEpistola'
paragraphs = range(14, 91)
positive, negative, ep_text = load_texts(path, positive_author='Dante', unknown_target='EpistolaXIII_{}.txt'.format(epistola))
n_full_docs = len(positive) + len(negative)
target = [f'EpistolaXIII_{epistola}_{paragraph}.txt' for paragraph in paragraphs]
positive, negative, ep_texts = load_texts(path, positive_author='Dante', unknown_target=target)
feature_extractor = FeatureExtractor(function_words_freq='latin',
conjugations_freq='latin',
features_Mendenhall=True,
features_sentenceLengths=True,
tfidf_feat_selection_ratio=0.1,
wordngrams=True, n_wordngrams=(1, 2),
charngrams=True, n_charngrams=(3, 4, 5), preserve_punctuation=False,
split_documents=True, split_policy=split_by_sentences, window_size=3,
normalize_features=True)
pickle_file = f'../dante_color/epistola{epistola}.pkl'
if os.path.exists(pickle_file):
print(f'loading pickle file {pickle_file}')
probabilities = pickle.load(open(pickle_file, 'rb'))
else:
print(f'generating pickle file')
n_full_docs = len(positive) + len(negative)
Xtr,ytr,groups = feature_extractor.fit_transform(positive, negative)
print(ytr)
feature_extractor = FeatureExtractor(function_words_freq='latin',
conjugations_freq='latin',
features_Mendenhall=True,
features_sentenceLengths=True,
tfidf_feat_selection_ratio=0.1,
wordngrams=True, n_wordngrams=(1, 2),
charngrams=True, n_charngrams=(3, 4, 5),
preserve_punctuation=False,
split_documents=True, split_policy=split_by_sentences, window_size=3,
normalize_features=True)
ep, ep_fragments = feature_extractor.transform(ep_text, return_fragments=True, window_size=3)
Xtr,ytr,groups = feature_extractor.fit_transform(positive, negative)
print(ytr)
print('Fitting the Verificator')
av = AuthorshipVerificator(nfolds=10, estimator=LogisticRegression)
av.fit(Xtr,ytr,groups)
print('Fitting the Verificator')
av = AuthorshipVerificator(nfolds=10, estimator=LogisticRegression, author_name='Dante')
av.fit(Xtr,ytr,groups)
print('Predicting the Epistola {}'.format(epistola))
title = 'Epistola {}'.format('I' if epistola==1 else 'II')
av.predict(ep, title)
fulldoc_prob, fragment_probs = av.predict_proba(ep, title)
color(path='../dante_color/epistola{}.html'.format(epistola), texts=ep_fragments, probabilities=fragment_probs, title=title)
probabilities = []
for i, target_text in enumerate(ep_texts):
ep = feature_extractor.transform(target_text, avoid_splitting=True)
prob, _ = av.predict_proba(ep, epistola_name=target[i])
probabilities.append(prob)
pickle.dump(probabilities, open(pickle_file, 'wb'), pickle.HIGHEST_PROTOCOL)
color(path=f'../dante_color/epistola{epistola}.html', texts=ep_texts, probabilities=probabilities, title=f'Epistola {("I" if epistola==1 else "II")}', paragraph_offset=paragraphs[0])
# print('Predicting the Epistola {}'.format(epistola))
# title = 'Epistola {}'.format('I' if epistola==1 else 'II')
# av.predict(ep, title)
# fulldoc_prob, fragment_probs = av.predict_proba(ep, title)
# color(path='../dante_color/epistola{}.html'.format(epistola), texts=ep_fragments, probabilities=fragment_probs, title=title)
# score_ave, score_std = av.leave_one_out(Xtr, ytr, groups, test_lowest_index_only=False)
# print('LOO[full-and-fragments]={:.3f} +-{:.5f}'.format(score_ave, score_std))
score_ave, score_std, tp, fp, fn, tn = av.leave_one_out(Xtr, ytr, groups, test_lowest_index_only=True, counters=True)
# score_ave, score_std, tp, fp, fn, tn = av.leave_one_out(Xtr, ytr, groups, test_lowest_index_only=True, counters=True)
# print('LOO[full-docs]={:.3f} +-{:.5f}'.format(score_ave, score_std))
f1_ = f1_from_counters(tp, fp, fn, tn)
print('F1 = {:.3f}'.format(f1_))
# f1_ = f1_from_counters(tp, fp, fn, tn)
# print('F1 = {:.3f}'.format(f1_))
# score_ave, score_std = av.leave_one_out(Xtr, ytr, None)
# print('LOO[w/o groups]={:.3f} +-{:.5f}'.format(score_ave, score_std))

View File

@ -13,8 +13,6 @@ for epistola in [1,2,3]: #3 means "both Ep1 and Ep2 corpora"
print('='*80)
path = '../testiXIV_{}'.format(epistola)
paragraphs = range(1, 6)
if epistola==2:
path+='_tutti'
target = [f'Epistola_ArigoVII.txt'] + [f'Epistola_ArigoVII_{paragraph}.txt' for paragraph in paragraphs]
positive, negative, ep_texts = load_texts(path, positive_author='Dante', unknown_target=target, train_skip_prefix='Epistola_ArigoVII')

6
src/data/dante_loader.py Normal file → Executable file
View File

@ -28,13 +28,13 @@ def remove_citations(doc):
doc = remove_pattern(doc, start_symbol='{', end_symbol='}', counter=counter)
return doc
def load_texts(path, positive_author='Dante', unknown_target=None):
def load_texts(path, positive_author='Dante', unknown_target=None, train_skip_prefix='EpistolaXIII_'):
# load the training data (all documents but Epistolas 1 and 2)
positive,negative = [],[]
authors = []
ndocs=0
for file in os.listdir(path):
if file.startswith('EpistolaXIII_'): continue
if file.startswith(train_skip_prefix): continue
file_clean = file.replace('.txt','')
author, textname = file_clean.split('_')[0],file_clean.split('_')[1]
text = open(join(path,file), encoding= "utf8").read()
@ -63,7 +63,7 @@ def load_texts(path, positive_author='Dante', unknown_target=None):
return positive, negative
def list_texts(path):
def ___list_texts(path):
authors = {}
for file in os.listdir(path):
if file.startswith('EpistolaXIII_'): continue

70
src/data/features.py Normal file → Executable file
View File

@ -26,7 +26,7 @@ latin_conjugations = ['o', 'eo', 'io', 'as', 'es', 'is', 'at', 'et', 'it', 'amus
'abis', 'ebis', 'ies', 'abit', 'ebit', 'iet', 'abimus', 'ebimus', 'emus', 'iemus', 'abitis',
'ebitis', 'ietis', 'abunt', 'ebunt', 'ient', 'abor', 'ebor', 'ar', 'iar', 'aberis', 'eberis',
'ieris', 'abitur', 'ebitur', 'ietur', 'abimur', 'ebimur', 'iemur', 'abimini', 'ebimini', 'iemini',
'abuntur', 'ebuntur', 'ientur', 'i', 'isti', 'it', 'imus', 'istis', 'erunt', 'em', 'eam', 'eas',
'abuntur', 'ebuntur', 'ientur', 'i', 'isti', 'it', 'istis', 'erunt', 'em', 'eam', 'eas',
'ias', 'eat', 'iat', 'eamus', 'iamus', 'eatis', 'iatis', 'eant', 'iant', 'er', 'ear', 'earis',
'iaris', 'eatur', 'iatur', 'eamur', 'iamur', 'eamini', 'iamini', 'eantur', 'iantur', 'rem', 'res',
'ret', 'remus', 'retis', 'rent', 'rer', 'reris', 'retur', 'remur', 'remini', 'rentur', 'erim',
@ -34,7 +34,7 @@ latin_conjugations = ['o', 'eo', 'io', 'as', 'es', 'is', 'at', 'et', 'it', 'amus
'ere', 'ire', 'ato', 'eto', 'ito', 'atote', 'etote', 'itote', 'anto', 'ento', 'unto', 'iunto',
'ator', 'etor', 'itor', 'aminor', 'eminor', 'iminor', 'antor', 'entor', 'untor', 'iuntor', 'ari',
'eri', 'iri', 'andi', 'ando', 'andum', 'andus', 'ande', 'ans', 'antis', 'anti', 'antem', 'antes',
'antium', 'antibus', 'antia', 'esse', 'sum', 'es', 'est', 'sumus', 'estis', 'sunt', 'eram', 'eras',
'antium', 'antibus', 'antia', 'esse', 'sum', 'est', 'sumus', 'estis', 'sunt', 'eram', 'eras',
'erat', 'eramus', 'eratis', 'erant', 'ero', 'eris', 'erit', 'erimus', 'eritis', 'erint', 'sim',
'sis', 'sit', 'simus', 'sitis', 'sint', 'essem', 'esses', 'esset', 'essemus', 'essetis', 'essent',
'fui', 'fuisti', 'fuit', 'fuimus', 'fuistis', 'fuerunt', 'este', 'esto', 'estote', 'sunto']
@ -139,7 +139,9 @@ def _features_function_words_freq(documents, lang):
funct_words_freq = [1000. * freqs[function_word] / nwords for function_word in function_words]
features.append(funct_words_freq)
return np.array(features)
f_names = [f'funcw::{f}' for f in function_words]
return np.array(features), f_names
def _features_conjugations_freq(documents, lang):
@ -156,7 +158,9 @@ def _features_conjugations_freq(documents, lang):
conjugation_freq = [1000. * freqs[conjugation] / nwords for conjugation in conjugations]
features.append(conjugation_freq)
return np.array(features)
f_names = [f'conj::{f}' for f in conjugations]
return np.array(features), f_names
def _features_Mendenhall(documents, upto=23):
@ -176,7 +180,10 @@ def _features_Mendenhall(documents, upto=23):
for i in range(1, upto):
tokens_count.append(1000.*(sum(j>= i for j in tokens_len))/nwords)
features.append(tokens_count)
return np.array(features)
f_names = [f'mendenhall::{c}' for c in range(1,upto)]
return np.array(features), f_names
def _features_sentenceLengths(documents, downto=3, upto=70):
@ -200,9 +207,10 @@ def _features_sentenceLengths(documents, downto=3, upto=70):
for i in range(downto, upto):
sent_count.append(1000.*(sum(j>= i for j in sent_len))/nsent)
features.append(sent_count)
return np.array(features)
f_names = [f'sentlength::{c}' for c in range(downto, upto)]
return np.array(features), f_names
def _features_tfidf(documents, tfidf_vectorizer=None, min_df = 1, ngrams=(1,1)):
@ -306,6 +314,7 @@ class FeatureExtractor:
self.normalize_features=normalize_features
self.window_size = window_size
self.verbose = verbose
self.feature_names = None
def fit_transform(self, positives, negatives):
@ -313,6 +322,7 @@ class FeatureExtractor:
authors = [1]*len(positives) + [0]*len(negatives)
n_original_docs = len(documents)
groups = list(range(n_original_docs))
self.feature_names = []
if self.split_documents:
doc_fragments, authors_fragments, groups_fragments = splitter(documents, authors,
@ -332,47 +342,71 @@ class FeatureExtractor:
# dense feature extraction functions
if self.function_words_freq:
X = self._addfeatures(X, _features_function_words_freq(documents, self.function_words_freq))
F, f_names = _features_function_words_freq(documents, self.function_words_freq)
X = self._addfeatures(X, F)
self.feature_names.extend(f_names)
self._print('adding function words features: {} features'.format(X.shape[1]))
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
if self.conjugations_freq:
X = self._addfeatures(X, _features_conjugations_freq(documents, self.conjugations_freq))
F, f_names = _features_conjugations_freq(documents, self.conjugations_freq)
X = self._addfeatures(X, F)
self.feature_names.extend(f_names)
self._print('adding conjugation features: {} features'.format(X.shape[1]))
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
if self.features_Mendenhall:
X = self._addfeatures(X, _features_Mendenhall(documents))
F, f_names = _features_Mendenhall(documents)
X = self._addfeatures(X, F)
self.feature_names.extend(f_names)
self._print('adding Mendenhall words features: {} features'.format(X.shape[1]))
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
if self.features_sentenceLengths:
X = self._addfeatures(X, _features_sentenceLengths(documents))
self._print('adding sentence lengths features: {} features'.format(X.shape[1]))
F, f_names = _features_sentenceLengths(documents)
X = self._addfeatures(X, F)
self.feature_names.extend(f_names)
self._print('adding sentence lengths features: {} features'.format(X.shape[1]))
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
# sparse feature extraction functions
if self.tfidf:
X_features, vectorizer = _features_tfidf(documents, ngrams=self.wordngrams)
self.tfidf_vectorizer = vectorizer
index2word = {i: w for w, i in vectorizer.vocabulary_.items()}
f_names = [f'tfidf::{index2word[i]}' for i in range(len(index2word))]
if self.tfidf_feat_selection_ratio < 1.:
if self.verbose: print('feature selection')
X_features, feat_sel = _feature_selection(X_features, y, self.tfidf_feat_selection_ratio)
self.feat_sel_tfidf = feat_sel
f_names = [f_names[i] for i in feat_sel.get_support(indices=True)]
X = self._addfeatures(_tocsr(X), X_features)
self.feature_names.extend(f_names)
self._print('adding tfidf words features: {} features'.format(X.shape[1]))
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
if self.ngrams:
X_features, vectorizer = _features_ngrams(documents, self.ns,
preserve_punctuation=self.preserve_punctuation)
self.ngrams_vectorizer = vectorizer
index2word = {i: w for w, i in vectorizer.vocabulary_.items()}
f_names = [f'ngram::{index2word[i]}' for i in range(len(index2word))]
if self.tfidf_feat_selection_ratio < 1.:
if self.verbose: print('feature selection')
X_features, feat_sel = _feature_selection(X_features, y, self.tfidf_feat_selection_ratio)
self.feat_sel_ngrams = feat_sel
f_names = [f_names[i] for i in feat_sel.get_support(indices=True)]
X = self._addfeatures(_tocsr(X), X_features)
self.feature_names.extend(f_names)
self._print('adding ngrams character features: {} features'.format(X.shape[1]))
self.feature_names = np.asarray(self.feature_names)
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
# print summary
if self.verbose:
print(
@ -401,19 +435,23 @@ class FeatureExtractor:
# dense feature extraction functions
if self.function_words_freq:
TEST = self._addfeatures(TEST, _features_function_words_freq(test, self.function_words_freq))
F,_=_features_function_words_freq(test, self.function_words_freq)
TEST = self._addfeatures(TEST, F)
self._print('adding function words features: {} features'.format(TEST.shape[1]))
if self.conjugations_freq:
TEST = self._addfeatures(TEST, _features_conjugations_freq(test, self.conjugations_freq))
F,_=_features_conjugations_freq(test, self.conjugations_freq)
TEST = self._addfeatures(TEST, F)
self._print('adding conjugation features: {} features'.format(TEST.shape[1]))
if self.features_Mendenhall:
TEST = self._addfeatures(TEST, _features_Mendenhall(test))
F,_ = _features_Mendenhall(test)
TEST = self._addfeatures(TEST, F)
self._print('adding Mendenhall words features: {} features'.format(TEST.shape[1]))
if self.features_sentenceLengths:
TEST = self._addfeatures(TEST, _features_sentenceLengths(test))
F, _ = _features_sentenceLengths(test)
TEST = self._addfeatures(TEST, F)
self._print('adding sentence lengths features: {} features'.format(TEST.shape[1]))
# sparse feature extraction functions
@ -454,8 +492,6 @@ class FeatureExtractor:
def _addfeatures(self, X, F):
# plt.matshow(F[:25])
# plt.show()
if self.normalize_features:
normalize(F, axis=1, copy=False)

0
src/data/pan2015.py Normal file → Executable file
View File

12
src/model.py Normal file → Executable file
View File

@ -47,26 +47,26 @@ class AuthorshipVerificator:
if estimator is SVC:
self.params['kernel'] = ['linear', 'rbf']
self.probability = True
self.svm = estimator(probability=self.probability)
self.classifier = estimator(probability=self.probability)
elif estimator is LinearSVC:
self.probability = False
self.svm = estimator()
self.classifier = estimator()
elif estimator is LogisticRegression:
self.probability = True
self.svm = LogisticRegression()
self.classifier = LogisticRegression()
def fit(self,X,y,groups=None):
if not isinstance(y,np.ndarray): y=np.array(y)
positive_examples = y.sum()
if positive_examples >= self.nfolds:
print('optimizing {}'.format(self.svm.__class__.__name__))
print('optimizing {}'.format(self.classifier.__class__.__name__))
# if groups is None or len(np.unique(groups[y==1])):
folds = list(StratifiedKFold(n_splits=self.nfolds).split(X, y))
# folds = list(GroupKFold(n_splits=self.nfolds).split(X,y,groups))
self.estimator = GridSearchCV(self.svm, param_grid=self.params, cv=folds, scoring=make_scorer(f1), n_jobs=-1)
self.estimator = GridSearchCV(self.classifier, param_grid=self.params, cv=folds, scoring=make_scorer(f1), n_jobs=-1)
else:
self.estimator = self.svm
self.estimator = self.classifier
self.estimator.fit(X, y)

0
src/pan2015_eval.py Normal file → Executable file
View File

19
src/util/color_visualization.py Normal file → Executable file
View File

@ -1,14 +1,17 @@
from matplotlib.cm import get_cmap
def color_tag(text, probability, cmap):
probability = (probability-0.5)*0.75+0.5
def color_tag(index, text, probability, cmap):
probability *= 0.6
# probability = (probability-0.5)*0.75+0.5
r,g,b,_ = cmap(probability)
# reliable = abs(probability-0.5) > 0.25*0.75
# text = '<font color="white">{}</font>'.format(text) if reliable else text
return '<a style="background-color:rgb({:.0f},{:.0f},{:.0f});">{} </a>'.format(r*255,g*255,b*255,text)
return f'<b>&nbsp;P{index}:</b> <a style="background-color:rgb({r*255:.0f},{g*255:.0f},{b*255:.0f});">{text} </a>'
def color(path, texts, probabilities, title):
cmap = get_cmap('RdYlGn')
def color(path, texts, probabilities, title, paragraph_offset=1):
# cmap = get_cmap('RdYlGn')
# cmap = get_cmap('Greens')
cmap = get_cmap('Greys')
with open(path, 'wt') as fo:
fo.write("""
@ -21,10 +24,8 @@ def color(path, texts, probabilities, title):
<body>
<h1>{}</h1>
""".format(title,title))
fo.write('<p>')
for line,probability in zip(texts,probabilities):
fo.write(color_tag(line,probability,cmap))
fo.write('</p>')
for i,(line,probability) in enumerate(zip(texts,probabilities)):
fo.write(color_tag(paragraph_offset + i, line,probability,cmap))
fo.write("""
</body>
</html>

0
src/util/disable_sklearn_warnings.py Normal file → Executable file
View File

0
src/util/epistole_split.py Normal file → Executable file
View File