yet more cleaning
This commit is contained in:
parent
c2a020e34e
commit
436550cf91
|
|
@ -1 +0,0 @@
|
||||||
IlIustrissimo domino Henricho, inclyto Romanorum imperatori et semper augusto, C. capitaneus ueronensis deuotione fidelitatis continua semper insistere uotis suis. Cum serena pacis tranquillitas, decora genitrix artium et alumna, multiplicet et dilatet quam plurimum commoda populorum, cura uigili procurare tenetur cuiuslibet principatus intenctio que sonoro laudis preconio desiderat predicari, ut inuiolatus permaneat status pacificus subiectorum. Nam, ut lectio testatur diuina, iIIud imperium, iIIud regnum quod diuersis uoluntatibus intercisum in se non continet unionem, desolationem incurrit, nec in ilio corpore sospitatis ilaritas perseuerat, cuius partes uel membra passionibus aliquibus singulariter affliguntur. Quippe recenter uobis hoc notifico euenisse, quod quidam iniquitatis alumni, uasa scelerum et putei uitiorum, quorum propositum est clandestinum et nefandum, sub cuius effectus specie imperiale decus corruere moIiuntur, quod absit, inter uirum magnificum, dominum P., incIytum principem Achaie, et hominem excelse potentie, dominum G., comitem, quos in istis partibus prefeceratis in presides et rectores, malignis afflatibus seminauerunt de nouo semen et materiam uictiorum, ita quod uterque ipsorum, cum suorum comitiua sequacium, contentionum ardoribus concitatus, ad prouintiam alterius prorumpere iam presunsit multotiens. Itaque fere iam partis cuiuslibet acies concurrissent, conquassatis capitibus plurimorum, nisi forent quorundam magnatum fidelium imperii suadele que ad salutem et robur imperialis diadematis aspirantes, pro uiribus studuerunt extinguere iracundiam iam conceptam, quod nondum tamen efficaciter potuerunt, malignante diabolo, bonorum operum subuersore, propter quod prouincia Lombardorum tota concutitur tremebunda timore, ne causa huius scandali lanietur, quassantibus inimicis propter casum huiusmodi, dum ex hoc cogitant euenire quod iampridem attentius desideratis affectibus cupiuerunt. Studeat igitur imperatoria celsitudo sui maturitate consiIii has radices amarissimas et pericula submouere. Nam si membra talia uestri gubernaculi tam excelsi sic inter se iam ceperint debaccari, quin contra se ipsa alia non insurgant non debet fore dubitabile menti uestre.
|
|
||||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
File diff suppressed because one or more lines are too long
Binary file not shown.
|
Before Width: | Height: | Size: 101 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 400 KiB |
|
|
@ -1,13 +0,0 @@
|
||||||
|
|
||||||
<!DOCTYPE html>
|
|
||||||
<html lang="en">
|
|
||||||
<head>
|
|
||||||
<meta charset="UTF-8">
|
|
||||||
<title>Epistola I</title>
|
|
||||||
</head>
|
|
||||||
<body>
|
|
||||||
<h1>Epistola I</h1>
|
|
||||||
<p><a style="background-color:rgb(171,219,109);">Magnifico atque uictorioso domino, domino Cani Grandi de la Scala, sacratissimi cesarei principatus in urbe Uerona et ciuitate Uicentie uicario generali, deuotissimus suus Dantes Alagherii, Florentinus natione non moribus, uitam orat per tempora diuturna felicem, et gloriosi nominis perpetuum incrementum. Inclita uestre magnificentie laus, quam fama uigil uolitando disseminat, sic distrahit in diuersa diuersos, ut hos in spem sue prosperitatis attollat, hos exterminii deiciat in terrorem. Huius quidem preconium, facta modernorum exsuperans, tanquam ueri existentia latius, arbitrabar aliquando superfluum. </a><a style="background-color:rgb(250,150,86);">Uerum, ne diuturna me nimis incertitudo suspenderet, uelut Austri regina Ierusalem petiit, uelut Pallas petiit Elicona, Ueronam petii fidis oculis discussurus audita, ibique magnalia uestra uidi, uidi beneficia simul et tetigi; et quemadmodum prius dictorum ex parte suspicabar excessum, sic posterius ipsa facta excessiua cognoui. Quo factum est ut ex auditu solo cum quadam animi subiectione beniuolus prius exstiterim; sed ex uisu postmodum deuotissimus et amicus. Nec reor amici nomen assumens, ut nonnulli forsitan obiectarent, reatum presumptionis incurrere, cum non minus dispares connectantur quam pares amicitie sacramento. </a><a style="background-color:rgb(253,173,96);">Nam si delectabiles et utiles amicitias inspicere libeat, illis persepius inspicienti patebit, preheminentes inferioribus coniugari personas. Et si ad ueram ac per se amicitiam torqueatur intuitus, nonne illustrium summorumque principum plerunque uiros fortuna obscuros, honestate preclaros, amicos fuisse constabit? Quidni, cum etiam Dei et hominis amicitia nequaquam impediatur excessu? </a><a style="background-color:rgb(255,253,188);">Quod si cuiquam, quod asseritur, nunc uideretur indignum, Spiritum Sanctum audiat, amicitie sue participes quosdam homines profitentem. Nam in Sapientia de sapientia legitur, quoniam . Sed habet imperitia uulgi sine discretione iudicium; et quemadmodum solem pedalis magnitudinis arbitratur, sic et circa mores uana credulitate decipitur. Nos autem, quibus optimum quod est in nobis noscere datum est, gregum uestigia sectari non decet, quin ymo suis erroribus obuiare tenemur. </a><a style="background-color:rgb(242,104,65);">Nam intellectu ac ratione degentes, diuina quadam libertate dotati, nullis consuetudinibus astringuntur; nec mirum, cum non ipsi legibus, sed ipsis leges potius dirigantur. Liquet igitur, quod superius dixi, me scilicet esse deuotissimum et amicum, nullatenus esse presumptum. Preferens ergo amicitiam uestram quasi thesaurum carissimum, prouidentia diligenti et accurata solicitudine illam seruare desidero. </a><a style="background-color:rgb(249,145,83);">Itaque, cum in dogmatibus moralis negotii amicitiam adequari et saluari analogo doceatur, ad retribuendum pro collatis beneficiis plus quam semel analogiam sequi michi uotiuum est; et propter hoc munuscula mea sepe multum conspexi et ab inuicem segregaui, nec non segregata percensui, dignius gratiusque uobis inquirens. Neque ipsi preheminentie uestre congruum magis comperi magis quam Comedie sublimem canticam, que decoratur titulo Paradisi; et illam sub presenti epistola, tanquam sub epigrammate proprio dedicatam, uobis ascribo, uobis offero, uobis denique recommendo. Illud quoque preterire silentio simpliciter inardescens non sinit affectus, quod in hac donatione plus dono quam domino et honoris et fame conferri potest uideri.Quidni cum eius titulum iam presagiam de gloria uestri nominis ampliandum? </a><a style="background-color:rgb(244,112,68);">Satis actenus uidebar expressisse quod de proposito fuit; sed zelus gratie uestre, quam sitio quasi uitam paruipendens, a primordio metam prefixam urget ulterius. Itaque, formula consumata epistole, ad introductionem oblati operis aliquid sub lectoris officio compendiose aggrediar. </a></p>
|
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
|
|
||||||
File diff suppressed because one or more lines are too long
|
|
@ -104,7 +104,7 @@ for epistola in [1]:
|
||||||
'RyccardusDeSanctoGermano', 'ZonoDeMagnalis']
|
'RyccardusDeSanctoGermano', 'ZonoDeMagnalis']
|
||||||
paragraphs = range(14, 91)
|
paragraphs = range(14, 91)
|
||||||
assert len(authors)==20, f'unexpected number of authors ({len(authors)})'
|
assert len(authors)==20, f'unexpected number of authors ({len(authors)})'
|
||||||
path+='_tutti'
|
|
||||||
|
|
||||||
discarded = 0
|
discarded = 0
|
||||||
f1_scores = []
|
f1_scores = []
|
||||||
|
|
|
||||||
|
|
@ -98,7 +98,6 @@ for epistola in [1]:
|
||||||
authors = ['Dante'] + authors1
|
authors = ['Dante'] + authors1
|
||||||
elif epistola == 2:
|
elif epistola == 2:
|
||||||
authors = ['Dante'] + authors2
|
authors = ['Dante'] + authors2
|
||||||
path += '_tutti'
|
|
||||||
else:
|
else:
|
||||||
authors = ['Dante'] + authors3
|
authors = ['Dante'] + authors3
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -13,7 +13,7 @@ from util.color_visualization import color
|
||||||
# TODO: sentence length (Mendenhall-style) ?
|
# TODO: sentence length (Mendenhall-style) ?
|
||||||
|
|
||||||
|
|
||||||
for epistola in [2]:
|
for epistola in [1]:
|
||||||
if epistola==1:
|
if epistola==1:
|
||||||
authors = ['Dante','ClaraAssisiensis', 'GiovanniBoccaccio', 'GuidoFaba','PierDellaVigna']
|
authors = ['Dante','ClaraAssisiensis', 'GiovanniBoccaccio', 'GuidoFaba','PierDellaVigna']
|
||||||
else:
|
else:
|
||||||
|
|
@ -50,7 +50,8 @@ for epistola in [2]:
|
||||||
features_sentenceLengths=True,
|
features_sentenceLengths=True,
|
||||||
tfidf_feat_selection_ratio=0.1,
|
tfidf_feat_selection_ratio=0.1,
|
||||||
wordngrams=True, n_wordngrams=(1, 2),
|
wordngrams=True, n_wordngrams=(1, 2),
|
||||||
charngrams=True, n_charngrams=(3, 4, 5), preserve_punctuation=False,
|
charngrams=True, n_charngrams=(3, 4, 5),
|
||||||
|
preserve_punctuation=False,
|
||||||
split_documents=True, split_policy=split_by_sentences, window_size=3,
|
split_documents=True, split_policy=split_by_sentences, window_size=3,
|
||||||
normalize_features=True)
|
normalize_features=True)
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -4,6 +4,8 @@ from data.features import *
|
||||||
from model import AuthorshipVerificator, f1_from_counters
|
from model import AuthorshipVerificator, f1_from_counters
|
||||||
from sklearn.svm import LinearSVC, SVC
|
from sklearn.svm import LinearSVC, SVC
|
||||||
from util.color_visualization import color
|
from util.color_visualization import color
|
||||||
|
import pickle
|
||||||
|
import os
|
||||||
|
|
||||||
# DONE: ngrams should contain punctuation marks according to Sapkota et al. [39] in the PAN 2015 overview
|
# DONE: ngrams should contain punctuation marks according to Sapkota et al. [39] in the PAN 2015 overview
|
||||||
# (More recently, it was shown that character
|
# (More recently, it was shown that character
|
||||||
|
|
@ -13,49 +15,69 @@ from util.color_visualization import color
|
||||||
# TODO: sentence length (Mendenhall-style) ?
|
# TODO: sentence length (Mendenhall-style) ?
|
||||||
|
|
||||||
|
|
||||||
for epistola in [2]:
|
for epistola in [1]:
|
||||||
|
|
||||||
print('Epistola {}'.format(epistola))
|
print('Epistola {}'.format(epistola))
|
||||||
print('='*80)
|
print('='*80)
|
||||||
path = '../testi_{}'.format(epistola)
|
path = '../testi_{}'.format(epistola)
|
||||||
|
if epistola==1:
|
||||||
|
paragraphs = range(1, 14)
|
||||||
if epistola==2:
|
if epistola==2:
|
||||||
path+='_interaEpistola'
|
paragraphs = range(14, 91)
|
||||||
|
|
||||||
positive, negative, ep_text = load_texts(path, positive_author='Dante', unknown_target='EpistolaXIII_{}.txt'.format(epistola))
|
target = [f'EpistolaXIII_{epistola}_{paragraph}.txt' for paragraph in paragraphs]
|
||||||
n_full_docs = len(positive) + len(negative)
|
positive, negative, ep_texts = load_texts(path, positive_author='Dante', unknown_target=target)
|
||||||
|
|
||||||
feature_extractor = FeatureExtractor(function_words_freq='latin',
|
pickle_file = f'../dante_color/epistola{epistola}.pkl'
|
||||||
conjugations_freq='latin',
|
if os.path.exists(pickle_file):
|
||||||
features_Mendenhall=True,
|
print(f'loading pickle file {pickle_file}')
|
||||||
features_sentenceLengths=True,
|
probabilities = pickle.load(open(pickle_file, 'rb'))
|
||||||
tfidf_feat_selection_ratio=0.1,
|
else:
|
||||||
wordngrams=True, n_wordngrams=(1, 2),
|
print(f'generating pickle file')
|
||||||
charngrams=True, n_charngrams=(3, 4, 5), preserve_punctuation=False,
|
n_full_docs = len(positive) + len(negative)
|
||||||
split_documents=True, split_policy=split_by_sentences, window_size=3,
|
|
||||||
normalize_features=True)
|
|
||||||
|
|
||||||
Xtr,ytr,groups = feature_extractor.fit_transform(positive, negative)
|
feature_extractor = FeatureExtractor(function_words_freq='latin',
|
||||||
print(ytr)
|
conjugations_freq='latin',
|
||||||
|
features_Mendenhall=True,
|
||||||
|
features_sentenceLengths=True,
|
||||||
|
tfidf_feat_selection_ratio=0.1,
|
||||||
|
wordngrams=True, n_wordngrams=(1, 2),
|
||||||
|
charngrams=True, n_charngrams=(3, 4, 5),
|
||||||
|
preserve_punctuation=False,
|
||||||
|
split_documents=True, split_policy=split_by_sentences, window_size=3,
|
||||||
|
normalize_features=True)
|
||||||
|
|
||||||
ep, ep_fragments = feature_extractor.transform(ep_text, return_fragments=True, window_size=3)
|
Xtr,ytr,groups = feature_extractor.fit_transform(positive, negative)
|
||||||
|
print(ytr)
|
||||||
|
|
||||||
print('Fitting the Verificator')
|
print('Fitting the Verificator')
|
||||||
av = AuthorshipVerificator(nfolds=10, estimator=LogisticRegression)
|
av = AuthorshipVerificator(nfolds=10, estimator=LogisticRegression, author_name='Dante')
|
||||||
av.fit(Xtr,ytr,groups)
|
av.fit(Xtr,ytr,groups)
|
||||||
|
|
||||||
print('Predicting the Epistola {}'.format(epistola))
|
probabilities = []
|
||||||
title = 'Epistola {}'.format('I' if epistola==1 else 'II')
|
for i, target_text in enumerate(ep_texts):
|
||||||
av.predict(ep, title)
|
ep = feature_extractor.transform(target_text, avoid_splitting=True)
|
||||||
fulldoc_prob, fragment_probs = av.predict_proba(ep, title)
|
prob, _ = av.predict_proba(ep, epistola_name=target[i])
|
||||||
color(path='../dante_color/epistola{}.html'.format(epistola), texts=ep_fragments, probabilities=fragment_probs, title=title)
|
probabilities.append(prob)
|
||||||
|
|
||||||
|
pickle.dump(probabilities, open(pickle_file, 'wb'), pickle.HIGHEST_PROTOCOL)
|
||||||
|
|
||||||
|
color(path=f'../dante_color/epistola{epistola}.html', texts=ep_texts, probabilities=probabilities, title=f'Epistola {("I" if epistola==1 else "II")}', paragraph_offset=paragraphs[0])
|
||||||
|
|
||||||
|
|
||||||
|
# print('Predicting the Epistola {}'.format(epistola))
|
||||||
|
# title = 'Epistola {}'.format('I' if epistola==1 else 'II')
|
||||||
|
# av.predict(ep, title)
|
||||||
|
# fulldoc_prob, fragment_probs = av.predict_proba(ep, title)
|
||||||
|
# color(path='../dante_color/epistola{}.html'.format(epistola), texts=ep_fragments, probabilities=fragment_probs, title=title)
|
||||||
|
|
||||||
# score_ave, score_std = av.leave_one_out(Xtr, ytr, groups, test_lowest_index_only=False)
|
# score_ave, score_std = av.leave_one_out(Xtr, ytr, groups, test_lowest_index_only=False)
|
||||||
# print('LOO[full-and-fragments]={:.3f} +-{:.5f}'.format(score_ave, score_std))
|
# print('LOO[full-and-fragments]={:.3f} +-{:.5f}'.format(score_ave, score_std))
|
||||||
|
|
||||||
score_ave, score_std, tp, fp, fn, tn = av.leave_one_out(Xtr, ytr, groups, test_lowest_index_only=True, counters=True)
|
# score_ave, score_std, tp, fp, fn, tn = av.leave_one_out(Xtr, ytr, groups, test_lowest_index_only=True, counters=True)
|
||||||
# print('LOO[full-docs]={:.3f} +-{:.5f}'.format(score_ave, score_std))
|
# print('LOO[full-docs]={:.3f} +-{:.5f}'.format(score_ave, score_std))
|
||||||
f1_ = f1_from_counters(tp, fp, fn, tn)
|
# f1_ = f1_from_counters(tp, fp, fn, tn)
|
||||||
print('F1 = {:.3f}'.format(f1_))
|
# print('F1 = {:.3f}'.format(f1_))
|
||||||
|
|
||||||
# score_ave, score_std = av.leave_one_out(Xtr, ytr, None)
|
# score_ave, score_std = av.leave_one_out(Xtr, ytr, None)
|
||||||
# print('LOO[w/o groups]={:.3f} +-{:.5f}'.format(score_ave, score_std))
|
# print('LOO[w/o groups]={:.3f} +-{:.5f}'.format(score_ave, score_std))
|
||||||
|
|
|
||||||
|
|
@ -13,8 +13,6 @@ for epistola in [1,2,3]: #3 means "both Ep1 and Ep2 corpora"
|
||||||
print('='*80)
|
print('='*80)
|
||||||
path = '../testiXIV_{}'.format(epistola)
|
path = '../testiXIV_{}'.format(epistola)
|
||||||
paragraphs = range(1, 6)
|
paragraphs = range(1, 6)
|
||||||
if epistola==2:
|
|
||||||
path+='_tutti'
|
|
||||||
|
|
||||||
target = [f'Epistola_ArigoVII.txt'] + [f'Epistola_ArigoVII_{paragraph}.txt' for paragraph in paragraphs]
|
target = [f'Epistola_ArigoVII.txt'] + [f'Epistola_ArigoVII_{paragraph}.txt' for paragraph in paragraphs]
|
||||||
positive, negative, ep_texts = load_texts(path, positive_author='Dante', unknown_target=target, train_skip_prefix='Epistola_ArigoVII')
|
positive, negative, ep_texts = load_texts(path, positive_author='Dante', unknown_target=target, train_skip_prefix='Epistola_ArigoVII')
|
||||||
|
|
|
||||||
|
|
@ -28,13 +28,13 @@ def remove_citations(doc):
|
||||||
doc = remove_pattern(doc, start_symbol='{', end_symbol='}', counter=counter)
|
doc = remove_pattern(doc, start_symbol='{', end_symbol='}', counter=counter)
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def load_texts(path, positive_author='Dante', unknown_target=None):
|
def load_texts(path, positive_author='Dante', unknown_target=None, train_skip_prefix='EpistolaXIII_'):
|
||||||
# load the training data (all documents but Epistolas 1 and 2)
|
# load the training data (all documents but Epistolas 1 and 2)
|
||||||
positive,negative = [],[]
|
positive,negative = [],[]
|
||||||
authors = []
|
authors = []
|
||||||
ndocs=0
|
ndocs=0
|
||||||
for file in os.listdir(path):
|
for file in os.listdir(path):
|
||||||
if file.startswith('EpistolaXIII_'): continue
|
if file.startswith(train_skip_prefix): continue
|
||||||
file_clean = file.replace('.txt','')
|
file_clean = file.replace('.txt','')
|
||||||
author, textname = file_clean.split('_')[0],file_clean.split('_')[1]
|
author, textname = file_clean.split('_')[0],file_clean.split('_')[1]
|
||||||
text = open(join(path,file), encoding= "utf8").read()
|
text = open(join(path,file), encoding= "utf8").read()
|
||||||
|
|
@ -63,7 +63,7 @@ def load_texts(path, positive_author='Dante', unknown_target=None):
|
||||||
return positive, negative
|
return positive, negative
|
||||||
|
|
||||||
|
|
||||||
def list_texts(path):
|
def ___list_texts(path):
|
||||||
authors = {}
|
authors = {}
|
||||||
for file in os.listdir(path):
|
for file in os.listdir(path):
|
||||||
if file.startswith('EpistolaXIII_'): continue
|
if file.startswith('EpistolaXIII_'): continue
|
||||||
|
|
|
||||||
|
|
@ -26,7 +26,7 @@ latin_conjugations = ['o', 'eo', 'io', 'as', 'es', 'is', 'at', 'et', 'it', 'amus
|
||||||
'abis', 'ebis', 'ies', 'abit', 'ebit', 'iet', 'abimus', 'ebimus', 'emus', 'iemus', 'abitis',
|
'abis', 'ebis', 'ies', 'abit', 'ebit', 'iet', 'abimus', 'ebimus', 'emus', 'iemus', 'abitis',
|
||||||
'ebitis', 'ietis', 'abunt', 'ebunt', 'ient', 'abor', 'ebor', 'ar', 'iar', 'aberis', 'eberis',
|
'ebitis', 'ietis', 'abunt', 'ebunt', 'ient', 'abor', 'ebor', 'ar', 'iar', 'aberis', 'eberis',
|
||||||
'ieris', 'abitur', 'ebitur', 'ietur', 'abimur', 'ebimur', 'iemur', 'abimini', 'ebimini', 'iemini',
|
'ieris', 'abitur', 'ebitur', 'ietur', 'abimur', 'ebimur', 'iemur', 'abimini', 'ebimini', 'iemini',
|
||||||
'abuntur', 'ebuntur', 'ientur', 'i', 'isti', 'it', 'imus', 'istis', 'erunt', 'em', 'eam', 'eas',
|
'abuntur', 'ebuntur', 'ientur', 'i', 'isti', 'it', 'istis', 'erunt', 'em', 'eam', 'eas',
|
||||||
'ias', 'eat', 'iat', 'eamus', 'iamus', 'eatis', 'iatis', 'eant', 'iant', 'er', 'ear', 'earis',
|
'ias', 'eat', 'iat', 'eamus', 'iamus', 'eatis', 'iatis', 'eant', 'iant', 'er', 'ear', 'earis',
|
||||||
'iaris', 'eatur', 'iatur', 'eamur', 'iamur', 'eamini', 'iamini', 'eantur', 'iantur', 'rem', 'res',
|
'iaris', 'eatur', 'iatur', 'eamur', 'iamur', 'eamini', 'iamini', 'eantur', 'iantur', 'rem', 'res',
|
||||||
'ret', 'remus', 'retis', 'rent', 'rer', 'reris', 'retur', 'remur', 'remini', 'rentur', 'erim',
|
'ret', 'remus', 'retis', 'rent', 'rer', 'reris', 'retur', 'remur', 'remini', 'rentur', 'erim',
|
||||||
|
|
@ -34,7 +34,7 @@ latin_conjugations = ['o', 'eo', 'io', 'as', 'es', 'is', 'at', 'et', 'it', 'amus
|
||||||
'ere', 'ire', 'ato', 'eto', 'ito', 'atote', 'etote', 'itote', 'anto', 'ento', 'unto', 'iunto',
|
'ere', 'ire', 'ato', 'eto', 'ito', 'atote', 'etote', 'itote', 'anto', 'ento', 'unto', 'iunto',
|
||||||
'ator', 'etor', 'itor', 'aminor', 'eminor', 'iminor', 'antor', 'entor', 'untor', 'iuntor', 'ari',
|
'ator', 'etor', 'itor', 'aminor', 'eminor', 'iminor', 'antor', 'entor', 'untor', 'iuntor', 'ari',
|
||||||
'eri', 'iri', 'andi', 'ando', 'andum', 'andus', 'ande', 'ans', 'antis', 'anti', 'antem', 'antes',
|
'eri', 'iri', 'andi', 'ando', 'andum', 'andus', 'ande', 'ans', 'antis', 'anti', 'antem', 'antes',
|
||||||
'antium', 'antibus', 'antia', 'esse', 'sum', 'es', 'est', 'sumus', 'estis', 'sunt', 'eram', 'eras',
|
'antium', 'antibus', 'antia', 'esse', 'sum', 'est', 'sumus', 'estis', 'sunt', 'eram', 'eras',
|
||||||
'erat', 'eramus', 'eratis', 'erant', 'ero', 'eris', 'erit', 'erimus', 'eritis', 'erint', 'sim',
|
'erat', 'eramus', 'eratis', 'erant', 'ero', 'eris', 'erit', 'erimus', 'eritis', 'erint', 'sim',
|
||||||
'sis', 'sit', 'simus', 'sitis', 'sint', 'essem', 'esses', 'esset', 'essemus', 'essetis', 'essent',
|
'sis', 'sit', 'simus', 'sitis', 'sint', 'essem', 'esses', 'esset', 'essemus', 'essetis', 'essent',
|
||||||
'fui', 'fuisti', 'fuit', 'fuimus', 'fuistis', 'fuerunt', 'este', 'esto', 'estote', 'sunto']
|
'fui', 'fuisti', 'fuit', 'fuimus', 'fuistis', 'fuerunt', 'este', 'esto', 'estote', 'sunto']
|
||||||
|
|
@ -139,7 +139,9 @@ def _features_function_words_freq(documents, lang):
|
||||||
funct_words_freq = [1000. * freqs[function_word] / nwords for function_word in function_words]
|
funct_words_freq = [1000. * freqs[function_word] / nwords for function_word in function_words]
|
||||||
features.append(funct_words_freq)
|
features.append(funct_words_freq)
|
||||||
|
|
||||||
return np.array(features)
|
f_names = [f'funcw::{f}' for f in function_words]
|
||||||
|
|
||||||
|
return np.array(features), f_names
|
||||||
|
|
||||||
|
|
||||||
def _features_conjugations_freq(documents, lang):
|
def _features_conjugations_freq(documents, lang):
|
||||||
|
|
@ -156,7 +158,9 @@ def _features_conjugations_freq(documents, lang):
|
||||||
conjugation_freq = [1000. * freqs[conjugation] / nwords for conjugation in conjugations]
|
conjugation_freq = [1000. * freqs[conjugation] / nwords for conjugation in conjugations]
|
||||||
features.append(conjugation_freq)
|
features.append(conjugation_freq)
|
||||||
|
|
||||||
return np.array(features)
|
f_names = [f'conj::{f}' for f in conjugations]
|
||||||
|
|
||||||
|
return np.array(features), f_names
|
||||||
|
|
||||||
|
|
||||||
def _features_Mendenhall(documents, upto=23):
|
def _features_Mendenhall(documents, upto=23):
|
||||||
|
|
@ -176,7 +180,10 @@ def _features_Mendenhall(documents, upto=23):
|
||||||
for i in range(1, upto):
|
for i in range(1, upto):
|
||||||
tokens_count.append(1000.*(sum(j>= i for j in tokens_len))/nwords)
|
tokens_count.append(1000.*(sum(j>= i for j in tokens_len))/nwords)
|
||||||
features.append(tokens_count)
|
features.append(tokens_count)
|
||||||
return np.array(features)
|
|
||||||
|
f_names = [f'mendenhall::{c}' for c in range(1,upto)]
|
||||||
|
|
||||||
|
return np.array(features), f_names
|
||||||
|
|
||||||
|
|
||||||
def _features_sentenceLengths(documents, downto=3, upto=70):
|
def _features_sentenceLengths(documents, downto=3, upto=70):
|
||||||
|
|
@ -200,9 +207,10 @@ def _features_sentenceLengths(documents, downto=3, upto=70):
|
||||||
for i in range(downto, upto):
|
for i in range(downto, upto):
|
||||||
sent_count.append(1000.*(sum(j>= i for j in sent_len))/nsent)
|
sent_count.append(1000.*(sum(j>= i for j in sent_len))/nsent)
|
||||||
features.append(sent_count)
|
features.append(sent_count)
|
||||||
return np.array(features)
|
|
||||||
|
|
||||||
|
f_names = [f'sentlength::{c}' for c in range(downto, upto)]
|
||||||
|
|
||||||
|
return np.array(features), f_names
|
||||||
|
|
||||||
|
|
||||||
def _features_tfidf(documents, tfidf_vectorizer=None, min_df = 1, ngrams=(1,1)):
|
def _features_tfidf(documents, tfidf_vectorizer=None, min_df = 1, ngrams=(1,1)):
|
||||||
|
|
@ -306,6 +314,7 @@ class FeatureExtractor:
|
||||||
self.normalize_features=normalize_features
|
self.normalize_features=normalize_features
|
||||||
self.window_size = window_size
|
self.window_size = window_size
|
||||||
self.verbose = verbose
|
self.verbose = verbose
|
||||||
|
self.feature_names = None
|
||||||
|
|
||||||
|
|
||||||
def fit_transform(self, positives, negatives):
|
def fit_transform(self, positives, negatives):
|
||||||
|
|
@ -313,6 +322,7 @@ class FeatureExtractor:
|
||||||
authors = [1]*len(positives) + [0]*len(negatives)
|
authors = [1]*len(positives) + [0]*len(negatives)
|
||||||
n_original_docs = len(documents)
|
n_original_docs = len(documents)
|
||||||
groups = list(range(n_original_docs))
|
groups = list(range(n_original_docs))
|
||||||
|
self.feature_names = []
|
||||||
|
|
||||||
if self.split_documents:
|
if self.split_documents:
|
||||||
doc_fragments, authors_fragments, groups_fragments = splitter(documents, authors,
|
doc_fragments, authors_fragments, groups_fragments = splitter(documents, authors,
|
||||||
|
|
@ -332,47 +342,71 @@ class FeatureExtractor:
|
||||||
|
|
||||||
# dense feature extraction functions
|
# dense feature extraction functions
|
||||||
if self.function_words_freq:
|
if self.function_words_freq:
|
||||||
X = self._addfeatures(X, _features_function_words_freq(documents, self.function_words_freq))
|
F, f_names = _features_function_words_freq(documents, self.function_words_freq)
|
||||||
|
X = self._addfeatures(X, F)
|
||||||
|
self.feature_names.extend(f_names)
|
||||||
self._print('adding function words features: {} features'.format(X.shape[1]))
|
self._print('adding function words features: {} features'.format(X.shape[1]))
|
||||||
|
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
|
||||||
|
|
||||||
if self.conjugations_freq:
|
if self.conjugations_freq:
|
||||||
X = self._addfeatures(X, _features_conjugations_freq(documents, self.conjugations_freq))
|
F, f_names = _features_conjugations_freq(documents, self.conjugations_freq)
|
||||||
|
X = self._addfeatures(X, F)
|
||||||
|
self.feature_names.extend(f_names)
|
||||||
self._print('adding conjugation features: {} features'.format(X.shape[1]))
|
self._print('adding conjugation features: {} features'.format(X.shape[1]))
|
||||||
|
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
|
||||||
|
|
||||||
if self.features_Mendenhall:
|
if self.features_Mendenhall:
|
||||||
X = self._addfeatures(X, _features_Mendenhall(documents))
|
F, f_names = _features_Mendenhall(documents)
|
||||||
|
X = self._addfeatures(X, F)
|
||||||
|
self.feature_names.extend(f_names)
|
||||||
self._print('adding Mendenhall words features: {} features'.format(X.shape[1]))
|
self._print('adding Mendenhall words features: {} features'.format(X.shape[1]))
|
||||||
|
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
|
||||||
|
|
||||||
if self.features_sentenceLengths:
|
if self.features_sentenceLengths:
|
||||||
X = self._addfeatures(X, _features_sentenceLengths(documents))
|
F, f_names = _features_sentenceLengths(documents)
|
||||||
self._print('adding sentence lengths features: {} features'.format(X.shape[1]))
|
X = self._addfeatures(X, F)
|
||||||
|
self.feature_names.extend(f_names)
|
||||||
|
self._print('adding sentence lengths features: {} features'.format(X.shape[1]))
|
||||||
|
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
|
||||||
|
|
||||||
# sparse feature extraction functions
|
# sparse feature extraction functions
|
||||||
if self.tfidf:
|
if self.tfidf:
|
||||||
X_features, vectorizer = _features_tfidf(documents, ngrams=self.wordngrams)
|
X_features, vectorizer = _features_tfidf(documents, ngrams=self.wordngrams)
|
||||||
self.tfidf_vectorizer = vectorizer
|
self.tfidf_vectorizer = vectorizer
|
||||||
|
index2word = {i: w for w, i in vectorizer.vocabulary_.items()}
|
||||||
|
f_names = [f'tfidf::{index2word[i]}' for i in range(len(index2word))]
|
||||||
|
|
||||||
if self.tfidf_feat_selection_ratio < 1.:
|
if self.tfidf_feat_selection_ratio < 1.:
|
||||||
if self.verbose: print('feature selection')
|
if self.verbose: print('feature selection')
|
||||||
X_features, feat_sel = _feature_selection(X_features, y, self.tfidf_feat_selection_ratio)
|
X_features, feat_sel = _feature_selection(X_features, y, self.tfidf_feat_selection_ratio)
|
||||||
self.feat_sel_tfidf = feat_sel
|
self.feat_sel_tfidf = feat_sel
|
||||||
|
f_names = [f_names[i] for i in feat_sel.get_support(indices=True)]
|
||||||
|
|
||||||
X = self._addfeatures(_tocsr(X), X_features)
|
X = self._addfeatures(_tocsr(X), X_features)
|
||||||
|
self.feature_names.extend(f_names)
|
||||||
self._print('adding tfidf words features: {} features'.format(X.shape[1]))
|
self._print('adding tfidf words features: {} features'.format(X.shape[1]))
|
||||||
|
|
||||||
|
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
|
||||||
if self.ngrams:
|
if self.ngrams:
|
||||||
X_features, vectorizer = _features_ngrams(documents, self.ns,
|
X_features, vectorizer = _features_ngrams(documents, self.ns,
|
||||||
preserve_punctuation=self.preserve_punctuation)
|
preserve_punctuation=self.preserve_punctuation)
|
||||||
self.ngrams_vectorizer = vectorizer
|
self.ngrams_vectorizer = vectorizer
|
||||||
|
index2word = {i: w for w, i in vectorizer.vocabulary_.items()}
|
||||||
|
f_names = [f'ngram::{index2word[i]}' for i in range(len(index2word))]
|
||||||
|
|
||||||
if self.tfidf_feat_selection_ratio < 1.:
|
if self.tfidf_feat_selection_ratio < 1.:
|
||||||
if self.verbose: print('feature selection')
|
if self.verbose: print('feature selection')
|
||||||
X_features, feat_sel = _feature_selection(X_features, y, self.tfidf_feat_selection_ratio)
|
X_features, feat_sel = _feature_selection(X_features, y, self.tfidf_feat_selection_ratio)
|
||||||
self.feat_sel_ngrams = feat_sel
|
self.feat_sel_ngrams = feat_sel
|
||||||
|
f_names = [f_names[i] for i in feat_sel.get_support(indices=True)]
|
||||||
|
|
||||||
X = self._addfeatures(_tocsr(X), X_features)
|
X = self._addfeatures(_tocsr(X), X_features)
|
||||||
|
self.feature_names.extend(f_names)
|
||||||
self._print('adding ngrams character features: {} features'.format(X.shape[1]))
|
self._print('adding ngrams character features: {} features'.format(X.shape[1]))
|
||||||
|
|
||||||
|
self.feature_names = np.asarray(self.feature_names)
|
||||||
|
|
||||||
|
assert X.shape[1] == len(self.feature_names), f'wrong number of feature names, expected {X.shape[1]} found {len(self.feature_names)}'
|
||||||
# print summary
|
# print summary
|
||||||
if self.verbose:
|
if self.verbose:
|
||||||
print(
|
print(
|
||||||
|
|
@ -401,19 +435,23 @@ class FeatureExtractor:
|
||||||
|
|
||||||
# dense feature extraction functions
|
# dense feature extraction functions
|
||||||
if self.function_words_freq:
|
if self.function_words_freq:
|
||||||
TEST = self._addfeatures(TEST, _features_function_words_freq(test, self.function_words_freq))
|
F,_=_features_function_words_freq(test, self.function_words_freq)
|
||||||
|
TEST = self._addfeatures(TEST, F)
|
||||||
self._print('adding function words features: {} features'.format(TEST.shape[1]))
|
self._print('adding function words features: {} features'.format(TEST.shape[1]))
|
||||||
|
|
||||||
if self.conjugations_freq:
|
if self.conjugations_freq:
|
||||||
TEST = self._addfeatures(TEST, _features_conjugations_freq(test, self.conjugations_freq))
|
F,_=_features_conjugations_freq(test, self.conjugations_freq)
|
||||||
|
TEST = self._addfeatures(TEST, F)
|
||||||
self._print('adding conjugation features: {} features'.format(TEST.shape[1]))
|
self._print('adding conjugation features: {} features'.format(TEST.shape[1]))
|
||||||
|
|
||||||
if self.features_Mendenhall:
|
if self.features_Mendenhall:
|
||||||
TEST = self._addfeatures(TEST, _features_Mendenhall(test))
|
F,_ = _features_Mendenhall(test)
|
||||||
|
TEST = self._addfeatures(TEST, F)
|
||||||
self._print('adding Mendenhall words features: {} features'.format(TEST.shape[1]))
|
self._print('adding Mendenhall words features: {} features'.format(TEST.shape[1]))
|
||||||
|
|
||||||
if self.features_sentenceLengths:
|
if self.features_sentenceLengths:
|
||||||
TEST = self._addfeatures(TEST, _features_sentenceLengths(test))
|
F, _ = _features_sentenceLengths(test)
|
||||||
|
TEST = self._addfeatures(TEST, F)
|
||||||
self._print('adding sentence lengths features: {} features'.format(TEST.shape[1]))
|
self._print('adding sentence lengths features: {} features'.format(TEST.shape[1]))
|
||||||
|
|
||||||
# sparse feature extraction functions
|
# sparse feature extraction functions
|
||||||
|
|
@ -454,8 +492,6 @@ class FeatureExtractor:
|
||||||
|
|
||||||
|
|
||||||
def _addfeatures(self, X, F):
|
def _addfeatures(self, X, F):
|
||||||
# plt.matshow(F[:25])
|
|
||||||
# plt.show()
|
|
||||||
if self.normalize_features:
|
if self.normalize_features:
|
||||||
normalize(F, axis=1, copy=False)
|
normalize(F, axis=1, copy=False)
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -47,26 +47,26 @@ class AuthorshipVerificator:
|
||||||
if estimator is SVC:
|
if estimator is SVC:
|
||||||
self.params['kernel'] = ['linear', 'rbf']
|
self.params['kernel'] = ['linear', 'rbf']
|
||||||
self.probability = True
|
self.probability = True
|
||||||
self.svm = estimator(probability=self.probability)
|
self.classifier = estimator(probability=self.probability)
|
||||||
elif estimator is LinearSVC:
|
elif estimator is LinearSVC:
|
||||||
self.probability = False
|
self.probability = False
|
||||||
self.svm = estimator()
|
self.classifier = estimator()
|
||||||
elif estimator is LogisticRegression:
|
elif estimator is LogisticRegression:
|
||||||
self.probability = True
|
self.probability = True
|
||||||
self.svm = LogisticRegression()
|
self.classifier = LogisticRegression()
|
||||||
|
|
||||||
def fit(self,X,y,groups=None):
|
def fit(self,X,y,groups=None):
|
||||||
if not isinstance(y,np.ndarray): y=np.array(y)
|
if not isinstance(y,np.ndarray): y=np.array(y)
|
||||||
positive_examples = y.sum()
|
positive_examples = y.sum()
|
||||||
if positive_examples >= self.nfolds:
|
if positive_examples >= self.nfolds:
|
||||||
print('optimizing {}'.format(self.svm.__class__.__name__))
|
print('optimizing {}'.format(self.classifier.__class__.__name__))
|
||||||
# if groups is None or len(np.unique(groups[y==1])):
|
# if groups is None or len(np.unique(groups[y==1])):
|
||||||
folds = list(StratifiedKFold(n_splits=self.nfolds).split(X, y))
|
folds = list(StratifiedKFold(n_splits=self.nfolds).split(X, y))
|
||||||
# folds = list(GroupKFold(n_splits=self.nfolds).split(X,y,groups))
|
# folds = list(GroupKFold(n_splits=self.nfolds).split(X,y,groups))
|
||||||
|
|
||||||
self.estimator = GridSearchCV(self.svm, param_grid=self.params, cv=folds, scoring=make_scorer(f1), n_jobs=-1)
|
self.estimator = GridSearchCV(self.classifier, param_grid=self.params, cv=folds, scoring=make_scorer(f1), n_jobs=-1)
|
||||||
else:
|
else:
|
||||||
self.estimator = self.svm
|
self.estimator = self.classifier
|
||||||
|
|
||||||
self.estimator.fit(X, y)
|
self.estimator.fit(X, y)
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,14 +1,17 @@
|
||||||
from matplotlib.cm import get_cmap
|
from matplotlib.cm import get_cmap
|
||||||
|
|
||||||
def color_tag(text, probability, cmap):
|
def color_tag(index, text, probability, cmap):
|
||||||
probability = (probability-0.5)*0.75+0.5
|
probability *= 0.6
|
||||||
|
# probability = (probability-0.5)*0.75+0.5
|
||||||
r,g,b,_ = cmap(probability)
|
r,g,b,_ = cmap(probability)
|
||||||
# reliable = abs(probability-0.5) > 0.25*0.75
|
# reliable = abs(probability-0.5) > 0.25*0.75
|
||||||
# text = '<font color="white">{}</font>'.format(text) if reliable else text
|
# text = '<font color="white">{}</font>'.format(text) if reliable else text
|
||||||
return '<a style="background-color:rgb({:.0f},{:.0f},{:.0f});">{} </a>'.format(r*255,g*255,b*255,text)
|
return f'<b> P{index}:</b> <a style="background-color:rgb({r*255:.0f},{g*255:.0f},{b*255:.0f});">{text} </a>'
|
||||||
|
|
||||||
def color(path, texts, probabilities, title):
|
def color(path, texts, probabilities, title, paragraph_offset=1):
|
||||||
cmap = get_cmap('RdYlGn')
|
# cmap = get_cmap('RdYlGn')
|
||||||
|
# cmap = get_cmap('Greens')
|
||||||
|
cmap = get_cmap('Greys')
|
||||||
|
|
||||||
with open(path, 'wt') as fo:
|
with open(path, 'wt') as fo:
|
||||||
fo.write("""
|
fo.write("""
|
||||||
|
|
@ -21,10 +24,8 @@ def color(path, texts, probabilities, title):
|
||||||
<body>
|
<body>
|
||||||
<h1>{}</h1>
|
<h1>{}</h1>
|
||||||
""".format(title,title))
|
""".format(title,title))
|
||||||
fo.write('<p>')
|
for i,(line,probability) in enumerate(zip(texts,probabilities)):
|
||||||
for line,probability in zip(texts,probabilities):
|
fo.write(color_tag(paragraph_offset + i, line,probability,cmap))
|
||||||
fo.write(color_tag(line,probability,cmap))
|
|
||||||
fo.write('</p>')
|
|
||||||
fo.write("""
|
fo.write("""
|
||||||
</body>
|
</body>
|
||||||
</html>
|
</html>
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue