sshoc-skosmapping/Values_Check_Tools_and_Services.ipynb at 1fefe236ac1bf06c9eb546340229029da162768d

Draft: check properties for Tools and Services in the MarketPlace Dataset¶

This notebook checks values in the MarketPlace datsaset for Tools and Services.

External libraries and function to download descriptions from the MarketPlace dataset using the API¶

The following two cells are used to import the external libraries used in this Notebook and to define a function; in the final release of this Notebook this function will be (possibly) optimized and provided as an external library.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:

def getMPDescriptions (url, pages):
    mdx = pd.Series(range(1, pages+1))
    df_desc = pd.DataFrame()
    for var in mdx:
        turl = url+str(var)+"&perpage=20"
        df_desc_par=pd.read_json(turl, orient='columns')
        df_desc=df_desc.append(df_desc_par, ignore_index=True)
        
    return (df_desc)

Get the the descriptions of Tools and Services¶

The MarketPlace API are used to download the descriptions of Tools and Services

In [3]:

df_tool_all = pd.DataFrame()
df_tool_all =getMPDescriptions ("https://sshoc-marketplace-api.acdh-dev.oeaw.ac.at/api/tools-services?page=", 81)
df_tool_all.index

Out[3]:

RangeIndex(start=0, stop=1606, step=1)

A quick look at data¶

In [163]:

df_tool_flat = pd.json_normalize(df_tool_all['tools'])
#df_tool_work=df_tool_flat[['id', 'category', 'label', 'licenses', 'contributors', 'accessibleAt', 'sourceItemId']]
df_tool_flat.count()

Out[163]:

id                                         1606
category                                   1606
label                                      1606
version                                       0
persistentId                               1606
description                                1606
licenses                                   1606
contributors                               1606
properties                                 1606
externalIds                                1606
accessibleAt                               1606
sourceItemId                               1606
relatedItems                               1606
lastInfoUpdate                             1606
status                                     1606
olderVersions                              1606
newerVersions                              1606
source.id                                  1606
source.label                               1606
source.url                                 1606
source.urlTemplate                         1606
informationContributor.id                  1606
informationContributor.username            1606
informationContributor.displayName         1606
informationContributor.enabled             1606
informationContributor.registrationDate    1606
informationContributor.role                1606
informationContributor.email               1606
dtype: int64

In [80]:

df_tool_flat_opt=pd.json_normalize(df_tool_all['tools'])
df_tool_flat_opt = df_tool_flat_opt.replace('No description provided.', np.nan)
df_tool_flat_opt.licenses = df_tool_flat_opt.licenses.apply(lambda y: np.nan if len(y)==0 else y)
df_tool_flat_opt.externalIds = df_tool_flat_opt.externalIds.apply(lambda y: np.nan if len(y)==0 else y)
df_tool_flat_opt.contributors = df_tool_flat_opt.contributors.apply(lambda y: np.nan if len(y)==0 else y)
df_tool_flat_opt.accessibleAt = df_tool_flat_opt.accessibleAt.apply(lambda y: np.nan if len(y)==0 else y)
df_tool_flat_opt.relatedItems = df_tool_flat_opt.relatedItems.apply(lambda y: np.nan if len(y)==0 else y)
df_tool_flat_opt.olderVersions = df_tool_flat_opt.olderVersions.apply(lambda y: np.nan if len(y)==0 else y)
df_tool_flat_opt.newerVersions = df_tool_flat_opt.newerVersions.apply(lambda y: np.nan if len(y)==0 else y)
df_tool_flat_opt.properties = df_tool_flat_opt.properties.apply(lambda y: np.nan if len(y)==0 else y)

In [194]:

print('{:<35}Number of missing values'.format("Property"), end='\n')
df_tool_flat_opt.isnull().sum()

Property                           Number of missing values

Out[194]:

id                                            0
category                                      0
label                                         0
version                                    1606
persistentId                                  0
description                                 174
licenses                                   1605
contributors                                366
properties                                  199
externalIds                                1606
accessibleAt                                507
sourceItemId                                  0
relatedItems                               1478
lastInfoUpdate                                0
status                                        0
olderVersions                                 2
newerVersions                              1606
source.id                                     0
source.label                                  0
source.url                                    0
source.urlTemplate                            0
informationContributor.id                     0
informationContributor.username               0
informationContributor.displayName            0
informationContributor.enabled                0
informationContributor.registrationDate       0
informationContributor.role                   0
informationContributor.email                  0
dtype: int64

Checking description values¶

In [115]:

fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(df_tool_flat_opt['description'].str.len(),  bins=100)
ax.set_title('Description Length')
ax.set_xlabel('Characters in description')
ax.set_ylabel('Frequency');

In [195]:

print (f"\n There are {df_tool_flat_opt['description'].isna().sum()} Tools and Services with empty descriptions\n")

 There are 174 Tools and Services with empty descriptions

In [185]:

#Print the Tools and Services with empty descriptions in a CSV file 
df_tool_flat_opt_e=df_tool_flat_opt[df_tool_flat_opt['description'].isna()]
df_tool_flat_opt_e[['id', 'label', 'description']].sort_values('label').to_csv(path_or_buf='ts_emptydescription.csv')

Count all Tools and Services where the description is shorter than an old school tweet (clearly the minimum amount of characters required to express any meaningful information in these years...:).

In [190]:

df_tool_flat_d = df_tool_flat_opt[(df_tool_flat_opt['description'].notnull()) & (df_tool_flat_opt['description'].str.len()<140)]
print (f'\n There are {df_tool_flat_d["description"].count()} Tools and Services where the description has less than 140 characters\n')

 There are 217 Tools and Services where the description has less than 140 characters

The following table shows some Tools and Services with short description. The list is currently saved on a file, this could be changed if we decided that this is a significant curation feature.

In [191]:

df_tool_flat_d[['id', 'label', 'description']].sort_values('label').head().style.set_properties(subset=['description'], **{'width': '600px'})

Out[191]:

	id	label	description
76	29934	ARRAS	ARRAS is a historically important tool for analyzing and concording text. It notably provided inspiration for the TACT system.
15	30370	Acronym Finder - Beta (TAPoRware)	This tool locates acronyms and matches them with the corresponding full name from a user-specified input text.
27	30353	Aelfred	Aelfred is a bare-bones Java XML parser. It has not been updated since 2002, and is dependent on JDK 1.4, which is very outdated.
49	29219	AnnotateIt	AnnotateIt lets users annotate anything anywhere on the web.
56	29224	Annotum	Annotum is an open-source, open-process, open-access scholarly authoring and publishing platform based on WordPress.

In [155]:

df_tool_flat_d[['id', 'label', 'description']].sort_values('label').to_csv(path_or_buf='ts_shortdescription.csv')

Checking values on Contributors¶

In [198]:

print (f"\n There are {df_tool_flat_opt['contributors'].isna().sum()} Tools and Services with no contributors\n")

 There are 366 Tools and Services with no contributors

In [203]:

df_prop_data_co = pd.json_normalize(data=df_tool_all['tools'], record_path='contributors', meta_prefix='tool_', meta=['label'])
df_prop_data_co.sort_values('tool_label').info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1372 entries, 0 to 1355
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   actor.id            1372 non-null   int64 
 1   actor.name          1372 non-null   object
 2   actor.externalIds   1372 non-null   object
 3   actor.website       566 non-null    object
 4   actor.email         293 non-null    object
 5   actor.affiliations  1372 non-null   object
 6   role.code           1372 non-null   object
 7   role.label          1372 non-null   object
 8   tool_label          1372 non-null   object
dtypes: int64(1), object(8)
memory usage: 107.2+ KB

In [204]:

df_prop_data_co.sort_values('tool_label').head()

Out[204]:

	actor.id	actor.name	actor.externalIds	actor.website	actor.email	actor.affiliations	role.code	role.label	tool_label
0	483	Ian Pearce, Devin Gaffney	[]	None	None	[]	contributor	Contributor	140kit
1	213	Dassault Systemes	[]	None	None	[]	contributor	Contributor	3DVIA Virtools
2	451	4D	[]	http://www.4d.com/	None	[]	contributor	Contributor	4th Dimension
3	617	80legs	[]	None	None	[]	contributor	Contributor	80legs
4	256	Nathan Smith	[]	http://sonspring.com/	None	[]	contributor	Contributor	960 Grid System

In [207]:

df_prop_data_contrib = pd.json_normalize(data=df_tool_all['tools'], record_path='contributors', meta_prefix='tool_', meta=['label'])
df_prop_data_contrib['actor.externalIds'] = df_prop_data_contrib['actor.externalIds'].apply(lambda y: np.nan if len(y)==0 else y)
df_prop_data_contrib['actor.affiliations'] = df_prop_data_contrib['actor.affiliations'].apply(lambda y: np.nan if len(y)==0 else y)
print('{:<15}Number of missing values'.format("Property"), end='\n')
df_prop_data_contrib.isnull().sum()

Property       Number of missing values

Out[207]:

actor.id                 0
actor.name               0
actor.externalIds     1372
actor.website          806
actor.email           1079
actor.affiliations    1372
role.code                0
role.label               0
tool_label               0
dtype: int64

Check the validity of URLs in the actor.website property using the HTTP Result Status¶

The code below explicitly execute an http call for every URL, waits for the Result Status Code of the call and then registers the code.
Depending on connections and server answer times it may take several minutes to process all URLs.
In the final release of this Notebook this code will be (possibly) optimized and provided as an external library.

In [213]:

df_tool_work_urls=df_prop_data_contrib[df_prop_data_contrib['actor.website'].str.len()>0]
df_urls=df_tool_work_urls['actor.website'].values
df_tool_work_aa_http_status = pd.DataFrame (columns = ['url','status'])
import requests
import re
regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)
for var in df_urls:
    if ( var != "" and var!=None and re.match(regex, var)):
        try:
            r =requests.get(var,timeout=8)
            df_tool_work_aa_http_status = df_tool_work_aa_http_status.append({'url': var, 'status': int(r.status_code)}, ignore_index=True)
        except requests.exceptions.ConnectionError:
          #  print(var)
            df_tool_work_aa_http_status = df_tool_work_aa_http_status.append({'url': var, 'status': int(503)}, ignore_index=True)
        except requests.exceptions.ConnectTimeout:
          #  print(var)
            df_tool_work_aa_http_status = df_tool_work_aa_http_status.append({'url': var, 'status': int(408)}, ignore_index=True)
        except requests.exceptions.ReadTimeout:
         #   print(var)
            df_tool_work_aa_http_status = df_tool_work_aa_http_status.append({'url': var, 'status': int(408)}, ignore_index=True)
        except requests.exceptions.RequestException:
         #   print(var)
            df_tool_work_aa_http_status = df_tool_work_aa_http_status.append({'url': var, 'status': int(500)}, ignore_index=True)
        except TypeError:
        #    print(var)
            df_tool_work_aa_http_status = df_tool_work_aa_http_status.append({'url': var, 'status': int(400)}, ignore_index=True)
    else:
       # print(var ,0)
        df_tool_work_aa_http_status = df_tool_work_aa_http_status.append({'url': var, 'status': int(400)}, ignore_index=True)
df_tool_work_aa_http_status.head()

Out[213]:

	url	status
0	http://www.4d.com/	200
1	http://sonspring.com/	200
2	https://www.abbyy.com/	200
3	http://geoffreyrockwell.com/	200
4	https://www.adobe.com/	200

In [214]:

df_http_status_sub=df_tool_work_aa_http_status[df_tool_work_aa_http_status['status'] != 1]
df_db_st = df_http_status_sub['status'].value_counts()
print('{:<8}Frequency'.format("Status"))
df_db_st.head(10)

Status  Frequency

Out[214]:

200    462
503     54
404     28
403      8
406      5
500      3
408      3
502      2
400      1
Name: status, dtype: int64

The first column in the table above shows the HTTP Status codes obtained when trying to connect on accessibleAt URLs, the second column the total number of URLs returning the status. Notice that while 404 means that the resource is not found, other status codes may indicate temporary problems.
The image below summarizes of the above result.

In [215]:

fig, ax = plt.subplots()
df_db_st.plot(kind='bar', figsize=(15,6), x='Status', y='Frequency',)
plt.grid(alpha=0.6)
ax.yaxis.set_label_text("")
ax.set_title("Number of Result Codes in actor.website", fontsize=15)
ax.set_xlabel('Result Code', fontsize=14)
ax.set_ylabel('Frequency', fontsize=14);
plt.show()

The list of possibly wrong URLs is saved in a Comma Separated Values (CSV) file having the following columns: id, label, url, status. The final release of this notebook will save this data in the curation dataset.

In [222]:

df_http_status_err=df_http_status_sub[df_http_status_sub['status'] != 200]
df_list_of_tools_wrongaa=pd.merge(left=df_prop_data_contrib, right=df_http_status_err, left_on='actor.website', right_on='url')
df_list_of_tools_wrongaa.head()
df_list_of_tools_wrongaa[['actor.id', 'tool_label', 'actor.website', 'status']].sort_values('tool_label').to_csv(path_or_buf='ts_wrongcontributorsurls.csv')

Checking values on Properties¶

In [225]:

print (f"\n There are {df_tool_flat_opt['properties'].isna().sum()} Tools and Services with no properties\n")

 There are 199 Tools and Services with no properties

In [226]:

#TODO: Print/Save the Tools and Services with empty properties

In [239]:

df_prop_data_ts = pd.json_normalize(data=df_tool_all['tools'], record_path='properties', meta_prefix='ts_', meta=['label'])
#df_prop_data_ts.sort_values('ts_label').head(7)

In [255]:

df_prop_data_ts = pd.json_normalize(data=df_tool_all['tools'], record_path='properties', meta_prefix='ts_', meta=['label'])
df_prop_data_ts['type.allowedVocabularies'] = df_prop_data_ts['type.allowedVocabularies'].apply(lambda y: np.nan if len(y)==0 else y)
print('{:<25}Number of missing values'.format("Property"), end='\n')
df_prop_data_ts.isnull().sum()

Property                 Number of missing values

Out[255]:

id                                    0
value                              3263
type.code                             0
type.label                            0
type.type                             0
type.ord                              0
type.allowedVocabularies           3128
concept.code                       3128
concept.vocabulary.code            3128
concept.vocabulary.label           3128
concept.vocabulary.accessibleAt    6391
concept.label                      3128
concept.notation                   3128
concept.definition                 3128
concept.uri                        3128
concept                            6391
ts_label                              0
dtype: int64

Values in type.code¶

In [266]:

a_df=df_prop_data_ts.drop_duplicates(['type.code','ts_label'])
df_temp_tc_label = a_df['type.code'].value_counts()
print('{:<28}Frequency'.format("Type Code"), end='\n')
df_temp_tc_label.head(39)

Type Code                   Frequency

Out[266]:

media                          1092
activity                       1080
terms-of-use                    916
tool-family                     156
keyword                         142
language                        132
thumbnail                        68
version                          56
authentication                   51
geographical-availabilities      15
life-cycle-status                15
technical-readiness-level        15
usermanual-url                   11
source-last-update               11
service-level-url                 9
see-also                          9
helpdesk-url                      6
termsofuse-url                    5
privacypolicy-url                 4
methodica-link                    4
accesspolicy-url                  3
pages                             2
volume                            1
repository-url                    1
wikidata-id                       1
doi                               1
license                           1
media-caption                     1
Name: type.code, dtype: int64

In [291]:

df_prop_data_ts[(df_prop_data_ts['type.code']=='terms-of-use')].sort_values('ts_label').tail(7)

Out[291]:

	id	value	type.code	type.label	type.type	type.ord	type.allowedVocabularies	concept.code	concept.vocabulary.code	concept.vocabulary.label	concept.vocabulary.accessibleAt	concept.label	concept.notation	concept.definition	concept.uri	concept	ts_label
6149	181262	Open Source	terms-of-use	Terms Of Use	string	42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	word2vec
6148	181261	Free	terms-of-use	Terms Of Use	string	42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	word2vec
6226	181814	Free	terms-of-use	Terms Of Use	string	42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	wordsimilarity (Word 2 Word)
6227	181815	Open Source	terms-of-use	Terms Of Use	string	42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	wordsimilarity (Word 2 Word)
6271	179508	Open Source	terms-of-use	Terms Of Use	string	42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	xMod
6268	179505	Free	terms-of-use	Terms Of Use	string	42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	xMod
6303	178536	Closed Source	terms-of-use	Terms Of Use	string	42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	yED Files

In [267]:

df_typecode_ts=df_prop_data_ts[['ts_label', 'type.code']]
df_typecode_ts_flat=df_typecode_ts.groupby('ts_label')['type.code'].apply(set).reset_index(name='typecodes')
df_typecode_ts_flat.head()

Out[267]:

	ts_label	typecodes
0	140kit	{activity, media}
1	3DF Zephyr - photogrammetry software - 3d mode...	{language, keyword}
2	3DHOP	{language, keyword}
3	3DHOP: 3D Heritage Online Presenter	{keyword}
4	3DReshaper \\| 3DReshaper	{language}

In [277]:

from collections import Counter, defaultdict
import itertools
cooccurrences = []

for props in df_typecode_ts_flat['typecodes']:
    prop_pairs = itertools.combinations(props, 2)
    for pair in prop_pairs:
        cooccurrences.append(tuple(sorted(pair)))

# Count the frequency of each cooccurring pair.
properties_co_counter = Counter(cooccurrences)

In [281]:

print("Top TypeCodes Cooccurrences by Frequency", '\n')
print('{:<50}{}'.format('Cooccurrence', 'Frequency'))
for k, v in properties_co_counter.most_common(10):
    topics = '['+k[0] + ', ' + k[1]+']'
    print(f'{topics:<50}{v}')

Top TypeCodes Cooccurrences by Frequency 

Cooccurrence                                      Frequency
[activity, media]                                 1013
[media, terms-of-use]                             861
[activity, terms-of-use]                          828
[media, tool-family]                              156
[terms-of-use, tool-family]                       151
[activity, tool-family]                           143
[keyword, language]                               90
[thumbnail, version]                              56
[authentication, thumbnail]                       51
[authentication, version]                         51

In [282]:

property_cooccurrences = list(
    itertools.chain(*[[tuple(sorted(c)) for c in itertools.combinations(d, 2)] 
                      for d in df_typecode_ts_flat['typecodes']])
)
# Count the frequency of each cooccurring pair.
property_edge_counter = Counter(property_cooccurrences)

In [283]:

property_cooccurrence_df = pd.DataFrame({
    'prop0': [dcc[0] for dcc in property_edge_counter.keys()],
    'prop1': [dcc[1] for dcc in property_edge_counter.keys()],
    'count': list(property_edge_counter.values()),
}).pivot_table(index='prop0', columns='prop1')['count']

In [285]:

import seaborn as sns
fig, ax = plt.subplots(figsize=(22, 18))
sns.heatmap(property_cooccurrence_df, annot=True, linewidths=0.2, fmt='.0f', ax=ax, cbar=None, cmap='Blues', linecolor='gray')
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha='right')
ax.set_yticklabels(ax.get_yticklabels(), rotation=0)
ax.invert_yaxis()
ax.set_xlabel(None)
ax.set_ylabel(None)
#sns.heatmap(df, linewidths=2, linecolor='yellow')
title = 'Cooccurences of TypeCode in Tools and Services\n'
plt.title(title, loc='left', fontsize=20)
plt.show()

Values in concept.vocabulary.code¶

In [299]:

acvc_df=df_prop_data_ts.drop_duplicates(['concept.vocabulary.code','ts_label'])
df_temp_cvc_label = acvc_df['concept.vocabulary.code'].value_counts()
print('{:<18}Frequency'.format("Type Code"), end='\n')
df_temp_cvc_label.head(39)

Type Code         Frequency

Out[299]:

tadirah2            1080
iso-639-3            130
iso-639-3-v2           2
software-license       1
Name: concept.vocabulary.code, dtype: int64

In [301]:

df_temp_cvc_concept = acvc_df['concept.label'].value_counts()
df_temp_cvc_concept.head(39)

Out[301]:

Analyzing                   358
eng                         130
Discovering                 115
Disseminating               109
Capturing                   108
Creating                     58
Sharing                      32
Visual Analysis              29
Annotating                   28
Web Development              27
Organizing                   25
Editing                      19
Data Cleansing               18
Collaborating                16
Storing                      15
Gathering                    14
Publishing                   13
Parsing                      12
Content Analysis              8
Named Entity Recognition      8
Converting                    6
Communicating                 6
Tagging                       5
Programming                   5
Modeling                      5
Writing                       5
Imaging                       4
Translating                   4
Enriching                     3
Spatial Analysis              3
Recording                     3
Network Analysis              3
Identifying                   3
Transcribing                  2
Interpreting                  2
Archiving                     2
Englisch                      2
Designing                     2
Structural Analysis           2
Name: concept.label, dtype: int64

189 KiB Raw Blame History