sshoc-skosmapping/TAPoRCheck.ipynb

3409 lines
295 KiB
Plaintext
Raw Normal View History

2020-08-05 16:03:30 +02:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Review of data ingested from TAPoR (draft)\n",
"\n",
"This is document cheks the TAPoR dataset using the python library Pandas.\n",
"\n",
"Reference to ticket: https://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/7\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Preamble"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"import ast\n",
"import sys\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from bokeh.io import output_notebook, show\n",
"from bokeh.plotting import figure\n",
"\n",
"from im_tutorials.data import *\n",
"from im_tutorials.utilities import flatten_lists\n",
"from im_tutorials.features.text_preprocessing import *\n",
"from im_tutorials.features.document_vectors import document_vector\n",
"from im_tutorials.features.dim_reduction import WrapTSNE, GaussianMixtureEval\n",
"# for db\n",
"import sqlalchemy as db\n",
"from sqlalchemy import *"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [],
"source": [
"engine = create_engine(\n",
" \"connection_string\")\n",
"connection = engine.connect()\n",
"metadata = db.MetaData()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Import data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Query the DB to get TAPoR data\n",
"\n",
"The TAPoR dataset used in this document is the sql dump published by Education and Research Archive (ERA) University of Alberta: \n",
"\n",
"https://era.library.ualberta.ca/items/f2da0666-f523-44d4-a83c-fa06351a1e94 \n",
"\n",
"(creation date: 2020-01-01).\n",
"The table *tool* contains 1504 records, each one describing a tool. \n",
"Records have been filtered according the value of the field *tool.is_approved*, there are 1363 *approved* records.\n",
"In this document this dataset will be called the **TAPoR dataset**.\n",
"\n",
"*Note that the TAPoR dataset reviewed here is not the same that has been used for the MP ingestion, this document will be update when we'll have it*\n"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=1363, step=1)"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_tools=pd.read_sql_query('SELECT * FROM TaPOR.tools where is_approved=1 order by last_updated', connection)\n",
"df_db_tools.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### An example of TAPoR item\n",
"Let's take a look at a random TAPoR dataset entry.\n",
"(The database schema of the TAPoR dataset is described here: https://era.library.ualberta.ca/items/f2da0666-f523-44d4-a83c-fa06351a1e94/download/8057eae2-3fae-4afa-bc8e-6dcc2a257b6f.)"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id 254\n",
"user_id NaN\n",
"name TextQuest\n",
"detail <p>TextQuest is a text analysis program availa...\n",
"url http://www.textquest.de/pages/en/general-infor...\n",
"is_approved 1\n",
"creators_name Social Science Consulting\n",
"creators_email info@textquest.de\n",
"creators_url http://www.textquest.de/\n",
"image_url images/tools/0/254.png\n",
"star_average 0\n",
"is_hidden 0\n",
"last_updated 2013-05-13\n",
"documentation_url http://www.textquest.de/pages/en/analysis-of-t...\n",
"code None\n",
"repository \n",
"language NaN\n",
"nature 0\n",
"created_at 2013-05-13 18:57:27\n",
"updated_at 2017-10-31 14:25:28\n",
"recipes \n",
"Name: 500, dtype: object"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#df_db_tools.dtypes\n",
"df_db_tools.iloc[500]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following table shows 5 records of the TAPoR dataset."
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>name</th>\n",
" <th>detail</th>\n",
" <th>url</th>\n",
" <th>is_approved</th>\n",
" <th>creators_name</th>\n",
" <th>creators_email</th>\n",
" <th>creators_url</th>\n",
" <th>image_url</th>\n",
" <th>...</th>\n",
" <th>is_hidden</th>\n",
" <th>last_updated</th>\n",
" <th>documentation_url</th>\n",
" <th>code</th>\n",
" <th>repository</th>\n",
" <th>language</th>\n",
" <th>nature</th>\n",
" <th>created_at</th>\n",
" <th>updated_at</th>\n",
" <th>recipes</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>906</th>\n",
" <td>937</td>\n",
" <td>1.0</td>\n",
" <td>140kit</td>\n",
" <td>&lt;p&gt;140kit provides a management layer for twee...</td>\n",
" <td>https://github.com/WebEcologyProject/140kit</td>\n",
" <td>1</td>\n",
" <td>Ian Pearce, Devin Gaffney</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>images/tools/1/937.png</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>2018-10-05</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2015-05-24 00:00:00</td>\n",
" <td>2018-10-05 04:43:34</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>334</th>\n",
" <td>1229</td>\n",
" <td>1.0</td>\n",
" <td>3DVIA Virtools</td>\n",
" <td>&lt;p&gt;A software tool for the creation of 3D inte...</td>\n",
" <td>None</td>\n",
" <td>1</td>\n",
" <td>Dassault Systemes</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2014-12-29 00:00:00</td>\n",
" <td>2014-12-29 00:00:00</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>688</th>\n",
" <td>783</td>\n",
" <td>1.0</td>\n",
" <td>4th Dimension</td>\n",
" <td>4th Dimension is a graphic environment for dev...</td>\n",
" <td>http://www.4d.com/products/4d2004/4dstandarded...</td>\n",
" <td>1</td>\n",
" <td>4D</td>\n",
" <td>None</td>\n",
" <td>http://www.4d.com/</td>\n",
" <td>images/tools/1/783.png</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>2018-09-18</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2015-05-24 00:00:00</td>\n",
" <td>2018-09-18 20:39:31</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>1156</th>\n",
" <td>648</td>\n",
" <td>937.0</td>\n",
" <td>80legs</td>\n",
" <td>80legs is a web crawling service. You need to ...</td>\n",
" <td>http://80legs.com/</td>\n",
" <td>1</td>\n",
" <td>80legs</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>images/tools/1/648.png</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>2018-10-30</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2017-10-15 23:04:46</td>\n",
" <td>2018-10-30 16:03:45</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>770</th>\n",
" <td>1454</td>\n",
" <td>1.0</td>\n",
" <td>960 Grid System</td>\n",
" <td>&lt;p&gt;960 Grid System is a CSS template that come...</td>\n",
" <td>https://960.gs/</td>\n",
" <td>1</td>\n",
" <td>Nathan Smith</td>\n",
" <td>None</td>\n",
" <td>http://sonspring.com/</td>\n",
" <td>images/tools/2/1454.png</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>2018-09-27</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>https://github.com/nathansmith/960-Grid-System</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2014-12-29 00:00:00</td>\n",
" <td>2018-09-27 22:29:43</td>\n",
" <td></td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" id user_id name \\\n",
"906 937 1.0 140kit \n",
"334 1229 1.0 3DVIA Virtools \n",
"688 783 1.0 4th Dimension \n",
"1156 648 937.0 80legs \n",
"770 1454 1.0 960 Grid System \n",
"\n",
" detail \\\n",
"906 <p>140kit provides a management layer for twee... \n",
"334 <p>A software tool for the creation of 3D inte... \n",
"688 4th Dimension is a graphic environment for dev... \n",
"1156 80legs is a web crawling service. You need to ... \n",
"770 <p>960 Grid System is a CSS template that come... \n",
"\n",
" url is_approved \\\n",
"906 https://github.com/WebEcologyProject/140kit 1 \n",
"334 None 1 \n",
"688 http://www.4d.com/products/4d2004/4dstandarded... 1 \n",
"1156 http://80legs.com/ 1 \n",
"770 https://960.gs/ 1 \n",
"\n",
" creators_name creators_email creators_url \\\n",
"906 Ian Pearce, Devin Gaffney None None \n",
"334 Dassault Systemes None None \n",
"688 4D None http://www.4d.com/ \n",
"1156 80legs \n",
"770 Nathan Smith None http://sonspring.com/ \n",
"\n",
" image_url ... is_hidden last_updated documentation_url \\\n",
"906 images/tools/1/937.png ... 0 2018-10-05 None \n",
"334 None ... 0 None None \n",
"688 images/tools/1/783.png ... 0 2018-09-18 None \n",
"1156 images/tools/1/648.png ... 0 2018-10-30 None \n",
"770 images/tools/2/1454.png ... 0 2018-09-27 None \n",
"\n",
" code repository language nature \\\n",
"906 None None NaN 0 \n",
"334 None None NaN 0 \n",
"688 None None NaN 0 \n",
"1156 None NaN 0 \n",
"770 None https://github.com/nathansmith/960-Grid-System NaN 0 \n",
"\n",
" created_at updated_at recipes \n",
"906 2015-05-24 00:00:00 2018-10-05 04:43:34 \n",
"334 2014-12-29 00:00:00 2014-12-29 00:00:00 \n",
"688 2015-05-24 00:00:00 2018-09-18 20:39:31 \n",
"1156 2017-10-15 23:04:46 2018-10-30 16:03:45 \n",
"770 2014-12-29 00:00:00 2018-09-27 22:29:43 \n",
"\n",
"[5 rows x 21 columns]"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_tools.sort_values('name').head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check for duplicates in TAPoR dataset\n",
"Considering the values for 'name' and 'url', it appears that in the TAPoR dataset there are 4 duplicated descriptions"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>name</th>\n",
" <th>detail</th>\n",
" <th>url</th>\n",
" <th>is_approved</th>\n",
" <th>creators_name</th>\n",
" <th>creators_email</th>\n",
" <th>creators_url</th>\n",
" <th>image_url</th>\n",
" <th>...</th>\n",
" <th>is_hidden</th>\n",
" <th>last_updated</th>\n",
" <th>documentation_url</th>\n",
" <th>code</th>\n",
" <th>repository</th>\n",
" <th>language</th>\n",
" <th>nature</th>\n",
" <th>created_at</th>\n",
" <th>updated_at</th>\n",
" <th>recipes</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1358</th>\n",
" <td>148</td>\n",
" <td>NaN</td>\n",
" <td>AntConc</td>\n",
" <td>AntConc is free concordance software. It is mu...</td>\n",
" <td>http://www.laurenceanthony.net/software/antconc/</td>\n",
" <td>1</td>\n",
" <td>Laurence Anthony</td>\n",
" <td>anthony@waseda.jp</td>\n",
" <td>http://www.antlab.sci.waseda.ac.jp/index.html</td>\n",
" <td>images/tools/0/148.png</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>2019-08-19</td>\n",
" <td>http://www.laurenceanthony.net/software/antcon...</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2012-07-30 18:25:44</td>\n",
" <td>2019-08-19 00:37:45</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>1362</th>\n",
" <td>1565</td>\n",
" <td>1201.0</td>\n",
" <td>SentiStrength</td>\n",
" <td>SentiStrength is a sentiment analysis (opinion...</td>\n",
" <td>http://sentistrength.wlv.ac.uk/</td>\n",
" <td>1</td>\n",
" <td>Mike Thelwall</td>\n",
" <td>m.thelwall@wlv.ac.uk</td>\n",
" <td>http://sentistrength.wlv.ac.uk</td>\n",
" <td>images/tools/3/1565.png</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>2019-09-27</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2019-09-20 05:03:47</td>\n",
" <td>2019-09-27 10:03:35</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>652</th>\n",
" <td>580</td>\n",
" <td>937.0</td>\n",
" <td>Voyant 2.0: Knots</td>\n",
" <td>Voyant Knots is a visualization where a line i...</td>\n",
" <td>http://voyant-tools.org/?view=knots</td>\n",
" <td>1</td>\n",
" <td>Stéfan Sinclair and Geoffrey Rockwell</td>\n",
" <td>stefan.sinclair@mcgill.ca</td>\n",
" <td>http://stefansinclair.name/</td>\n",
" <td>images/tools/1/580.png</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>2016-04-29</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2016-04-29 16:08:28</td>\n",
" <td>2017-10-31 14:26:36</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>653</th>\n",
" <td>581</td>\n",
" <td>937.0</td>\n",
" <td>Voyant 2.0: Knots</td>\n",
" <td>Voyant Knots is a visualization where a line i...</td>\n",
" <td>http://voyant-tools.org/?view=knots</td>\n",
" <td>1</td>\n",
" <td>Stéfan Sinclair and Geoffrey Rockwell</td>\n",
" <td>stefan.sinclair@mcgill.ca</td>\n",
" <td>http://stefansinclair.name/</td>\n",
" <td>images/tools/1/581.png</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>2016-04-29</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2016-04-29 16:11:55</td>\n",
" <td>2017-10-31 14:26:36</td>\n",
" <td></td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>4 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" id user_id name \\\n",
"1358 148 NaN AntConc \n",
"1362 1565 1201.0 SentiStrength \n",
"652 580 937.0 Voyant 2.0: Knots \n",
"653 581 937.0 Voyant 2.0: Knots \n",
"\n",
" detail \\\n",
"1358 AntConc is free concordance software. It is mu... \n",
"1362 SentiStrength is a sentiment analysis (opinion... \n",
"652 Voyant Knots is a visualization where a line i... \n",
"653 Voyant Knots is a visualization where a line i... \n",
"\n",
" url is_approved \\\n",
"1358 http://www.laurenceanthony.net/software/antconc/ 1 \n",
"1362 http://sentistrength.wlv.ac.uk/ 1 \n",
"652 http://voyant-tools.org/?view=knots 1 \n",
"653 http://voyant-tools.org/?view=knots 1 \n",
"\n",
" creators_name creators_email \\\n",
"1358 Laurence Anthony anthony@waseda.jp \n",
"1362 Mike Thelwall m.thelwall@wlv.ac.uk \n",
"652 Stéfan Sinclair and Geoffrey Rockwell stefan.sinclair@mcgill.ca \n",
"653 Stéfan Sinclair and Geoffrey Rockwell stefan.sinclair@mcgill.ca \n",
"\n",
" creators_url image_url \\\n",
"1358 http://www.antlab.sci.waseda.ac.jp/index.html images/tools/0/148.png \n",
"1362 http://sentistrength.wlv.ac.uk images/tools/3/1565.png \n",
"652 http://stefansinclair.name/ images/tools/1/580.png \n",
"653 http://stefansinclair.name/ images/tools/1/581.png \n",
"\n",
" ... is_hidden last_updated \\\n",
"1358 ... 0 2019-08-19 \n",
"1362 ... 0 2019-09-27 \n",
"652 ... 1 2016-04-29 \n",
"653 ... 0 2016-04-29 \n",
"\n",
" documentation_url code repository \\\n",
"1358 http://www.laurenceanthony.net/software/antcon... None \n",
"1362 None None \n",
"652 None None \n",
"653 None None \n",
"\n",
" language nature created_at updated_at recipes \n",
"1358 NaN 0 2012-07-30 18:25:44 2019-08-19 00:37:45 \n",
"1362 NaN 0 2019-09-20 05:03:47 2019-09-27 10:03:35 \n",
"652 NaN 0 2016-04-29 16:08:28 2017-10-31 14:26:36 \n",
"653 NaN 0 2016-04-29 16:11:55 2017-10-31 14:26:36 \n",
"\n",
"[4 rows x 21 columns]"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"duplicateRowsDF0 = df_db_tools[df_db_tools.duplicated(['name', 'url'])].sort_values('name')\n",
"#print(\"The (possibly) duplicated items in TAPoR dataset:\")\n",
"duplicateRowsDF0.head(15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get the ingested TAPoR data in the Market Place (using the API)\n",
"\n",
"The SSHOC Market Place API entry: \n",
"\n",
" https://sshoc-marketplace-api.acdh-dev.oeaw.ac.at/api/tools\n",
"\n",
"has been used to extract the TAPoR descriptions imported in the SSHOC Market Place. In the rest of the document this dataset will be called: **MP dataset**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=1353, step=1)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#x = ('2','3','4','5')\n",
"x = pd.Series(range(2,69))\n",
"url = 'https://sshoc-marketplace-api.acdh-dev.oeaw.ac.at/api/tools?page=1&perpage=20'\n",
"df_tool_all = pd.read_json(url, orient='columns')\n",
"for var in x:\n",
" url = \"https://sshoc-marketplace-api.acdh-dev.oeaw.ac.at/api/tools?page=\"+str(var)+\"&perpage=20\"\n",
" df_tool_par=pd.read_json(url, orient='columns')\n",
" df_tool_all=df_tool_all.append(df_tool_par, ignore_index=True)\n",
" # print(\"url: \"+ url + \":\",var)\n",
"df_tool_all.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are 1353 tool descriptions in MP dataset. The following table shows 10 records of the MP dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at row 500 of the MP dataset"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id 1388\n",
"category tool\n",
"label InTEXT\n",
"version None\n",
"description InTEXT is a legacy, commercial suite of progra...\n",
"licenses []\n",
"contributors [{'actor': {'id': 956, 'name': 'InTEXT Systems...\n",
"properties [{'id': 14091, 'type': {'code': 'tadirah-metho...\n",
"accessibleAt http://intext.com/\n",
"sourceItemId 247\n",
"relatedItems []\n",
"informationContributors [{'id': 4, 'username': 'System importer', 'dis...\n",
"lastInfoUpdate 2020-06-28T18:25:58+0000\n",
"status ingested\n",
"comments []\n",
"olderVersions []\n",
"newerVersions []\n",
"repository None\n",
"source.id 1\n",
"source.label TAPoR\n",
"source.url http://tapor.ca\n",
"source.urlTemplate http://tapor.ca/tools/{source-item-id}\n",
"source NaN\n",
"Name: 500, dtype: object"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#descriptions are in JSON, create a dataframe\n",
"df_tool_flat = pd.json_normalize(df_tool_all['tools'])\n",
"df_tool_flat.iloc[500]\n",
"#df_tool_flat.sort_values('label').head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the MP dataset there are 1353 tool descriptions."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=1353, step=1)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tool_flat.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Considering the values for 'label' and 'accessibleAT', it appears that in the MP dataset there are 9 duplicated descriptions"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>category</th>\n",
" <th>label</th>\n",
" <th>version</th>\n",
" <th>description</th>\n",
" <th>licenses</th>\n",
" <th>contributors</th>\n",
" <th>properties</th>\n",
" <th>accessibleAt</th>\n",
" <th>sourceItemId</th>\n",
" <th>...</th>\n",
" <th>status</th>\n",
" <th>comments</th>\n",
" <th>olderVersions</th>\n",
" <th>newerVersions</th>\n",
" <th>repository</th>\n",
" <th>source.id</th>\n",
" <th>source.label</th>\n",
" <th>source.url</th>\n",
" <th>source.urlTemplate</th>\n",
" <th>source</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>326</th>\n",
" <td>335</td>\n",
" <td>tool</td>\n",
" <td>EVI-LINHD</td>\n",
" <td>None</td>\n",
" <td>EVI-LINHD is a free and open-source cloud plat...</td>\n",
" <td>[]</td>\n",
" <td>[{'actor': {'id': 275, 'name': 'Elena González...</td>\n",
" <td>[{'id': 2702, 'type': {'code': 'thumbnail', 'l...</td>\n",
" <td>http://www.evilinhd.com/</td>\n",
" <td>594</td>\n",
" <td>...</td>\n",
" <td>ingested</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>None</td>\n",
" <td>1.0</td>\n",
" <td>TAPoR</td>\n",
" <td>http://tapor.ca</td>\n",
" <td>http://tapor.ca/tools/{source-item-id}</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>532</th>\n",
" <td>776</td>\n",
" <td>tool</td>\n",
" <td>JSAN</td>\n",
" <td>None</td>\n",
" <td>The Integrated JStylo and Anonymouth Package. ...</td>\n",
" <td>[]</td>\n",
" <td>[{'actor': {'id': 493, 'name': '18th Connect',...</td>\n",
" <td>[{'id': 7310, 'type': {'code': 'thumbnail', 'l...</td>\n",
" <td>https://github.com/psal/jstylo</td>\n",
" <td>1559</td>\n",
" <td>...</td>\n",
" <td>ingested</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>None</td>\n",
" <td>1.0</td>\n",
" <td>TAPoR</td>\n",
" <td>http://tapor.ca</td>\n",
" <td>http://tapor.ca/tools/{source-item-id}</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>533</th>\n",
" <td>451</td>\n",
" <td>tool</td>\n",
" <td>JSAN</td>\n",
" <td>None</td>\n",
" <td>The Integrated JStylo and Anonymouth Package. ...</td>\n",
" <td>[]</td>\n",
" <td>[{'actor': {'id': 493, 'name': '18th Connect',...</td>\n",
" <td>[{'id': 4037, 'type': {'code': 'keyword', 'lab...</td>\n",
" <td>https://github.com/psal/jstylo</td>\n",
" <td>1557</td>\n",
" <td>...</td>\n",
" <td>ingested</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>None</td>\n",
" <td>1.0</td>\n",
" <td>TAPoR</td>\n",
" <td>http://tapor.ca</td>\n",
" <td>http://tapor.ca/tools/{source-item-id}</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>697</th>\n",
" <td>1186</td>\n",
" <td>tool</td>\n",
" <td>NodeXL</td>\n",
" <td>None</td>\n",
" <td>NodeXL is a free, open source tool for generat...</td>\n",
" <td>[]</td>\n",
" <td>[{'actor': {'id': 832, 'name': 'M. Smith, N. M...</td>\n",
" <td>[{'id': 11766, 'type': {'code': 'license-type'...</td>\n",
" <td>http://nodexl.codeplex.com/</td>\n",
" <td>482</td>\n",
" <td>...</td>\n",
" <td>ingested</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>None</td>\n",
" <td>1.0</td>\n",
" <td>TAPoR</td>\n",
" <td>http://tapor.ca</td>\n",
" <td>http://tapor.ca/tools/{source-item-id}</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>854</th>\n",
" <td>560</td>\n",
" <td>tool</td>\n",
" <td>Python Tools for Text-Analysis</td>\n",
" <td>None</td>\n",
" <td>This is a set of simple, free tools for analyz...</td>\n",
" <td>[]</td>\n",
" <td>[{'actor': {'id': 424, 'name': 'David L. Hoove...</td>\n",
" <td>[{'id': 5060, 'type': {'code': 'thumbnail', 'l...</td>\n",
" <td>https://wp.nyu.edu/exceltextanalysis/python_to...</td>\n",
" <td>1507</td>\n",
" <td>...</td>\n",
" <td>ingested</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>None</td>\n",
" <td>1.0</td>\n",
" <td>TAPoR</td>\n",
" <td>http://tapor.ca</td>\n",
" <td>http://tapor.ca/tools/{source-item-id}</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>947</th>\n",
" <td>1136</td>\n",
" <td>tool</td>\n",
" <td>SentiStrength</td>\n",
" <td>None</td>\n",
" <td>SentiStrength is a tool for sentiment analysis...</td>\n",
" <td>[]</td>\n",
" <td>[{'actor': {'id': 799, 'name': 'Thelwall, M., ...</td>\n",
" <td>[{'id': 11290, 'type': {'code': 'keyword', 'la...</td>\n",
" <td>http://sentistrength.wlv.ac.uk/</td>\n",
" <td>453</td>\n",
" <td>...</td>\n",
" <td>ingested</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>None</td>\n",
" <td>1.0</td>\n",
" <td>TAPoR</td>\n",
" <td>http://tapor.ca</td>\n",
" <td>http://tapor.ca/tools/{source-item-id}</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>948</th>\n",
" <td>378</td>\n",
" <td>tool</td>\n",
" <td>SentiStrength</td>\n",
" <td>None</td>\n",
" <td>It is a sentiment analysis program. Automatic ...</td>\n",
" <td>[]</td>\n",
" <td>[{'actor': {'id': 493, 'name': '18th Connect',...</td>\n",
" <td>[{'id': 3210, 'type': {'code': 'thumbnail', 'l...</td>\n",
" <td>http://sentistrength.wlv.ac.uk/</td>\n",
" <td>1564</td>\n",
" <td>...</td>\n",
" <td>ingested</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>None</td>\n",
" <td>1.0</td>\n",
" <td>TAPoR</td>\n",
" <td>http://tapor.ca</td>\n",
" <td>http://tapor.ca/tools/{source-item-id}</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1187</th>\n",
" <td>607</td>\n",
" <td>tool</td>\n",
" <td>UCINET</td>\n",
" <td>None</td>\n",
" <td>UCINET is a social media analysis set for soft...</td>\n",
" <td>[]</td>\n",
" <td>[{'actor': {'id': 459, 'name': 'Borgatti, S.P....</td>\n",
" <td>[{'id': 5501, 'type': {'code': 'tadirah-method...</td>\n",
" <td>https://sites.google.com/site/ucinetsoftware/home</td>\n",
" <td>576</td>\n",
" <td>...</td>\n",
" <td>ingested</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>None</td>\n",
" <td>1.0</td>\n",
" <td>TAPoR</td>\n",
" <td>http://tapor.ca</td>\n",
" <td>http://tapor.ca/tools/{source-item-id}</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>476</th>\n",
" <td>165</td>\n",
" <td>tool</td>\n",
" <td>igraph</td>\n",
" <td>None</td>\n",
" <td>igraph is an open source collection of network...</td>\n",
" <td>[]</td>\n",
" <td>[{'actor': {'id': 147, 'name': 'Gábor Csárdi, ...</td>\n",
" <td>[{'id': 771, 'type': {'code': 'tadirah-methods...</td>\n",
" <td>http://igraph.org/</td>\n",
" <td>623</td>\n",
" <td>...</td>\n",
" <td>ingested</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>[]</td>\n",
" <td>None</td>\n",
" <td>1.0</td>\n",
" <td>TAPoR</td>\n",
" <td>http://tapor.ca</td>\n",
" <td>http://tapor.ca/tools/{source-item-id}</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9 rows × 23 columns</p>\n",
"</div>"
],
"text/plain": [
" id category label version \\\n",
"326 335 tool EVI-LINHD None \n",
"532 776 tool JSAN None \n",
"533 451 tool JSAN None \n",
"697 1186 tool NodeXL None \n",
"854 560 tool Python Tools for Text-Analysis None \n",
"947 1136 tool SentiStrength None \n",
"948 378 tool SentiStrength None \n",
"1187 607 tool UCINET None \n",
"476 165 tool igraph None \n",
"\n",
" description licenses \\\n",
"326 EVI-LINHD is a free and open-source cloud plat... [] \n",
"532 The Integrated JStylo and Anonymouth Package. ... [] \n",
"533 The Integrated JStylo and Anonymouth Package. ... [] \n",
"697 NodeXL is a free, open source tool for generat... [] \n",
"854 This is a set of simple, free tools for analyz... [] \n",
"947 SentiStrength is a tool for sentiment analysis... [] \n",
"948 It is a sentiment analysis program. Automatic ... [] \n",
"1187 UCINET is a social media analysis set for soft... [] \n",
"476 igraph is an open source collection of network... [] \n",
"\n",
" contributors \\\n",
"326 [{'actor': {'id': 275, 'name': 'Elena González... \n",
"532 [{'actor': {'id': 493, 'name': '18th Connect',... \n",
"533 [{'actor': {'id': 493, 'name': '18th Connect',... \n",
"697 [{'actor': {'id': 832, 'name': 'M. Smith, N. M... \n",
"854 [{'actor': {'id': 424, 'name': 'David L. Hoove... \n",
"947 [{'actor': {'id': 799, 'name': 'Thelwall, M., ... \n",
"948 [{'actor': {'id': 493, 'name': '18th Connect',... \n",
"1187 [{'actor': {'id': 459, 'name': 'Borgatti, S.P.... \n",
"476 [{'actor': {'id': 147, 'name': 'Gábor Csárdi, ... \n",
"\n",
" properties \\\n",
"326 [{'id': 2702, 'type': {'code': 'thumbnail', 'l... \n",
"532 [{'id': 7310, 'type': {'code': 'thumbnail', 'l... \n",
"533 [{'id': 4037, 'type': {'code': 'keyword', 'lab... \n",
"697 [{'id': 11766, 'type': {'code': 'license-type'... \n",
"854 [{'id': 5060, 'type': {'code': 'thumbnail', 'l... \n",
"947 [{'id': 11290, 'type': {'code': 'keyword', 'la... \n",
"948 [{'id': 3210, 'type': {'code': 'thumbnail', 'l... \n",
"1187 [{'id': 5501, 'type': {'code': 'tadirah-method... \n",
"476 [{'id': 771, 'type': {'code': 'tadirah-methods... \n",
"\n",
" accessibleAt sourceItemId ... \\\n",
"326 http://www.evilinhd.com/ 594 ... \n",
"532 https://github.com/psal/jstylo 1559 ... \n",
"533 https://github.com/psal/jstylo 1557 ... \n",
"697 http://nodexl.codeplex.com/ 482 ... \n",
"854 https://wp.nyu.edu/exceltextanalysis/python_to... 1507 ... \n",
"947 http://sentistrength.wlv.ac.uk/ 453 ... \n",
"948 http://sentistrength.wlv.ac.uk/ 1564 ... \n",
"1187 https://sites.google.com/site/ucinetsoftware/home 576 ... \n",
"476 http://igraph.org/ 623 ... \n",
"\n",
" status comments olderVersions newerVersions repository source.id \\\n",
"326 ingested [] [] [] None 1.0 \n",
"532 ingested [] [] [] None 1.0 \n",
"533 ingested [] [] [] None 1.0 \n",
"697 ingested [] [] [] None 1.0 \n",
"854 ingested [] [] [] None 1.0 \n",
"947 ingested [] [] [] None 1.0 \n",
"948 ingested [] [] [] None 1.0 \n",
"1187 ingested [] [] [] None 1.0 \n",
"476 ingested [] [] [] None 1.0 \n",
"\n",
" source.label source.url source.urlTemplate \\\n",
"326 TAPoR http://tapor.ca http://tapor.ca/tools/{source-item-id} \n",
"532 TAPoR http://tapor.ca http://tapor.ca/tools/{source-item-id} \n",
"533 TAPoR http://tapor.ca http://tapor.ca/tools/{source-item-id} \n",
"697 TAPoR http://tapor.ca http://tapor.ca/tools/{source-item-id} \n",
"854 TAPoR http://tapor.ca http://tapor.ca/tools/{source-item-id} \n",
"947 TAPoR http://tapor.ca http://tapor.ca/tools/{source-item-id} \n",
"948 TAPoR http://tapor.ca http://tapor.ca/tools/{source-item-id} \n",
"1187 TAPoR http://tapor.ca http://tapor.ca/tools/{source-item-id} \n",
"476 TAPoR http://tapor.ca http://tapor.ca/tools/{source-item-id} \n",
"\n",
" source \n",
"326 NaN \n",
"532 NaN \n",
"533 NaN \n",
"697 NaN \n",
"854 NaN \n",
"947 NaN \n",
"948 NaN \n",
"1187 NaN \n",
"476 NaN \n",
"\n",
"[9 rows x 23 columns]"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_p_d=df_tool_flat[df_tool_flat.duplicated(['label', 'accessibleAt'])].sort_values('label')\n",
"test_p_d"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"#df_tool_flat.dtypes "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 https://github.com/WebEcologyProject/140kit\n",
"1 \n",
"2 http://www.4d.com/products/4d2004/4dstandarded...\n",
"3 http://80legs.com/\n",
"4 https://960.gs/\n",
" ... \n",
"1348 \n",
"1349 https://www.zotero.org/\n",
"1350 http://zotfile.com/\n",
"1351 https://wordpress.org/plugins/zotpress/\n",
"1352 http://www.zubrag.com/tools/html-tags-stripper...\n",
"Name: accessibleAt, Length: 1353, dtype: object"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tool_flat['accessibleAt'].replace(np.nan, \"\", inplace=True)\n",
"df_tool_flat['accessibleAt'].replace(r'^\\s*$', \"\", regex=True)\n",
"#df_tool_flat['accessibleAt'].isnull()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"#dataframe for MP properties\n",
"df_prop_data = pd.json_normalize(data=df_tool_all['tools'], record_path='properties', meta=['label'])\n",
"#df_prop_data.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"#dataframe for MP contributors\n",
"df_contr_data = pd.json_normalize(data=df_tool_all['tools'], record_path='contributors', meta=['label'])\n",
"#df_contr_data.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"df_mpdatasets=df_tool_flat.join(df_contr_data.set_index('label'), on='label')\n",
"#df_mpdatasets.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comparing TAPoR dataset and MP datasets to find import issues"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"906 https://github.com/WebEcologyProject/140kit\n",
"334 \n",
"688 http://www.4d.com/products/4d2004/4dstandarded...\n",
"1156 http://80legs.com/\n",
"770 https://960.gs/\n",
" ... \n",
"816 http://www.jasondavies.com/wordtree/\n",
"520 http://code.google.com/p/word2vec/\n",
"815 https://code.google.com/p/wordsimilarity/\n",
"702 http://www.tei-c.org/Vault/MembersMeetings/200...\n",
"45 \n",
"Name: url, Length: 1359, dtype: object"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#create a dataframe with a subset of columns for the TAPoR dataset\n",
"df_tapor_worksub=df_db_tools.sort_values('name')[['name', 'url']].drop_duplicates()\n",
"df_tapor_worksub['url'].replace(np.nan, \"\", inplace=True)\n",
"df_tapor_worksub['url'].replace(r\"\\s+\", np.nan, regex=True)\n",
"#df_tapor_worksub['url'].isnull()\n",
"#df_tapor_worksub.tail(30)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"#create a dataframe with a subset of columns for the MP dataset and change column names to have homogenous formats\n",
"df_mp_taporsub= df_tool_flat[df_tool_flat['source.label'] == 'TAPoR']\n",
"df_mp_worksub=df_mp_taporsub.sort_values('label')[['label','accessibleAt']].drop_duplicates()\n",
"df_mp_worksub=df_mp_worksub.rename(columns={\"label\": \"name\", 'accessibleAt':'url'})\n",
"#df_mp_worksub['url'].isnull()"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"# define a function that compares dataframes\n",
"def dataframe_difference(df1, df2, which):\n",
" \"\"\"Find rows which are different between two DataFrames.\"\"\"\n",
" comparison_df = df1.merge(df2,\n",
" indicator=True,\n",
" how='outer')\n",
" if which is None:\n",
" diff_df = comparison_df[comparison_df['_merge'] != 'both']\n",
" else:\n",
" diff_df = comparison_df[comparison_df['_merge'] == which]\n",
" diff_df.to_csv('data/diff.csv')\n",
" return diff_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Considering values for 'name' and 'url', there are 1260 tool descriptions in MP dataset that are identical to descriptions in TAPoR dataset"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n",
" ...\n",
" 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342],\n",
" dtype='int64', length=1260)"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_both=dataframe_difference(df_mp_worksub, df_tapor_worksub, 'both')\n",
"df_both.index"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>url</th>\n",
" <th>_merge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>140kit</td>\n",
" <td>https://github.com/WebEcologyProject/140kit</td>\n",
" <td>both</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3DVIA Virtools</td>\n",
" <td></td>\n",
" <td>both</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4th Dimension</td>\n",
" <td>http://www.4d.com/products/4d2004/4dstandarded...</td>\n",
" <td>both</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>80legs</td>\n",
" <td>http://80legs.com/</td>\n",
" <td>both</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>960 Grid System</td>\n",
" <td>https://960.gs/</td>\n",
" <td>both</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name url _merge\n",
"0 140kit https://github.com/WebEcologyProject/140kit both\n",
"1 3DVIA Virtools both\n",
"2 4th Dimension http://www.4d.com/products/4d2004/4dstandarded... both\n",
"3 80legs http://80legs.com/ both\n",
"4 960 Grid System https://960.gs/ both"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_both.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Considering values for 'name' and 'url', there are 83 tool descriptions in MP dataset but not in TAPoR dataset"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>url</th>\n",
" <th>_merge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>142</th>\n",
" <td>CONDOR</td>\n",
" <td>http://www.ickn.org/ckntools.html</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>144</th>\n",
" <td>CQPweb</td>\n",
" <td>https://cqpweb.lancs.ac.uk/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>146</th>\n",
" <td>CSV Sort</td>\n",
" <td>https://bitbucket.org/richardpenman/csvsort</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>156</th>\n",
" <td>CasualConc</td>\n",
" <td>https://sites.google.com/site/casualconc/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>161</th>\n",
" <td>Chartle</td>\n",
" <td></td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>163</th>\n",
" <td>Chorus</td>\n",
" <td>http://chorusanalytics.co.uk/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>165</th>\n",
" <td>Chronos Timeline</td>\n",
" <td>http://hyperstudio.mit.edu/software/chronos-ti...</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>180</th>\n",
" <td>Code Bubbles</td>\n",
" <td>http://cs.brown.edu/~spr/codebubbles/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>184</th>\n",
" <td>Colaboratory</td>\n",
" <td>https://colab.research.google.com/notebooks/we...</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>214</th>\n",
" <td>ContaWords</td>\n",
" <td>http://contawords.iula.upf.edu/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>215</th>\n",
" <td>Contropedia</td>\n",
" <td>http://contropedia.net/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>220</th>\n",
" <td>Cowo</td>\n",
" <td>https://github.com/seinecle/Cowo/blob/master/R...</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>223</th>\n",
" <td>Critic Markup</td>\n",
" <td>http://criticmarkup.com/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>228</th>\n",
" <td>Cytoscape</td>\n",
" <td>http://www.cytoscape.org/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>254</th>\n",
" <td>Density Design - Knot</td>\n",
" <td>http://www.densitydesign.org/research/knot/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>255</th>\n",
" <td>DfR Browser</td>\n",
" <td>https://agoldst.github.io/dfr-browser/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>300</th>\n",
" <td>EVI-LINHD</td>\n",
" <td>http://www.evilinhd.com/</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>307</th>\n",
" <td>EgoWeb 2.0</td>\n",
" <td>http://www.rand.org/methods/egoweb.html</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>332</th>\n",
" <td>Facepager</td>\n",
" <td>https://github.com/strohne/Facepager</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>342</th>\n",
" <td>Find Locations from A Text (Named-Entity Recog...</td>\n",
" <td></td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name \\\n",
"142 CONDOR \n",
"144 CQPweb \n",
"146 CSV Sort \n",
"156 CasualConc \n",
"161 Chartle \n",
"163 Chorus \n",
"165 Chronos Timeline \n",
"180 Code Bubbles \n",
"184 Colaboratory \n",
"214 ContaWords \n",
"215 Contropedia \n",
"220 Cowo \n",
"223 Critic Markup \n",
"228 Cytoscape \n",
"254 Density Design - Knot \n",
"255 DfR Browser \n",
"300 EVI-LINHD \n",
"307 EgoWeb 2.0 \n",
"332 Facepager \n",
"342 Find Locations from A Text (Named-Entity Recog... \n",
"\n",
" url _merge \n",
"142 http://www.ickn.org/ckntools.html left_only \n",
"144 https://cqpweb.lancs.ac.uk/ left_only \n",
"146 https://bitbucket.org/richardpenman/csvsort left_only \n",
"156 https://sites.google.com/site/casualconc/ left_only \n",
"161 left_only \n",
"163 http://chorusanalytics.co.uk/ left_only \n",
"165 http://hyperstudio.mit.edu/software/chronos-ti... left_only \n",
"180 http://cs.brown.edu/~spr/codebubbles/ left_only \n",
"184 https://colab.research.google.com/notebooks/we... left_only \n",
"214 http://contawords.iula.upf.edu/ left_only \n",
"215 http://contropedia.net/ left_only \n",
"220 https://github.com/seinecle/Cowo/blob/master/R... left_only \n",
"223 http://criticmarkup.com/ left_only \n",
"228 http://www.cytoscape.org/ left_only \n",
"254 http://www.densitydesign.org/research/knot/ left_only \n",
"255 https://agoldst.github.io/dfr-browser/ left_only \n",
"300 http://www.evilinhd.com/ left_only \n",
"307 http://www.rand.org/methods/egoweb.html left_only \n",
"332 https://github.com/strohne/Facepager left_only \n",
"342 left_only "
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#tools in TAPoR but not in MP datset\n",
"df_lo=dataframe_difference(df_mp_worksub.sort_values('name'), df_tapor_worksub.sort_values('name'), 'left_only')\n",
"# see 20 records in MP dataset but not in TAPoR\n",
"df_lo.head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Comparing values for 'name' and 'url', there are 99 tool descriptions in TAPoR dataset but not in MP dataset"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>url</th>\n",
" <th>_merge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1343</th>\n",
" <td>ANNIS</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1344</th>\n",
" <td>Adobe Flash</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1345</th>\n",
" <td>Ainm.ie</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1346</th>\n",
" <td>Alpheios</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1347</th>\n",
" <td>Anastasia</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1348</th>\n",
" <td>ArcExplorer</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1349</th>\n",
" <td>AroniSmartIntelligence™</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1350</th>\n",
" <td>Aruspix</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1351</th>\n",
" <td>BASE</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1352</th>\n",
" <td>Basement Waterproofing: Tips and Instructions</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1353</th>\n",
" <td>Berkeley Parser</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1354</th>\n",
" <td>CATMA (Computer Aided Textual Markup and Analy...</td>\n",
" <td>http://www.catma.de/</td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1355</th>\n",
" <td>Canva \"The Amazingly Simple Graphic Design Sof...</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1356</th>\n",
" <td>Chicken</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1357</th>\n",
" <td>CloudConvert</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1358</th>\n",
" <td>Collocate</td>\n",
" <td>http://</td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1359</th>\n",
" <td>Commentpress</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1360</th>\n",
" <td>CoolTool NeuroLab</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1361</th>\n",
" <td>Datapress</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1362</th>\n",
" <td>Delicious</td>\n",
" <td></td>\n",
" <td>right_only</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name url \\\n",
"1343 ANNIS \n",
"1344 Adobe Flash \n",
"1345 Ainm.ie \n",
"1346 Alpheios \n",
"1347 Anastasia \n",
"1348 ArcExplorer \n",
"1349 AroniSmartIntelligence™ \n",
"1350 Aruspix \n",
"1351 BASE \n",
"1352 Basement Waterproofing: Tips and Instructions \n",
"1353 Berkeley Parser \n",
"1354 CATMA (Computer Aided Textual Markup and Analy... http://www.catma.de/ \n",
"1355 Canva \"The Amazingly Simple Graphic Design Sof... \n",
"1356 Chicken \n",
"1357 CloudConvert \n",
"1358 Collocate http:// \n",
"1359 Commentpress \n",
"1360 CoolTool NeuroLab \n",
"1361 Datapress \n",
"1362 Delicious \n",
"\n",
" _merge \n",
"1343 right_only \n",
"1344 right_only \n",
"1345 right_only \n",
"1346 right_only \n",
"1347 right_only \n",
"1348 right_only \n",
"1349 right_only \n",
"1350 right_only \n",
"1351 right_only \n",
"1352 right_only \n",
"1353 right_only \n",
"1354 right_only \n",
"1355 right_only \n",
"1356 right_only \n",
"1357 right_only \n",
"1358 right_only \n",
"1359 right_only \n",
"1360 right_only \n",
"1361 right_only \n",
"1362 right_only "
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Tools in MP dataset but not in TAPoR\n",
"df_ro=dataframe_difference(df_mp_worksub.sort_values('name'), df_tapor_worksub.sort_values('name'), 'right_only')\n",
"df_ro.head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distribution of items in TAPoR dataset by 'last_updated' value\n",
"\n",
"Check the content of the field 'last_update' for TAPoR dataset descriptions. This value *seems* the date when a description of a tool has been updated the last time.\n"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>url</th>\n",
" <th>last_updated</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>423</th>\n",
" <td>List Words - HTML (TAPoRware)</td>\n",
" <td>http://taporware.ualberta.ca/~taporware/htmlTo...</td>\n",
" <td>2011-11-27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>424</th>\n",
" <td>List Words - XML (TAPoRware)</td>\n",
" <td>http://taporware.ualberta.ca/~taporware/xmlToo...</td>\n",
" <td>2011-11-27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>425</th>\n",
" <td>List Words - Plain Text (TAPoRware)</td>\n",
" <td>http://taporware.ualberta.ca/~taporware/textTo...</td>\n",
" <td>2011-11-28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>426</th>\n",
" <td>List Tags - HTML (TAPoRware)</td>\n",
" <td>http://taporware.ualberta.ca/~taporware/htmlTo...</td>\n",
" <td>2011-11-28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>427</th>\n",
" <td>List XML Elements (TAPoRware)</td>\n",
" <td>http://taporware.ualberta.ca/~taporware/xmlToo...</td>\n",
" <td>2011-11-28</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name \\\n",
"423 List Words - HTML (TAPoRware) \n",
"424 List Words - XML (TAPoRware) \n",
"425 List Words - Plain Text (TAPoRware) \n",
"426 List Tags - HTML (TAPoRware) \n",
"427 List XML Elements (TAPoRware) \n",
"\n",
" url last_updated \n",
"423 http://taporware.ualberta.ca/~taporware/htmlTo... 2011-11-27 \n",
"424 http://taporware.ualberta.ca/~taporware/xmlToo... 2011-11-27 \n",
"425 http://taporware.ualberta.ca/~taporware/textTo... 2011-11-28 \n",
"426 http://taporware.ualberta.ca/~taporware/htmlTo... 2011-11-28 \n",
"427 http://taporware.ualberta.ca/~taporware/xmlToo... 2011-11-28 "
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_tools['correctdata']=pd.to_datetime(df_db_tools['last_updated'])\n",
"df_db_tools['justdata'] = df_db_tools['correctdata'].dt.year\n",
"df_reg_tm_sorted=df_db_tools.sort_values('last_updated')\n",
"df_reg_tools_sub=df_reg_tm_sorted[['name', 'url', 'last_updated']]\n",
"df_reg_tools_sub.head()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Number of tools by year their description has been updated')"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA20AAAF3CAYAAAA2IKMeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdeXxb93nn++/DXRIlURQXUJu1WCuheJOV2E68yySdNE57Jx23aet0mrq9cZvm3pk2STtt0874XrdNl+n0ZqaeJI077cRN0mbiJhFoeZF3W5ZXAdr3jeAiihIlivvv/nEObQgGRZAEeQDi83698ALxw++c8+AsEh6c5/yOOecEAAAAAMhOBUEHAAAAAAAYHUkbAAAAAGQxkjYAAAAAyGIkbQAAAACQxUjaAAAAACCLkbQBAAAAQBYjaQOygJl91cycmTWneO/7ZrZ9GmO53Y8lPF3LHA8zW29mL5jZRT/O5Sn6lPjr9NopjOPbZrZzkvPYbmbfz1RMucDMavxtszypPWP7nZl91p9X+WTnlTDPaT0Ox2JmR83sa+OcJuVxYWbL/fX1icxGOWocgf4bk4/H3Wgmsl+b2Rp/P6oIMg4g35C0AdnlHjO7MeggstyfSaqQ9ElJN0lqSdGnRNIfSpqypA0TViNv2yyfwmX8WN6+0TOFywjaT0v663FOM9px0SJvfb2Ygbgw862Rtx9lLGkDMLaioAMA8J5OSScl/Z6kTwUcy5QxszLnXO8kZrFO0hPOuaczFRMyz8xmOecuBbFs51y7pPYr9QkyvskYids591am5umc65P0aqbmBwDIPM60AdnDSfp/JH3SzDaO1skvS+lI0e7M7DcSXh81s6+Z2ZfNrMXMzpnZn5vnXjOLmVm3mf1vM1uQYlGLzOxHfhnicTP79RTL/KiZPWdmPWZ2xsz+h5nNTXh/pExts1+SdEnSb1/hs11rZk/78ztrZv9oZrX+e8vNzElaJen/8ue7fZRZdfvPf+f3e6+M0syqzOwxP94eP65NSXEU+uv5uJn1+evq50eL25+mwsy+YWanzazXn/Z/XGmahGkf9LfXJTP7sZktTnjvdTP7uxTTPGZmb44yv3r/M9+W1F5uZhfM7AsJbWNtwzoz+5aZHfbj229m/9nMShL6jJTXfcbM/t7MuiT9a4q4lkva5b98dmTbJHWrMrPv+XEeNrPPp5hPuvtd+XjiS5h+qZn9xP+8R83sc6P0C/vbq9t/fM/MQgnvF5t3DI7sR6fN7AdJ6+4qM/uOmXX4n+fdkX3tSnFbUnmk+eW6ZvYpM9vr74MvmtmGhJBTHheWojwynWMgYZlb/Lgv+susH23dJrnitjazm8zsCX+9XTSzt83sM0l9puS4898vM7M/NbMT/jp4x8zuTTGfz/nrp8/MjpnZ72RiPSXvxwntydt+u3nlhWN9njH3azNbZ2aP+5+5x/9cXzSzAv/92/X+sXPEj+9owvTL/Ok7/embzWzteOMAkIJzjgcPHgE/JH1VUoe8H1L2Sno84b3vS9qe3DfFPJyk30h4fVTembt/kdQo7wyek/SXkt6Q9DOSPiPprKT/njDd7X6/E/KSyAZJf+u3fSKh3y2S+iT9k6R7Jf2ipFOSvp/Q57P+dIck/QdJd0i6bpR1UC2pS9Ir8s40/oIf/7vyyrpKJX1EXinXP/p/bxhlXnf4y/1Pfr+PSCr133tRUlzSL0v6KUnPy/sye3XC9A9LGpD0H/3P/6g/v59L6PNtSTsTXn/L33b/VtJtfvyPjrHdt/vrbJe/PX7eX++vJ/T5NUkXJJUntJX7bb95hXm/IunbSW2/7G+zqnFsw42SvuZvk9sk/arf528T+iz310+LpP9P0hZJd6aIqdT/jE7S50e2TdJ+d8Bf71v8deokbZ7gflc+nvj8vibpTUnH/Vh/xt8+p3T5cXi1pHOSnvbXzf8habek1yWZ3+cP/GU+IOlWST/r7zez/PdrJJ2WdNCP+S5JvyXpS2PFLe/4/lrS/tgu6bC843ok7hOSyq50XCQsJ/H4TvcYaJP0trz9/pOS9kuKjayDUdZxutv6fklf8rfznZJ+X1J/UgxTctz5/X7kf77/U9I9kr4haVDStQl9fttfTw/7n+PL8vbP38jAevqsEvbjpH/bvzaez6P09+u7JP2RvH8bb5f0RXn7+Vf89+dJ+vd+XD/t70PX+e9V+vN/S96+/gl5/96e0Pv7fFpx8ODB44OPwAPgwYPH5YmY/x/1kKQ1/uvJJG0HJRUmtO3wv3SsSGj7U0mtCa9v9+f1aNL8t0l6NeH1C5KeTepzpz9tOOGzOEm/lcY6eERe0jYvoW2zPvhF8bIvLKPMq9yf7rNJ7Y1++20JbXPkfdn9W/91paSLkv4wadqfSNqX8Prbujxpi+oKSdQocW6X94XvqoS2W/wYG/3X8/x4fjmhz7+T98Vw4RXm/Tl9MNl7XpcnN2NuwxTzLfK/bPVKKvHblvvT/CCNzxz2+96e1D6y3/1xQluxv20emeB+l5y0pRPfvX7fDye0XSXvuEk8Dv+npH0j68BvWy3v2P24//pHkv78Csv6f/1tWzfK+6PGrdRJm5N0c4q4f32M42JkOZ+YwDEwKGl1Qtun/Hmtu8LnTmtbJ01j/r73t5KemYbj7i4l/VuRcAx9L+HYvJBiPf2xvB+GCie5ni7bj6+w7dP5PGnt16Os89+VdDih/RP+vJYn9f9Pks5IqkxoWyAv6XtoonHw4MHDe1AeCWSff5D3K+RXMjCv7c65oYTXByUddc4dSWqrTizZ8v0g6fW/SLrBL5uaLW/ggu+aWdHIQ96vqgOSbkia9sdpxLpZ0pPOufMjDc65HfK+oHw0jenTsVlSu3PuuYRlXJT35XpkGWFJsyV9L2naf5K0xsxqRpn325J+28w+b2ZrxhHTm865YwnxvCTvV/nN/uvz8hL3zyZM81l51/WducJ8H/efPy1JZrZK3mf8O/91WtvQPF80s93mlbcOyDvTWSppWdIy09nOY3ly5A/n3IC8szFLxhPzFaS7H7Y6515LiOOYvLPTie6Wd4wMJ8RxRN7+OlJu+7akz5rZ75jZh8zMkuZxp6SIcy7VYDrjjVuS2pxzL6eIe3Oa048YzzFw1Dl3IOH1bv95SRrLGXVbS5KZLTCzvzazY/K274CkB+UNhDFiSo47eds3LumlpH3tab2/fW+S96PP95L6PCOpVpevg8msp0x8nrT2a78k9I/M7KC8H4ZGziKu8D/bldwt78e98wnrottfxsg6S/f4ApCEpA3IMs65QXlnv37BzK6a5Oy6kl73j9Jm8koQE7WleF0kqUrer6eFkr6u979MDcj7T75Y0tKkaVvTiLVulH6t8n75z4R0llGX0JbcR/I+eyq/Iel/yyuJ22dmB8zs/jRiSl7PI211Ca+/KeljZrbKT74+Jq8sbFTOuQuSviuvJFLyEr24pEjC50hnG35R0p/LS1Duk/el6yH/vbKkxaaznceSav8cWc5497tk6cQX0ujbJFGVvNK9gaTHyoQ4/rO8ssbPS3pH0gkz+62EeSxU6tFPJxJ3qhhH2upStF/JeI6BVNtL+uC+kcqVtrXknaH6t/JGjL1H0o3y9vvEPlN13FXJ2xeSt+9X9f72rfKfY0l9nvXbE/fHyayndIz1edLdr/9EXin7o/LOit0obz+Wxo61St72Sl5nd+j9dZFuHACSMHokkJ2+Je9ajy+leK9XSQmWpR5IZLKSzyjVyCth6ZD3n7eT9wXmJymmPZ302qWxvJYUy5S8X6wz9SvslZbRmdBHfr8zSX2U0O8yzrkuSV+Q9AUz+5Ck35H0j2b2rnNud6ppEpaTqu29L/POuefN7IC8a6NM3vp9MsV0yb4h70zBakm/JOnvE868dim9bfhpeeVgvzfyhl0+uEWidLbzZKQb82jSiS+u0bdJ4miTnfIS2W+k6NshSc4bJfUPJP2Bvw1+XdJfmdk+51xE3v6VTkKV7nodLe5YmtOPmNAxkElmVibp4/JKvv97QvtlPzZP4XHXKe86qyuN5DuyHj6h1In1vitMm46
"text/plain": [
"<Figure size 1080x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"f, ax1 = plt.subplots(nrows=1, figsize=(15,6))\n",
"df_reg_tm_sorted.justdata.value_counts().reindex(sorted(df_reg_tm_sorted.justdata.value_counts().index)).plot(ax=ax1)\n",
"ax1.set_title('Number of tools by year their description has been updated', fontsize=15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check URL in TAPoR dataset\n",
"In TAPoR dataset there are descriptions where the URL of a Tool is not provided"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"name 136\n",
"url 136\n",
"last_updated 136\n",
"dtype: int64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_reg_tools_sub_emurl=df_reg_tools_sub[df_reg_tools_sub['url'] == '']\n",
"#print(\"number of record with missed URL in TAPoR dataset:\")\n",
"df_reg_tools_sub_emurl.count()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Int64Index([423, 446, 444, 443, 442, 441, 440, 439, 438, 437,\n",
" ...\n",
" 413, 414, 415, 416, 417, 418, 419, 420, 421, 422],\n",
" dtype='int64', length=1227)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_reg_tools_sub_whurl=df_reg_tools_sub[df_reg_tools_sub['url'] != '']\n",
"df_reg_tools_sub_whurl.index"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"#df_reg_tools_sub.head()\n",
"#for column in df_reg_tools_sub[['name', 'url']]:\n",
"# # Select column contents by column name using [] operator\n",
"# columnSeriesObj = df_reg_tools_sub[column]\n",
"# print('Colunm Name : ', column)\n",
"# print('Column Contents : ', columnSeriesObj.values)\n",
"df_urls=df_reg_tools_sub_whurl.url.values\n",
"#df_urls"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>url</th>\n",
" <th>status</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>test</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>http://taporware.ualberta.ca/~taporware/htmlTo...</td>\n",
" <td>404.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>http://taporware.ualberta.ca/~taporware/textTo...</td>\n",
" <td>404.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>http://taporware.ualberta.ca/~taporware/htmlTo...</td>\n",
" <td>404.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>http://taporware.ualberta.ca/~taporware/textTo...</td>\n",
" <td>404.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" url status\n",
"0 test 1.0\n",
"1 http://taporware.ualberta.ca/~taporware/htmlTo... 404.0\n",
"2 http://taporware.ualberta.ca/~taporware/textTo... 404.0\n",
"3 http://taporware.ualberta.ca/~taporware/htmlTo... 404.0\n",
"4 http://taporware.ualberta.ca/~taporware/textTo... 404.0"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = {'url': ['test'],'status': [1]}\n",
"df_http_status = pd.DataFrame (data, columns = ['url','status'])\n",
"import requests\n",
"import re\n",
"regex = re.compile(\n",
" r'^(?:http|ftp)s?://' # http:// or https://\n",
" r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\\.)+(?:[A-Z]{2,6}\\.?|[A-Z0-9-]{2,}\\.?)|' #domain...\n",
" r'localhost|' #localhost...\n",
" r'\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})' # ...or ip\n",
" r'(?::\\d+)?' # optional port\n",
" r'(?:/?|[/?]\\S+)$', re.IGNORECASE)\n",
"\n",
"\n",
"for var in df_urls:\n",
" # print(var)\n",
" if ( var != \"\" and var!=None and re.match(regex, var)):\n",
" try:\n",
" r =requests.get(var,timeout=8)\n",
" #print(\"result: \"+var+ \" \",r.status_code)\n",
" df_http_status = df_http_status.append({'url': var, 'status': int(r.status_code)}, ignore_index=True)\n",
" except requests.exceptions.ConnectionError:\n",
" # print(var)\n",
" df_http_status = df_http_status.append({'url': var, 'status': int(503)}, ignore_index=True)\n",
" except requests.exceptions.ConnectTimeout:\n",
" # print(var)\n",
" df_http_status = df_http_status.append({'url': var, 'status': int(408)}, ignore_index=True)\n",
" except requests.exceptions.ReadTimeout:\n",
" # print(var)\n",
" df_http_status = df_http_status.append({'url': var, 'status': int(408)}, ignore_index=True)\n",
" except requests.exceptions.RequestException:\n",
" # print(var)\n",
" df_http_status = df_http_status.append({'url': var, 'status': int(500)}, ignore_index=True)\n",
" except TypeError:\n",
" # print(var)\n",
" df_http_status = df_http_status.append({'url': var, 'status': int(400)}, ignore_index=True)\n",
" else:\n",
" # print(var ,0)\n",
" df_http_status = df_http_status.append({'url': var, 'status': int(400)}, ignore_index=True)\n",
"df_http_status.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The HTTP result status values for URL in TAPoR dataset descriptions\n",
"\n",
"The table below shows the HTTP Status code (https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) obtained when 'clicking' on URL of tool descriptions of TAPoR dataset.\n",
"\n",
"There is a significant number of URLs that seems not correct (status 404, 503, 500, 508....)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"200.0 652\n",
"400.0 442\n",
"404.0 83\n",
"503.0 25\n",
"403.0 11\n",
"406.0 7\n",
"408.0 3\n",
"500.0 2\n",
"502.0 1\n",
"420.0 1\n",
"Name: status, dtype: int64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_http_status_sub=df_http_status[df_http_status['status'] != 1]\n",
"df_db_st = df_http_status_sub['status'].value_counts()\n",
"df_db_st.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TAPoR dataset 'creators' \n",
"There are 164 descriptions in TAPoR dataset that don't have values in *creators_name* field, and there are 924 different creators. \n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Int64Index([649, 686, 697, 701, 706, 719, 733, 736, 746, 765,\n",
" ...\n",
" 405, 407, 408, 410, 412, 414, 416, 417, 420, 422],\n",
" dtype='int64', length=164)"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_tools_na=df_db_tools[df_db_tools['creators_name'] == ''].sort_values('last_updated')\n",
"df_db_tools_na.index"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"924"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#the number of creators\n",
"len(df_db_tools['creators_name'].unique())-1"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"df_db_tools.loc[df_db_tools['creators_name']=='','creators_name']='n/a'\n",
"df_db_tech_NoCoT = df_db_tools['creators_name'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA4gAAAG5CAYAAADMCRrvAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdebid0/n/8fdHDEkkoiTUfFqUxhRyYgylVdVqa4oGaSsd+Pp1QFtapV+C0rS+HQxFQwmaomaVlpgiCSI5mSOoKbSh5oQQEXH//lhr82Rn732GnDgnJ5/XdZ3rPHs9a7jX2keuc1vr2UcRgZmZmZmZmdkqbR2AmZmZmZmZtQ9OEM3MzMzMzAxwgmhmZmZmZmaZE0QzMzMzMzMDnCCamZmZmZlZ5gTRzMzMzMzMACeIZmZmKzVJwyX9so3GlqQrJL0uacJyHqtOUkhatZnthkj6y/KKy9qGpGskHdTWcTRG0hqSHpO0XlvHYisPJ4hmZmbtiKTZkl6UtGah7LuSRrdhWMtLf+DzwMYRsXP5TUmDJY376MNasUkaLem7bR1HeyVpe2AH4FZJp0ian7/ekbS48PqRVhpvdUk35P+2Q9LeZfcl6deSXs1fv5EkgIhYCFwO/Kw1YjFrCieIZmZm7c+qwPFtHURzSerUzCabAbMj4q3lEc+Kqrm7nK08tiR19N8P/wcYEck5EdEtIroBxwIPlV5HxDatOOY44OvAfyvcOwY4iJS0bg98OcdY8lfgKElrtGI8ZlV19H8AzMzMVkTnAidKWrv8RqWjksUdo7zr9oCk30uaK+lpSbvn8n9LeknSUWXd9pR0l6Q3Jd0vabNC31vne69JelzS1wr3hku6WNI/JL0F7FMh3g0l3ZbbPynp6Fz+HeAyYLe8W3NGWbtPA5cU7s/N5T0kXSXpZUnPSvpFKaGRtEp+/Wye51WSelRa4LweT+c5PyNpUI33o7Ok63LdyZJ2yH2cJOnGsn4vkPSHKmNuIummHPurki4sxFJ6z14DhuSjhf8n6bm8o3yJpC65/sck3Z77eT1fb5zvnQ3sCVyY1600xu6SJkqal7/vXohrtKSzJT0AvA18sqnro3QE9295rd+U9Iik+sL9kyU9le/NknRw2XvQ5J/VRtakZ16HuflnbayqJ7pfBO6vcq84t8bW7FeSJuT7t0pap1I/EfFuRPwhIsYBiytUOQr4bUT8JyLmAL8FBhfa/wd4Hdi1sZjNWoMTRDMzs/anARgNnNjC9rsA04F1SbsP1wL9gC1IuxgXSupWqD8IOAvoCUwFRgAoHXO9K/exHnAEcJGk4s7KkcDZQHfSLkm5a4D/ABsCA4BzJH0uIv7Mkjs2pxcbRcSjZfdLyfIFQA/gk8BngG8C38r3BuevffL9bsCF5QHleZ0PfDEiugO753lXcyBwPbBOXotbJK0G/AXYXzmRV0raBwJXVxizE3A78CxQB2xEel9KdgGeJq3z2cCvgU8BfUjv20bAabnuKsAVpB3YTYEFpXlGxKnAWOAHed1+kBOXkXnO6wK/A0ZKWrcw/jdIO1ndgZebuT5fzXNZG7iNJdf8KVLC2gM4A/iLpA3K5t3Un9Vaa/IT0s9ZL2B94BQgygPN7/0ngMdrzIcmrtk3gW+Tfrbfy3VbYhtgWuH1tFxW9Chph9FsuXOCaGZm1j6dBvxQUq8WtH0mIq6IiMXAdcAmwJkRsTAiRgHvkn7BLhkZEWPy806nknbtNiEddZud+3ovIiYDN5ISvZJbI+KBiHg/It4pBpH76A/8LCLeiYippF3Db7RgTqUkayDw84h4MyJmk3ZbSv0NAn4XEU9HxHzg58Dhqnxk831gW0ldIuKFiKj1vNmkiLghIhaREoXOwK4R8QIwBjgs19sfeCUiJlXoY2dSInFSRLyV16OYUD8fERdExHvAO8DRwI8i4rWIeBM4BzgcICJejYgbI+LtfO9sUrJczQHAExFxdX4frwEeA75SqDM8Ih7J47/XzPUZFxH/yD9vV1NIZCLi+oh4Pv98XAc8kdeipEk/q5JUa02ARcAGwGYRsSgixkbEUgkiKYkFeLPGfJq6ZldHxMx8RPp/ga+p+cesIf2PjHmF1/OAbnnOJW8WYjdbrpwgmpmZtUMRMZO043RyC5q/WLhekPsrLyvuIP67MO584DVSMrMZsEs+tjdX6ZjnIODjldpWsCFQ+mW+5FnSzk9L9ARWz31U6m/DCvdWJe0ofSD/Qj+QtEP5gqSRkrauMW5xfd7nwx1RgCtJO13k70vtHmabAM/mBKzmGKRdsK7ApMK635HLkdRV0p+UjtK+QUpS166RnJSvCyz9PhTn2Nz1KT5X9zbpSO6qOdZvSppamMe2pPexpKk/qzXXhHQs+0lgVD6qWu2/m7n5e/ca84Fmrlm+txpLzq2p5gNrFV6vBcwvS3C782HsZsuVE0QzM7P263TSrknxl9LSB7p0LZQVE7aW2KR0kY/zrQM8T/oF+P6IWLvw1S0i/l+hbaVdmpLngXUkFX8Z3xSY08S4yvt+hbRTtFmhrNjf8xXuvceSSUjqOOLOiPg8adfpMeDSGnEU12cVYOM8FsAtwPaStiXtuI6o0se/gU2r7GbCknN9hZQYbVNY9x75g1QgHafcCtglItYC9iqFV6EvWHpdYOn3YYk2zVyfipSeZb0U+AGwbj4mPLMQZ3PUXJO8o/yTiPgkaZfvx5I+V95JTn6fIh1VraUpa7ZJ2b1FOc7meoQlj4/ukMuKPs2Sx1DNlhsniGZmZu1URDxJOnZ3XKHsZdIvqV+X1EnSt4HNl3GoL0nqL2l10rOID0fEv0k7mJ+S9A1Jq+WvfkofINOU+P8NPAj8SlJnpT8v8B2qJ1HlXgQ2znGRjyH+DThbUvecgPyY9CwgpOcdfyTpEznRPQe4rnzXTtL6kr6an0dbSNrBqfThISV9JR2Sk7sTcpvxOaZ3gBtIz89NiIjnqvQxAXgBGCppzbwee1SqmHcpLwV+r/z37yRtJOkLuUp3UrI0Nz8rd3pZFy+SnsEs+QfpfTxS0qqSBgK9Se/vUlqwPtWsSUo8X879fou0g9hsja2JpC9LKh1FfSPHWy3mf1D7SG6pTmNr9nVJvSV1Bc4Ebsg/o0tR+oCdzvnl6vn9LyXKV5ES2o0kbUj6HwDDC203Iv1Pm/GNxGzWKpwgmpmZtW9nkn7RLjoaOAl4lfRhFg8u4xh/JSUZrwF9ScdIyUdD9yM95/U86Sjhr4HmfNz+EaQPZXkeuBk4PSLuamLbe0k7Kf+VVNqZ+SFpF/Vp0ofi/JX0d+LI368mHbl8hvQs3w8r9LsK6Zfw50lz/gzwvRpx3Eo6cvk66XnHQ/LziCVXAttR/XhpKbn9CunZz+dIx1QH1hjzZ6Qjk+PzMdK7SbuGAH8AupB2q8aTjloWnQcMUPqE0/Mj4lXS7uZPSD8zPwW+HBHVdruauz4VRcQs0jOiD5GS1u2AB5rbT0GtNdkyv56fx7soIkZX6WcYMKjsGb/y2JuyZleTErn/kp5LPY7qHicl9RsBd+br0g7ln4C/AzNIO6wjc1nJkcCV+Rlhs+VOlZ/fNTMzM7OmkLQp6RjmxyPijbaOxxon6a/A3yLilha2Hw38JSIua9XAlh5nDdLR0r0i4qXlOZZZSZv9IVYzMzOzFV1+JvHHwLVODlccEXFkW8fQFHnXsNYHBJm1OieIZmZmZi2Qn9F7kfQJlvu3cThmZq3CR0zNzMzMzMwM8IfUmJmZmZmZWeYjpmZmraBnz55RV1fX1mGYmZmZNWrSpEmvRESvSvecIJqZtYK6ujoaGhraOgwzMzOzRkl6tto9HzE1MzMzMzMzwAmimZmZmZmZZU4QzczMzMzMDPAziGZmrWLGnHnUnTyyWW1mDz1gOUVjZmZm1jLeQTQzMzMzMzPACaLZEiTNL3s9WNKFy9jnbEk98/WD+fvekm6v0WZNSa9K6lFWfoukrzU1/ibEtkbuc4akKZI+2Uj9HSWFpC80Z5zctk7SzOa2a+YYiyVNlTRT0vWSukqql3T+8hzXzMzMrKNwgmjWiiTVPLYdEbs
"text/plain": [
"<Figure size 720x504 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots()\n",
"df_db_tech_NoCoT.head(20).plot.barh(figsize=(10,7), ax=ax)\n",
"ax.set_title('Number of tools by creators names (Top 10)')\n",
"ax.set_xlabel('N. of tools')\n",
"ax.set_ylabel('Creators');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Number of tool descriptions in TAPoR dataset that don't have the related creator email"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"382"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_tools_naem=df_db_tools[df_db_tools['creators_email'] == ''].sort_values('last_updated')\n",
"#df_db_tools_naem.index\n",
"len(df_db_tools_naem)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Number of tool description in TAPoR dataset that don't have the related creator URL"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"171"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_tools_nau=df_db_tools[df_db_tools['creators_url'] == ''].sort_values('last_updated')\n",
"len(df_db_tools_nau)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ------ "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"df_db_tech=pd.read_sql_query('select t.id, t.name, t.detail, t.creators_name, t.last_updated, at.name as \"attributetype\", av.name as\"attribute\", tags.text as \"tag\" from TaPOR.tools as t, TaPOR.attribute_values as av, TaPOR.tool_attributes as ta, TaPOR.attribute_types as at, TaPOR.tags as tags, TaPOR.tool_tags as tota where t.is_approved=1 and t.id=ta.tool_id and t.id=tota.tool_id and tags.id=tota.tag_id and ta.attribute_value_id=av.id and ta.attribute_type_id=at.id', connection)\n",
"#df_db_tech=pd.read_sql_table('tools', connection)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"#df_db_tech.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=43845, step=1)"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_tech.index"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['id', 'name', 'detail', 'creators_name', 'last_updated',\n",
" 'attributetype', 'attribute', 'tag'],\n",
" dtype='object')"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_tech.columns"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"df_items=df_db_tech[['id', 'name', 'detail', 'creators_name', 'last_updated']].drop_duplicates()\n",
"#df_items.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Attributes in TAPoR dataset items\n",
"\n",
"The following dataframe shows the list of attribute types defined in TaPOR dataset to charachterize tools"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Type of analysis</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Type of license</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Background Processing</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Web Usable</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Ease of Use</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Warning</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Usage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Tool Family</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Historic Tool (developed before 2005)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Compute Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Link to Recipe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>TaDiRAH Goals</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>TaDiRAH Methods</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name\n",
"0 Type of analysis\n",
"1 Type of license\n",
"2 Background Processing\n",
"3 Web Usable\n",
"4 Ease of Use\n",
"5 Warning\n",
"6 Usage\n",
"7 Tool Family\n",
"8 Historic Tool (developed before 2005)\n",
"9 Compute Canada\n",
"10 Link to Recipe\n",
"11 TaDiRAH Goals\n",
"12 TaDiRAH Methods"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_tools_toa=pd.read_sql_query('SELECT distinct name FROM TaPOR.attribute_types', connection)\n",
"df_db_tools_toa.head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tools with no attribute in TAPoR dataset\n",
"\n",
"The following dataframe shows the main fields of tool descriptions in TAPoR dataset that do not have attribute values"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>name</th>\n",
" <th>creators_name</th>\n",
" <th>url</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>579</td>\n",
" <td>Voyant 2.0: Knots</td>\n",
" <td>Stéfan Sinclair and Geoffrey Rockwell</td>\n",
" <td>http://voyant-tools.org/?view=knots</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>591</td>\n",
" <td>Warc Extractor</td>\n",
" <td>Ryan Chartier &amp; Internet Archive</td>\n",
" <td>https://github.com/recrm/ArchiveTools/blob/mas...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>754</td>\n",
" <td>TAGS https://t.co/T007ezdZoA</td>\n",
" <td></td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>755</td>\n",
" <td>Multiple enhancements to DiRT Directory (tools...</td>\n",
" <td></td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>758</td>\n",
" <td>RT : Today's \"dirt\": DiRT now uses TaDiRAH ter...</td>\n",
" <td></td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>823</td>\n",
" <td>Basement Waterproofing: Tips and Instructions</td>\n",
" <td></td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>1017</td>\n",
" <td>Datapress</td>\n",
" <td>MIT CSAIL</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1063</td>\n",
" <td>WordVenture</td>\n",
" <td>WordNet</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1174</td>\n",
" <td>VoiceThread</td>\n",
" <td>VoiceThread LLC</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>1183</td>\n",
" <td>Purdue OWL</td>\n",
" <td>Purdue University Writing Lab, Purdue Universi...</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1352</td>\n",
" <td>Aruspix</td>\n",
" <td></td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>1369</td>\n",
" <td>MMax2</td>\n",
" <td></td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>1377</td>\n",
" <td>Lextek</td>\n",
" <td></td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id name \\\n",
"0 579 Voyant 2.0: Knots \n",
"1 591 Warc Extractor \n",
"2 754 TAGS https://t.co/T007ezdZoA \n",
"3 755 Multiple enhancements to DiRT Directory (tools... \n",
"4 758 RT : Today's \"dirt\": DiRT now uses TaDiRAH ter... \n",
"5 823 Basement Waterproofing: Tips and Instructions \n",
"6 1017 Datapress \n",
"7 1063 WordVenture \n",
"8 1174 VoiceThread \n",
"9 1183 Purdue OWL \n",
"10 1352 Aruspix \n",
"11 1369 MMax2 \n",
"12 1377 Lextek \n",
"\n",
" creators_name \\\n",
"0 Stéfan Sinclair and Geoffrey Rockwell \n",
"1 Ryan Chartier & Internet Archive \n",
"2 \n",
"3 \n",
"4 \n",
"5 \n",
"6 MIT CSAIL \n",
"7 WordNet \n",
"8 VoiceThread LLC \n",
"9 Purdue University Writing Lab, Purdue Universi... \n",
"10 \n",
"11 \n",
"12 \n",
"\n",
" url \n",
"0 http://voyant-tools.org/?view=knots \n",
"1 https://github.com/recrm/ArchiveTools/blob/mas... \n",
"2 None \n",
"3 None \n",
"4 None \n",
"5 None \n",
"6 None \n",
"7 None \n",
"8 None \n",
"9 None \n",
"10 None \n",
"11 None \n",
"12 None "
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_tools_noatt=pd.read_sql_query('select distinct tools.id, tools.name, tools.creators_name, tools.url from TaPOR.tools where tools.is_approved=1 and tools.id not in (select distinct TaPOR.tool_attributes.tool_id from TaPOR.tool_attributes)', connection)\n",
"df_db_tools_noatt.head(19)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Type of Licenses in TAPoR dataset items"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Int64Index([ 8, 40, 44, 108, 170, 306, 330, 344, 362,\n",
" 380,\n",
" ...\n",
" 43679, 43728, 43730, 43751, 43762, 43779, 43814, 43816, 43821,\n",
" 43824],\n",
" dtype='int64', length=1024)"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_sub=df_db_tech[['id', 'name', 'detail', 'creators_name', 'last_updated', 'attributetype', 'attribute']]\n",
"df_to=df_db_sub[df_db_sub['attributetype'] == 'Type of license'].drop_duplicates()\n",
"df_to.index"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Free 470\n",
"Open Source 256\n",
"Closed Source 195\n",
"Commercial 79\n",
"Creative Commons 22\n",
"Shareware 2\n",
"Name: attribute, dtype: int64"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_lic = df_to['attribute'].value_counts()\n",
"df_db_lic.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA38AAAHeCAYAAAAraLLtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdeZhlVXm28fuhQQYVHEACDdgotIqYADZEP0dAUIOKGkQcIqiBz4hDnNEYNSqfxCmJJiZiQNGoiAOKYMQWZUiMAg0IAtKgjIJiS0BU5n6/P/YuOF1UdZ+Cqtqnat+/6zpXnb3OsN+qs6DrqbX2WqkqJEmSJEnz21pdFyBJkiRJmnmGP0mSJEnqAcOfJEmSJPWA4U+SJEmSesDwJ0mSJEk9YPiTJEmSpB4w/EnSHJTkPUkqyYkTPPaVJCfPYi1PbWvZfrbOORVJHpXktCS/b+tcNO7xA9r21d0um6ZaTk7ylXv5Hpcl+fB01DPBe78nyYrVPD7Sn7UkafXW7roASdK9smeSnavqjK4LGWEfAh4APAf4PXDNuMdPAB4/cLwP8KZxbbfMZIFzyFk0P5efdV2IJGnqDH+SNHddB1wF/A3w3I5rmTFJ1quqm+/FWzwSOK6qTprowar6NfDrgfMtadt/eC/OOS9V1W8Bfy6SNEc57VOS5q4C/h/wnCSPmexJk03la6fvvWbg+LIkH05ySJJrktyQ5CNp/FmS85PcmOTrSR44wak2T3J8O73yiiSvmuCcT0xySpI/JPlNkk8luf/A42NTMHdpp0jeBLxlNd/bDklOat/vf5N8Psmm7WOLkhTwcOAN7fuePNl7rU6Srdvv+7ftz+CbSbYZ95wNknwsyS+T3JzkjCR7ruF9t0hyTJJrk9yU5GdJ3jdkTX/bnut37fe9Udu+dpKrk7x7gteckuRrU/nex73+btM+kyxI8vYky5PckuSqJJ8Z97q9k5zZ/lx+meSDSdYZePw9SVYk2THJD9vP8+wkTxr3Ps9JsqztY/+b5EdJnjLw+Fpt/72krWV5kv3v6fcrSfON4U+S5rYvA8tpRv+mw37ALsDLgQ8CbwQ+CrwP+FvgVcBTgA9M8NojgHOB5wP/CfxrkmeNPZjkCcBJwC9pplb+NfBnwKcneK8vAse3jx8/UaFJNgFOBjYAXgy8tq1taZL70EzvfHx7vi+09189zA9h3HnWbet+FHAgcACwNXBKkgcNPPVTND+3Q4HnAVcCJyR54mre/rPAlsBBwDPb1647RFkvAp7W1vNGYC/g3wGq6nbgKOCAJBn4Ph4GPImJf973xieBvwOOAZ5FM2X2vgPn3Rf4GnA6zdTbv6P5fsf3oQ3auj8J/DnNVNtjk2zQvs/Dga8A3wOeDbyEpm8MfgYfB94JHE7zMzkWOHKwH0pSr1WVN2/evHmbYzfgPcCK9v4BwB3A4vb4K8DJEz133HsU8JqB48uAS4AFA22nA7cDWw+0fRD41cDxU9v3Onzc+y8FfjhwfBrw/XHP2a197fYD30sBrx/iZ3AYcD2w4UDbLu3rXzTu+/rwFH62r2n+ebzz+FXtz+BhA21bALcCb2+PHwWsBPYfeM5awE+AEwfaTga+MnD8O+DZU/zsL6OZ8nu/gbaXtOd/VHu8bftz2HXgOe+lCcJrD9OvJnl87LMe+7we2R6/bpLnB7gc+PS49lcANwEPHjhvAbsNPGeHtu0Z7fE+wG9WU9s24z+Dtv2zwBn35r83b968eZsvN0f+JGnu+w/gCuDt0/BeJ1fVHQPHlwCXVdWl49o2aUfXBh077vhrwGPbaYEb0Iy8HdNOS1w7ydrAfwG3AY8d99oThqh1F+A71VyHBkBVnU4TjlY32jZVuwBnVdXPB85zFfDfA+fZmSbofHngOSvb49XVcg7wgXa661ZTqGlpVf1u4Phr7fl3bs99MXAqTZimHQF8GfC5akYGp8uu7dfPTPL4YmAr7v65fw9YDxhcNfQ2mnA85oL26xbt1/OAjZIclWTPJPdlVbvThL9jx53rJGCHJAum/u1J0vxi+JOkOa79Zf6DwEuTPPRevt31445vnaQtwPjwd+0Ex2sDGwMPBBYAn6D5JX/sdguwDs3Ux0G/GqLWzSZ53q9YdSrgvTXMeTYDfldVf5jgORu0U0cn8kLgTOAfgMuTnJNk9yFqWuVnXVU30YwibjbQfASwT3tN5W7AQ5n+KZ8PBn4/GMDH2bj9+i1W/dzH/pgw+Ln/tg3MAFTVre3d9drji4C9gYe177ciyRfa6b9j51oA3DDuXJ+h6YeDPxtJ6iVX+5Sk+eFImmud3jbBYzczLqhl4gVb7q2HTHB8O7CC5hf4opne960JXnv1uOMa4nzXTHBOgE2BZUO8fljXAI+e5DzXDTznfkk2GBcANwX+UFUTbhVRVb+guTZvLZoRxvcAxyXZqqp+s5qaVvm+k6wP3I9Vt7H4MvAx4AU0I3Q/qqoLmF6/Ae6bZMNJAuDYz+cg4OwJHr90grZJVdUJNNdRbkRzTd8/0lznt197rtuBJ9CMAI43/o8TktQ7jvxJ0jzQhosP01xLNX6E4yrg/kkWDrStdhXKe+h5Exwvq6o7qur3NFsEPKKqzpzgNj78DeNHwNOz6mqhOwOLaKaTTpcf0Uxf3XrgPAuB/zNwnjNoAus+A89Je7zGWqpqZTVbS/wdzcInaxrB3SPJ/QaOn9+e/8yB97yJZuGcg9vHp3vUD5rpm9BMKZ3IRcAvgEWTfO6rC7iTqqobquoLNFONtxuoZQGw0STnunXSN5SknnDkT5Lmj08C76AJJacMtH+bZnGNI5N8hGalyrttwzANnpnk0Pbczwf2oJmmN+atwElJVtIsSnMjzfVgewF/U1XLp3i+jwJ/BZyY5O9pRr4Oo7k27Kv35hsZ5zM0I6r/meRdNIvrvIdmRPOTAFV1YZIvAv+cZEOa6yIPpFkQ5a8metN29OpEmgVJltOs8vkmmkVZLlxDTTfRjIB9iCbsfwg4doKRvSNoPuubgKOH/H7vk2SfCdpPGd9QVRclORz4SJKH0Fxn+ABgn6rar6pWJnkT8Ln25/KfNNOGH0azN+U+E0yVnVCS/0tz3ei3aUaKt6UZ1fzsQC3/Bhyd5IM0QXg9mlHbxVX1l0N+/5I0bxn+JGmeqKo/JPkHmu0CBttXJPlzmpHBr9NMiXwxdy2oMV3+kmb7hjfQTME7uKqOG6jjv5I8mWZ063M0ozSX0/wyP8w1fquoql8n2RX4CM0I1600U0rfMJ2jPFV1S5Kn0YTNI2iudzwZeH5VXTfw1AOBv6fZEuMBNCH0WVU12cjfze1zXk9z7dsfaEZH92xH7VbnaJrwfARN6D2OCUJmVZ2Z5Bc0C/ncsObvFoD7M7BwzYBdJ2iDZvuMy2k+/0NoplcuHajhS0l+S/OHiVfQhOef02zTMJXP6VyarSI+SnOt5TU022u8a+A5B9ME6QNpVjf9LU0/P2IK55GkeStVw1xWIUmS5pok2wHnA0+rqpO6rkeS1C3DnyRJ80ySBwOPAN5HszjMH5f/4EtS77ngiyRJ88+zaRaa2Qw4wOAnSQJH/iRJkiSpFxz5kyRJkqQeMPxJkiRJUg/Mq60eNt5441q0aFHXZYycW265hXXXXbfrMjRH2F80LPuKpsL+omHZVzQV9pe7W7Zs2Yqq2mSix+ZV+Fu0aBFnnnlm12WMnOXLl7N48eKuy9AcYX/RsOwrmgr7i4ZlX9FU2F/uLsnlkz3mtE9JkiRJ6gHDnyRJkiT1gOFPkiRJknrA8CdJkiRJPWD4kyRJkqQeMPxJkiRJUg8Y/iRJkiSpBwx/kiRJktQDhj9JkiRJ6gHDnyRJkiT1gOFPkiRJknrA8CdJkiRJPWD4kyRJkqQeMPxJkiRJUg+s3XUB89WiQ07ouoQ77bFwJUuPvLjrMgC47LC9ui5BkiRJ6iVH/iRJkiSpBwx/kiRJktQDhj9JkiRJ6gHDnyRJkiT1gOFPkiRJknrA8CdJkiRJPWD4kyRJkqQeMPxJkiRJUg8Y/iRJkiSpBwx/kiRJktQDhj9JkiRJ6gHDnyRJkiT
"text/plain": [
"<Figure size 1080x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots()\n",
"df_db_lic.plot(kind='bar', figsize=(15,6), x='licences', y='tools',)\n",
"plt.grid(alpha=0.6)\n",
"ax.yaxis.set_label_text(\"\")\n",
"ax.set_title(\"Number of Tools by License\", fontsize=15)\n",
"ax.set_xlabel('License', fontsize=14)\n",
"ax.set_ylabel('N of Tools', fontsize=14);\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"#df_db_tech.loc[df_db_tech['country']=='', 'country']='N/A'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## *Type of analysis* in TAPoR dataset items\n",
"\n",
"A tool description can have more than one value for *Type of analysis* (i.e. a tool can perform one or more type of analysis)"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>name</th>\n",
" <th>detail</th>\n",
" <th>creators_name</th>\n",
" <th>last_updated</th>\n",
" <th>attributetype</th>\n",
" <th>attribute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>43724</th>\n",
" <td>1499</td>\n",
" <td>iPhoto</td>\n",
" <td>&lt;p&gt;iPhoto is a digital photograph manipulation...</td>\n",
" <td>Apple</td>\n",
" <td>2018-10-12</td>\n",
" <td>Type of analysis</td>\n",
" <td>Organizing</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43726</th>\n",
" <td>1499</td>\n",
" <td>iPhoto</td>\n",
" <td>&lt;p&gt;iPhoto is a digital photograph manipulation...</td>\n",
" <td>Apple</td>\n",
" <td>2018-10-12</td>\n",
" <td>Type of analysis</td>\n",
" <td>Storage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43748</th>\n",
" <td>1500</td>\n",
" <td>Google 3D Warehouse</td>\n",
" <td>&lt;p&gt;A collection of free-to-download 3D models ...</td>\n",
" <td>Google</td>\n",
" <td>2018-11-06</td>\n",
" <td>Type of analysis</td>\n",
" <td>Collaboration</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43749</th>\n",
" <td>1500</td>\n",
" <td>Google 3D Warehouse</td>\n",
" <td>&lt;p&gt;A collection of free-to-download 3D models ...</td>\n",
" <td>Google</td>\n",
" <td>2018-11-06</td>\n",
" <td>Type of analysis</td>\n",
" <td>Dissemination</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43750</th>\n",
" <td>1500</td>\n",
" <td>Google 3D Warehouse</td>\n",
" <td>&lt;p&gt;A collection of free-to-download 3D models ...</td>\n",
" <td>Google</td>\n",
" <td>2018-11-06</td>\n",
" <td>Type of analysis</td>\n",
" <td>Modeling</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43759</th>\n",
" <td>1501</td>\n",
" <td>SketchUp (Formerly Google SketchUp)</td>\n",
" <td>&lt;p&gt;Google SketchUp is easy-to-use free 3D mode...</td>\n",
" <td>Google</td>\n",
" <td>2018-10-26</td>\n",
" <td>Type of analysis</td>\n",
" <td>Creation</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43760</th>\n",
" <td>1501</td>\n",
" <td>SketchUp (Formerly Google SketchUp)</td>\n",
" <td>&lt;p&gt;Google SketchUp is easy-to-use free 3D mode...</td>\n",
" <td>Google</td>\n",
" <td>2018-10-26</td>\n",
" <td>Type of analysis</td>\n",
" <td>Interpretation</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43761</th>\n",
" <td>1501</td>\n",
" <td>SketchUp (Formerly Google SketchUp)</td>\n",
" <td>&lt;p&gt;Google SketchUp is easy-to-use free 3D mode...</td>\n",
" <td>Google</td>\n",
" <td>2018-10-26</td>\n",
" <td>Type of analysis</td>\n",
" <td>Modeling</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43790</th>\n",
" <td>1502</td>\n",
" <td>GIMP (GNU Image Manipulation Program)</td>\n",
" <td>&lt;p&gt;GIMP is image editing software, much like P...</td>\n",
" <td>GIMP Team</td>\n",
" <td>None</td>\n",
" <td>Type of analysis</td>\n",
" <td>Creation</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43818</th>\n",
" <td>1556</td>\n",
" <td>Reaper</td>\n",
" <td>REAPER is a complete digital audio production ...</td>\n",
" <td>Cockos</td>\n",
" <td>2019-03-24</td>\n",
" <td>Type of analysis</td>\n",
" <td>Creation</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id name \\\n",
"43724 1499 iPhoto \n",
"43726 1499 iPhoto \n",
"43748 1500 Google 3D Warehouse \n",
"43749 1500 Google 3D Warehouse \n",
"43750 1500 Google 3D Warehouse \n",
"43759 1501 SketchUp (Formerly Google SketchUp) \n",
"43760 1501 SketchUp (Formerly Google SketchUp) \n",
"43761 1501 SketchUp (Formerly Google SketchUp) \n",
"43790 1502 GIMP (GNU Image Manipulation Program) \n",
"43818 1556 Reaper \n",
"\n",
" detail creators_name \\\n",
"43724 <p>iPhoto is a digital photograph manipulation... Apple \n",
"43726 <p>iPhoto is a digital photograph manipulation... Apple \n",
"43748 <p>A collection of free-to-download 3D models ... Google \n",
"43749 <p>A collection of free-to-download 3D models ... Google \n",
"43750 <p>A collection of free-to-download 3D models ... Google \n",
"43759 <p>Google SketchUp is easy-to-use free 3D mode... Google \n",
"43760 <p>Google SketchUp is easy-to-use free 3D mode... Google \n",
"43761 <p>Google SketchUp is easy-to-use free 3D mode... Google \n",
"43790 <p>GIMP is image editing software, much like P... GIMP Team \n",
"43818 REAPER is a complete digital audio production ... Cockos \n",
"\n",
" last_updated attributetype attribute \n",
"43724 2018-10-12 Type of analysis Organizing \n",
"43726 2018-10-12 Type of analysis Storage \n",
"43748 2018-11-06 Type of analysis Collaboration \n",
"43749 2018-11-06 Type of analysis Dissemination \n",
"43750 2018-11-06 Type of analysis Modeling \n",
"43759 2018-10-26 Type of analysis Creation \n",
"43760 2018-10-26 Type of analysis Interpretation \n",
"43761 2018-10-26 Type of analysis Modeling \n",
"43790 None Type of analysis Creation \n",
"43818 2019-03-24 Type of analysis Creation "
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_to_ta=df_db_sub[df_db_sub['attributetype'] == 'Type of analysis'].drop_duplicates()\n",
"df_to_ta.tail(10)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Analysis 434\n",
"Visualization 236\n",
"Content Analysis 185\n",
"Search 139\n",
"Natural Language Processing 125\n",
"Discovering 124\n",
"Capture 113\n",
"Gathering 97\n",
"Publishing 92\n",
"Dissemination 91\n",
"Enrichment 90\n",
"Annotating 83\n",
"Collaboration 80\n",
"Organizing 71\n",
"Creation 52\n",
"Uncategorized 49\n",
"Storage 40\n",
"Web development 39\n",
"Modeling 25\n",
"Programming 22\n",
"Interpretation 18\n",
"RDF 12\n",
"Name: attribute, dtype: int64"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_db_a = df_to_ta['attribute'].value_counts()\n",
"df_db_a.head(25)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA38AAAIRCAYAAAD6PgQRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdeZhkdXm+8fsRVDSIG2hgAAeVUXE3iCLE4IIbKu6CmoAx8jNixGiiaGLcI66JmriggLiLccMl6oiAKyK4BpBFQUUQRFRQEYR5f3+cU0xN0zNdxXTXOTPn/lxXX9116lSdp2t/67ulqpAkSZIkbdyu03UASZIkSdLSs/iTJEmSpAGw+JMkSZKkAbD4kyRJkqQBsPiTJEmSpAGw+JMkSZKkAbD4k6RrIclLk1SSz89z3v8kOW6GWfZos9xpVsecRpI7JPlKkt+3OZfPOX//dvu6fs5ZpCzHJfmf9byOc5K8fjHyzLnehW6DSrLHYh93sSXZOslnk/x2ksxJdmj3+2mSLHG29b7/Z3GdkrRUNu06gCRt4B6U5J5V9a2ug/TY64CbAI8Efg+cP+f8zwC7jp1+HPC8OdsuX8qAPTH+/94A+BLwSprbZ+TUmSa6dv4FuCuwL3AxC2fet/29HbAb8NWli7Ykngn8qesQkjQJiz9JuvYuBs6l+bD7qI6zLJkkm1XVH9fjKm4PHF1Vx8x3ZlX9Evjl2PF2brefsB7H3OCM/79JNm///NEGeDvcHvhmVX12wv33BU4A7tz+vUEVf1W1IRTkkgTY7VOS1kcB/w48Msmd17ZT20X0onm2V5JnjZ0+J8nrkxyc5Py229wb0nhYklOSXJrkE0luOs+htkny6bZ75U+TPGOeY+6e5Pgkf0jyqyTvTHKjsfNHXTB3abuzXQb88zr+t7slOaa9vl8neX+SW7bnLU9SwG2Af2yv97i1Xde6tF0DP5HkkvY2+FSS287Z54ZJ3pzkF0n+mORbSR60wPVum+SoJBcmuSzJj5K8YsJML26P9bv2/75xu33TJOcleck8lzk+ycem+d/byx3Y/t+bz9l+v/Z2vUt7evQYmjfb2OVuluQdSS5ob6uvJ7nXBDnWeT+09/cDgEdngu66Se4I3Ak4AjgaeHySTefs8+4kJyXZM8n328f3V9vLju/3vPY+/237f13jMTL32G3Gv5qzffP2dnv22H6fS3Jxe+zTkhw4tv8a3T7X5zElSUvN4k+S1s9HgDNoWv8Wwz7ALsBTgdcCzwXeCLwCeDHwDOCvgFfPc9nDgO8DjwH+F3hbkoePzkyyG3AM8AuarpXPAR5G88F7rg8Cn27P//R8QZNsBRwH3BB4EvAPbbaVSa5H071z1/Z4H2j/fuYkN8Kc41y/zX0H4OnA/sAOwPFJbja26ztpbrdXAY8GfgZ8Jsnu67j699B0NzwAeGh72etPEGtf4IFtnucCewHvAqiqK4Ejgf2T1WPYktwa+Evmv70X8n6a3jqPm7N9f+DbVfX9SbK1Oa4PfBHYk6awfxRNy+sXk/z52gJMeD/sCnwHOLb9+9EL/F9PAq4EPkrzmNuKpnica3ua7sOvav+/WwBHjd++wLbAfwF7t/k2Ab42t/AdqapTaFocnzrnrMcD16V5zEJTlF4FPIWm6/JbgBuxdtf2MSVJS6+q/PHHH3/8mfIHeClwUfv3/jQfDle0p/8HOG6+fedcRwHPGjt9DnAWsMnYthNpPhzvMLbttcAFY6f3aK/r0DnXvxI4Yez0V4Bj5+xz//aydxr7Xwo4aILb4BDgN8AWY9t2aS+/75z/6/VT3LbPat6erj79jPY2uPXYtm2BK4AXtqfvAKwC9hvb5zrA/wGfH9t2HPA/Y6d/Bzxiyvv+HJouv5uPbXtye/w7tKd3bG+H+43t83KaQnjTCY6xeXv5/ce2vQ84fs4+v5vnMbRQtqe1t92OY/tsCvwIeN06Mi14P8x3Gy/wf/4I+Gz79/Xa7O+es8+72+OO531Ue/vcfi3XuwnNuMlLgb9Zx/3/d+1tOH57fXm0D7Ble5w7r+N/WO/HlD/++OPPrH5s+ZOk9fc+4KfACxfhuo6rqqvGTp8FnFNVZ8/ZtlXbujbu43NOfwz4iySbJLkhTUvMUW23xE3b7nVfpZms4i/mXPYzLGwX4AtVdcloQ1WdSFOArKu1bVq70LRu/XjsOOcCXxs7zj2B0LTEjvZZ1Z5eV5bvAq9O0911+ykyrayq342d/lh7/Hu2xz6TpojYH6Btofob4L3VtAxeG4cBf9m2IAI8gaZo+8Cc/daZjaZV8GTg7LHHAcDxwM7rOP4k98PE2m6mtwY+1F7XFW3WRyfZbM7u57S36chonN22Y9d37yQrk/yKplj8A02BvGIdMT7U/n58ex23af+XUevsxTQtyG9P8sQkt5jgX7u2jylJWnIWf5K0ntoP868FnpLkVut5db+Zc/qKtWwLTUvJuAvnOb0pTevFTWlaQ95KU+yNfi6n6eK23ZzLXjBB1q3Xst8FwM3m2X5tTXKcrYHfVdUf5tnnhm2Xxfk8ETgJ+A/gJ0m+m2S+bodzrXFbV9VlNC0+W49tPgx4XJoxlfcHbsW16/I5chzwY9qCkqa74ier6uIps20J3Js1Hwd/aq9v7uNg3GLf3/u2xz0uyU2S3ITmS4ctaLobj5vvOQCwGUBbZH2B5nnx/2hmDb0nzW0xt5C8WlskH8Xqrp/707TOfq49fxXwoHbb4cAv0ixbcvd1/F/X9jElSUvO2T4laXEcDvwr8IJ5zvsjcwq1zD9hy/qa2ypxC5oWkItoPgAXTRfU+WZhPG/O6ZrgeOfPc0yAW9K0LC2W84E7zrP9ljQtM6N9Nk9ywzkF4C2BP1TVvEtFVNXPacbmXYemZeulwNFJtq+qX60j0xr/d5Ib0LQyjS9j8RHgzTStSvejmQHzWs8MWVWV5HDggCTvpWmheui1yHYxTXHy9/Ncdl1LakxyP0ykvb2fQPPFw0/m2WVfmlbAST2EZuzp3lX1+/YYmzJZUfoumrGBO9K0zr5nvPW9qn4IPDbJdWnGbL6GZizptm1xuIb1eExJ0pKz5U+SFkFbXLwe+FvWbP2BZjmIGyVZNrZtnbNQXktzJ9d4NHByVV3VfiA+AbhdVZ00z8/c4m8S3wQenDVnC70nsJzFna7/mzTdV3cYO84y4D5jx/kWTcH6uLF90p5eMEtVrapmSYWX0RQRC7Xg7pk1Z958THv8k8au8zKaSUwObM9fn1a/kXfTdHU8HPg5zbjOabMdA9wW+Ok8j4MfrOPYk9wPk9qD5nnyAprCePznPcDDxx9XE7gBzbjG8S61o26x61RVXwd+SHObbk9zG8+335+q6ks0EzBtTbN25bqud9rHlCQtOVv+JGnxvAN4Ec2H4ePHtn8OuAw4PMkbaGZIvMYyDIvgoUle1R77MTSzOe49dv7zgWOSrKKZlOZSmg+7ewH/UlVnTHm8N9K0Hn0+yWtoWpcOAX5AM3vjYnk3TZHwv0n+jWZynZfStGi+A6CqTkvyQeC/kmxBMy7y6TRrzs3XwkU7C+TnaYqNM2hmZHweTRe/0xbIdBlN68/raAqB1wEfn6dl7zCa+/oyVo8vu9aq6rwkn6O5z149Z3zopNne02Y6LsnrabqS3pymleoXVfUfazn8u1ngfpjCvjRdOd80t1U2ySU0LXCPAt474fV9iaZb8xFJDqNpofwnrtlddG0Oo7mdvtG29I2y3IXmS50P09xON6W5Db43T3fb9X1MSdKSs+VPkhZJ293wGh+cq+oi4LE0LTafoJky/klLEOHvgHu0x3g4cGBVHT2W46vAfWmm038v8CmagvBnTDbGbw3VLM5+P5purR8E/ptmRtE928k7FkVbHDyQpnXmMJplFH4C7DHnA/jT2/NeDHySpqXl4e3/PZ8/0hSqB9FM538kzSQhD2pb7dblQzTLGRwG/CfN0hpPmyf7STQtdB+rqt8u+M9O5hPt77W1JK4zW1X9keZ+W0nTKvUF4E00M5SeuLaDTnE/rFPbffKxwEfm645bVd+mmdBl4udI22L5VOB
"text/plain": [
"<Figure size 1080x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots()\n",
"df_db_a.plot(kind='bar', figsize=(15,6), x='analysys', y='tools',)\n",
"plt.grid(alpha=0.6)\n",
"ax.yaxis.set_label_text(\"\")\n",
"ax.set_title(\"Number of Tools by Type of Analysis\", fontsize=15)\n",
"ax.set_xlabel('Type of Analysis', fontsize=14)\n",
"ax.set_ylabel('N of Tools', fontsize=14);\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## *Tool families* in TAPoR dataset items"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"TAPoRware 55\n",
"Voyant 18\n",
"Digital Methods Initiative 12\n",
"Stanford NLP 11\n",
"SEASR 8\n",
"SIMILE Widgets 6\n",
"EURAC 5\n",
"CNRTL 5\n",
"Visualizing Literature 5\n",
"Book Genome Project 5\n",
"CHNM 4\n",
"Orlando 3\n",
"Laurence Anthony 3\n",
"Stanford HCI Group 2\n",
"Stanford Vis Group 2\n",
"Scholars' Lab 2\n",
"Name: attribute, dtype: int64"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_to_tf=df_db_sub[df_db_sub['attributetype'] == 'Tool Family'].drop_duplicates()\n",
"df_to_tf = df_to_tf['attribute'].value_counts()\n",
"df_to_tf.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA3kAAAH8CAYAAABsEmaXAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdd7xlVXn/8c8XsCGiYkHAAipjNCS20dh+xoZRsXexYOVnCxhjFDW2xBYjGmsiRgxqLGg0oqg4ooBdwd4YUFFRFEcsKAjCPL8/1r5w5nJn5rY5++z9+7xfr/uas/cp+1n3nHtmP3ut9axUFZIkSZKkcdiu7wAkSZIkSavHJE+SJEmSRsQkT5IkSZJGxCRPkiRJkkbEJE+SJEmSRsQkT5IkSZJGxCRPklZZkhcmqSTHLHDf+5IcN8VY7tDFss+0jrkUSW6Y5NNJ/tDFuee8+x/d7d/Sz2mrFMtxSd63wtc4LckrVyOeea+7td9BJbnDKhxn7ve90xYec4fNHP+ClR5/EfFd4vPcbT91YnvF76MkDd0OfQcgSSN21yS3qKov9x3IDPtX4ErAvYE/AGfMu/9o4NYT2w8E/n7evvO2ZYAzYrK9lwM+CbyY9vuZ852pRgQPB34wsT2NhXe/QvtdfH8Lj3ky8KcpxCJJM8skT5K2jbOA04HnAvftOZZtJsllq+qPK3iJPwOOqqpjF7qzqn4J/HLieGu7/V9YwTEHZ7K9E71s3+/59/CNqvrWNA9YVb8Dttjmqpp2sitJM8fhmpK0bRTwUuDeSf5icw/qhnZuWGD//CFopyV5ZZJDkpyR5LdJDk1zjyTfTnJ2kv9NcuUFDrV7kg93wyJ/nOSJCxzzdkmOT3JOkl8leXOSK0zcPzeU75bdkLhzgX/YQttukuTY7vV+neS/k+za3bdnkgKuB/xd97rHbe61tiTJXl27f9f9Dj6U5PrzHrNjktcm+XmSPyb5cpK7buV1r5nkyCRnJjk3yfeT/PMiY3ped6zfd+2+Yrd/hyQ/S/KCBZ5zfJL3L6XtE8/dvvss/TjJed3nYf8FHvfgJN/sHvOTJC9JsqoXfJO8vDvG75Oc3rX/GvMes6zP80LDNRc4/iWGaybZJ8nR3WueneS9kzEluVQXz9zv72dJPpDk0qv5u5GkaTHJk6Rt573Aelpv3mp4KHBL4DHAK4CnA68C/hl4HvBE4K+Bly3w3LcA3wDuD3wU+Pck95y7M8ltgWOBn9OGRD4NuAfw1gVe613Ah7v7P7xQoEmuBhwH7AjsD/xtF9u67sT5DNqwu58D7+xuP3kxv4R5x7lMF/cNgScAjwb2Ao5PssvEQ99M+729BLgf8BPg6CS328LLvw24FnAgcPfuuZdZRFgPA+7SxfN0YD/gPwGq6gLgCODRSTLRjusC/4eFf9+L8U+0z9lhtKGvnwX+O8nDJo5xV+A9tCGP9wFeBzwDeP0yj7l9l7TO/cydU1yddoFjP9rn6LrAJ5NsP+/5K/k8L1qX8H8WuCzwSNpn5M+BD028B8+mDT99HrBvF/dvgfkxS9IwVJU//vjjjz+r+AO8ENjQ3X40cCGwptt+H3DcQo+d9xoFPHVi+zTgVGD7iX1fAi4A9prY9wrgFxPbd+he67B5r78O+MLE9qeBT817zJ265+4z0ZYCDl7E7+DlwG+AnSf23bJ7/sPmteuVS/jdPrX913XR9hO738F1J/ZdEzgfeHa3fUNgI3DAxGO2A74FHDOx7zjgfRPbvwfutcT3/jTaUN2dJvY9vDv+Dbvtvbvfwx0nHvNPtIR3h0UcY6fu+Y/utnehzWd8wbzHfQQ4eWL7Cwu8x8/sPp/XnPce77SF4899pub/vHiBx24P7NHdf/tV/Dzvs4W/lfnv49uBk4FLT+zbu2v3ft32h4FDl/Je++OPP/7M8o89eZK0bb0D+DGtp2CljquqCye2TwVOq6ofztt3tQWGmX1g3vb7gZt3w/x2pPWkHTnZMwN8hlbA4ubznns0W3dL4OPV5lABUFVfop3cb6n3bKluCXylqi4qAFJVp9N6buaOcwsgtJ7Vucds7La3FMvXgJd1w1SvvYSY1lXV7ye2398d/xbdsU8BTqAlVHS9SY8C3l6tp2+p9qH1mL533v73AGuSXL3rRbvZZh6zHZsWdlmsh9LaNPfzRoAkd0/yuSS/pSVtp3ePXzPv+Sv5PC/FXWif/40Tn+0f0j6La7vHfI3Wu/rMJH852csqSUNkkidJ21B30v4K4BFJrrPCl/vNvO3zN7MvwPyT4jMX2N4BuCpwZVqPyxtpSd3cz3nApWhDFif9YhGx7raZx/2C1vO0WhZznN2A31fVOQs8ZsduyOdCHgKcCLwa+FGSryW58yJi2uR3XVXn0noFd5vY/RbggWlzHu8EXIflD9Wce935v4e57SvT3udLbeExy3lPvl1VJ078/CzJLYCjaIndI2nJ4626x1923vNX8nleiqsCz2LTz/afaMNI5z7bLwbeQBsy/HXgJ0kOXsExJalXVteUpG3vcOAfaSea8/2ReSewWbhwykpdfYHtC4ANtJPvog0d/cgCz/3ZvO3FlMo/Y4FjAuwKnLSI5y/WGbT5VQsd56yJx+yUZMd5id6uwDlVteASDFX1U1rvzna0HsMXAkcluXZV/WoLMW3S7iSXow2xnFwe4r3Aa4EHAXcEvljLrwo597pXBybj2rX796zu50/zY5v3mNVwP1o11IdUVQGswsWNlTqL1pP3nwvctwGgWoXY5wPPT7I3bRjwvyU5uao+NrVIJWmV2JMnSdtYl0S8Engsm/bmQOvxuEKSPSb2bbHq4zLdb4Htk6rqwqr6A22+1g3m9cxc1EOzjON9EfibbFqd8xbAnrRhoKvli7Rhp3tNHGcP4DYTx/kyLTF94MRj0m1vNZaq2lhtqYIX0YZFbi1p2TebLiZ+/+74J0685rm0AjZP6e5fbi8etLmF59ASxkkPBtZX1S+7YZEnbeYxG4HPr+D4ky4H/Gkuwes8fJVee7mOpQ1pPWmBz/Zp8x/cDad9Bq0n+0bTDVWSVoc9eZI0HW8CnkNLPo6f2P8x4Fzg8CSH0ipDXmJ5g1Vw9yQv6Y59f1oFwftM3P9M4NgkG2nFYc4Grk2rkPjcqlq/xOO9CngScEySf6H1ZL0c+CbwPytpyDz/Resh/WiS59OKabyQ1kPzJoCq+m6SdwGvT7IzbZ7XE2hr9D1poRdNW/LgGFqFzfW0qpp/TyuO8t2txHQurXLnv9KS+n8FPrBAT91baO/1ucC7F93iearqrCT/BvxjkgtoyeT9adVPHzbx0BfQ3o+3dsf7C1olyzd38xhXwzrgaV08H6J93h+xSq+9XC+kFXU5OsnhtM/GHrS/gf+qquOSfICWBH+V9n48kHaOdEIvEUvSCpnkSdIUVNU5SV5NK8M/uX9DkgfQevr+l3aiuT+w2gs6P55WFv7vaMPXnlJVR03E8Zkkt6f1Vr2dNkfvR7QkdDFz8DZRVb9MckfgUFqP1fm0oaB/V1Xnr7Atk8c5L8ldaEnlW2jzt44D7l9Vk0MQnwD8C61E/pVoyeY9q2pzPXl/7B5zMG3e1jm03s67dr1wW/JuWpL8FlpyexQLJJNVdWKSn9IKkPx2663doufTht8+iTYE81TgEVV1UfJYVR9P8lDa0OGH0+YOHkpL/lZFVX0kybNoS2Y8gdZDeE9aotyLqlqf5Fa0eXeH0Xobf0rr4Tu1e9jnaHMw/4E2yuk7wAOq6sRLvqIkzb5sOqJCkiRNQ5IbAd8G7lJVx/YdjyRpPEzyJEmaoiRXAW5AGyp5deAvy/+MJUmryMIrkiRN171oBV92oy1oboInSVpV9uRJkiRJ0ojYkydJkiRJIzLI6ppXvepVa88999zmxznvvPO4zGUus82PMy1jaw+Mr022Z/aNrU22Z/aNrU22Z/aNrU22Z7aNrT0wvTaddNJJG6rqagvdN8gkb8899+TEE7d9VeP169ezZs2abX6caRlbe2B8bbI9s29sbbI9s29sbbI9s29sbbI9s21s7YHptSnJjzZ3n8M1JUmSJGl
"text/plain": [
"<Figure size 1080x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots()\n",
"df_to_tf.plot(kind='bar', figsize=(15,6), x='analysys', y='tools',)\n",
"plt.grid(alpha=0.6)\n",
"ax.yaxis.set_label_text(\"\")\n",
"ax.set_title(\"Number of Tools by Tool Families\", fontsize=15)\n",
"ax.set_xlabel('Tool Family', fontsize=14)\n",
"ax.set_ylabel('N of Tools', fontsize=14);\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## *Web Usable* in TAPoR items"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>name</th>\n",
" <th>detail</th>\n",
" <th>creators_name</th>\n",
" <th>last_updated</th>\n",
" <th>attributetype</th>\n",
" <th>attribute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>1</td>\n",
" <td>List Words - HTML (TAPoRware)</td>\n",
" <td>&lt;p&gt;This tool lists words in an HTML document, ...</td>\n",
" <td>Geoffrey Rockwell et. al.</td>\n",
" <td>2011-11-27</td>\n",
" <td>Web Usable</td>\n",
" <td>Run in Browser</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52</th>\n",
" <td>4</td>\n",
" <td>Wordle</td>\n",
" <td>&lt;p&gt;Wordle is an online toy for generating &lt;a h...</td>\n",
" <td>Jonathan Feinberg</td>\n",
" <td>2018-10-17</td>\n",
" <td>Web Usable</td>\n",
" <td>Run in Browser</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82</th>\n",
" <td>5</td>\n",
" <td>OrlandoVision (OVis)</td>\n",
" <td>&lt;p&gt;An application for visualizing a specific c...</td>\n",
" <td>The Orlando Project</td>\n",
" <td>2018-11-01</td>\n",
" <td>Web Usable</td>\n",
" <td>Software you Download and Install</td>\n",
" </tr>\n",
" <tr>\n",
" <th>118</th>\n",
" <td>8</td>\n",
" <td>Voyant Cirrus</td>\n",
" <td>&lt;p&gt;Cirrus is a visualization tool that display...</td>\n",
" <td>Stéfan Sinclair and Geoffrey Rockwell</td>\n",
" <td>2018-10-05</td>\n",
" <td>Web Usable</td>\n",
" <td>Run in Browser</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192</th>\n",
" <td>9</td>\n",
" <td>Voyant Links</td>\n",
" <td>&lt;p&gt;Links finds collocates for words and displa...</td>\n",
" <td>Stéfan Sinclair and Geoffrey Rockwell</td>\n",
" <td>2018-09-18</td>\n",
" <td>Web Usable</td>\n",
" <td>Run in Browser</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id name \\\n",
"16 1 List Words - HTML (TAPoRware) \n",
"52 4 Wordle \n",
"82 5 OrlandoVision (OVis) \n",
"118 8 Voyant Cirrus \n",
"192 9 Voyant Links \n",
"\n",
" detail \\\n",
"16 <p>This tool lists words in an HTML document, ... \n",
"52 <p>Wordle is an online toy for generating <a h... \n",
"82 <p>An application for visualizing a specific c... \n",
"118 <p>Cirrus is a visualization tool that display... \n",
"192 <p>Links finds collocates for words and displa... \n",
"\n",
" creators_name last_updated attributetype \\\n",
"16 Geoffrey Rockwell et. al. 2011-11-27 Web Usable \n",
"52 Jonathan Feinberg 2018-10-17 Web Usable \n",
"82 The Orlando Project 2018-11-01 Web Usable \n",
"118 Stéfan Sinclair and Geoffrey Rockwell 2018-10-05 Web Usable \n",
"192 Stéfan Sinclair and Geoffrey Rockwell 2018-09-18 Web Usable \n",
"\n",
" attribute \n",
"16 Run in Browser \n",
"52 Run in Browser \n",
"82 Software you Download and Install \n",
"118 Run in Browser \n",
"192 Run in Browser "
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_to_bp=df_db_sub[df_db_sub['attributetype'] == 'Web Usable'].drop_duplicates()\n",
"df_to_bp.head()"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Run in Browser 503\n",
"Other 400\n",
"Software you Download and Install 187\n",
"Web Application you Launch 8\n",
"Name: attribute, dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_to_bp = df_to_bp['attribute'].value_counts()\n",
"df_to_bp.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA38AAAIsCAYAAABC2hBiAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdeZhkZX3+//cNKLgwKlEJ+6AyKrggW8RdFBUxaowLRhIUlRgxkp/GiGZxC3E332hiFAWDGwYjCG4hiAIiioKibDIgICIIjLigIgLz+f1xTkPR9HT3THfV6arzfl3XXF3n1Hb3UlB3Pc95TqoKSZIkSdJkW6/rAJIkSZKk4bP8SZIkSVIPWP4kSZIkqQcsf5IkSZLUA5Y/SZIkSeoBy58kSZIk9YDlT5KGKMkbk1SS42e47n+SnDTCLI9rszxoVM+5NpI8MMnXkvymzbl82vUvbPfP9u/SRcpyUpL/WeBjXJrkXYuRZ9rjvrj9Xrectv/t7f59p+3fs93/iHk+/tTP+a6LmXsxzTfj9N9j+3pcNbB9m9dEkju2t9lxeOklqTsbdB1AknriSUl2rapvdx1kCXsncHfg6cBvgCunXf8FYPeB7WcDr56274ZhBlwiTmu/PgI4amD/I4Dftl8/Pm3/DcCZI0m3tLwcuHGW679D8/fzw3b7jsAbgEuBs4aaTJI6YPmTpOG7Frgc+HvgmR1nGZokG1XV7xbwEA8AjquqE2e6sqquAa4ZeL5d2v3fXMBzjqMf0PxN3VL+ktwB2Bk4ot0/6BHAGVXVh2J8G1V13hzX/wro29+PpB5z2qckDV8B/wI8PcmD13Sj6VPSBvZXklcMbF+a5F1JDk5yZZJfJnl3Gk9Ncm6S65J8Nsk9ZniqzZN8vp1eeVmSl83wnI9KcnKS3yb5WZIPJdl44PqpaXe7tVPrrgdeM8v3tmOSE9vH+3mSTyTZtL1ueZIC7gv8f+3jnrSmx5pNkm3b7/tX7c/gc0nuN+02d07y3iQ/TfK7JN9O8qQ5HnfLJEcluTrJ9Ul+mOQt88z0j+1z/br9vu/W7t8gyRVJ3jDDfU5OcvRMj1dVBXyD25a8hwEB3g88eOp3lWQ94I+4dbRwzt/tgKlpuNcnWZnkT+b4Ppe3v7unTdv/X0nOGNie9WeZZPckx7U/m98kOSvJC9bwtLNmnD7tc4bM06dCX9d+/UhunUq8vP0b+cgM9z8iyXdm+7lI0lJi+ZOk0fg0sJJm9G8x7APsBrwIeAfwKuA9wFuAfwReBjwWeOsM9z0M+D7wLOBLwH8OvmFP8kjgROCnNFMr/wZ4KnC7N7/AkcDn2+s/P1PQJPcCTgLuDPwZ8NdtthOS3JFmeufu7fN9sr388vn8EKY9z4Zt7gcCLwVeCGwLnJxkk4Gbfojm53YI8CfAj4EvJHnULA//UWAr4ABgr/a+G84j1vOBJ7Z5XgXsDXwYoKpuohmpe2GSDHwf9wEezcw/7ymnATsmuVO7vTvNtM5zgF/QFD6AHYC7AV9vH3ttfrf/DRxL83dyNvDpJA+dx/c8l7l+ltu0eV8C/DHwGZoy9vwRZNyj/frPND/T3Wn+Pj8MPCcDxxi2l/+U2X9PkrSkOO1TkkagqlYneRtwWJJ/qqqVC3zI3wHPqaqbgf9N8gyaUrVdVV0C0L4J3o+mCA76UlW9vr18fFs2/oFby9vbgNOq6nlTd0jyE+DEJA+qqnMGHuu9VfVvc2R9dfv1ye00O5KsBE4H/rSqjgS+meQG4MoFTON8EbA1sKKqLm6f53TgYuAvgbcmeSBNIXtRVR3R3uZ4mjL8j8CT1/DYuwHPr6rPtdsnzTPTnYC9q+rX7XP9BvhYkgdW1fnA4cDBwOOAr7b3eSFwNU0xX5PTgDsAuwKn0IwCfqOqKsk32+0vc+vo4NTI39r8bj9cVe9qb3M8cB7wOpoPHhZi1p9lVX1qIFva729LmgJ95LTHWuyMU8fk/nDw7zDJkTQfrjyHW8vec2l+B59cx+eSpJFz5E+SRufjwGU0b04X6qS2+E25CLh0qvgN7LtXO7o26Jhp20cDOydZP8mdaUY7jmqnJW6QZAPgVJqFM3aedt8vzCPrbsD/TRU/gKr6Fs2iGrONtq2t3YDvTBW/9nkupxlFmnqeXWmmR3564Dar2+3ZspxFUx5fmGTrtch0wlTxax3dPv+u7XNfSFNuXgi3lJ2/AD7WjgyuybeAm7i13D2CZiooNMewDe6/sKquWYff7S1/J+3P6Fian/FCzfqzTHKPdlruj9pcN9KMEq6Y4bGGlfE22r/d/6H9PbVeSHOM6s8W+/kkaVgsf5I0Iu2b+XcA+ybZZoEP94tp279fw77QrGA46OoZtjcA7gncA1if5tixGwf+3UAzyrHVtPteNY+sm63hdlcBm8ywf13N53k2A35dVb+d4TZ3bqeOzuR5wBnAvwI/ao9De8I8Mt3mZ11V1wO/bnNMOQx4dnvc3R400x5nnUrY5j8LeESaUz5sya3l7xvAw9vj/R5BO+WTtf/dzvR3shkLN9fP8r/a27wTeBJNUT4c2GiGxxpWxpkcBjw6yX2T3Jdmau7hQ3ouSRoKp31K0mgdTjPF8rUzXPc7phW1zLxgy0Lde4btm4BVNG+wC3gj8MUZ7nvFtO2ax/NdOcNzAmzK4p5+4EqaY9xmep5rB25z1yR3nlYANwV+u6YVMavqJzTH5q1HM7L0RuC4JFvPMfJzm++7PUbvrtz2NBafBt5LM6Xw8cDpc61S2TqN5hjKR9CM+k495unAxjTHVd4PeHu7/xes3e/23sDPpm1PP/3GoKmVXqd/2HCbgj/bz5LmFB97A6+oqg9M3ae97UzWNuM6q6pTklxIM5U6ND+v/xvGc0nSsDjyJ0kj1JaLdwH7c/sRisuBjZNsMbBv1lUo19H0VRv/BDizqm6uqt/QTBu8f1WdMcO/6QVhPk4Hnpzbrha6K7CcZsrhYjmdZvrqtgPPswVNOZp6nm/TFKBnD9wm7facWapqdXss2JtoFrCZawR3z9z2ROTPap//ltUv29HAI4ED2+vnu4DI12lGa/fj1lE/quo64Fzgb9tdp7X71/Z3e8vfSVu+nkEz3XRNrqYZSXzgwP3uym3Pw3iLNfwsN6QZnbxh4DE2pjn340zWNuNcft9+nWmUEZoPb/ajmZr70WlTryVpyXPkT5JG74PA62lKyckD+/8XuB44PMm7aVaqvN1pGBbBXkkOaZ/7WcCeNG+ap/wdzQIgq2mOc7qOZiGVvYG/X4fFat4D/BXN4jJvpxn5ehvN6oyfWcg3Ms1/0YyofinJPwE304wqraL5mVNV57eLd/x7kmU0x0W+lOYcg38104OmOTXD8TSrVK6kKSivplkx8/w5Ml1Ps5LoO2nK/juBY2YY2TuM5nd9PfAp5mdqOudewEHTrvsGzff182kZ1+Z3+5Ikv6dZQfSlNKOIM624CdyyqNGxNKfr+BHNSOOr2+8JmPtnWVXXJ/k28E9JfgWsplkQ55fAshmedq0yzqWqfp/kEuC5Sc6hGc38flVNlcIjaFYC3YDm702Sxoojf5I0Yu10w3+dYf8qmqXjtwQ+C+xLM61vsb0E2Kl9jqcBB1bVcQM5TgUeA9wL+BjwOZrS8GPmd4zfbbQnZ388zRvpI4H/AL4G7DnwpnrB2lHVJ9KcBP0wmjfqPwIeV1XXDtz0pe11/0izQMg2wNPa73smv6MpqgcBx7X3/S3wpHbUbjafolnF8zDg/9Gs4PniGbKfAfwEOLqqfjnnN8st0ycvo5mC+I1pV39jan97XsCp+6zN73YfmpG1zwIPBZ5XVd+dI9YraErp+2l+z0cCXxm4fj4/yz8DLqEpiP9G8wHBR9fwfOuScS4voxlR/TLNSPHmU1dU1U9pRpi/XlUXLPB5JGnkMvD/BEmS1IEk29NM1XxiVZ3YdR7NrD1f5E9ojkk8rOs8krS2LH+SJHUkyR8A9wfeQrNYyUPK/zEvOe1xh9vTjFg+EVg+w4qxkrTkOe1TkqTu/DHNQjObAS+0+C1ZO3Pr+RP/wuInaVw58idJkiRJPeD
"text/plain": [
"<Figure size 1080x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots()\n",
"df_to_bp.plot(kind='bar', figsize=(15,6), x='webusable', y='tools',)\n",
"plt.grid(alpha=0.6)\n",
"ax.yaxis.set_label_text(\"\")\n",
"ax.set_title(\"Number of Tools by Web usability\", fontsize=15)\n",
"ax.set_xlabel('Web usable', fontsize=14)\n",
"ax.set_ylabel('N of Tools', fontsize=14);\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ------"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}