Example of using litstudy
This notebook shows an example of how to use litstudy from inside a Jupyter notebook. It shows how to load a dataset, plot statistics, perform topic modeling, do network analysis, and some more advanced features.
This notebook focuses on the topic of programming model for GPUs. GPUs (Graphic Processing Units) are specialized processors that are used in many data centers and supercomputers for data processing and machine learning. However, programming these devices remaining difficult, which is why there is a plethora of research on developing programming models for GPUs.
Imports
[1]: # Import other libraries import os import sys import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sbs # Options for plots plt.rcParams['figure.figsize'] = (10, 6) sbs.set('paper') # Import litstudy path = os.path.abspath(os.path.join('..')) if path not in sys.path: sys.path.append(path) import litstudy Collecting the dataset
For this example, we have queried both IEEE Xplore and Springer Link for "GPU" and "programming model". IEEE Xplore gives 5 CSV files (1 per page) and Springer Link gives a single CSV file. We load all files document sets and merge the resulting document sets.
[2]: # Load the CSV files docs1 = litstudy.load_ieee_csv('data/ieee_1.csv') docs2 = litstudy.load_ieee_csv('data/ieee_2.csv') docs3 = litstudy.load_ieee_csv('data/ieee_3.csv') docs4 = litstudy.load_ieee_csv('data/ieee_4.csv') docs5 = litstudy.load_ieee_csv('data/ieee_5.csv') docs_ieee = docs1 | docs2 | docs3 | docs4 | docs5 print(len(docs_ieee), 'papers loaded from IEEE') docs_springer = litstudy.load_springer_csv('data/springer.csv') print(len(docs_springer), 'papers loaded from Springer') # Merge the two document sets docs_csv = docs_ieee | docs_springer print(len(docs_csv), 'papers loaded from CSV') 441 papers loaded from IEEE 1000 papers loaded from Springer 1441 papers loaded from CSV We can also exclude some papers that we are not interested in. Here, we load a document set from a RIS file and subtract these documents from our original document set.
[3]: docs_exclude = litstudy.load_ris_file('data/exclude.ris') docs_remaining = docs_csv - docs_exclude print(len(docs_exclude), 'papers were excluded') print(len(docs_remaining), 'paper remaining') 1 papers were excluded 1440 paper remaining The amount metadata provided by the CSV files is minimal. To enhance the metadata, we can find the corresponding articles on Scopus using refine_scopus. This function returns two sets: the set of documents that were found on Scopus and the set of original documents not were not found. We have two options on how to handle these two sets: (1) merge the two sets back into one set or (2) discard the documents that were not found. We chose the second option here for simplicity.
[4]: import logging logging.getLogger().setLevel(logging.CRITICAL) docs_scopus, docs_notfound = litstudy.refine_scopus(docs_remaining) print(len(docs_scopus), 'papers found on Scopus') print(len(docs_notfound), 'papers were not found and were discarded') 100%|██████████| 1440/1440 [00:03<00:00, 361.20it/s] 1387 papers found on Scopus 53 papers were not found and were discarded Next, we plot the number of documents per publication source.
[5]: litstudy.plot_year_histogram(docs_scopus);
In this example, we discover that one document was published in 1997. This document should not be in our set since GPUs were not used for general purpose computing before 2006. We can remove this document by filtering on year of publication.
[6]: docs = docs_scopus.filter_docs(lambda d: d.publication_year >= 2000) Print how many papers are left
[7]: print(len(docs), 'papers remaining') 1386 papers remaining General statistics
litstudy supports plot many general statistics of the document set as histograms. We show some simple examples below.
[8]: litstudy.plot_year_histogram(docs, vertical=True);
[9]: litstudy.plot_affiliation_histogram(docs, limit=15);
[10]: litstudy.plot_author_histogram(docs);
[11]: litstudy.plot_language_histogram(docs);
[12]: litstudy.plot_number_authors_histogram(docs);
[13]: # This names are long, which is why a short abbreviation is provided. mapping = { "IEEE International parallel and distributed processing symposium IPDPS": "IEEE IPDPS", "IEEE International parallel and distributed processing symposium workshops IPDPSW": "IEEE IPDPS Workshops", } litstudy.plot_source_histogram(docs, mapper=mapping, limit=15);
[14]: litstudy.plot_country_histogram(docs, limit=15);
[15]: litstudy.plot_continent_histogram(docs);
Network analysis
The network below shows an example of a co-citation network. This is a type of network where nodes represent documents and edges represent pairs of documents that have been cited together simulatenously by other papers. The strength of the edges indicates how often two documents have been cited together. Two papers with a high co-citation strength (i.e., stronger edge) are usually highly related.
[16]: litstudy.plot_cocitation_network(docs, max_edges=500) 100%|██████████| 1000/1000 [00:00<00:00, 1752.38it/s] BarnesHut Approximation took 0.14 seconds Repulsion forces took 0.32 seconds Gravitational forces took 0.01 seconds Attraction forces took 0.01 seconds AdjustSpeedAndApplyForces step took 0.04 seconds [16]: Topic modeling
litstudy supports automatic topic discovery based on the words used in documents abstracts. We show an example below. First, we need to build a corpus from the document set. Note that build_corpus supports many arguments to tweak the preprocessing stage of building the corpus. In this example, we pass ngram_threshold=0.85. This argument adds commonly used n-grams (i.e., frequent consecutive words) to the corpus. For instance, artificial and intelligence is a bigram, so a token artificial_intelligence is added to the corpus.
[17]: corpus = litstudy.build_corpus(docs, ngram_threshold=0.8) We can compute a word distribution using litstudy.compute_word_distribution which shows how often each word occurs across all documents. In this example, we focus only on n-grams by selecting tokens that contain a _. We see that words such as artificial intelligence and trade offs indeed have been recognized as common bigrams.
[18]: litstudy.compute_word_distribution(corpus).filter(like='_', axis=0).sort_index() [18]: | count | |
|---|---|
| artificial_intelligence | 13 |
| author_exclusive | 41 |
| berlin_heidelberg | 83 |
| chinese_academy | 6 |
| coarse_grained | 16 |
| ... | ... |
| synthetic_aperture | 7 |
| trade_offs | 10 |
| unified_device | 108 |
| xeon_phi | 21 |
| zhejiang_university | 6 |
63 rows × 1 columns
Let’s visualize the word distribution from this corpus.
[19]: plt.figure(figsize=(20, 3)) litstudy.plot_word_distribution(corpus, limit=50, title="Top words", vertical=True, label_rotation=45);
This word distribution looks normal. Next, we train an NMF topic model. Topic modeling is a technique from natural language processing for discovering abstract "topics" in a set of document. We need to manually select the number of desired topics. Here we choose 15 topics. It is recommended to experiment with more or less topics to obtain topics that are more fine-grained or more coarse-grained
[20]: num_topics = 15 topic_model = litstudy.train_nmf_model(corpus, num_topics, max_iter=250) To understand the result of NMF, we can print the top 3 words for each topic.
[21]: for i in range(num_topics): print(f'Topic {i+1}:', topic_model.best_tokens_for_topic(i)) Topic 1: ['cluster', 'mpi', 'node', 'hybrid', 'communication'] Topic 2: ['mapreduce', 'big', 'data', 'hadoop', 'cloud'] Topic 3: ['simulation', 'particle', 'numerical', 'fluid', 'flow'] Topic 4: ['learning', 'network', 'deep', 'deep_learning', 'training'] Topic 5: ['fpga', 'memory', 'access', 'cache', 'bandwidth'] Topic 6: ['openacc', 'compiler', 'openmp', 'directive', 'language'] Topic 7: ['image', 'segmentation', 'algorithm', 'medical', 'sensing'] Topic 8: ['sequence', 'alignment', 'protein', 'database', 'search'] Topic 9: ['video', 'decoding', 'encoding', 'ldpc', 'motion'] Topic 10: ['gpgpu', 'cuda', 'code', 'general_purpose', 'general'] Topic 11: ['energy', 'heterogeneous', 'power', 'consumption', 'systems'] Topic 12: ['graph', 'vertex', 'framework', 'analytics', 'edge'] Topic 13: ['scheduling', 'task', 'heterogeneous', 'resources', 'execution'] Topic 14: ['intel', 'matrix', 'phi', 'xeon', 'cloud'] Topic 15: ['opencl', 'portability', 'benchmark', 'platforms', 'sycl'] An alternative way to visualize the output of NMF is to plot each discovered topic as a word cloud. The size of each word in a cloud indicate the importance of that word for that topic.
[22]: plt.figure(figsize=(15, 5)) litstudy.plot_topic_clouds(topic_model, ncols=5);
These 15 topics look promising. For example, there is one topic on graphs, one on OpenACC (the open accelerators programming standard), one on OpenCL (the open compute language), one on FPGAs (field-programmable gate array), etc.
We can visualize the results as a "landscape" plot. This is a visual appealing way to place documents on 2D plane. The documents are placed such that similar documents are located closed to each other. However, this is a non-linear embedding so the distances between the documents are not linear.
[23]: plt.figure(figsize=(20, 20)) litstudy.plot_embedding(corpus, topic_model);
Advanced topic modeling
We can combine the results of topic modeling with the plotting of statistics. Here we show we a simple example.
One of the topics appears to be on "deep_learning". First, we find the topic id for the topic that most strongly belongs to "deep_learning".
[24]: topic_id = topic_model.best_topic_for_token('deep_learning') Let’s print the top 10 papers that most stongly belong to this topic to check the results. We see that these are indeed documents on the topic of deep learning.
[25]: for doc_id in topic_model.best_documents_for_topic(topic_id, limit=10): print(docs[int(doc_id)].title) High performance networked computing in media, services and information management What do Programmers Discuss about Deep Learning Frameworks Sparse evolutionary deep learning with over one million artificial neurons on commodity hardware Deep learning for intelligent traffic sensing and prediction: recent advances and future challenges SOLAR: Services-oriented learning architectures: Deep learning as a service Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment Network Management 2030: Operations and Control of Network 2030 Services SOLAR: Services-Oriented Deep Learning Architectures-Deep Learning as a Service DLPlib: A Library for Deep Learning Processor Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey Next, we annotate the document set with a "dl_topic" tag for document that strongly belong to this topic (i.e., weight above a certain threshold).
After this, we define two groups: documents that have the tag "dl_topic" and documents that do not have this tag. Now we can, for instance, print the publications over the years to see if interest in deep learning has increased or decreased over the years.
[26]: threshold = 0.2 dl_topic = topic_model.doc2topic[:, topic_id] > threshold docs = docs.add_property('dl_topic', dl_topic) groups = { 'deep learning related': 'dl_topic', 'other': 'not dl_topic', } litstudy.plot_year_histogram(docs, groups=groups, stacked=True);
The histogram shows that interest in deep learning has clearly risen over the years. We can even calculate the exact amount by calculating the percentage of documents on deep learning each year. The example below shows that this percentage has increased from just 3.4% in 2011 to 13.6% in 2021.
[27]: table = litstudy.compute_year_histogram(docs, groups=groups) table.div(table.sum(axis=1), axis=0) * 100 [27]: | deep learning related | other | |
|---|---|---|
| 2005 | 0.000000 | 100.000000 |
| 2006 | 0.000000 | 100.000000 |
| 2007 | 0.000000 | 100.000000 |
| 2008 | 6.250000 | 93.750000 |
| 2009 | 0.000000 | 100.000000 |
| 2010 | 2.127660 | 97.872340 |
| 2011 | 3.409091 | 96.590909 |
| 2012 | 2.941176 | 97.058824 |
| 2013 | 3.738318 | 96.261682 |
| 2014 | 3.305785 | 96.694215 |
| 2015 | 9.565217 | 90.434783 |
| 2016 | 7.272727 | 92.727273 |
| 2017 | 3.448276 | 96.551724 |
| 2018 | 7.200000 | 92.800000 |
| 2019 | 9.027778 | 90.972222 |
| 2020 | 9.734513 | 90.265487 |
| 2021 | 13.600000 | 86.400000 |
| 2022 | 14.285714 | 85.714286 |
Alternatively, we can plot the two groups for the publications source. We can see that some journals/conferences have a strong focus on deep learning (e.g. "Neural Computing and Applications"), while others have no or few publications on deep learning (e.g. "Journal of Real Time Image Processing").
[28]: plt.figure(figsize=(10, 10)) litstudy.plot_source_histogram(docs, groups=groups, limit=25, stacked=True);
We can even calculate the most popular publication venues for deep learning in our dataset using some simple Panda functions. It appears that "Neural Computing and Applications" is the most popular publication venue.
[29]: # Compute histogram by publication venue table = litstudy.compute_source_histogram(docs, groups=groups) # Add column 'total' table['total'] = table['deep learning related'] + table['other'] # Remove rare venues that have less than 5 publications table = table[table['total'] >= 5] # Add column 'ratio' table['ratio'] = table['deep learning related'] / table['total'] * 100 # Sort by ratio in descending order table.sort_values(by='ratio', ascending=False) [29]: | deep learning related | other | total | ratio | |
|---|---|---|---|---|
| Neural Computing and Applications | 3 | 3 | 6 | 50.000000 |
| Computing | 3 | 12 | 15 | 20.000000 |
| IEEE International Symposium on Workload Characterization IISWC | 1 | 4 | 5 | 20.000000 |
| Science China Information Sciences | 2 | 8 | 10 | 20.000000 |
| IEEE High Performance Extreme Computing Conference HPEC | 1 | 7 | 8 | 12.500000 |
| European Physical Journal C | 1 | 7 | 8 | 12.500000 |
| Journal of Grid Computing | 1 | 7 | 8 | 12.500000 |
| Journal of Computer Science and Technology | 3 | 21 | 24 | 12.500000 |
| Multimedia Tools and Applications | 3 | 25 | 28 | 10.714286 |
| Journal of Supercomputing | 20 | 205 | 225 | 8.888889 |
| BMC Bioinformatics | 2 | 22 | 24 | 8.333333 |
| Journal of Big Data | 1 | 11 | 12 | 8.333333 |
| Cluster Computing | 3 | 52 | 55 | 5.454545 |
| International Journal of Parallel Programming | 5 | 87 | 92 | 5.434783 |
| IEEE International Parallel and Distributed Processing Symposium Workshops IPDPSW | 1 | 21 | 22 | 4.545455 |
| Journal of Signal Processing Systems | 1 | 42 | 43 | 2.325581 |
| Soft Computing | 0 | 16 | 16 | 0.000000 |
| BMC Research Notes | 0 | 6 | 6 | 0.000000 |
| Journal of the Brazilian Society of Mechanical Sciences and Engineering | 0 | 5 | 5 | 0.000000 |
| IEEE International Conference on Cluster Computing CLUSTER | 0 | 5 | 5 | 0.000000 |
| IEEE International Symposium on High Performance Computer Architecture HPCA | 0 | 5 | 5 | 0.000000 |
| IEEE International Conference on Parallel and Distributed Systems ICPADS | 0 | 5 | 5 | 0.000000 |
| Journal of Real Time Image Processing | 0 | 57 | 57 | 0.000000 |
| Journal of Scientific Computing | 0 | 5 | 5 | 0.000000 |
| Computing and Visualization in Science | 0 | 5 | 5 | 0.000000 |
| Visual Computer | 0 | 7 | 7 | 0.000000 |
| IEEE ACM International Symposium on Cluster Cloud and Grid Computing CCGrid | 0 | 6 | 6 | 0.000000 |
| Euromicro International Conference on Parallel Distributed and Network Based Processing PDP | 0 | 6 | 6 | 0.000000 |
| IEEE Transactions on Parallel and Distributed Systems | 0 | 7 | 7 | 0.000000 |
| International Conference for High Performance Computing Networking Storage and Analysis SC | 0 | 7 | 7 | 0.000000 |
| Frontiers of Computer Science | 0 | 8 | 8 | 0.000000 |
| VLDB Journal | 0 | 8 | 8 | 0.000000 |
| Computer Science Research and Development | 0 | 10 | 10 | 0.000000 |
| IEEE International Parallel and Distributed Processing Symposium IPDPS | 0 | 11 | 11 | 0.000000 |
| International Conference on High Performance Computing and Simulation HPCS | 0 | 5 | 5 | 0.000000 |