arXiV: a rudimentary EDA

12 minute read

Check out the notebook for this post here

In this post we’re going to perform a straightforward Exploratory Data Analysis (EDA) on a dataset; whereby we load it, perform some sensible preprocessing steps, generate some statistics to get a sense of the data then answer some more interesting questions about the dataset with some plots.

The dataset we’re going to examine is the ArXiv dataset from Kaggle, a “repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.”

ArXiv provides open access to scholarly articles (such as research papers), so we can do some interesting analysis about the change in interest of, say, AI/ML related articles.

Creating a DataFrame

import numpy as np 
import pandas as pd
import plotly_express as px
import os
import json
pd.set_option('float_format', '{:f}'.format)

Let’s load the data, we use ‘yield’ to get the necessary information in a loop since json files in the dataset are huge so we avoid memory problems.

file_path = 'arxiv-metadata-oai-snapshot.json'

def get_metadata():
    with open(file_path, 'r') as f:
        for line in f:
            yield line

Looking at one example of a paper we see lots of information available: a series of dates in the ‘versions’, author names, an abstract, categories and so on.

metadata = get_metadata()
for paper in metadata:
    for k, v in json.loads(paper).items():
        print(f'{k}: {v}')
    break
id: 0704.0001
submitter: Pavel Nadolsky
authors: C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan
title: Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies
comments: 37 pages, 15 figures; published version
journal-ref: Phys.Rev.D76:013009,2007
doi: 10.1103/PhysRevD.76.013009
report-no: ANL-HEP-PR-07-12
categories: hep-ph
license: None
abstract:   A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of massive photon pairs at hadron colliders. All
next-to-leading order perturbative contributions from quark-antiquark,
gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as
all-orders resummation of initial-state gluon radiation valid at
next-to-next-to-leading logarithmic accuracy. The region of phase space is
specified in which the calculation is most reliable. Good agreement is
demonstrated with data from the Fermilab Tevatron, and predictions are made for
more detailed tests with CDF and DO data. Predictions are shown for
distributions of diphoton pairs produced at the energy of the Large Hadron
Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs
boson are contrasted with those produced from QCD processes at the LHC, showing
that enhanced sensitivity to the signal can be obtained with judicious
selection of events.

versions: [{'version': 'v1', 'created': 'Mon, 2 Apr 2007 19:18:42 GMT'}, {'version': 'v2', 'created': 'Tue, 24 Jul 2007 20:10:27 GMT'}]
update_date: 2008-11-26
authors_parsed: [['Balázs', 'C.', ''], ['Berger', 'E. L.', ''], ['Nadolsky', 'P. M.', ''], ['Yuan', 'C. -P.', '']]

We now extract a subset of the fields which we will perform our anlysis on.

titles, abstracts, versions, categories, doi, authors_parsed = [], [], [], [], [], []

metadata = get_metadata()
# loop over all papers
for paper in metadata:
    # extract single paper
    paper_dict = json.loads(paper)
    version = paper_dict.get('versions')
    try:
        versions.append(version[-1]['created']) # choose created as the most recent version
        titles.append(paper_dict.get('title'))
        abstracts.append(paper_dict.get('abstract'))
        categories.append(paper_dict.get('categories'))
        doi.append(paper_dict.get('doi'))
        authors_parsed.append(paper_dict.get('authors_parsed'))
    except:
        pass

Let’s create a pandas dataframe to make our analysis easier:

papers = pd.DataFrame({
    'title': titles,
    'abstract': abstracts,
    'categories': categories,
    'version': versions,
    'doi': doi,
    'authors': authors_parsed
})
papers.head()

# reduce memory constraints
del titles, abstracts, versions, categories, doi, authors_parsed
title abstract categories version doi authors
0 Calculation of prompt diphoton production cros... A fully differential calculation in perturba... hep-ph Tue, 24 Jul 2007 20:10:27 GMT 10.1103/PhysRevD.76.013009 [[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...
1 Sparsity-certifying Graph Decompositions We describe a new algorithm, the $(k,\ell)$-... math.CO cs.CG Sat, 13 Dec 2008 17:26:00 GMT None [[Streinu, Ileana, ], [Theran, Louis, ]]
2 The evolution of the Earth-Moon system based o... The evolution of Earth-Moon system is descri... physics.gen-ph Sun, 13 Jan 2008 00:36:28 GMT None [[Pan, Hongjun, ]]
3 A determinant of Stirling cycle numbers counts... We show that a determinant of Stirling cycle... math.CO Sat, 31 Mar 2007 03:16:14 GMT None [[Callan, David, ]]
4 From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a... In this paper we show how to compute the $\L... math.CA math.FA Mon, 2 Apr 2007 18:09:58 GMT None [[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]

Preprocessing

As we can see, some columns have different data types, whilst we also have some None or missing values present. The isna() function is helpful to find None or NaN values in data. As an example lets see how many of our papers do not have a DOI:

print('{} of {} papers without DOI'.format(papers['doi'].isna().sum(), len(papers)))
846236 of 1765688 papers without DOI

Now since DOI is often added after a paper is published (source) we won’t remove this column but if we wanted to something such as papers[papers.doi.notnull()] would do the trick.

Next let’s clean up some of the columns:

  • We want the abstract to be one continous string of text without new lines.
  • versions need to be Python datetime objects then lets extract the month and year they were published.
  • The authors column would be better as a list of strings
# clean abstracts
papers['abstract'] = papers['abstract'].apply(lambda x: x.replace("\n",""))
papers['abstract'] = papers['abstract'].apply(lambda x: x.strip())

# extract date time info 
papers['DateTime']= pd.to_datetime(papers['version'])
papers['month'] = pd.DatetimeIndex(papers['DateTime']).month
papers['year'] = pd.DatetimeIndex(papers['DateTime']).year

# clean authors
papers['authors']= papers['authors'].apply(lambda authors:[(" ".join(a)).strip() for a in authors])
papers.head()
title abstract categories version doi authors DateTime month year
0 Calculation of prompt diphoton production cros... A fully differential calculation in perturbati... hep-ph Tue, 24 Jul 2007 20:10:27 GMT 10.1103/PhysRevD.76.013009 [Balázs C., Berger E. L., Nadolsky P. M., Yuan... 2007-07-24 20:10:27+00:00 7 2007
1 Sparsity-certifying Graph Decompositions We describe a new algorithm, the $(k,\ell)$-pe... math.CO cs.CG Sat, 13 Dec 2008 17:26:00 GMT None [Streinu Ileana, Theran Louis] 2008-12-13 17:26:00+00:00 12 2008
2 The evolution of the Earth-Moon system based o... The evolution of Earth-Moon system is describe... physics.gen-ph Sun, 13 Jan 2008 00:36:28 GMT None [Pan Hongjun] 2008-01-13 00:36:28+00:00 1 2008
3 A determinant of Stirling cycle numbers counts... We show that a determinant of Stirling cycle n... math.CO Sat, 31 Mar 2007 03:16:14 GMT None [Callan David] 2007-03-31 03:16:14+00:00 3 2007
4 From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a... In this paper we show how to compute the $\Lam... math.CA math.FA Mon, 2 Apr 2007 18:09:58 GMT None [Abu-Shammala Wael, Torchinsky Alberto] 2007-04-02 18:09:58+00:00 4 2007

Analysis

Looking at each paper, we might want to know things such as how many categories does each paper have? How many words is in the abstract? How many authors are in this paper? Pandas makes this easy with the apply method where you can apply an arbitrary function to produce a new column from another, as below.

papers['num_categories'] = papers['categories'].apply(lambda x:len(x)).astype('int')
papers['num_words_abstract'] = papers['abstract'].apply(lambda x:len(x.split())).astype('int')
papers['num_authors'] = papers['authors'].apply(lambda x:len(x)).astype('int')
papers.head()
title abstract categories version doi authors DateTime month year num_categories num_words_abstract num_authors
0 Calculation of prompt diphoton production cros... A fully differential calculation in perturbati... hep-ph Tue, 24 Jul 2007 20:10:27 GMT 10.1103/PhysRevD.76.013009 [Balázs C., Berger E. L., Nadolsky P. M., Yuan... 2007-07-24 20:10:27+00:00 7 2007 6 127 4
1 Sparsity-certifying Graph Decompositions We describe a new algorithm, the $(k,\ell)$-pe... math.CO cs.CG Sat, 13 Dec 2008 17:26:00 GMT None [Streinu Ileana, Theran Louis] 2008-12-13 17:26:00+00:00 12 2008 13 105 2
2 The evolution of the Earth-Moon system based o... The evolution of Earth-Moon system is describe... physics.gen-ph Sun, 13 Jan 2008 00:36:28 GMT None [Pan Hongjun] 2008-01-13 00:36:28+00:00 1 2008 14 133 1
3 A determinant of Stirling cycle numbers counts... We show that a determinant of Stirling cycle n... math.CO Sat, 31 Mar 2007 03:16:14 GMT None [Callan David] 2007-03-31 03:16:14+00:00 3 2007 7 32 1
4 From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a... In this paper we show how to compute the $\Lam... math.CA math.FA Mon, 2 Apr 2007 18:09:58 GMT None [Abu-Shammala Wael, Torchinsky Alberto] 2007-04-02 18:09:58+00:00 4 2007 15 35 2

We’re now ready to ask some questions about the data, each which can be answered in a few lines of code:

1. How many authors to papers have?

Here we use the describe() method to get a high level summary of the column in question, where a histogram isn’t necessary at this point. We see that most papers have 4 authors or less but there’s a heavy right tail with some papers having a few thousand others. Odd.

papers['num_authors'].astype('int').describe()
count   1765688.000000
mean          4.153911
std          20.305943
min           1.000000
25%           2.000000
50%           3.000000
75%           4.000000
max        2832.000000
Name: num_authors, dtype: float64

Let’s look into the heavy tail a bit more - we see that the top 3 papers with most authors are in the ‘hep-ex’ cateogory which stands for ‘High Energy Physics - Experiment’. If we read the abstract of the most authored paper we see in fact its the result from an experiment Large Hadron Collider at CERN, the product of worldwide scientific collaboratin (and so making sense of the 2832 authors!)

papers.sort_values(by='num_authors', ascending=False).head()
title abstract categories version doi authors DateTime month year num_categories num_words_abstract num_authors
574385 Observation of the rare $B^0_s\to\mu^+\mu^-$ d... A joint measurement is presented of the branch... hep-ex hep-ph Mon, 17 Aug 2015 15:53:53 GMT 10.1038/nature14474 [CMS The, Collaborations LHCb, :, Khachatryan ... 2015-08-17 15:53:53+00:00 8 2015 13 99 2832
101754 Expected Performance of the ATLAS Experiment -... A detailed study is presented of the expected ... hep-ex Fri, 14 Aug 2009 13:50:42 GMT None [The ATLAS Collaboration, Aad G., Abat E., Abb... 2009-08-14 13:50:42+00:00 8 2009 6 80 2612
535194 The Physics of the B Factories This work is on the Physics of the B Factories... hep-ex hep-ph Sat, 31 Oct 2015 06:42:11 GMT 10.1140/epjc/s10052-014-3026-9 [Bevan A. J., Golob B., Mannel Th., Prell S., ... 2015-10-31 06:42:11+00:00 10 2015 13 111 2034
901222 Search for High-energy Neutrinos from Binary N... The Advanced LIGO and Advanced Virgo observato... astro-ph.HE Thu, 9 Nov 2017 05:44:40 GMT 10.3847/2041-8213/aa9aed [Albert A. ANTARES, IceCube, Pierre\n Auger,... 2017-11-09 05:44:40+00:00 11 2017 11 170 1945
1041880 Search for Multi-messenger Sources of Gravitat... Astrophysical sources of gravitational waves, ... astro-ph.HE Thu, 15 Nov 2018 21:37:04 GMT 10.3847/1538-4357/aaf21d [ANTARES, IceCube, LIGO, Collaborations Virgo,... 2018-11-15 21:37:04+00:00 11 2018 11 133 1595
papers.iloc[574385]['abstract'][:250]
'A joint measurement is presented of the branching fractions$B^0_s\\to\\mu^+\\mu^-$ and $B^0\\to\\mu^+\\mu^-$ in proton-proton collisions at theLHC by the CMS and LHCb experiments. The data samples were collected in 2011 ata centre-of-mass energy of 7 TeV, '

2. How many words does the average abstract have?

Next lets look at how abstracts are in general, where a histogram will be a good option to visualise the data. As expected most aren’t too long with the average abstract 122 words long.

fig = px.histogram(papers, x="num_words_abstract", nbins=500)
fig.show()

3. How many papers have been produced over time?

We plot a line chart to answer this question. To count the number of papers per year, we used the groupby() method, a powerful tool in pandas to aggregate information in a large variety of ways.

There’s a steady growth of papers published over time as ArXiV became more popular and wide reaching, whilst also perhaps reflecting the higher output of research in Science across the world. Since we’re half way through 2020 the line tails off as expected. This makes sense.

papers_per_year = papers.groupby(['year']).size().reset_index().rename(columns={0:'n_papers'})
fig = px.line(x='year', y='n_papers', data_frame=papers_per_year)
fig.show()

4. In which months are the most papers published?

We see little variation across months excpet a slight decrease across Winter and Christmas months, but nothing significant.

papers_per_month = papers.groupby(['month']).size().reset_index().rename(columns={0:'n_papers'})
fig = px.bar(x='month', y='n_papers', data_frame=papers_per_month)
fig.show()

## AI & ML

Now let’s filter for papers related to AI & ML. It’s not as simple as using papers['categories'].isin(ai_list) since most papers have more than one categories. So we use the intermediate step of seeing if any of the categories of the paper are in the first. If at least one is in the list we assign a value of True to this intermediate variable and filter for values. Note this could have been done in a single line but for clarity we split this out.

ai_list=['cs.AI','cs.LG','stat.ML']
papers['is_ai'] = papers['categories'].apply((lambda x: any(ele in x for ele in ai_list)==True))
ai_papers = papers[papers['is_ai']==True]
ai_papers.head()
title abstract categories version doi authors DateTime month year num_categories num_words_abstract num_authors is_ai
46 Intelligent location of simultaneously active ... The intelligent acoustic emission locator is d... cs.NE cs.AI Sun, 1 Apr 2007 13:06:50 GMT None [Kosel T., Grabec I.] 2007-04-01 13:06:50+00:00 4 2007 11 155 2 True
49 Intelligent location of simultaneously active ... Part I describes an intelligent acoustic emiss... cs.NE cs.AI Sun, 1 Apr 2007 18:53:13 GMT None [Kosel T., Grabec I.] 2007-04-01 18:53:13+00:00 4 2007 11 124 2 True
303 The World as Evolving Information This paper discusses the benefits of describin... cs.IT cs.AI math.IT q-bio.PE Wed, 13 Oct 2010 19:49:16 GMT 10.1007/978-3-642-18003-3_10 [Gershenson Carlos] 2010-10-13 19:49:16+00:00 10 2010 28 107 1 True
670 Learning from compressed observations The problem of statistical learning is to cons... cs.IT cs.LG math.IT Thu, 5 Apr 2007 02:57:15 GMT 10.1109/ITW.2007.4313111 [Raginsky Maxim] 2007-04-05 02:57:15+00:00 4 2007 19 138 1 True
953 Sensor Networks with Random Links: Topology De... In a sensor network, in practice, the communic... cs.IT cs.LG math.IT Fri, 6 Apr 2007 21:58:52 GMT 10.1109/TSP.2008.920143 [Kar Soummya, Moura Jose M. F.] 2007-04-06 21:58:52+00:00 4 2007 19 244 2 True

As before lets try and answer some questions about this subset of the data.

5. What is the growth of of AI & ML papers published over the years?

For this we need to count both the total number of papers published each year as well as the total number of AI & ML papers published each year; then we compare the two and plot the result. As before we use the groupby() method on the main and smaller dataframes, then use merge() to compare the two.

The plot shows an exponential growth in the topic kicking off around 2006, around the time ImageNet was released with several other seminal papers contributing to this explosion of growth.

# total papers published per year
all_papers_per_year = papers.groupby(['year']).size().reset_index().rename(columns={0:'all'})
# AI & ML papers published per year
ai_papers_per_year = ai_papers.groupby(['year']).size().reset_index().rename(columns={0:'AI'})

# merge and calculate percentage
compare = all_papers_per_year.merge(ai_papers_per_year, how='inner')
compare['ratio'] = compare['AI']/compare['all']

# plot
fig = px.line(x='year', y='ratio', data_frame=compare)
fig.show()

6. Which authors have published the most work?

Yoshua Bengio, one of the fathers of Deep Learning, comes top of the list with over 300 papers published in our dataset! An impressive number.

# flatten list of authors
authors = [y for x in ai_papers['authors'].tolist() for y in x]
authors_df = pd.DataFrame({'authors': authors}).groupby(['authors']).size().reset_index().rename(columns={0:'count'})
authors_df = authors_df.sort_values('count',ascending=False).head(15)
# plot
fig = px.bar(x='count', y='authors', data_frame=authors_df)
fig.show()

Conclusion

There you have a quick intro to EDA with some useful methods in pandas that help you along your way. Its always important to understand your dataset before diving into a Machine Learning model and asking + answering some high level questions is a good way to start. Your data should always inform your model choices and make you think about what to try and why.

Find the notebook for this post here

Updated: