(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Short text classification with machine learning in the social sciences: The case of climate change on Twitter [1]

['Karina Shyrokykh', 'Department Of Economic History', 'International Relations', 'Stockholm University', 'Stockholm', 'Max Girnyk', 'Lisa Dellmuth']

Date: 2023-11

In this section, we review existing automated text-classification methods. These methods rely on a similar workflow consisting of the collection of data and various text preprocessing techniques, which we discuss in detail in the S1 Appendix. We start with the lexicon-based classifier. After this, we move on to the basics of the supervised-learning approach to classification. This is followed by an overview of relevant traditional ML algorithms. Finally, we discuss advanced neural classifiers.

To describe the methods, we use the following notation. Boldface letters, such as a, denote vectors. Capital boldface letters, such as A, denote matrices and calligraphic letters, such as , denote sets. Estimates are denoted as . A vector transpose is denoted as aT. Lower indices, such as in , are used to denote features, whilst upper indices denote observations. The probability distribution of a variable a is denoted as p(a), and p(a|b) designates the conditional probability distribution of a given b has been observed, where we deliberately do not distinguish between the random variables and their realizations.

There are many possible ways to classify tweets with respect to the topic of climate change. However, there is no standard lexicon for this type of classification. For this article, in order to benchmark the ML methods with a reasonable classifier, we select the lexicon from [ 3 ]. It consists of the following key terms: “climate”, “climatechange”, “globalwarming”, “climaterealists” and “agw”(an abbreviation for “anthropogenic global warming”). The lexicon has a decent performance and constitutes a good baseline for our empirical performance assessment.

When we use lexicon-based classification to categorize Twitter data from UN agencies, the content of tweets is tokenized (or split up into separate words) and then inspected. If one or more of the tokenized words in the tweet matches with key terms contained within the lexicon, then the tweet is classified as dealing with climate change. If no words match, then the tweet is classified as not dealing with climate change. The number of tokenized words which need to match for positive classification is a design parameter.

Lexicon-based classification is the most widely used approach to text classification in the social sciences. It is an unsupervised technique that is based on the filtering or labeling of texts with the help of a lexicon, or a list of relevant key terms (e.g., [ 8 , 10 ]). Typically, the lexicon that is used is created by an expert in the area of interest, who has knowledge of the terminology used in context in this area.

This approach makes it possible to validate the models and select the best one in terms of the out-of-sample loss. However, it might still not be possible to assess the actual performance of the model on as yet unseen data. This is because the process of model selection contaminates the dataset, injecting a dependency between the model and the data based on which it was selected. Hence, the validation loss is not representative of the actual performance on yet unseen data. To tackle this problem, it is necessary to make yet another partition of the dataset . That is, prior to cutting subsets and from the dataset , another subset of data, called a test set is withheld from the validation set. This test set is not touched during the training and validation, and is used purely for assessing the classifier’s performance on unseen data.

To detect overfitting, the entire dataset can be split into two parts: a training set and a validation set . This makes it possible to validate the out-of-sample performance of a model on the validation set. Then several models can be fitted to the training set, while also validating their out-of-sample performance using the validation set. This validation can be used as a basis for selecting a model that is complex enough to generalize to new examples and performs the best in terms of average loss among all other considered models. Following this, the best model can be retrained using the entire dataset (minus the test set), whilst also ensuring that it generalizes well when applied to as-yet-unseen data.

The more complex the model , the greater the possibility to decrease the empirical loss in Eq (1) and improve its in-sample performance on the given dataset . However, when the number of parameters becomes too high (relative to the number of examples N in the dataset), the training process might start confusing the noise in the data with the behavior of the feature variables. The model becomes tied to particular examples in the dataset and does not generalize to unseen data. That is, the in-sample performance of the model on the dataset becomes highly misleading and does not reflect its out-of-sample performance, which is known as overfitting. Thus, a general rule for picking the model is the following: the model should be complex enough to overcome underfitting, but simple enough to not reach overfitting.

If the model is too simple (i.e., it has too few parameters w to capture the behavior of the data), it is unable to fit the data well and has poor in-sample performance. This is known as underfitting and is an indicator that the model is poor in general. To avoid underfitting, more complex models have to be considered. The true predictor f(⋅) can be a complicated relation between features and labels. To approximate it well, the model may sometimes need to be complicated and include many parameters K. For example, a linear model of this kind could be given by the following expression: , where φ k ( x ) is some function of features x for k = 1, …, K − 1. More complicated models (e.g., polynomials) might involve even larger numbers of parameters.

The model , being a function of the features x , is also a function of a set of its own parameters w = [w 0 , …, w K−1 ]. The fitting of the model is therefore carried out by searching the parameter vector w that minimizes the average loss in Eq (1) . This process is referred to as the training of the model. By monitoring loss we can track the in-sample performance of the model, or more specifically how closely corresponds with the data to which it is being fitted. Because the model learns the “average” mapping from features into labels, we also expect the model to have a decent out-of-sample performance (i.e., performance on new and as-yet-unseen examples). Unfortunately, the in-sample performance of a model might not be indicative of its out-of-sample performance, as we shall see below.

An efficient alternative to lexicon-based classifiers is the use of supervised ML methods. This class of methods requires a manually labeled dataset (see Eq (1) and Fig 3 in the S1 Appendix ). By fitting a set of N examples given in this labeled dataset, the methods obtain a model that approximates the predictor function. To fit the model, we choose a mapping that minimizes the empirical average loss. This means that, in fitting the model, the computer attempts to minimize the difference between the predicted value and the actual observation. This can be formally expressed as: (1) where is some chosen loss function that reflects the closeness of the model prediction for features x (i) to the true label y (i) for observation i.

The classifier fits two Gaussian distributions (depicted with contour plots) to the categorized labels in the training set. The decision boundary is determined by the point where the probability densities for the two categories take the same value.

There are several versions of the NB classifier (e.g., Gaussian, Bernoulli, multinomial). The operation of a Gaussian NB classifier on exemplary data is demonstrated in Fig 5 . Studies have reported that a multinomial mixture model shows improved performance scores on several data collections, when compared to other commonly used versions of the NB approach [ 46 ]. Because of this, for our empirical analysis, we use a multinomial NB classifier with the smoothing hyperparameter tuned to achieve the best performance.

The advantages of the NB classifier are its simplicity, good scalability, quick training and insensitivity to irrelevant data. The disadvantages of the method are the assumption of independent predictor features and zero-probability problem for features present in the test set, but not occurring in the training set.

Given an observation x , such as a short text, the NB classifier assigns the most probable category according to (3) due to Bayes’ theorem. The term p( x ) is not dependent on class c and hence skipped. Since it is difficult to estimate p( x |c), the naïve Bayes assumption is made. Namely, we factorize this conditional distribution as p( x |c) = p(x 1 |c)p(x 2 |c) ⋯ p(x M |c), which simplifies the computations. Moreover, p(c) is estimated as the fraction of the tweets in class y in the training set. Meanwhile, p(x j |c) is estimated as the relative count of feature x j in class c to all features, with applied Laplace smoothing.

Naïve Bayes (NB) is a probabilistic classifier whose basic idea is to use the joint probabilities of words and categories to estimate the conditional category probabilities given a data point [ 45 ]. It has been successfully used for text classification [ 46 , 47 ]. It is called “naïve” because it makes an assumption that features are conditionally independent given the label. This assumption simplifies the computations carried out by the NB classifier, when compared to a non-naïve Bayes classifier, because the NB classifier does not use word combinations as predictors.

In our analysis, we use KNN with uniform weights. Because of the majority voting procedure, we set the hyperparameter k to an odd number. We sweep k and another hyperparameter, leaf size, to optimize the KNN performance.

The k nearest neighbors (KNN) algorithm [ 42 ] is a simple classifier which has been widely used by researchers for text categorization (see, e.g., [ 43 , 44 ]). In contrast to other ML algorithms, KNN does not have a training phase. Instead the training set is simply stored and the actual computation is deferred to the prediction phase. In a nutshell, given an unlabeled observation x , the algorithm calculates the distances d( x , x (i) ) to training observations x (i) , for i = 1, …, N, in the feature space and finds k nearest neighbors to x . The classes of these neighbors are used to weigh the available classes for the given observation. That is, majority voting is performed on the observations within the set of k nearest neighbors. Fig 4 illustrates a KNN classifier in action on synthetic data.

The RF classifier has the advantages of good interpretability, possibility to assess the importance of input features, and efficient avoidance of overfitting with many trees. The main disadvantage of this method is its computational complexity and slow speed with a large number of trees, which limits its applicability for real-time operation.

An issue with decision trees is that they typically overfit training data. A widely-used way to improve generalizability is to average the predictions of a number of decision trees by means of bagging [ 41 ]. With bagging, several bootstrap training sets are randomly sampled (with replacements). A decision-tree model is then fitted to each of these training sets. The predictions from the models are combined, e.g., using majority voting over decision trees in the forest. This improves the model’s generalizability by reducing the variance of the classifier, without increasing its bias. The RF approach is an extension of bagging that also randomly selects subsets of features used in each data sample. Fig 3 illustrates an RF classifier in action. A total of B decision trees are formed by considering various subsets of available features. Each decision tree makes its own prediction on its own subset of the training data. Then majority vote defines the final prediction label of the random forest classifier.

The classifier consists of a collection of decision trees, each making its own prediction regarding the class of an unseen data point. The predictions are then used in a majority voting for producing the final prediction.

Once the training is done, a new observation ( x , y) is passed through the entire decision tree. As it passes through the tree, the observation experiences a sequence of learned conditional statements (greater or lower than a threshold) for each feature x j where j = 1, …, M. At the bottom of the tree (in a leaf node, where there is no further split), the prediction is obtained as the majority class of the data points satisfying all the conditions along the path. Illustrations of predictions with decision trees are shown as one block in Fig 3 . Each circle depicts a conditional statement over a feature with respect to a threshold. The thick lines depict the paths of observations as they travel through the decision tree, satisfying all the conditions on their way. The numbers below the trees indicate the predicted classes.

Random forest (RF) classifiers [ 35 , 36 ] are widely used for text classification (see, e.g., [ 37 , 38 ]). The algorithm uses the so-called decision tree approach [ 39 , 40 ]. The latter relies on the intuition that, instead of seeking a complicated mapping function , one can partition the feature space into disjoint regions and fit a simpler model in each of those. This is done during training by recursively splitting the space of possible values for each feature at a certain threshold. For example, at a given iteration, the entire input space is split into a pair of half-spaces { x : x j ≥ t j } and { x : x j < t j }, where t j is a threshold for a certain feature j. The split is done in a greedy way to minimize some loss function (e.g., misclassification error, Gini impurity, or entropy), without looking ahead at future splits. This process is repeated on each of the half-spaces recurrently until a stopping criterion is fulfilled (e.g., there is no information gain from further splits). In this way, a binary “decision tree” is constructed, which sets the thresholds for the classification of new samples.

Because it is known that text classification is usually linearly-separable [ 12 ], we use an SVM implementation with a linear kernel for our empirical analysis. The regularization hyperparameter, defining the “softness” of the margin, is tuned to maximize the performance of the SVM.

The advantages of the SVM classifier are its operation efficiency with high-dimensional data, ability to model non-linear decision boundaries, and efficient handling of small datasets. The downsides are its sensitivity to the choice of the kernel function, high computational complexity for very large datasets, and the lack of probabilistic interpretation (unlike in LR).

A hyperplane is fitted to separate the two classes and maximize the margin. Linear kernel (a) is used for the case of linearly separable classes, while radial basis function kernel (b) is used for the case with class overlap.

The above intuition only works for cases where classes are linearly separable (see Fig 2a ). In the general case, when classes may overlap in such a way that it is not possible to fit a separating hyperplane between their data points, the SVM principle can be extended to soft margins. This allows some observations to reside on the other side of the hyperplane [ 34 ]. Furthermore, a non-linear transformation ϕ(⋅) can also be applied to the data, expanding the feature space to a higher dimension in which there may be a linearly separating hyperplane. This transformation can be done, e.g., by adding higher-order polynomial terms. However, a problem with this approach is that it might explode the computational complexity of the algorithm and lead to overfitting. In order to overcome this problem, the “kernel trick” can be invoked. Namely, by means of Lagrangian duality, the maximization of the soft margin is reformulated in terms of scalar products between data points. There are kernel functions k(⋅) where k( x (i) , x (j) ) = ϕ( x (i) )ϕ( x (j) )—e.g., polynomial, Bessel, radial basis functions (RBFs). This means that it is possible to skip computing the data transformation ϕ( x ). Instead, only computing the kernel function will give the dot products in the transformed feature space for the optimization. The obtained separating hyperplane translates back to the original feature space as a non-linear decision boundary (see Fig 2b that illustrates an SVM with an RBF kernel on synthetic data).

SVM classification works by fitting a hyperplane that separates the classes and maximizes the separation margin [ 32 ]. Support vectors refer to those samples that are the closest to the separating hyperplane. These are the only samples that impact the SVM training, as they are most likely to cause misclassification. The margin is given by the length of the projections of the supporting vectors onto the vector of parameters w which is perpendicular to the separating hyperplane. The average loss is given by Eq (1) with a hinge loss function.

Support vector machines (SVMs) were first proposed in [ 32 ] for linearly separable problems and later generalized to non-linear problems in [ 33 ]. SVMs have been frequently used for the classification of political texts [ 12 , 13 , 18 ]. They are sometimes seen to be the classification method to be used for short texts in social sciences [ 8 ]. Unlike the LR method which provides a probability as its output, the output of an SVM gives the direct prediction of the label.

A classifier is characterized by a set of hyperparameters which refer to parameters related to the model’s architecture and whose values are set before the learning process begins. This is in contrast to the model’s own parameters which are learned during the training. The SciKit-Learn implementation of the LR classifier has several hyperparameters, including regularization strength and maximum number of iterations. For our numerical performance assessment, these are tuned to get the best classification performance.

The advantages of the LR classifier are its simplicity, efficient training, interpretability and the absence of assumptions about the class distribution. The downsides are overfitting for a large number of features (compared to the number of observations) and sensitivity to multicollinearity among predictors.

An example of a LR classifier in action is shown in Fig 1 on exemplary synthetic data. The two classes are separated by a decision boundary, which is determined by the equal-probability line on the sigmoid-based probability map (see the black solid line on Fig 1 ). For all data points with probability above the decision threshold we predict , while for the rest we predict .

Logistic regression (LR) is a classification algorithm that was invented in [ 29 ] and has been widely applied in text classification [ 30 , 31 ]. The classifier models a binary random variable representing the label y given the data x . The model, parameterized by weights w , is given by (2) where σ(⋅) is the sigmoid function representing the conditional probability of a positive class label given the observed features x 1 , …, x M .

By traditional ML classifiers we mean a collection of ML algorithms which have previously been developed for classification tasks. Those are built on different principles and have different computational complexities. Their performance depends on the task at hand. In this section, we will review the most popular ML classifiers and specify the variants of the classifiers that will be used for our subsequent performance comparisons. In our analysis, we use the SciKit-Learn implementations [ 27 ] of the traditional ML algorithms, whereas for the deep neural classifiers, we use the TensorFlow implementations set up via the Keras API [ 28 ]. For each of the methods, we test the three available vectorizers, as well as optimize the number of features to maximize the classifier’s performance.

Deep learning

Deep learning (DL) is a prominent supervised ML framework for data-driven predictive modeling. Despite its promising classification capacities, DL continues to be a much less used approach to text classification in social science, when compared to its use in such fields as computer science. A DL classifier can be thought of as a black box: data go in, decisions come out (see Fig 6a). Inside the black box there are a large number of parameters that are learned during the training. In this way, the DL model approximates a predictor function that describes the observed data. Because of the way its operations resemble those of the human brain, the broader class of methods to which DL belongs is often referred to as artificial intelligence (AI).

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 6. Illustration of a deep-learning model. The classifier provides a complex non-linear function (a) that maps an input feature vector into a predicted label. The function is learned from the training set by adjusting the parameters of a set of internal units arranged in, e.g., a layered fully-connected structure (b). (a) Deep-learning model as a black box. (b) Fully-connected deep neural network. https://doi.org/10.1371/journal.pone.0290762.g006

A central element of DL is the concept of an artificial neural network (NN) which describes the inner-workings of the DL black box. It consists of a layered structure of units conducting mathematical operations on their inputs and passing along the results (see Fig 6b). The usefulness of an NN lies in its ability to infer a function from observations. In this way, the DL model can approximate the predictor function that describes the observed data which is essential for both classification and regression tasks. The idea of an NN was first proposed by Hebb [48] as an abstraction of a real biological network of neurons in the mammalian brain.

A biological NN comprises numerous layers, each consisting of a set of neurons (see Fig 4 in the S1 Appendix). Each neuron consists of dendrites that receive input signal pulses, a cell body that performs a transformation of the incoming combination of pulses, and an axon. The axon carries the output pulse through its trunk (covered with isolating myelin sheaths) to its ramified terminal, which has a set of synapses that connect to the dendrites of the upstream neurons in the network. The functioning of biological NNs has been studied and formalized independently in [49, 50].

Each neuron in the network conducts a non-linear operation on the superposition of the signals received by its dendrites. These signals are weighted by the strength of the connection of its synapses of the upstream neurons. The process carried out by each neuron can therefore be modeled as a simple unit computing the weighted sum of its inputs x 1 , …, x M with weights w j,1 , …, w j,M , adding a bias term b j and applying a non-linear transformation a(⋅), called an activation function on the weighted sum (see Fig 4a in the S1 Appendix). This unit, first proposed in [51], is referred to as an artificial neuron and constitutes the basic building block of every NN.

Information travels through the NN from the input layer towards the output layer, passing through all neurons on its way. For binary classification tasks, the output layer consists of a single neuron that outputs the probability of the observation belonging to one of the classes. The layers in between are called hidden layers, as they are not directly visible from the outside of the black box. An NN is referred to as a deep NN if it contains two hidden layers or more. Note that NNs could equally well be adapted for regression and multi-class classification tasks. For the latter, the last layer would consist of the same number of neurons as there are classes, each having a softmax activation function and outputting the probability of belonging to that class.

It has been proven that sufficiently deep NNs are able to approximate any arbitrarily complex function f(⋅) [52]. Hence, deep NNs are particularly useful in applications where there are large amounts of data and significant computational resources available, and where manual analysis is tedious or the performance of traditional ML algorithms is unsatisfactory. In the context of text classification, NNs were first applied in [53, 54]. Only recently have they started to be used for political science research [55, 56]. There are various neural architectures available for this task. We review the most relevant ones below.

Fully-connected neural networks. The first attempt to combine artificial neurons into a layer was made by Rosenblatt [57]. He devised the concept of a perceptron, a binary classifier mapping an input feature vector into one of several available classes. Formally, given a feature vector x, the perceptron outputs (4) where is the set of weights, and a(⋅) is the activation function. This step is referred to as the forward pass. Given an actual training output y, the algorithm readjusts its weights according to the gradient descent rule (5) where α is the so-called learning rate which determines the convergence speed of loss minimization. This adjustment process is referred to as the backward pass. The process continues iteratively until the loss function in Eq (1) is below an acceptance threshold. If the classification problem is linearly separable, the perceptron is guaranteed to converge, and the predictor function is given by the forward pass with the learned weights w j for all j = 1, …, M. In a nutshell, the perceptron is an NN with a single layer. To improve its performance, Werbos [58] proposed stacking several perceptrons in consecutive layers. That study coined the concept of a multi-layer perceptron, also known as a fully-connected NN (FCNN). FCNNs exhibit a much better learning ability than that of the perceptron, and are the simplest form of a general class of feedforward NNs, which are neural architectures without cycles, being the most common architectures used nowadays. An FCNN comprises a set of fully-connected (or dense) layers stacked in a line topology (see Fig 6b). Due to the presence of multiple layers, an FCNN is capable of having multiple levels of abstraction. An example is its use for facial recognition [56]. Based on an input image, the first layers of the NN capture the simple characteristics of the image, such as oriented edges or corners. Then further layers react to more complicated shapes, such as noses or eyes. Finally, the last layers are able to detect the face itself. In this way, adding extra layers to the NN can enable solutions to a lot of otherwise non-separable problems. For a deep FCNN with P hidden layers, the input-output relation at each layer p = 1, …, P + 1 is given in a matrix form as (6) where z p−1 and z p are vectors of the inputs and outputs of layer p, respectively. Then W p is the matrix of weights between the input and output neurons of the layer, while b p is the vector of bias terms of the given layer. Moreover, a p (⋅) is the activation function, which is often chosen as sigmoid, hyperbolic tangent (tanh), or rectified linear unit (ReLU). Also note that here the input of the first layer is given by the input feature vector z 0 = x. Meanwhile, during the training, the output of the last layer in the network is set to the true label, i.e., z P+1 = y. For a new data point, the output of the last layer provides the predicted label, i.e., . Learning for feedforward NNs is usually done by means of backpropagation. The NN parameters (i.e., weights and biases) for each layer are adjusted based on the adopted loss function which captures the difference (or error) between the observation, y, and the output of the forward pass based on current parameters . The adjustment is done based on the loss function (1). This learning process is often carried out by means of the stochastic gradient descent, or any other known optimization algorithms used for learning, such as, e.g., Adam [59] and RMSProp [60]. This involves recursively propagating the gradients of the parameters backwards, from the last to the first layer, similarly to Eq (5). It is noteworthy that a mini-batch version of the gradient update is often used for more robust convergence. More concretely, the training dataset is split into small batches for which the model loss is calculated and model parameters are updated. In this way, the batch updates are computationally more efficient because they do not need the training data to be held in the computer’s memory. The advantages of FCNNs are their ability to solve complex non-linear problems, efficiency of handling large amounts of data, and quick predictions after the slow process of training is completed. Their disadvantages are slow training, poor interpretability due to their black-box nature, large numbers of parameters due to the fully-connected structure, and ignoring spatial information by accepting only vectorized inputs. For our analysis, we have used the same architecture for all NNs under consideration, with difference in only a single layer that is particular to the given NN. For FCNN, this layer comprised a fully-connected layer with a certain number of units and a ReLU activation function. The number of neurons was optimized, alongside other hyperparameters, such as batch size, number of epochs, dropout rate, learning rate, regularization strength, embedding dimension, and maximum vocabulary length.

Convolutional neural networks. Convolutional neural networks (CNNs) refer to another subclass of feedforward NNs that are designed to capture spatial and temporal dependencies through the application of kernels. CNNs have the benefit of having a reduced number of parameters and reusable weights. The use of CNNs has become a state-of-the-art approach for the analysis of images [16]. They have also been shown to be useful for text classification [55, 61]. A CNN consists of two parts, where the first part is dedicated to feature extraction, while the second part performs the actual classification by means of an FCNN (see Fig 5 in the S1 Appendix). CNNs often start with an embedding layer that maps a discrete variable to a vector of continuous numbers. This is done by means of a single fully-connected layer (or a set thereof) that is either learned during the training or substituted by a pre-trained embedding. In the subsequent convolution layer, a set of fixed-size kernels slide through the list of embeddings, performing a convolution operation. This layer functions as the CNN’s feature extractor because it learns to find local spatial features in the output of the embedding layer. The size of the kernel is the number of embeddings it sees at once multiplied by the length of an entire word embedding. The output of this layer is a set of feature maps that serve as the input for the subsequent pooling layer. Zero-padding is often used in this step, surrounding the input so that a feature map does not shrink. The pooling layer then subsamples the feature maps input to it. It does this by selecting a single element from each region of the feature maps covered by its filter. The pooling layer operates on each feature map independently. This reduces the size of the feature representation, thereby effectively reducing the amount of parameters in the network, whilst preserving the most prominent features. Oftentimes, a dropout layer is also applied as a means to improve generalizability. By destroying the learned co-dependencies between neurons that compensate for errors from the previous layers, overfitting is prevented at a cost of longer training. The output is then fed into a fully-connected layer (or an FCNN) that conducts the classification based on the features extracted by the previous layers. CNNs have important advantages, such as automatic feature extraction, with the possibility of using those for new tasks, and weight sharing, which leads to a reduced number of parameters. They also have lower computational complexity than FCNNs. Their drawbacks are ignoring the position and orientation of the object in their predictions, their need for large amounts of data, and long training times. In our performance comparisons, the CNN architecture includes an embedding layer, a dropout layer, a convolution layer and a max-pooling layer with output flattened to match a single neuron for final classification. We optimize the same hyperparameters as we tuned for the FCNN, except for the number of units. Instead, for a convolution layer we tune the number of filters and the size of the kernel.

[END]
---
[1] Url: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0290762

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/