Two other machine learning systems, Linguistic Profiling and Ti MBL, come close to this result, at least when the input is first preprocessed with PCA. Introduction In the Netherlands, we have a rather unique resource in the form of the Twi NL data set: a daily updated collection that probably contains at least 30% of the Dutch public tweet production since 2011 (Tjong Kim Sang and van den Bosch 2013).
However, as any collection that is harvested automatically, its usability is reduced by a lack of reliable metadata.
We also varied the recognition features provided to the techniques, using both character and token n-grams.
The age component of the system is described in (Nguyen et al. The authors apply logistic and linear regression on counts of token unigrams occurring at least 10 times in their corpus.