One of the major tasks in natural language processing (NLP) is the part-of-speech (POS) tagging of sentences, i.e. categorizing the words according to grammatical properties. Common parts of speech are noun, verb, article, adjective, preposition, pronoun, adverb, conjunction and interjection.
With the help of a recent R package RDRPOSTagger now one can perform POS tagging within R on more than 40 languages, including English, Hungarian, French, German, Hindi, Italian, Thai, Vietnamese and many more. Below we present a brief introduction to this topic via a simple R script.
First, we have to install/load the package. The tokenizers package is also needed for splitting the text into sentences.
The part-of-speech tags are listed along with their abbreviations.
unipostag_types <- c("ADJ" = "adjective", "ADP" = "adposition", "ADV" = "adverb", "AUX" = "auxiliary", "CONJ" = "coordinating conjunction", "DET" = "determiner", "INTJ" = "interjection", "NOUN" = "noun", "NUM" = "numeral", "PART" = "particle", "PRON" = "pronoun", "PROPN" = "proper noun", "PUNCT" = "punctuation", "SCONJ" = "subordinating conjunction", "SYM" = "symbol", "VERB" = "verb", "X" = "other")
Next, the text to analyse is added (source: Wikipedia).
text <- "Rubik's Cube is a 3-D combination puzzle invented in 1974 by Hungarian sculptor and professor of architecture Ernő Rubik. Originally called the Magic Cube, the puzzle was licensed by Rubik to be sold by Ideal Toy Corp. in 1980 via businessman Tibor Laczi and Seven Towns founder Tom Kremer, and won the German Game of the Year special award for Best Puzzle that year. As of January 2009, 350 million cubes had been sold worldwide making it the world's top-selling puzzle game. It is widely considered to be the world's best-selling toy."
We split it into sentences.
sentences <- tokenize_sentences(text, simplify = TRUE)
The language and type of tagging needs to be defined.
unipostagger <- rdr_model(language = "UD_English", annotation = "UniversalPOS")
Finally, the tagging is performed.
unipostags <- rdr_pos(unipostagger, sentences)
unipostags$word.type <- unipostag_types[unipostags$word.type]
The results for the first sentence can be seen below.
sentence.id word.id word word.type
1 1 1 Rubik's noun
2 1 2 Cube noun
3 1 3 is verb
4 1 4 a determiner
5 1 5 3-D numeral
6 1 6 combination noun
7 1 7 puzzle adjective
8 1 8 invented verb
9 1 9 in adposition
10 1 10 1974 numeral
11 1 11 by adposition
12 1 12 Hungarian adjective
13 1 13 sculptor noun
14 1 14 and coordinating conjunction
15 1 15 professor noun
16 1 16 of adposition
17 1 17 architecture noun
18 1 18 Ernő proper noun
19 1 19 Rubik. proper noun
For more details about the RDRPOSTagger package please check this link: Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger.
OpenTrialsFDA works on making clinical trial data from the FDA (the US Food and Drug Administration) more easily accessible and searchable. Until now, this information has been hidden in the user-unfriendly Drug Approval Packages that the FDA publishes via its dataportal Drugs@FDA. These are often just images of pages, so you cannot even search for a text phrase in them. OpenTrialsFDA scrapes all the relevant data and documents from the FDA documents, runs Optical Character Recognition across all documents and links this information to other clinical trial data.
Explore the public beta version through a new user-friendly web interface at https://fda.opentrials.net.
OpenTrials aims to provide a comprehensive picture of the data and documents on all clinical trials conducted on medicines and other treatments. The platform will present data aggregated from a wide variety of existing sources, starting with clinical trial registers and moving on to academic journals, systematic reviews and other data sources.
The intention is to create an open, freely re-usable index of all such information, to increase discoverability, facilitate research, identify inconsistent data, enable audits on the availability and completeness of this information, support advocacy for better data and drive standards around open data in evidence-based medicine.
Explore the public beta version of OpenTrials here.
The well-known quote from Andrew Lang reads as follows: „The statistician uses statistics as a drunken man uses lamp posts—for support rather than illumination.”. It is easy for a mathematician or a statistician to interpret the result of a statistical analysis with caution, but one, who is only interested in the result and less familiar with the mathematical background of the used methods, can easily jump to a wrong conclusion. The simplest example, which points out, why prudence is needed in the implementation of statsitical results, is the Simpson’s paradox (described firstly by Edward H. Simpson in 1951)
Consider the following study. A new drug is being tested on a group of 800 people (400 men and 400 women) with a particular disease. The aim is to establish whether there is a link between taking the drug and recovery from the disease. In a standard scenario half of the people (randomly selected) are given the drug and the other half are given placebo. The results in the following table show that, of the 400 given the drug, 200 (50 %) recover from the disease; this compares favourably with just 160 out of the 400 (40 %) given the placebo who recover.
So clearly we can conclude that the drug has positive effect. Or can we? A more detailed look at the data results in exactly the opposite conclusion. Specifically, the following table shows the results when brokan down into male and female subjects.
|Recovery rate||30 %||20 %||70 %||60 %|
Focusing first on he men, we find that 70 % taking the palcebo recover, but only 60 % taking the drug recover. So, formen, the recovery rate is better without the drug. Similarly, with the women we find that 30 % taking the palcebo recover, but only 20 % taking the drug recover. So, for women, the recovery rate is also better without the drug. So we can conclude, in every subcategory the drug is worse than the placebo.
The process of drilling down into the data this way (in this case by looking at men and women separately) is called stratification. Simpson’s paradox is simply the observation that, on the same data, stratified versions of the data can produce the opposite result to non-stratified versions. Often, there is a causal explanation. In this case men are much more likely to recover naturally from this disease than women. Although an equal number of subjects overall were given the drug as were given the placebo, and although there were an equal number of men and women overall in the trial, the drug was not equally distributed between men and women. More men than women were given the drug. Because of the men’s higher natural recovery rate, overall more people in the trial recovered when given the drug than when given the placebo.
Someone may ask the questions, ’Does this difficulty arise in more general case (e.g. if we stratify the data into more subgroup)? ’ or ’How can we avoid this kind of effects?’. For answers, an more details please refer the following articles:
SatRdays are community-led, regional conferences to support collaboration, networking and innovation within the R community. The initiative of Steph Locke and Gergely Daroczi was accepted and funded by the R consortium. The very first event of this series took place in Budapest, Hungary on September 3, 2016 with almost 200 attendees of 19 countries and 12 hours of pure R fun. The day began with various workshops, followed by two keynotes and several regular talks, and ended with a data visualization challenge. The complete schedule can be found on the conference website, http://budapest.satrdays.org . The talks were live-streamed and can be watched online: http://www.ustream.tv/channel/xFdxHeVnGKS . If you have only limited time, we recommend the following talks: 1st keynote by Gábor Csárdi (R package history), Romain François‘ question section (including a marriage proposal), 2nd keynote by Jeroen Ooms (HTTP requests, ImageMagick) and data sonification by Thomas Levine. In overall, the first satRdays event received very positive feedback from the R community, and started to establish the reputation of the series. Personal thoughts about the conference from the main organizer were published at https://www.r-consortium.org/news/blogs/2016/09/start-satrdays .