Divining Twitterverse with R

In this post I continue my journey into Twitterverse with R and capture the tweet frequency for the hashtags #NaMo, #AAP and #RaGa over the last 7 days.  This seemed the most appropriate thing to do given that the 16th Indian General Election 2014 is just around the corner. The handshake that has to be established with Twitter is the same as mentioned in my last post “To R is human …”

Here is a great blog post on measuring tweet frequencies – Getting Genetics done by Stephen Turner.

Once the initial handshake is done the following has to be done. It appears that searchTwitter can only search tweets within the last 7 days and that too for a maximum of 1500 tweets.

This is done as follows for the hashtag #NaMo.  The dates variable creates 7 date strings. The for loop performs a searchTwitter everyday for the last 7 days

#Search the last 7 days for the hashtag #NaMo everyday

dates <- paste(“2014-03-“,10:17,sep=””) # need to go to 18th to catch tweets from 17th

for (i in 2:length(dates)) {

print(paste(dates[i-1], dates[i]))

tweets <- c(tweets, searchTwitter(“#Namo”, since=dates[i-1], until=dates[i], n=1500))

}

The tweets are then converted to dataframes for processing

# Create a dataframe from the tweets

tweets <- twListToDF(tweets)

tweets <- unique(tweets)

Finally the tweets are plotted using ggplot

#Plot the frequency of tweets in 2 hour windows

minutes <- 120

ggplot(data=tweets, aes(x=created)) +

geom_bar(aes(fill=..count..), binwidth=60*minutes) +

scale_x_datetime(“Date”) +

scale_y_continuous(“Frequency”) +

opts(title=”#NaMo Tweet Frequency March 11-17″, legend.position=’none’)

ggsave(file=’NaMo-frequency.png’, width=7, height=7, dpi=100)

The plot for #NaMo is shown below

namo

The same is performed for

#AAP

AAP

And for #RaGa

RaGa

While the number of tweets for #NaMo is very high, #RaGa seems to occur in lower number but consistently everyday

Of course we can check the tweets whether is sentiment is positive or negative for the hashtags. Thats for another day though.

The code can be cloned at Rtweet-frequency

Find me on Google+

To R is human …

“To R is human, to dabble in it fun” one could say. In this post I try to be a little of Nate Silver looking at Twiiterverse. Since the Indian general election 2014 is around the corner for constituting the 16th Lok Sabha in India I wanted to play around a little bit. Anyway here goes.

To get started on Twitter, with R we first need to establish a handshake between Twitter and R. We need to authenticate our R application with Twitter to enable us to mine the tweets in Twitterverse.. The steps are fairly straightforward. The R app you create has to authenticated and authorized with Twitter.

The first step is to create an app at Twitter at http://dev.twitter.com.. Login to your twitter account. Click the drop down at your photo and choose “My applications”. Then click “Create new application”. Now do the following
– Enter a unique name for your application
– Enter a description
– For the ‘Website’ enter any valid URL
– Leave the Callback URL blank
– Accept the conditions

bb
Leave this in your browser. The handshake between your R application and Twitter needs to be established as follows

#install the necessary packages
install.packages("ROAuth")
install.packages("twitteR")
install.packages("wordcloud")
install.packages("tm")

library("ROAuth")
library("twitteR")
library("wordcloud")
library("tm")
library(RCurl)

# Set SSL certs globally
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

require(twitteR)
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"

Now go to your browser. In the created Twitter application, choose the API Keys tab. Copy and paste the API key and API secret in the next 2 lines

apiKey <- "Your API key here"
apiSecret <- "Your API secret here"
twitCred twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

When you enter this you should see the following
To enable the connection, please direct your web browser to:
https://api.twitter.com/oauth/authorize?oauth_token=WnTGL4eHsiNJRFRiW1UU3GoYSvVZiYDBbO3WAsZO

Copy and paste the link given in a new tab in your browser. Copy the 7 digit PIN and paste it in the space below
When complete, record the PIN given to you and provide it here: 7377963

registerTwitterOAuth(twitCred)

This should complete the authorization. Now you are good to go.

Here is a short example of performing Text Mining with the help of package “tm”.

I wanted to create a word cloud around the hashtag #NaMo

So here is the code. We need to create a Corpus

#Search Twitter for the hashtag #NaMo

#Search Twitter for the hashtag #NaMo
r_stats<- searchTwitter("#NaMo",n=500, cainfo="cacert.pem")


# Save text
r_stats_text <- sapply(r_stats, function(x) x$getText())
# Create a corpus
r_stats_text_corpus <- Corpus(VectorSource(r_stats_text))
# Clean up the text
r_stats_text_corpus <- tm_map(r_stats_text_corpus, tolower)
r_stats_text_corpus <- tm_map(r_stats_text_corpus, removePunctuation)
r_stats_text_corpus <- tm_map(r_stats_text_corpus, function(x)removeWords(x,stopwords()))

# Now create a word cloud
wordcloud(r_stats_text_corpus)

modi

This will create a Wordcloud of the words most used with the hashtag, in this case #NaMo

You can clone the code at Rwordcloud

Watch this space. Hasta la vista. I’ll be back!

Find me on Google+

C Language – The code of God

dna

One could easily say “In the beginning there was C language. All else were variants” and not be far from the truth.  As I headed to work today I was ruminating on the impact C language has had to the computing landscape for the past 4 decades. No other language has had such a significant impact.

C Language was created by Dennis Ritchie & Brian Kernighan in Bell Labs, around 1972. C was the trigger for many seismic shifts in the computing industry. The language is terse and compact. C language strikes a rich balance between brevity and readability.

C language, in my opinion, is the code of God.

We can easily divide the epoch of programming languages as before C and after C. Before C, there was a babel of languages from FORTRAN, COBOL, Pascal, Basic, Prolog, Ada, Lisp and numerous others. When C language, entered the scene, many other languages simply faded away. C set the tone for programming and spawned an entire industry.

Many of the popular constructs like the if-then-else, for, while loops had a crisp simplicity in C. C included in its repertoire the ability to manipulate the bits of registers  all the way to creating complex and rich data structures with the help of structures and pointers. In fact C was probably one of key enablers for the development of the legendary Operating System (OS), UNIX from Bell Labs.

Building the innards of an OS is an undertaking of gigantic proportions and requires the need to be able to manipulate the registers of the numerous input/output devices, the processors and the memory.  C was eminently suited for this the job. Also the complex algorithms of OS for e.g. process scheduling, memory management, IO management, disk management could now be programed simultaneously in a bottom-up fashion by working at the ‘bit’ level and in a top-down fashion allowing for complex data structures and algorithms required for scheduling, memory and IO management. . With a powerful language C, the birth of UNIX was a given.  When AT&T distributed UNIX to the universities in the late 1970 it created serious shock waves in the industry. Since then UNIX has resulted in numerous variants – Solaris, HP-UX, AIX, iOS, Linux and then Android and so on. Well, that’s another story!

C came at an opportune time when the internet was at its infancy. C proceeded to be useful also for protocol of the internet namely TCP/IP. C spawned an army of programmers all keen to take on this new language twiddling bits, bytes and complex data structures of the OS and protocols.

C, UNIX and TCP/IP almost entirely power the internet and the WorldWideWeb.

The beauty and brevity of the language enabled programmers to easily express complex problems as units of C functions. Pointers, and bit manipulation gave it a power that was unparalleled at that time. Soon C became the de facto programming standard. C, in fact, became a way of thinking for problems!

So it was not surprising that languages that came after C used the same or similar constructs. C++ maintained identical constructs of C to maintain backward compatibility as well as to allow the already existing millions of C programmers easily assimilate the OO paradigm. Java, from Sun Micro Systems, followed suit.  Java, a very powerful and popular language, also retained the flavor of C.

Many interpreted and dynamic languages like Perl, Python, and Ruby all have C look-alike constructs,

Even in the languages of the Word Wide Web, C familiarity is extremely useful. JavaScript, PHP look familiar to one who is proficient in C.

The only other language which is entirely different from C from the bottom up, in my opinion is Lisp. Lisp is older than C and requires an entirely different way of thinking. There are possibly others too.

C balances economy of syntax, style and structure in programming exquisitely. It does have a few shortcomings, as its detractors would like to say. For e.g. C in the hands of a novice can spell disaster. It has also been accused of allowing programmers to create impregnable code. But in the hands of an experienced programmer it is possible to create really, robust code. UNIX and its variants are considered to be more resilient than OS’es to hackers.

C is really the soul of programming!

Find me on Google+

The language R

In the universe of programming languages there is a rising staR. It is moving fasteR and getting biggeR and brighteR!

Ok, you get the hint! It is the language R or the R Language.

R language is the successor to the language S. R is extremely powerful for statistical computing and processing. It is an interpreted language much like Python, Perl. The power of the language R comes from the 4000+ software packages that make the R language almost indispensable for any type of statistical computing.

As I mentioned above in my opinion, R, is soon going to play a central role in the technological world. In today’s world we are flooded with data from all sides. To make sense of this information overload we need techniques like Big Data, Analytics and machine learning to make sense of this data deluge. This is where R with its numerous packages that make short work of data becomes critical. The packages also have very interesting graphic packages to display the data in many forms for faster  analysis and easier consumption.

The language R can easily ingest large sets of data in CSV format and perform many computations on them. R language is being used in machine learning, data mining, classification and clustering, text mining besides also being utilized in sentiment analysis from social networks.

The R language contains the usual programming constructs namely logical, loops, assignment etc. The language enables to easily assign values to vectors, matrices, arrays and perform all the associated operations on them.

The R Language can be installed from R-project. The R Language package comes with many datasets which are data collected from various sources. One such dataset is the Iris dataset. The Iris dataset is dataset about the Iris plant( Iris is a genus of 260–300[1][2] species of flowering plants with showy flowers).

The dataset contains 5 parameters

1)      Sepal length 2) Sepal Width 3) Petal length 4) Petal width 5) Species

This dataset has been used in many research papers. R allows you to easily perform any sophisticated set of statistical operations on this data set. Included below are a sample set of operations you can perform on the Iris dataset or any dataset

> iris[1:5,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1          5.1         3.5          1.4         0.2  setosa

2          4.9         3.0          1.4         0.2  setosa

3          4.7         3.2          1.3         0.2  setosa

4          4.6         3.1          1.5         0.2  setosa

5          5.0         3.6          1.4         0.2  setosa

> summary(iris)

Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species

Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50

1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50

Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50

Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199

3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800

Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

>hist(iris$Sepal.Length)

1

Here is a scatter plot of the Petal width, sepal length and sepal width

>scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)

2

 

As can be seen R can really make short work of data with the numerous packages that come along with it. I have just skimmed the surface of R language.

I hope this has whetted your appetite. Do give R a spin!

Watch this space!

You may also like
1. Introducing cricketr! : An R package to analyze performances of cricketers
2. Literacy in India : A deepR dive.
3. Natural Language Processing: What would Shakespeare say?
4. Revisiting crimes against women in India
5. Sixer – R package cricketr’s new Shiny Avatar

Also see
1. Designing a Social Web Portal
2. Design principles of scalable, distributed systems
3. A Cloud Medley with IBM’s Bluemix, Cloudant and Node.js
4. Programming Zen and now – Some essential tips -2 
5. Fun simulation of a Chain in Android

Find me on Google+