# Category: pySpark

# Big Data-2: Move into the big league:Graduate from R to SparkR

**Note 2**: You can download this RMarkdown file from Github at Big Data- Python to Pyspark and R to SparkR

1a. Read CSV- R

# Big Data-1: Move into the big league:Graduate from Python to Pyspark

This post discusses similar constructs in Python and Pyspark. As in my earlier post R vs Python: Different similarities and similar differences the focus is on the key and common constructs to highlight the similarities.

**Important Note**:You can also access this notebook at databricks public site Big Data-1: Move into the big league:Graduate from Python to Pyspark (the formatting here is much better!!).

For this notebook I have used Databricks community edition

You can download the notebook from Github at Big Data-1:PythontoPysparkAndRtoSparkR

Hope you found this useful!

**Note**: There are still a few more important constructs which I will be adding to this post.

Also see

1. My book “Deep Learning from first principles” now on Amazon

2. My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon

3. Re-introducing cricketr! : An R package to analyze performances of cricketers

4. GooglyPlus: yorkr analyzes IPL players, teams, matches with plots and tables

5. Deblurring with OpenCV: Weiner filter reloaded

6. Design Principles of Scalable, Distributed Systems

# My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)

*Then felt I like some watcher of the skies*

*When a new planet swims into his ken;*

*Or like stout Cortez when with eagle eyes*

*He star’d at the Pacific—and all his men*

*Look’d at each other with a wild surmise—*

*Silent, upon a peak in Darien.*

On First Looking into Chapman’s Homer by John Keats

The above excerpt from John Keat’s poem captures the the exhilaration that one experiences, when discovering something for the first time. This also summarizes to some extent my own as enjoyment while pursuing Data Science, Machine Learning and the like.

I decided to write this post, as occasionally youngsters approach me and ask me where they should start their adventure in Data Science & Machine Learning. There are other times, when the ‘not-so-youngsters’ want to know what their next step should be after having done some courses. This post includes my travels through the domains of Data Science, Machine Learning, Deep Learning and (soon to be done AI).

By no means, am I an authority in this field, which is ever-widening and almost bottomless, yet I would like to share some of my experiences in this fascinating field. I include a short review of the courses I have done below. I also include alternative routes through courses which I did not do, but are probably equally good as well. Feel free to pick and choose any course or set of courses. Alternatively, you may prefer to read books or attend bricks-n-mortar classes, In any case, I hope the list below will provide you with some overall direction.

All my learning in the above domains have come from MOOCs and I restrict myself to the top 3 MOOCs, or in my opinion, ‘the original MOOCs’, namely Coursera, edX or Udacity, but may throw in some courses from other online sites if they are only available there. I would recommend these 3 MOOCs over the other numerous online courses and also over face-to-face classroom courses for the following reasons. These MOOCs

- Are taken by world class colleges and the lectures are delivered by top class Professors who have a great depth of knowledge and a wealth of experience
- The Professors, besides delivering quality content, also point out to important tips, tricks and traps
- You can revisit lectures in online courses
- Lectures are usually short between 8 -15 mins (Personally, my attention span is around 15-20 mins at a time!)

Here is a fair warning and something quite obvious. No amount of courses, lectures or books will help if you don’t put it to use through some language like Octave, R or Python.

**The journey**

My trip through Data Science, Machine Learning started with an off-chance remark,about 3 years ago, from an old friend of mine who spoke to me about having done a few courses at Coursera, and really liked it. He further suggested that I should try. This was the final push which set me sailing into this vast domain.

I have included the list of the courses I have done over the past 3 years (33 certifications completed and another 9 audited-listened only without doing the assignments). For each of the courses I have included a short review of the course, whether I think the course is mandatory, the language in which the course is based on, and finally whether I have done the course myself etc. I have also included alternative courses, which I may have not done, but which I think are equally good. Finally, I suggest some courses which I have heard of and which are very good and worth taking.

1. Machine Learning, Stanford, Prof Andrew Ng, Coursera

(Requirement: Mandatory, Language:Octave,Status:Completed)

This course provides an excellent foundation to build your Machine Learning citadel on. The course covers the mathematical details of linear, logistic and multivariate regression. There is also a good coverage of topics like Neural Networks, SVMs, Anamoly Detection, underfitting, overfitting, regularization etc. Prof Andrew Ng presents the material in a very lucid manner. It is a great course to start with. It would be a good idea to brush up some basics of linear algebra, matrices and a little bit of calculus, specifically computing the local maxima/minima. You should be able to take this course even if you don’t know Octave as the Prof goes over the key aspects of the language.

2. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford– (Requirement:Mandatory, Language:R, Status;Completed) –

The course includes linear and polynomial regression, logistic regression. Details also include cross-validation and the bootstrap methods, how to do model selection and regularization (ridge and lasso). It also touches on non-linear models, generalized additive models, boosting and SVMs. Some unsupervised learning methods are also discussed. The 2 Professors take turns in delivering lectures with a slight touch of humor.

3a. Data Science Specialization: Prof Roger Peng, Prof Brian Caffo & Prof Jeff Leek, John Hopkins University (Requirement: Option A, Language: R Status: Completed)

This is a comprehensive 10 module specialization based on R. This Specialization gives a very broad overview of Data Science and Machine Learning. The modules cover R programming, Statistical Inference, Practical Machine Learning, how to build R products and R packages and finally has a very good Capstone project on NLP

3b. Applied Data Science with Python Specialization: University of Michigan (Requirement: Option B, Language: Python, Status: Not done)

In this specialization I only did the Applied Machine Learning in Python (Prof Kevyn-Collin Thomson). This is a very good course that covers a lot of Machine Learning algorithms(linear, logistic, ridge, lasso regression, knn, SVMs etc. Also included are confusion matrices, ROC curves etc. This is based on Python’s Scikit Learn

3c. Machine Learning Specialization, University Of Washington (Requirement:Option C, Language:Python, Status : Not completed). This appears to be a very good Specialization in Python

4. Statistics with R Specialization, Duke University (Requirement: Useful and a must know, Language R, Status:Not Completed)

I audited (listened only) to the following 2 modules from this Specialization.

a.Inferential Statistics

b.Linear Regression and Modeling

Both these courses are taught by Prof Mine Cetikya-Rundel who delivers her lessons with extraordinary clarity. Her lectures are filled with many examples which she walks you through in great detail

5.Bayesian Statistics: From Concept to Data Analysis: Univ of California, Santa Cruz (Requirement: Optional, Language : R, Status:Completed)

This is an interesting course and provides an alternative point of view to frequentist approach

6. Data Science and Engineering with Spark, University of California, Berkeley, Prof Antony Joseph, Prof Ameet Talwalkar, Prof Jon Bates

(Required: Mandatory for Big Data, Status:Completed, Language; pySpark)

This specialization contains 3 modules

a.Introduction to Apache Spark

b.Distributed Machine Learning with Apache Spark

c.Big Data Analysis with Apache Spark

This is an excellent course for those who want to make an entry into Distributed Machine Learning. The exercises are fairly challenging and your code will predominantly be made of map/reduce and lambda operations as you process data that is distributed across Spark RDDs. I really liked the part where the Prof shows how a matrix multiplication on a single machine is of the order of O(nd^2+d^3) (which is the basis of Machine Learning) is reduced to O(nd^2) by taking outer products on data which is distributed.

7. Deep Learning Prof Andrew Ng, Younes Bensouda Mourri, Kian Katanforoosh : Requirement:Mandatory,Language:Python, Tensorflow Status:Completed)

This course had 5 Modules which start from the fundamentals of Neural Networks, their derivation and vectorized Python implementation. The specialization also covers regularization, optimization techniques, mini batch normalization, Convolutional Neural Networks, Recurrent Neural Networks, LSTMs applied to a wide variety of real world problems

The modules are

a. Neural Networks and Deep Learning

In this course Prof Andrew Ng explains differential calculus, linear algebra and vectorized Python implementations of Deep Learning algorithms. The derivation for back-propagation is done and then the Prof shows how to compute a multi-layered DL network

b.Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Deep Neural Networks can be very flexible, and come with a lots of knobs (hyper-parameters) to tune with. In this module, Prof Andrew Ng shows a systematic way to tune hyperparameters and by how much should one tune. The course also covers regularization(L1,L2,dropout), gradient descent optimization and batch normalization methods. The visualizations used to explain the momentum method, RMSprop, Adam,LR decay and batch normalization are really powerful and serve to clarify the concepts. As an added bonus,the module also includes a great introduction to Tensorflow.

c.Structuring Machine Learning Projects

A very good module with useful tips, tricks and traps that need to be considered while working on Machine Learning and Deep Learning projects

d. Convolutional Neural Networks

This domain has a lot of really cool ideas, where images represented as 3D volumes, are compressed and stretched longitudinally before applying a multi-layered deep learning neural network to this thin slice for performing classification,detection etc. The Prof provides a glimpse into this fascinating world of image classification, detection andl neural art transfer with frameworks like Keras and Tensorflow.

e. Sequence Models

In this module covers in good detail concepts like RNNs, GRUs, LSTMs, word embeddings, beam search and attention model.

8. Neural Networks for Machine Learning, Prof Geoffrey Hinton,University of Toronto

(Requirement: Mandatory, Language;Octave, Status:Completed)

This is a broad course which starts from the basic of Perceptrons, all the way to Boltzman Machines, RNNs, CNNS, LSTMs etc The course also covers regularization, learning rate decay, momentum method etc

9.Probabilistic Graphical Models, Stanford Prof Daphne Koller(Language:Octave, Status: Partially completed)

This has 3 courses

a.Probabilistic Graphical Models 1: Representation – Done

b.Probabilistic Graphical Models 2: Inference – To do

c.Probabilistic Graphical Models 3: Learning – To do

This course discusses how a system, which can be represented as a complex interaction

of probability distributions, will behave. This is probably the toughest course I did. I did manage to get through the 1st module, While I felt that grasped a few things, I did not wholly understand the import of this. However I feel this is an important domain and I will definitely revisit this in future

10. Mining Massive Data Sets Prof Jure Leskovec, Prof Anand Rajaraman and ProfJeff Ullman. Online Stanford, Status Partially done.

I did quickly audit this course, a year back, when it used to be in Coursera. It now seems to have moved to Stanford online. But this is a very good course that discusses key concepts of Mining Big Data of the order a few Petabytes

11. Introduction to Artificial Intelligence, Prof Sebastian Thrun & Prof Peter Norvig, Udacity

This is a really good course. I have started on this course a couple of times and somehow gave up. Will revisit to complete in future. Quite extensive in its coverage.Touches BFS,DFS, A-Star, PGM, Machine Learning etc.

12. Deep Learning (with TensorFlow), Vincent Vanhoucke, Principal Scientist at Google Brain.

Got started on this one and abandoned some time back. In my to do list though

My learning journey is based on Lao Tzu’s dictum of ‘A good traveler has no fixed plans and is not intent on arriving’. You could have a goal and try to plan your courses accordingly.

And so my journey continues…

I hope you find this list useful.

Have a great journey ahead!!!

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. – Jamie ZawinskiSome programmers, when confronted with a problem, think “I know, I’ll use floating point arithmetic.” Now they have 1.999999999997 problems. – @tomscottSome people, when confronted with a problem, think “I know, I’ll use multithreading”. Nothhw tpe yawrve o oblems. – @d6Some people, when confronted with a problem, think “I know, I’ll use versioning.” Now they have 2.1.0 problems.– @JaesCoyleSome people, when faced with a problem, think, “I know, I’ll use binary.” Now they have 10 problems. – @nedbat## Introduction

The power of Spark, which operates on in-memory datasets, is the fact that it stores the data as collections using Resilient Distributed Datasets (RDDs), which are themselves distributed in partitions across clusters. RDDs, are a fast way of processing data, as the data is operated on parallel based on the map-reduce paradigm. RDDs can be be used when the operations are low level. RDDs, are typically used on unstructured data like logs or text. For structured and semi-structured data, Spark has a higher abstraction called Dataframes. Handling data through dataframes are extremely fast as they are Optimized using the Catalyst Optimization engine and the performance is orders of magnitude faster than RDDs. In addition Dataframes also use Tungsten which handle memory management and garbage collection more effectively.

The picture below shows the performance improvement achieved with Dataframes over RDDs

Benefits from Project Tungsten

Npte: The above data and graph is taken from the course Big Data Analysis with Apache Spark at edX, UC Berkeley

This post is a continuation of my 2 earlier posts

1. Big Data-1: Move into the big league:Graduate from Python to Pyspark

2. Big Data-2: Move into the big league:Graduate from R to SparkR

In this post I perform equivalent operations on a small dataset using RDDs, Dataframes in Pyspark & SparkR and HiveQL. As in some of my earlier posts, I have used the tendulkar.csv file for this post. The dataset is small and allows me to do most everything from data cleaning, data transformation and grouping etc.

You can clone fork the notebooks from github at Big Data:Part 3

## 1. RDD – Select all columns of tables

## 1b.RDD – Select columns 1 to 4

[[‘Runs’, ‘Mins’, ‘BF’, ‘4s’],

[’15’, ’28’, ’24’, ‘2’],

[‘DNB’, ‘-‘, ‘-‘, ‘-‘],

[’59’, ‘254’, ‘172’, ‘4’],

[‘8′, ’24’, ’16’, ‘1’]]

## 1c. RDD – Select specific columns 0, 10

[(‘Ground’, ‘Runs’),

(‘Karachi’, ’15’),

(‘Karachi’, ‘DNB’),

(‘Faisalabad’, ’59’),

(‘Faisalabad’, ‘8’)]

## 2. Dataframe:Pyspark – Select all columns

|Runs|Mins| BF| 4s| 6s| SR|Pos|Dismissal|Inns|Opposition| Ground|Start Date|

+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+

| 15| 28| 24| 2| 0| 62.5| 6| bowled| 2|v Pakistan| Karachi| 15-Nov-89|

| DNB| -| -| -| -| -| -| -| 4|v Pakistan| Karachi| 15-Nov-89|

| 59| 254|172| 4| 0| 34.3| 6| lbw| 1|v Pakistan|Faisalabad| 23-Nov-89|

| 8| 24| 16| 1| 0| 50| 6| run out| 3|v Pakistan|Faisalabad| 23-Nov-89|

| 41| 124| 90| 5| 0|45.55| 7| bowled| 1|v Pakistan| Lahore| 1-Dec-89|

+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+

only showing top 5 rows

## 2a. Dataframe:Pyspark- Select specific columns

|Runs| BF|Mins|

+—-+—+—-+

| 15| 24| 28|

| DNB| -| -|

| 59|172| 254|

| 8| 16| 24|

| 41| 90| 124|

+—-+—+—-+

## 3. Dataframe:SparkR – Select all columns

## 3a. Dataframe:SparkR- Select specific columns

1 15 24 28

2 DNB – –

3 59 172 254

4 8 16 24

5 41 90 124

6 35 51 74

## 4. Hive QL – Select all columns

|Runs|Mins|BF |4s |6s |SR |Pos|Dismissal|Inns|Opposition|Ground |Start Date|

+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+

|15 |28 |24 |2 |0 |62.5 |6 |bowled |2 |v Pakistan|Karachi |15-Nov-89 |

|DNB |- |- |- |- |- |- |- |4 |v Pakistan|Karachi |15-Nov-89 |

|59 |254 |172|4 |0 |34.3 |6 |lbw |1 |v Pakistan|Faisalabad|23-Nov-89 |

|8 |24 |16 |1 |0 |50 |6 |run out |3 |v Pakistan|Faisalabad|23-Nov-89 |

|41 |124 |90 |5 |0 |45.55|7 |bowled |1 |v Pakistan|Lahore |1-Dec-89 |

+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+

## 4a. Hive QL – Select specific columns

+—-+—+—-+

|15 |24 |28 |

|DNB |- |- |

|59 |172|254 |

|8 |16 |24 |

|41 |90 |124 |

+—-+—+—-+

## 5. RDD – Filter rows on specific condition

[[‘Runs’,

‘Mins’,

‘BF’,

‘4s’,

‘6s’,

‘SR’,

‘Pos’,

‘Dismissal’,

‘Inns’,

‘Opposition’,

‘Ground’,

‘Start Date’],

[’15’,

’28’,

’24’,

‘2’,

‘0’,

‘62.5’,

‘6’,

‘bowled’,

‘2’,

‘v Pakistan’,

‘Karachi’,

’15-Nov-89′],

[‘DNB’,

‘-‘,

‘-‘,

‘-‘,

‘-‘,

‘-‘,

‘-‘,

‘-‘,

‘4’,

‘v Pakistan’,

‘Karachi’,

’15-Nov-89′],

[’59’,

‘254’,

‘172’,

‘4’,

‘0’,

‘34.3’,

‘6’,

‘lbw’,

‘1’,

‘v Pakistan’,

‘Faisalabad’,

’23-Nov-89′],

[‘8′,

’24’,

’16’,

‘1’,

‘0’,

’50’,

‘6’,

‘run out’,

‘3’,

‘v Pakistan’,

‘Faisalabad’,

’23-Nov-89′]]

## 5a. Dataframe:Pyspark – Filter rows on specific condition

|Runs|Mins| BF| 4s| 6s| SR|Pos|Dismissal|Inns|Opposition| Ground|Start Date|

+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+

| 15| 28| 24| 2| 0| 62.5| 6| bowled| 2|v Pakistan| Karachi| 15-Nov-89|

| 59| 254|172| 4| 0| 34.3| 6| lbw| 1|v Pakistan|Faisalabad| 23-Nov-89|

| 8| 24| 16| 1| 0| 50| 6| run out| 3|v Pakistan|Faisalabad| 23-Nov-89|

| 41| 124| 90| 5| 0|45.55| 7| bowled| 1|v Pakistan| Lahore| 1-Dec-89|

| 35| 74| 51| 5| 0|68.62| 6| lbw| 1|v Pakistan| Sialkot| 9-Dec-89|

+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+

only showing top 5 rows

## 5b. Dataframe:SparkR – Filter rows on specific condition

## 5c Hive QL – Filter rows on specific condition

|Runs|BF |Mins|

+—-+—+—-+

|15 |24 |28 |

|59 |172|254 |

|8 |16 |24 |

|41 |90 |124 |

|35 |51 |74 |

|57 |134|193 |

|0 |1 |1 |

|24 |44 |50 |

|88 |266|324 |

|5 |13 |15 |

+—-+—+—-+

only showing top 10 rows

## 6. RDD – Find rows where Runs > 50

## 6a. Dataframe:Pyspark – Find rows where Runs >50

from pyspark.sql import SparkSession

|Runs|Mins| BF| 4s| 6s| SR|Pos|Dismissal|Inns| Opposition| Ground|Start Date|

+—-+—-+—+—+—+—–+—+———+—-+————–+————+———-+

| 59| 254|172| 4| 0| 34.3| 6| lbw| 1| v Pakistan| Faisalabad| 23-Nov-89|

| 57| 193|134| 6| 0|42.53| 6| caught| 3| v Pakistan| Sialkot| 9-Dec-89|

| 88| 324|266| 5| 0|33.08| 6| caught| 1| v New Zealand| Napier| 9-Feb-90|

| 68| 216|136| 8| 0| 50| 6| caught| 2| v England| Manchester| 9-Aug-90|

| 114| 228|161| 16| 0| 70.8| 4| caught| 2| v Australia| Perth| 1-Feb-92|

| 111| 373|270| 19| 0|41.11| 4| caught| 2|v South Africa|Johannesburg| 26-Nov-92|

| 73| 272|208| 8| 1|35.09| 5| caught| 2|v South Africa| Cape Town| 2-Jan-93|

| 50| 158|118| 6| 0|42.37| 4| caught| 1| v England| Kolkata| 29-Jan-93|

| 165| 361|296| 24| 1|55.74| 4| caught| 1| v England| Chennai| 11-Feb-93|

| 78| 285|213| 10| 0|36.61| 4| lbw| 2| v England| Mumbai| 19-Feb-93|

+—-+—-+—+—+—+—–+—+———+—-+————–+————+———-+

## 6b. Dataframe:SparkR – Find rows where Runs >50

## 7 RDD – groupByKey() and reduceByKey()

(‘Lahore’, 17.0),

(‘Adelaide’, 32.6),

(‘Colombo (SSC)’, 77.55555555555556),

(‘Nagpur’, 64.66666666666667),

(‘Auckland’, 5.0),

(‘Bloemfontein’, 85.0),

(‘Centurion’, 73.5),

(‘Faisalabad’, 27.0),

(‘Bridgetown’, 26.0)]

## 7a Dataframe:Pyspark – Compute mean, min and max

| Ground| avg(Runs)|min(Runs)|max(Runs)|

+————-+—————–+———+———+

| Bangalore| 54.3125| 0| 96|

| Adelaide| 32.6| 0| 61|

|Colombo (PSS)| 37.2| 14| 71|

| Christchurch| 12.0| 0| 24|

| Auckland| 5.0| 5| 5|

| Chennai| 60.625| 0| 81|

| Centurion| 73.5| 111| 36|

| Brisbane|7.666666666666667| 0| 7|

| Birmingham| 46.75| 1| 40|

| Ahmedabad| 40.125| 100| 8|

|Colombo (RPS)| 143.0| 143| 143|

| Chittagong| 57.8| 101| 36|

| Cape Town|69.85714285714286| 14| 9|

| Bridgetown| 26.0| 0| 92|

| Bulawayo| 55.0| 36| 74|

| Delhi|39.94736842105263| 0| 76|

| Chandigarh| 11.0| 11| 11|

| Bloemfontein| 85.0| 15| 155|

|Colombo (SSC)|77.55555555555556| 104| 8|

| Cuttack| 2.0| 2| 2|

+————-+—————–+———+———+

only showing top 20 rows

## 7b Dataframe:SparkR – Compute mean, min and max

Also see

1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

2.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon

3.The Clash of the Titans in Test and ODI cricket

4. Introducing QCSimulator: A 5-qubit quantum computing simulator in R

5.Latency, throughput implications for the Cloud

6. Simulating a Web Joint in Android

5. Pitching yorkpy … short of good length to IPL – Part 1

To see all posts click Index of Posts