Skip to content

These are the files required to perform the workshop on Recommendation Engine using Apache Spark DataFrames on Amazon EMR. Follow the instructions in the readme to setup all the pre-reqs for this workshop.

License

Notifications You must be signed in to change notification settings

rizwanbpatel/EC2Collection

 
 

Repository files navigation

Workshop on Recommendation Engine using Apache Spark on Amazon Elastic MapReduce (EMR)

This is the detailed description of the initial setup before attending the workshop on building a Recommendation Engine using Apache Spark running on Amazon EMR. The workshop uses Python 3 and the Spark DataFrames API. Amazon EMR comes with python3 installed on all the cluster nodes, we use emr-5.5.0 release for this workshop, and we configure our EMR cluster to have Zeppelin & Spark installed at launch. Refer to the Cloudformation template for more details.

NB: You may choose to finish all the steps listed here, except the launching of the cluster until much closer to the time you will be attending the workshop, that way you will not incur unnecessary charges for resources used.

Clone this Git repository.

To clone this repository, issue the following command on your command prompt,

git clone http://github.com/OmarKhayyam/EC2Collection ./my_local_directory

else, you can use the Clone or download to get the contents of this repository.

Do not change the names of the files, we will soon be uploading them to S3.

personalRatings.txt - This is the file that contains a sample set of movies that user zero(0) has rated. To see which movies these are, you can have a look at the rateMovies script. This file is generated by rateMovies script or you can manually edit the ratigs to eeach of the movies.

rateMovies - This is the script that is used to rate movies and generate the personalRatings.txt file and uploads it to an S3 bucket of your choice. We run this script without any parameters, if you are not able to run this script, either make the necessary modifications to it based upon the version of Python installed on your computer or just edit personalRatings.txt with the ratings for your choice for the movies and upload it to the S3 bucket you will create in the next step.

mySparkAppV3.py - This is the application we will run to generate the recommendations eventually.

Creating a bucket

With a name of your choice in the region of your choice. Note, this will decide in which region your EMR cluster will be launched, it will be the same as the region where the bucket is created. You can create an Amazon S3 bucket using the AWS Management Console or using the CLI, here is an example of creating the bucket using AWS CLI.

aws s3 mb s3://your-bucket-name

Prepare the scripts for use later

Change the following files to use the name of your new bucket instead of the string myBucket.

rateMovies - Replace myBucket with the name of the bucket you just created above. If you are going to modify the personalRatings.txt file by hand, you don't need to do modify this file.

mySparkAppV3.py - Replace myBucket with the name of the bucket you just created above.

Getting the Data

We will be using the that was previously published on the Movielens site, this site : Movielens 100K dataset now has new data. For this exercise, download the following files:

https://s3.ap-south-1.amazonaws.com/rnsdemowritable/movies.csv

https://s3.ap-south-1.amazonaws.com/rnsdemowritable/ratings.csv

Upload the files to the bucket you just created above.

Uploading the files

We should now upload all of the files we discussed so far, including the Movielens dataset to the newly created S3 bucket. You can either use the CLI to upload your files, like so,

aws s3 cp <source-Filename> s3://myBucket

To use the CLI to upload the files, we will need to ensure that we have our CLI configured, check here for guidance on how to go about downloading and configuring the AWS CLI.

OR,

We could use the AWS Management Console to upload all the files to our bucket, refer to this documentation for guidance on using the AWS Management Console to upload your files.

At this point, the following files should be in the newly created S3 bucket,

*mySparkAppV3.py

*personalRatings.txt

*All the files from the Movielens Dataset.

At this point, you should have the default IAM Roles in place,

If you are using an IAM User to create these default roles for an Amazn EMR cluster you intend to launch, your IAM user will require the following permissions:

iam:CreateRole

iam:PutRolePolicy

iam:CreateInstanceProfile

iam:AddRoleToInstanceProfile

iam:ListRoles

iam:GetPolicy

iam:GetInstanceProfile

iam:GetPolicyVersion

iam:AttachRolePolicy

iam:PassRole

Creation of default roles USING THE AWS Management Console

These IAM roles are expected by the cloudformation template we will use below to spin up our cluster. If you ever launched an Amazon EMR cluster before and used default roles for the cluster, you will have them, the default roles go by the name EMR_EC2_DefaultRole and EMR_DefaultRole. If you have never launched an EMR cluster in your AWS account, follow the instructions below to ensure that you have created these roles before we use the cloudformation template below,

  1. Login to your AWS account and navigate to your IAM Management Console, this is what it will look like,

  2. Click Roles (highlighted by the rectangle in the screen shot above), this will take you to the Roles screen, click on Create new role at the top, you will be taken to a screen (see below), where you need to choose AWS Service Role as highlighted and choose the role type, in our case, first select the role type named Amazon Elastic MapReduce (we will talk about the other role type shortly).

  3. This will take you to the Attach Policy screen, refer below screen shot, choose the policy and click Next step.

  4. In the next screen, give this role the name EMR_DefaultRole and click the Create role button at the bottom of your screen.

  5. Follow the steps 2 through 4 for the other role, for the role type, choose Amazon Elastic MapReduce for EC2 (ref. screen shot in point 2 above). Same as in point 3 above, you will be presented with a single policy, select the policy, same as you did in point 3 above, and finally, name the role EMR_EC2_DefaultRole and click Create role.

Creation of default roles USING THE AWS CLI

If you have all your access key and scret keys in plaace and configured AWS CLI on your laptop/machine/desktop, all you have to do to create these roles is issue the following command:

aws emr create-default-roles

Run the Cloudformation template

Now, open the Cloudformation console in the AWS Management Console, and create the stack, the template will ask for some parameters, make sure you already have them, they are,

EC2 Keypair name : Read here how to create an EC2 key pair. This will be used to set up SSH between your laptop/desktop to the Master node.

Log URI : This is the location where EMR will store the logs, this is useful later for debugging issues you may face. This should be of the format like s3://myBucket/EMRLogs, as you can see, you may use prefixes anywhere and everywhere you see fit.

The cloudformation template that we need to run is part of this Git repository, the filename is launchclusterV3.template.

  1. Proceed to the Cloudformation service from the AWS Management Console.
  2. Click Create Stack.
  3. On the next page, choose Upload a template to Amazon S3. Browse to select the dowloaded file.
  4. Provide a Stack name and then provide all the required parameters the template requires to create the stack.
  5. Click Next and then review.
  6. Click Create.

After the stack is created

You can find the Master Node's FQDN in the Outputs tab of your Cloudformation console. You can also find your EMR cluster in the Amazon EMR console. We will need to modify the Security group of the master node in the EMR cluster. You can find the Master node's security group by looking at the Cluster details and identifying the master node's security group. Like so,

Click on the security group and add your custom IP address for SSH into the inbound list of allowed IP addresses and ports, for details have a look at this.

Download the Spark application into the cluster

You can do this when you bootstrap the EMR cluster, but we will do this manually here. Notice that you have not setup your AWS crdentials on any of the nodes on the cluster. Log into the master node, whose FQDN you obtained above using ssh, like so,

ssh -i ~/<path-to-keyfile> hadoop@<FQDN of the master node>

Once logged in, issue the following command,

aws s3 cp s3://myBucket/mySparkAppV3.py ./ && chmod u+x ./mySparkAppV3.py

Do not forget to replace the myBucket string in the above command with your specific bucket name. Exit the ssh session, you will no longer be required to connect to the cluster for the rest of this exercise.

Add step to your EMR cluster

You are now all set to run your recommender application,

aws emr add-steps --cluster-id j-XXXXXXXXXXXXXX --steps Type=CUSTOM_JAR,Name="My Recommender",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[spark-submit,--master,yarn,--deploy-mode,client,--conf,"spark.executor.extraJavaOptions=-XX:ThreadStackSize=2048",--driver-memory,2g,--executor-memory,2g,/home/hadoop/mySparkAppV3.py]

You may, in some cases, have to specify the region.

The Results

Download the results from your S3 bucket to see the recommendations, like so,

aws s3 cp s3://myBucket/mySparkApp.out ./

About

These are the files required to perform the workshop on Recommendation Engine using Apache Spark DataFrames on Amazon EMR. Follow the instructions in the readme to setup all the pre-reqs for this workshop.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%