This is the detailed description of the initial setup before attending the workshop on building a Recommendation Engine using Apache Spark running on Amazon EMR. The workshop uses Python 3 and the Spark DataFrames API. Amazon EMR comes with python3 installed on all the cluster nodes, we use emr-5.5.0 release for this workshop, and we configure our EMR cluster to have Zeppelin & Spark installed at launch. Refer to the Cloudformation template for more details.
NB: You may choose to finish all the steps listed here, except the launching of the cluster until much closer to the time you will be attending the workshop, that way you will not incur unnecessary charges for resources used.
To clone this repository, issue the following command on your command prompt,
git clone http://github.com/OmarKhayyam/EC2Collection ./my_local_directory
else, you can use the Clone or download to get the contents of this repository.
Do not change the names of the files, we will soon be uploading them to S3.
personalRatings.txt - This is the file that contains a sample set of movies that user zero(0) has rated. To see which movies these are, you can have a look at the rateMovies script. This file is generated by rateMovies script or you can manually edit the ratigs to eeach of the movies.
rateMovies - This is the script that is used to rate movies and generate the personalRatings.txt file and uploads it to an S3 bucket of your choice. We run this script without any parameters, if you are not able to run this script, either make the necessary modifications to it based upon the version of Python installed on your computer or just edit personalRatings.txt with the ratings for your choice for the movies and upload it to the S3 bucket you will create in the next step.
mySparkAppV3.py - This is the application we will run to generate the recommendations eventually.
With a name of your choice in the region of your choice. Note, this will decide in which region your EMR cluster will be launched, it will be the same as the region where the bucket is created. You can create an Amazon S3 bucket using the AWS Management Console or using the CLI, here is an example of creating the bucket using AWS CLI.
aws s3 mb s3://your-bucket-name
Change the following files to use the name of your new bucket instead of the string myBucket.
rateMovies - Replace myBucket with the name of the bucket you just created above. If you are going to modify the personalRatings.txt file by hand, you don't need to do modify this file.
mySparkAppV3.py - Replace myBucket with the name of the bucket you just created above.
We will be using the that was previously published on the Movielens site, this site : Movielens 100K dataset now has new data. For this exercise, download the following files:
https://s3.ap-south-1.amazonaws.com/rnsdemowritable/movies.csv
https://s3.ap-south-1.amazonaws.com/rnsdemowritable/ratings.csv
Upload the files to the bucket you just created above.
We should now upload all of the files we discussed so far, including the Movielens dataset to the newly created S3 bucket. You can either use the CLI to upload your files, like so,
aws s3 cp <source-Filename> s3://myBucket
To use the CLI to upload the files, we will need to ensure that we have our CLI configured, check here for guidance on how to go about downloading and configuring the AWS CLI.
OR,
We could use the AWS Management Console to upload all the files to our bucket, refer to this documentation for guidance on using the AWS Management Console to upload your files.
*mySparkAppV3.py
*personalRatings.txt
*All the files from the Movielens Dataset.
If you are using an IAM User to create these default roles for an Amazn EMR cluster you intend to launch, your IAM user will require the following permissions:
iam:CreateRole
iam:PutRolePolicy
iam:CreateInstanceProfile
iam:AddRoleToInstanceProfile
iam:ListRoles
iam:GetPolicy
iam:GetInstanceProfile
iam:GetPolicyVersion
iam:AttachRolePolicy
iam:PassRole
Creation of default roles USING THE AWS Management Console
These IAM roles are expected by the cloudformation template we will use below to spin up our cluster. If you ever launched an Amazon EMR cluster before and used default roles for the cluster, you will have them, the default roles go by the name EMR_EC2_DefaultRole and EMR_DefaultRole. If you have never launched an EMR cluster in your AWS account, follow the instructions below to ensure that you have created these roles before we use the cloudformation template below,
-
Login to your AWS account and navigate to your IAM Management Console, this is what it will look like,

-
Click Roles (highlighted by the rectangle in the screen shot above), this will take you to the Roles screen, click on Create new role at the top, you will be taken to a screen (see below), where you need to choose AWS Service Role as highlighted and choose the role type, in our case, first select the role type named Amazon Elastic MapReduce (we will talk about the other role type shortly).

-
This will take you to the Attach Policy screen, refer below screen shot, choose the policy and click Next step.

-
In the next screen, give this role the name EMR_DefaultRole and click the Create role button at the bottom of your screen.
-
Follow the steps 2 through 4 for the other role, for the role type, choose Amazon Elastic MapReduce for EC2 (ref. screen shot in point 2 above). Same as in point 3 above, you will be presented with a single policy, select the policy, same as you did in point 3 above, and finally, name the role EMR_EC2_DefaultRole and click Create role.
Creation of default roles USING THE AWS CLI
If you have all your access key and scret keys in plaace and configured AWS CLI on your laptop/machine/desktop, all you have to do to create these roles is issue the following command:
aws emr create-default-roles
Now, open the Cloudformation console in the AWS Management Console, and create the stack, the template will ask for some parameters, make sure you already have them, they are,
EC2 Keypair name : Read here how to create an EC2 key pair. This will be used to set up SSH between your laptop/desktop to the Master node.
Log URI : This is the location where EMR will store the logs, this is useful later for debugging issues you may face. This should be of the format like s3://myBucket/EMRLogs, as you can see, you may use prefixes anywhere and everywhere you see fit.
The cloudformation template that we need to run is part of this Git repository, the filename is launchclusterV3.template.
- Proceed to the Cloudformation service from the AWS Management Console.
- Click Create Stack.
- On the next page, choose Upload a template to Amazon S3. Browse to select the dowloaded file.
- Provide a Stack name and then provide all the required parameters the template requires to create the stack.
- Click Next and then review.
- Click Create.
You can find the Master Node's FQDN in the Outputs tab of your Cloudformation console. You can also find your EMR cluster in the Amazon EMR console. We will need to modify the Security group of the master node in the EMR cluster. You can find the Master node's security group by looking at the Cluster details and identifying the master node's security group. Like so,
Click on the security group and add your custom IP address for SSH into the inbound list of allowed IP addresses and ports, for details have a look at this.
You can do this when you bootstrap the EMR cluster, but we will do this manually here. Notice that you have not setup your AWS crdentials on any of the nodes on the cluster. Log into the master node, whose FQDN you obtained above using ssh, like so,
ssh -i ~/<path-to-keyfile> hadoop@<FQDN of the master node>
Once logged in, issue the following command,
aws s3 cp s3://myBucket/mySparkAppV3.py ./ && chmod u+x ./mySparkAppV3.py
Do not forget to replace the myBucket string in the above command with your specific bucket name. Exit the ssh session, you will no longer be required to connect to the cluster for the rest of this exercise.
You are now all set to run your recommender application,
aws emr add-steps --cluster-id j-XXXXXXXXXXXXXX --steps Type=CUSTOM_JAR,Name="My Recommender",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[spark-submit,--master,yarn,--deploy-mode,client,--conf,"spark.executor.extraJavaOptions=-XX:ThreadStackSize=2048",--driver-memory,2g,--executor-memory,2g,/home/hadoop/mySparkAppV3.py]
You may, in some cases, have to specify the region.
Download the results from your S3 bucket to see the recommendations, like so,
aws s3 cp s3://myBucket/mySparkApp.out ./
