Wrangle_Data_Project

In this wrangling process, I started with gathering all the needed datasets. The first dataset (twitter_archived_enhanced.csv) was downloaded traditionally by clicking on the provided link and saving it to a local device, after which I loaded it into the Jupyter notebook I used for the wrangling process. The second dataset (image_predictions.tsv) was downloaded programmatically from Udacity 's servers using the Requests library and the following URL and loaded it into the Jupyter notebook. The third dataset (tweet_json.txt) was queried from the Twitter API using the API query process and I needed to apply for permission to access their API due to privacy reasons. After I got permission, I queried the Twitter API for each tweet's JSON data using Python's Tweepy library, I stored each tweet's entire set of JSON data in a ‘tweet_json.txt’ file then I read the .txt file line by line into a pandas DataFrame with the tweet ID, retweet count, and favorite count columns

I proceeded to the next wrangling process, ‘assessing’ all the datasets. I used two methods to perform this part which were visually and programmatically. I opened the twitter_archive_enhanced.csv file using MS Excel to visually assess it and opened the other two datasets using a Jupyter notebook. Next, I used pandas functions like ‘.info()’, ‘.describe()’, ‘.head()’ etc. to detect and documented eight (8) quality issues and two (2) tidiness issues

I proceeded to the last wrangling process, ‘cleaning’. The first thing I did in this part was to make copies, which is a good cleaning practice so I still have the original data as a backup copy in case I need to reverse any wrong action or changes. I implemented the cleaning process by converting the documented quality and tidiness issues that needed to be addressed to code, I used pandas functions like ‘.drop()’, to trim and arrange the dataset columns and rows to remove what was not needed and correct some errors

Lastly, I merged the three cleaned datasets into one master dataset and stored it in a CSV file called ‘twitter_archive_master.csv’, using the ‘.to_csv()’ pandas function

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Code_file (.ipynb & .html)		Code_file (.ipynb & .html)
.gitattributes		.gitattributes
ACT REPORT.pdf		ACT REPORT.pdf
ReadMe.md		ReadMe.md
Wrangle Report.pdf		Wrangle Report.pdf
image_predictions.tsv		image_predictions.tsv
tweet_json.txt		tweet_json.txt
twitter_archive_enhanced.csv		twitter_archive_enhanced.csv
twitter_archive_master.csv		twitter_archive_master.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wrangle_Data_Project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wrangle_Data_Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages