Write My Paper Button

WhatsApp Widget
Skip to content
Home » Assignment 02: Data wrangling with R Marks: 50 (will be scaled to 30) Assignment Type: Individual  Overview Over the past few weeks, you have learned how to use R to wrangle business data. This assignmen

Assignment 02: Data wrangling with R Marks: 50 (will be scaled to 30) Assignment Type: Individual  Overview Over the past few weeks, you have learned how to use R to wrangle business data. This assignmen

  • by

BUS5DWR – Data Wrangling and R

Assignment 02: Data wrangling with R

Marks: 50 (will be scaled to 30)

Assignment Type: Individual

 

 

Overview

Over the past few weeks, you have learned how to use R to wrangle business data. This assignment will provide you with an opportunity to demonstrate your R skill for data wrangling. Using the tidyverse package is recommended but not compulsory.

Please carefully read the entire assignment to make sure you understand the requirements and also the submission format and marking rubrics before starting.

 

Academic Integrity

Plagiarism occurs when you use words, ideas, or work products attributable to another identifiable person or source:

•  without attributing the work to the source from which it was obtained

•  in a situation in which there is a legitimate expectation of original authorship

•  in order to obtain some benefit, credit, or gain which need not be monetary

Collusion is a form of cheating which occurs when people work together in a deceitful way to develop a submission for an assessment which has been restricted to individual effort.

                               

By submitting* this piece of work and signing this document, I declare that:

1.      The work is my own individual work.

2.      I have not previously submitted all or part of this work for assessment in any subject, unless the subject coordinator for the current subject (or my research supervisor, if applicable) has given me written permission to reuse specific material and I have correctly referenced the material taken from my own earlier work.

3.      I have read and agree to be bound by the Statutes, Regulations and Policies of the University relating to Academic Integrity available at http://www.latrobe.edu.au/students/academic- integrity; and

4.      I may be subject to student discipline processes in the event of an act of academic misconduct by me including an act of plagiarism or cheating.

I further grant to the University or any third party authorised by the University

(www.latrobe.edu.au/text-match) the right to reproduce and/or communicate (make available online or electronically transmit) the work I have submitted for the purpose of detecting plagiarism.

 

Assignment Requirements

Part 1 [30 marks]

The given data files female_players_23_updated.csv and Nationality.csv. The first file records the

information           female       soccer       players       across       the       world       in       2023       extracted                                 from 

https://www.kaggle.com/datasets/joebeachcapital/fifa-players. The second file records the nationality id and the nationality name within FIFA records.

Write R code in an Rmd file to answer the following questions. When it asks to show/display information (on screen), the dataframe should be unchanged. Each question should be presented in one code chunk.

 

1.1. Load the dataset from the given files into two dataframes. Rename columns to remove spaces if exist in the column names (Hint: use str_replace_all to do this automatically for all columns). Make sure

date values are in the correct type. Show a summary of each dataframe (including statistics of each column).             (6 marks)

1.2. Write R code to investigate whether there are duplicate rows in the show dataframe; if yes, how many of them, display them and then remove them (Hint: check the duplicated() function). Display the number of rows before and after removal. Modify the “value euro” column in a way that null values are replaced by “NA” and change its type to numeric if needed.             (6 marks)

1.3. Choose a nationality country and the preferred foot of your interest and write the R code to display the players who are from this country and play with your chosen preferred foot with potential above

                  85 in FIFA version 23 and update 9. Your result should not be empty.                           (2 marks)

1.4. Display players who their position in their national team (nation position) is “GK” (Goalkeeper) in FIFA version 23 and update 9. Display the player id, short name, nation position, age, international reputation in a descending order of international reputation and ascending order of age. Only

                   show the top 5 rows.                                                                                                           (3 marks)

1.5. Find the top three nationalities with the highest number of players who their international reputation is more than and equal to 3 based on FIFA version 23 and update 9. For each country,

                  display the number of players and average overall score.                                               (3 marks)

1.6. Draw boxplot to compare the distribution of the overall scores for players who are from the United States and Australia based on FIFA version 23 and update 9. What are the five values in each boxplot? Write a short paragraph (less than 100 words) to describe your insights.            (6 marks)

           1.7. To help a coach to choose good goalkeeper (“GK”) for their team, you are required to:          (4 marks)

a)      Propose a ranking with at least 2 criteria to rank the players that best suit to their team

(provide justification).  (1 marks)

b)      Then write R code to find the top 5 goalkeepers based on your proposed ranking (restrict your data to FIFA version 23 and update 9). (2 marks)

c)      Write a short paragraph to discuss your insights and recommendations. (1 marks)

 

Part 2 [20 marks]

The given Excel file is named Players.xlsx with two worksheets. The Data worksheet records the number of players for each country for each position in each year and their average overall score. The second worksheet, which is Continent, records information about countries in each continent.

You will see that the data is far from being ready for analysis and needs to be ‘wrangled’. You are required to write R code to perform the following steps. When it asks to show/display information (on screen), the dataframe should be unchanged.

 

2.1. Load the data from the Data worksheet into an R dataframe. Rename the columns to remove the word “Year” in the column names. Use glimpse to show the information of the data frame. (2 marks)

 

           2.2. Transform the dataframe:                                                                                                   (8 marks)

 

a)      Use pivot_longer to transform the dataframe into three columns, namely “Country / Position”, “Year”, and “Value”. Drop all rows having NA in Value. (2 marks)

 

b)      Split the first column into two columns and give meaningful column names to them. (1 marks)

 

c)      Split column “Value” into two columns, namely “NumberPlayers” and “Score”. Remove the string “/100” in the Score column. Make sure they have the correct data types. (2 marks)

 

d)      Display the number of columns and rows after transformation. (1 marks)

 

e)      Show the number of distinct countries and distinct years. (2 marks)

 

2.3. What are the countries having the average score on all position in 2023 from 65 to 70?

(2 marks)

2.4.  Load the data from the Continent worksheet. Rename the columns to “Country”, “Continent”. How many countries in total and how many do not appear in the Data worksheet?       (2 marks)

 

2.5.  To help FIFA choose an African country to further improve soccer in that country, you are required to:       (6 marks)

 

a)      propose a criterion to rank the African countries (with  justification). (1 marks)

 

b)      Then write R code to find the top five countries based on your proposed ranking. (2 marks)

 

c)      Draw a column/bar chart to compare these countries in terms of the measure you used for ranking. Order the result from the highest to the lowest value. (2 marks)

 

d)      Write a short paragraph (less than 100 words) to describe your insights. (1 marks)

 

Submission Guidelines

You have to submit A SINGLE file (.Rmd) comprising all the codes to answer all the questions of the two parts in the given order.

 

Each question is one code chunk. PUT the question number before the code chunk. DO NOT include the question description (to avoid a high Turnitin similarity score).

 

When writing your code, keep the data files in the same directory as your notebook so that you DO NOT specify directories or file paths in your code. This allows us to run your code smoothly on our device.

 

Marks will be deducted if your submission does not follow the guidelines.

 

Marking Rubrics

For each question, the full mark will be awarded for non-error and correct answers. For open questions, full mark is given if all questions are answered, well justified and run without errors. Half of the mark will be given for something close.

Marks will be deducted if the R code does not work smoothly on the marker’s R studio installation and we need to offer you an opportunity to show us that it does work on your installation. This means all paths or references to directories have to be removed and the packages being used are to be specified clearly. It is assumed that the tidyverse and readxl have been installed on our device. If you use other packages, make sure to have commands to install and load them.

Submissions having a high similarity in the answers with another submission will be considered as plagiarism/collusion.