7 đź’» Essentials of Data Wrangling in R
Data wrangling also said data cleaning are the preparation steps prior fitting any statistical procedure. This is needed since data, the most of the times, comes in a raw diry format. Why data come in a raw/dirty format? people organise and treat data badly
7.1 Intro to R
Let’s start from the basics!
3+5
#> [1] 812/7
#> [1] 1.714286store result into objects
result <- 3 + 5
result
#> [1] 8then print them:
print(result)
#> [1] 8add operation to R objects
result <- result * 3.1415
print(result)
#> [1] 25.132let’s define a vector (which is an object type)
vector = c(1, 3, 8, 13)
vector
#> [1] 1 3 8 13How can we acccess to vectors? we need square brackets object[]. Dont be confused with dataframes, in that object class (i.e. dataframe are object classes, so R objects with certain characteristics ).
get first element (paly with it )
vector[1]
#> [1] 1now to first 3 elements, notice the : that means: “fino a” i.e. da 1 fino a 3, positional arguments
vector[1:3]
#> [1] 1 3 8Certainly, I can help you continue explaining data cleaning in R for entry-level students in a clear and simple manner. Let’s cover some essential concepts:
7.2 Subsetting Vectors
You’ve already introduced vectors, but let’s dive deeper into subsetting them. Subsetting means selecting specific elements from a vector.
7.2.1 Accessing Individual Elements
To access individual elements in a vector, use square brackets [] along with the position of the element you want to access. For example:
7.2.2 Accessing Multiple Elements
You can also select multiple elements by specifying a range using the : operator. For example, to get the first three elements:
first_three <- vector[1:3]
print(first_three) # Output: [1] 1 3 8
#> [1] 1 3 87.2.3 Conditional Subsetting
You can subset a vector based on a condition. For example, to select elements greater than 5:
greater_than_5 <- vector[vector > 5]
print(greater_than_5) # Output: [1] 8 13
#> [1] 8 137.3 Data Frames
Now, let’s introduce data frames, a fundamental data structure in R.
7.3.1 What are Data Frames?
Data frames are like spreadsheets or tables where you can store and manipulate data. They consist of rows and columns, making them suitable for working with structured data.
7.3.2 Creating Data Frames
You can create a data frame by combining vectors of equal length into columns. Here’s how you can create a simple data frame:
# Create vectors for data
names <- c("Alice", "Bob", "Charlie")
ages <- c(25, 30, 22)
# Combine vectors into a data frame
people_df <- data.frame(Name = names, Age = ages)
print(people_df)
#> Name Age
#> 1 Alice 25
#> 2 Bob 30
#> 3 Charlie 227.3.3 Accessing Data Frame Elements
To access elements in a data frame, you can use square brackets [] just like with vectors but with an additional dimension. For example:
# Access the first row and the "Name" column
first_name <- people_df[1, "Name"]
print(first_name) # Output: [1] "Alice"
#> [1] "Alice"
7.4 Data Frame Essentials data.frame()
7.4.1 Data Frame Structure
Data frames are structured as rows and columns, making them a powerful tool for working with structured data. Use the str() function to display the structure of a data frame.
# Create a sample data frame
data_frame <- data.frame(Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 22))
# Display the structure
str(data_frame)
#> 'data.frame': 3 obs. of 2 variables:
#> $ Name: chr "Alice" "Bob" "Charlie"
#> $ Age : num 25 30 227.4.2 Summary Statistics
You can generate summary statistics for the numerical columns in a data frame using the summary() function.
# Create a data frame with numeric data
data_frame <- data.frame(Age = c(25, 30, 22, 35, 28),
Height = c(165, 175, 160, 180, 170),
Name = c("Mark", "Rupert", "Celine", "Frank", "Madison"))
# Generate summary statistics
summary(data_frame)
#> Age Height Name
#> Min. :22 Min. :160 Length:5
#> 1st Qu.:25 1st Qu.:165 Class :character
#> Median :28 Median :170 Mode :character
#> Mean :28 Mean :170
#> 3rd Qu.:30 3rd Qu.:175
#> Max. :35 Max. :1807.5 Filtering and Subsetting Data Frames
7.5.1 Filtering Rows
To filter rows in a data frame based on a condition, use the subset operation. For example, to select individuals older than 25:
# Filter data frame for ages > 25
subset_data <- subset(data_frame, Age > 25)7.5.2 Selecting Columns
You can select specific columns from a data frame using the $ operator or square brackets. For instance, to select the “Name” column:
# Select the "Name" column using the $ operator
names <- data_frame$Name
# Select the "Age" column using square brackets
ages <- data_frame[["Age"]]7.5.3 Combining Filtering and Subsetting
You can combine filtering and subsetting to select specific rows and columns simultaneously. For example, to select the “Name” column for individuals older than 25:
# Filter for ages > 25 and select the "Name" column
subset_data <- subset(data_frame, Age > 25, select = Name)7.6 Merging Data Frames
If you have multiple data frames and want to combine them, you can use functions like merge() or cbind().
# Create two data frames
data_frame1 <- data.frame(ID = c(1, 2, 3), Score = c(85, 92, 78))
data_frame2 <- data.frame(ID = c(2, 3, 4), Grade = c("A", "B", "C"))
# Merge data frames by a common column (ID)
merged_data <- merge(data_frame1, data_frame2, by = "ID")7.7 Aggregating Data
You can aggregate data to calculate summary statistics based on one or more columns. The aggregate() function is useful for this purpose.
# Calculate the mean score by grade
agg_data <- aggregate(Score ~ Grade, data = merged_data, FUN = mean)7.8 Data Cleaning
Data cleaning is a crucial step in data analysis, ensuring that your data is accurate and ready for analysis. Here are some common data cleaning operations:
7.8.1 Dealing with Missing Values
Handling missing data is a common challenge in data cleaning. R represents missing values as “NA.” You can identify and deal with missing values using functions like is.na() and na.omit().
# Create a vector with missing values
data <- c(1, NA, 3, 4, NA, 6)
# Check for missing values
missing_values <- is.na(data)
# Remove missing values
clean_data <- na.omit(data)
print(missing_values) # Output: [1] FALSE TRUE FALSE FALSE TRUE FALSE
#> [1] FALSE TRUE FALSE FALSE TRUE FALSE
print(clean_data) # Output: [1] 1 3 4 6
#> [1] 1 3 4 6
#> attr(,"na.action")
#> [1] 2 5
#> attr(,"class")
#> [1] "omit"7.8.2 Removing Duplicates
Sometimes, your data may contain duplicate records that need to be removed. You can use the duplicated() and unique() functions for this.
# Create a vector with duplicates
data <- c(1, 2, 2, 3, 4, 4, 5)
# Find and remove duplicates
duplicates <- duplicated(data)
unique_data <- unique(data)
print(duplicates) # Output: [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE
#> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE
print(unique_data) # Output: [1] 1 2 3 4 5
#> [1] 1 2 3 4 57.8.3 Renaming Columns
If you’re working with data frames, you may need to rename columns to make them more descriptive or concise. You can use the colnames() function for this purpose.
7.8.4 Data Type Conversion
Data often needs to be converted from one data type to another. For example, converting character data to numeric data. You can use functions like as.numeric(), as.character(), etc.
# Create a character vector
char_vector <- c("1", "2", "3")
# Convert to numeric
num_vector <- as.numeric(char_vector)
print(num_vector)
#> [1] 1 2 37.9 String Manipulation
Data cleaning often involves working with text data. You may need to clean and manipulate strings in various ways. Here are some common string operations in R:
7.9.1 Combining Strings
You can concatenate strings using the paste() function. This is useful for creating new variables or labels in your data.
7.9.2 Splitting Strings
The strsplit() function allows you to split a string into multiple parts based on a delimiter.
7.10 Date and Time Handling
If your data includes dates and times, you’ll need to understand how to manipulate them:
7.10.1 Date Conversion
To work with dates, you can use the as.Date() function to convert strings into date objects.
7.11 Handling Categorical Data
When dealing with categorical data, you may want to create dummy variables for analysis. The factor() function is used to represent categorical data.
7.12 Exercises
Exercise 7.1 Given a vector of numeric data, access and print the third and fifth elements.
numeric_vector <- c(10, 20, 30, 40, 50, 60)
Exercise 7.2 Create a data frame for the following information and then print the entire data frame:
- Name: Alice, Bob, Charlie
- Age: 25, 30, 22
Exercise 7.3 Given a data frame with columns named “First_Name” and “Last_Name,” rename them to “First” and “Last.”
data_frame <- data.frame(First_Name = c("Alice", "Bob", "Charlie"),
Last_Name = c("Smith", "Johnson", "Brown"))
Exercise 7.4 From the given data frame, filter and print only the rows where “Age” is greater than 25.
# Create a data frame
data_frame <- data.frame(Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 22))
# Your code here to filter and print rows where Age > 25
Exercise 7.5 From the given data frame, select and print the “Name” column.
data_frame <- data.frame(Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 22))
7.13 Solutions
Answer to Exercise 7.1:
element3 <- numeric_vector[3]
element5 <- numeric_vector[5]
print(element3) # Output: [1] 30
print(element5) # Output: [1] 50
Answer to Exercise 7.2:
# Create a data frame
data_frame <- data.frame(Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 22))
print(data_frame)
Answer to Exercise 7.3:
colnames(data_frame) <- c("First", "Last")
print(data_frame)
Answer to Exercise 7.4:
filtered_data <- subset(data_frame, Age > 25)
print(filtered_data)
Answer to Exercise 7.5:
names <- data_frame$Name
print(names)