7 đź’» Essentials of Data Wrangling in R

Data wrangling also said data cleaning are the preparation steps prior fitting any statistical procedure. This is needed since data, the most of the times, comes in a raw diry format. Why data come in a raw/dirty format? people organise and treat data badly

7.1 Intro to R

Let’s start from the basics!

3+5
#> [1] 8
12/7
#> [1] 1.714286

store result into objects

result <- 3 + 5
result
#> [1] 8

then print them:

print(result)
#> [1] 8

add operation to R objects

result <- result * 3.1415
print(result)
#> [1] 25.132

let’s define a vector (which is an object type)

vector = c(1, 3, 8, 13)
vector
#> [1]  1  3  8 13

How can we acccess to vectors? we need square brackets object[]. Dont be confused with dataframes, in that object class (i.e. dataframe are object classes, so R objects with certain characteristics ).

get first element (paly with it )

vector[1]
#> [1] 1

now to first 3 elements, notice the : that means: “fino a” i.e. da 1 fino a 3, positional arguments

vector[1:3]
#> [1] 1 3 8

Certainly, I can help you continue explaining data cleaning in R for entry-level students in a clear and simple manner. Let’s cover some essential concepts:

7.2 Subsetting Vectors

You’ve already introduced vectors, but let’s dive deeper into subsetting them. Subsetting means selecting specific elements from a vector.

7.2.1 Accessing Individual Elements

To access individual elements in a vector, use square brackets [] along with the position of the element you want to access. For example:

vector <- c(1, 3, 8, 13)
element1 <- vector[1]
element3 <- vector[3]

print(element1)  # Output: [1] 1
#> [1] 1
print(element3)  # Output: [1] 8
#> [1] 8

7.2.2 Accessing Multiple Elements

You can also select multiple elements by specifying a range using the : operator. For example, to get the first three elements:

first_three <- vector[1:3]

print(first_three)  # Output: [1] 1 3 8
#> [1] 1 3 8

7.2.3 Conditional Subsetting

You can subset a vector based on a condition. For example, to select elements greater than 5:

greater_than_5 <- vector[vector > 5]

print(greater_than_5)  # Output: [1]  8 13
#> [1]  8 13

7.3 Data Frames

Now, let’s introduce data frames, a fundamental data structure in R.

7.3.1 What are Data Frames?

Data frames are like spreadsheets or tables where you can store and manipulate data. They consist of rows and columns, making them suitable for working with structured data.

7.3.2 Creating Data Frames

You can create a data frame by combining vectors of equal length into columns. Here’s how you can create a simple data frame:

# Create vectors for data
names <- c("Alice", "Bob", "Charlie")
ages <- c(25, 30, 22)

# Combine vectors into a data frame
people_df <- data.frame(Name = names, Age = ages)

print(people_df)
#>      Name Age
#> 1   Alice  25
#> 2     Bob  30
#> 3 Charlie  22

7.3.3 Accessing Data Frame Elements

To access elements in a data frame, you can use square brackets [] just like with vectors but with an additional dimension. For example:

# Access the first row and the "Name" column
first_name <- people_df[1, "Name"]

print(first_name)  # Output: [1] "Alice"
#> [1] "Alice"

7.4 Data Frame Essentials data.frame()

7.4.1 Data Frame Structure

Data frames are structured as rows and columns, making them a powerful tool for working with structured data. Use the str() function to display the structure of a data frame.

# Create a sample data frame
data_frame <- data.frame(Name = c("Alice", "Bob", "Charlie"),
                         Age = c(25, 30, 22))

# Display the structure
str(data_frame)
#> 'data.frame':    3 obs. of  2 variables:
#>  $ Name: chr  "Alice" "Bob" "Charlie"
#>  $ Age : num  25 30 22

7.4.2 Summary Statistics

You can generate summary statistics for the numerical columns in a data frame using the summary() function.

# Create a data frame with numeric data
data_frame <- data.frame(Age = c(25, 30, 22, 35, 28),
                         Height = c(165, 175, 160, 180, 170),
                         Name = c("Mark", "Rupert", "Celine", "Frank", "Madison"))

# Generate summary statistics
summary(data_frame)
#>       Age         Height        Name          
#>  Min.   :22   Min.   :160   Length:5          
#>  1st Qu.:25   1st Qu.:165   Class :character  
#>  Median :28   Median :170   Mode  :character  
#>  Mean   :28   Mean   :170                     
#>  3rd Qu.:30   3rd Qu.:175                     
#>  Max.   :35   Max.   :180

7.5 Filtering and Subsetting Data Frames

7.5.1 Filtering Rows

To filter rows in a data frame based on a condition, use the subset operation. For example, to select individuals older than 25:

# Filter data frame for ages > 25
subset_data <- subset(data_frame, Age > 25)

7.5.2 Selecting Columns

You can select specific columns from a data frame using the $ operator or square brackets. For instance, to select the “Name” column:

# Select the "Name" column using the $ operator
names <- data_frame$Name

# Select the "Age" column using square brackets
ages <- data_frame[["Age"]]

7.5.3 Combining Filtering and Subsetting

You can combine filtering and subsetting to select specific rows and columns simultaneously. For example, to select the “Name” column for individuals older than 25:

# Filter for ages > 25 and select the "Name" column
subset_data <- subset(data_frame, Age > 25, select = Name)

7.5.4 Sorting Data Frames

To sort a data frame by a specific column, use the order() function. For example, to sort by “Age” in ascending order:

# Sort data frame by "Age" in ascending order
sorted_data <- data_frame[order(data_frame$Age), ]

7.6 Merging Data Frames

If you have multiple data frames and want to combine them, you can use functions like merge() or cbind().

# Create two data frames
data_frame1 <- data.frame(ID = c(1, 2, 3), Score = c(85, 92, 78))
data_frame2 <- data.frame(ID = c(2, 3, 4), Grade = c("A", "B", "C"))

# Merge data frames by a common column (ID)
merged_data <- merge(data_frame1, data_frame2, by = "ID")

7.7 Aggregating Data

You can aggregate data to calculate summary statistics based on one or more columns. The aggregate() function is useful for this purpose.

# Calculate the mean score by grade
agg_data <- aggregate(Score ~ Grade, data = merged_data, FUN = mean)

7.8 Data Cleaning

Data cleaning is a crucial step in data analysis, ensuring that your data is accurate and ready for analysis. Here are some common data cleaning operations:

7.8.1 Dealing with Missing Values

Handling missing data is a common challenge in data cleaning. R represents missing values as “NA.” You can identify and deal with missing values using functions like is.na() and na.omit().

# Create a vector with missing values
data <- c(1, NA, 3, 4, NA, 6)

# Check for missing values
missing_values <- is.na(data)

# Remove missing values
clean_data <- na.omit(data)

print(missing_values)  # Output: [1] FALSE  TRUE FALSE FALSE  TRUE FALSE
#> [1] FALSE  TRUE FALSE FALSE  TRUE FALSE
print(clean_data)      # Output: [1] 1 3 4 6
#> [1] 1 3 4 6
#> attr(,"na.action")
#> [1] 2 5
#> attr(,"class")
#> [1] "omit"

7.8.2 Removing Duplicates

Sometimes, your data may contain duplicate records that need to be removed. You can use the duplicated() and unique() functions for this.

# Create a vector with duplicates
data <- c(1, 2, 2, 3, 4, 4, 5)

# Find and remove duplicates
duplicates <- duplicated(data)
unique_data <- unique(data)

print(duplicates)   # Output: [1] FALSE  FALSE   TRUE FALSE FALSE  TRUE FALSE
#> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
print(unique_data)  # Output: [1] 1 2 3 4 5
#> [1] 1 2 3 4 5

7.8.3 Renaming Columns

If you’re working with data frames, you may need to rename columns to make them more descriptive or concise. You can use the colnames() function for this purpose.

# Create a data frame
data_frame <- data.frame(Name = c("Alice", "Bob", "Charlie"), Age = c(25, 30, 22))

# Rename columns
colnames(data_frame) <- c("Full_Name", "Years")

print(data_frame)
#>   Full_Name Years
#> 1     Alice    25
#> 2       Bob    30
#> 3   Charlie    22

7.8.4 Data Type Conversion

Data often needs to be converted from one data type to another. For example, converting character data to numeric data. You can use functions like as.numeric(), as.character(), etc.

# Create a character vector
char_vector <- c("1", "2", "3")

# Convert to numeric
num_vector <- as.numeric(char_vector)

print(num_vector)
#> [1] 1 2 3

7.9 String Manipulation

Data cleaning often involves working with text data. You may need to clean and manipulate strings in various ways. Here are some common string operations in R:

7.9.1 Combining Strings

You can concatenate strings using the paste() function. This is useful for creating new variables or labels in your data.

first_name <- "John"
last_name <- "Doe"
full_name <- paste(first_name, last_name)

print(full_name)  # Output: [1] "John Doe"
#> [1] "John Doe"

7.9.2 Splitting Strings

The strsplit() function allows you to split a string into multiple parts based on a delimiter.

text <- "apple,banana,cherry"
split_text <- strsplit(text, ",")

print(split_text)  # Output: [[1]]
#> [[1]]
#> [1] "apple"  "banana" "cherry"
# [1] "apple"  "banana" "cherry"

7.9.3 Removing Whitespace

To clean up extra whitespace in a string, you can use the gsub() function with regular expressions.

text <- "   This is a messy text.   "
cleaned_text <- gsub("\\s+", " ", text)

print(cleaned_text)  # Output: [1] " This is a messy text. "
#> [1] " This is a messy text. "

7.10 Date and Time Handling

If your data includes dates and times, you’ll need to understand how to manipulate them:

7.10.1 Date Conversion

To work with dates, you can use the as.Date() function to convert strings into date objects.

date_string <- "2023-10-18"
date_object <- as.Date(date_string)

print(date_object)  # Output: [1] "2023-10-18"
#> [1] "2023-10-18"

7.10.2 Date Arithmetic

You can perform arithmetic operations with dates. For example, to calculate the difference between two dates:

date1 <- as.Date("2023-10-18")
date2 <- as.Date("2023-10-25")
date_diff <- date2 - date1

print(date_diff)  # Output: Time difference of 7 days
#> Time difference of 7 days

7.11 Handling Categorical Data

When dealing with categorical data, you may want to create dummy variables for analysis. The factor() function is used to represent categorical data.

# Create a vector of categorical data
categories <- c("Red", "Green", "Blue", "Red", "Green")
factor_data <- factor(categories)

print(factor_data)
#> [1] Red   Green Blue  Red   Green
#> Levels: Blue Green Red

7.12 Exercises

Exercise 7.1 Given a vector of numeric data, access and print the third and fifth elements.

numeric_vector <- c(10, 20, 30, 40, 50, 60)

Exercise 7.2 Create a data frame for the following information and then print the entire data frame:

  • Name: Alice, Bob, Charlie
  • Age: 25, 30, 22

Exercise 7.3 Given a data frame with columns named “First_Name” and “Last_Name,” rename them to “First” and “Last.”

data_frame <- data.frame(First_Name = c("Alice", "Bob", "Charlie"),
                         Last_Name = c("Smith", "Johnson", "Brown"))

Exercise 7.4 From the given data frame, filter and print only the rows where “Age” is greater than 25.

# Create a data frame
data_frame <- data.frame(Name = c("Alice", "Bob", "Charlie"),
                         Age = c(25, 30, 22))

# Your code here to filter and print rows where Age > 25

Exercise 7.5 From the given data frame, select and print the “Name” column.

data_frame <- data.frame(Name = c("Alice", "Bob", "Charlie"),
                         Age = c(25, 30, 22))

7.13 Solutions

Answer to Exercise 7.1:

element3 <- numeric_vector[3]
element5 <- numeric_vector[5]

print(element3)  # Output: [1] 30
print(element5)  # Output: [1] 50

Answer to Exercise 7.2:

# Create a data frame
data_frame <- data.frame(Name = c("Alice", "Bob", "Charlie"),
                         Age = c(25, 30, 22))

print(data_frame)

Answer to Exercise 7.3:

colnames(data_frame) <- c("First", "Last")
print(data_frame)

Answer to Exercise 7.4:

filtered_data <- subset(data_frame, Age > 25)
print(filtered_data)

Answer to Exercise 7.5:

names <- data_frame$Name
print(names)