Welcome to Data Wrangling with R! In this book, I will help you learn the essentials of preprocessing data leveraging the R programming language to easily and quickly turn noisy data into usable pieces of information. Data wrangling, which is also commonly referred to as data munging, transformation, manipulation, janitor work, etc. can be a painstakenly laborious process. In fact, its been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data. However, being a prerequisite to the rest of the data analysis workflow (visualization, analysis, reporting), it's essential that you become fluent and efficient in data wrangling techniques. This book will guide you through the data wrangling process along with give you a solid foundation of working with data in R. My goal is to teach you how to easily wrangle your data, so you can spend more time focused on understanding the content of your data via visualization, analysis, and reporting. By the time you finish reading this book, you will have learned: - How to work with the different types of data such as numerics, characters, regular expressions, factors, and dates - The difference between the different data structures and how to create, add additional components to, and how to subset each data structure - How to acquire and parse data from locations you may not have been able to access before such as web scraping - How to develop your own functions and use loop control structures to reduce code redundancy - How to use pipe operators to simplify your code and make it more readable - How to reshape the layout of your data, and manipulate, summarize, and join data sets. In essence, you will have the data wrangling toolbox required for modern day data analysis.

ResearchGate Logo

Discover the world's research

  • 20+ million members
  • 135+ million publications
  • 700k+ research projects

Join for free

A preview of the PDF is not available

  • Bradley C. Boehmke

Synonymous to Samuel Taylor Coleridge's quote in Rime of the Ancient Mariner, the degree to which data are useful is largely determined by an analyst's ability to wrangle data. In spite of advances in technologies for working with data, analysts still spend an inordinate amount of time obtaining data, diagnosing data quality issues and pre-processing data into a usable form. Research has illustrated that this portion of the data analysis process is the most tedious and time consuming component; often consuming 50–80 % of an analyst's time (cf. Wickham 2014; Dasu and Johnson 2003). Despite the challenges, data wrangling remains a fundamental building block that enables visualization and statistical modeling. Only through data wrangling can we make data useful. Consequently, one's ability to perform data wrangling tasks effectively and efficiently is fundamental to becoming an expert data analyst in their respective domain.

  • Bradley C. Boehmke

A language for data analysis and graphics. This definition of R was used by Ross Ihaka and Robert Gentleman in the title of their 1996 paper (Ihaka and Gentleman 1996) outlining their experience of designing and implementing the R software. It's safe to say this remains the essence of what R is; however, it's tough to encapsulate such a diverse programming language into a single phrase.

  • [object Object] Bradley C Boehmke
  • Bradley C. Boehmke

A computer language is described by its and semantics; where syntax is about the grammar of the language and semantics the meaning behind the sentence. And jumping into a new programming language correlates to visiting a foreign country with only that ninth grade Spanish 101 class under your belt; there is no better way to learn than to immerse yourself in the environment! Although it'll be painful early on and your nose will surely bleed, eventually you'll learn the dialect and the quirks that come along with it.

  • Bradley C. Boehmke

In this chapter you will learn the basics of working with numbers in R. This includes understanding how to manage the numeric type (integer vs. double), the different ways of generating non-random and random numbers, how to set seed values for reproducible random number generation, and the different ways to compare and round numeric values.

  • Bradley C. Boehmke

Dealing with character strings is often under-emphasized in data analysis training. The focus typically remains on numeric values; however, the growth in data collection is also resulting in greater bits of information embedded in character strings. Consequently, handling, cleaning and processing character strings is becoming a prerequisite in daily data analysis. This chapter is meant to give you the foundation of working with characters by covering some basics followed by learning how to manipulate strings using base R functions along with using the simplified package.

  • Bradley C. Boehmke

A regular expression (aka regex) is a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings. Typically, regex patterns consist of a combination of alphanumeric characters as well as special characters. The pattern can also be as simple as a single character or it can be more complex and include several characters.

  • Bradley C. Boehmke

Factors are variables in R, which take on a limited number of different values; such variables are often referred to as categorical variables. One of the most important uses of factors is in statistical modeling; since categorical variables enter into statistical models such as lm and glm differently than continuous variables, storing data as factors insures that the modeling functions will treat such data correctly.

  • Bradley C. Boehmke

Real world data are often associated with dates and time; however, dealing with dates accurately can appear to be a complicated task due to the variety in formats and accounting for time-zone differences and leap years. R has a range of functions that allow you to work with dates and times. Furthermore, packages such as lubridate make it easier to work with dates and times.

  • Bradley C. Boehmke

Prior to jumping into the data structures, it's beneficial to understand two components of data structures - the structure and attributes.

  • Bradley C. Boehmke

The basic structure in R is the vector. A vector is a sequence of data elements of the same basic type: integer, double, logical, or character. The one-dimensional examples illustrated in the previous section are considered vectors. In this chapter I will illustrate how to create vectors, add additional elements to pre-existing vectors, add attributes to vectors, and subset vectors.

  • Bradley C. Boehmke

A list is an R structure that allows you to combine elements of different types, including lists embedded in a list, and length. Many statistical outputs are provided as a list as well; therefore, its critical to understand how to work with lists. In this chapter I will illustrate how to create lists, add additional elements to pre-existing lists, add attributes to lists, and subset lists.

  • Bradley C. Boehmke

A matrix is a collection of data elements arranged in a two-dimensional rectangular layout. In R, the elements that make up a matrix must be of a consistent mode (i.e. all elements must be numeric, or character, etc.). Therefore, a matrix can be thought of as an atomic vector with a dimension attribute. Furthermore, all rows of a matrix must be of same length. In this chapter I will illustrate how to create matrices, add additional elements to pre-existing matrices, add attributes to matrices, and subset matrices.

  • Bradley C. Boehmke

A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. Under the hood, a data frame is a list of equal-length vectors. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. As a result, data frames can store different classes of objects in each column (i.e. numeric, character, factor). In essence, the easiest way to think of a data frame is as an Excel worksheet that contains columns of different types of data but are all of equal length rows. In this chapter I will illustrate how to create data frames, add additional elements to pre-existing data frames, add attributes to data frames, and subset data frames.

  • Bradley C. Boehmke

A common task in data analysis is dealing with missing values. In R, missing values are often represented by NA or some other value that represents missing values (i.e. 99). We can easily work with missing values and in this chapter I illustrate how to test for, recode, and exclude missing values in your data.

  • Bradley C. Boehmke

The first step to any data analysis process is to get the data. Data can come from many sources but two of the most common include text and Excel files. This chapter covers how to import data into R by reading data from common text files and Excel spreadsheets. In addition, I cover how to load data from saved R object files for holding or transferring data that has been processed in R. In addition to the commonly used base R functions to perform data importing, I will also cover functions from the popular readr, xlsx, and readxl packages.

  • Bradley C. Boehmke

Rapid growth of the World Wide Web has significantly changed the way we share, collect, and publish data. Vast amount of information is being stored online, both in structured and unstructured forms. Regarding certain questions or research topics, this has resulted in a new problem—no longer is the concern of data scarcity and inaccessibility but, rather, one of overcoming the tangled masses of online data.

  • Bradley C. Boehmke

Although getting data into R is essential, getting data out of R can be just as important. Whether you need to export data or analytic results simply to store, share, or feed into another system it is generally a straight forward process. This section will cover how to export data to text files, Excel files (along with some additional formatting capabilities), and save to R data objects. In addition to the commonly used base R functions to perform data importing, I will also cover functions from the popular readr and xlsx packages along with a lesser known but useful package for Excel formatting.

  • Bradley C. Boehmke

R is a functional programming language, meaning that everything you do is basically built on functions. However, moving beyond simply using pre-built functions to writing your own functions is when your capabilities really start to take off and your code development/writing takes on a new level of efficiency. Functions allow you to reduce code duplication by automating a generalized task to be applied recursively. Whenever you catch yourself repeating a function or copy and pasting code there is a good change that you should write a function to eliminate the redundancies.

  • Bradley C. Boehmke

Looping is similiar to creating functions in that they are merely a means to automate a certain multi-step process by organizing sequences of R expressions. R consists of several loop control statements which allow you to perform repetitive code processes with different intentions and allow these automated expressions to naturally respond to features of your data. Consequently, learning these loop control statements will go a long ways in reducing code redundancy and becoming a more efficient data wrangler.

  • Bradley C. Boehmke

Removing duplication is an important principle to keep in mind with your code; however, equally important is to keep your code efficient and readable. Efficiency is often accomplished by leveraging functions and control statements in your code. However, efficiency also includes eliminating the creation and saving of unnecessary objects that often result when you are trying to make your code more readable, clear, and explicit. Consequently, writing code that is simple, readable, and efficient is often considered contradictory. For this reason, the package is a powerful tool to have in your data wrangling toolkit.

  • Bradley C. Boehmke

Jenny Bryan stated that "classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth." In essence, she was getting to the point that often when we learn how to perform a modeling approach in the classroom, the data used is provided in a format that appropriately feeds into the modeling tool of choice. In reality, datasets are messy and "every messy dataset is messy in its own way." The concept of "tidy data" was established by Hadley Wickham and represents "standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning)." The objective should always to be to get a dataset into a tidy form which consists of:

  • Bradley C. Boehmke

Transforming your data is a basic part of data wrangling. This can include filtering, summarizing, and ordering your data by different means. This also includes combining disparate data sets, creating new variables, and many other manipulation tasks. Although many fundamental data transformation and manipulation functions exist in R, historically they have been a bit convoluted and lacked a consistent and cohesive code structure. Consequently, Hadley Wickham developed the very popular package to make these data processing tasks more efficient along with a syntax that is consistent and easier to remember and read.

... A1. Data Wrangling This step implies the collection and preconditioning of data so it can be analyzed by algorithms. It is estimated that this step can consume 80% of a data scientists time since data quality is key to further successes [28]. This step also involves dividing the data into a training set, for model development, and a testing set for model verification. ...

... No consensus exists except that: 1) the model is not trained on the test set, 2) approaches to dividing are well stated (random, deterministic) and discussed, and 3) percentages of the total data are reasonable (generally 10-50% for testing) [8]. Beyond these matters, data wrangling is outside the scope of this paper, and solid methodologies can be found in [28]. ...

As a result of increased usage of internet, a huge amount of data is collected from variety of sources like surveys, census, and sensors in internet of things. This resultant data is coined as big data and analysis of this leads to major decision making. Since the collected data is in raw form, it is difficult to understand inherent properties and it becomes just a liability if not analyzed, summarized, and visualized. Although text can be used to articulate the relation between facts and to explain the findings, presenting it in the form of tables and graphs conveys information effectively. Presentation of data using tools to create visual images in order to gain more insights into data is called as data visualization. Data analysis is processing and interpretation of data to discover useful information and to deduce certain inferences based on the values. This chapter concerns usage of R tool and understanding its effectiveness for data analysis and intelligent data visualization by experimenting on data set obtained from University of California Irvine Machine Learning Repository.

The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer's disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets.

  • Fulya Gokalp Yavuz Fulya Gokalp Yavuz
  • Mark Daniel Ward

Data Science is one of the newest interdisciplinary areas. It is transforming our lives unexpectedly fast. This transformation is also happening in our learning styles and practicing habits. We advocate an approach to data science training that utilizes several types of computational tools, including R, bash, awk, regular expressions, SQL, and XPath, often used in tandem. We discuss ways for undergraduate mentees to learn about data science topics, at an early point in their training. We give some intuition for researchers, professors, and practitioners about how to effectively embed real-life examples into data science learning environments. As a result, we have a unified program built on a foundation of team-oriented, data-driven projects.

ResearchGate has not been able to resolve any references for this publication.