6 min read

Merging Multiple Data Files into One Data Frame in R: 3 Options

First of all, HAPPY NEW YEAR!

Another Exciting Project

Recently, I started the new project with NIA in order to find the topics and their moving trends over time (2005~2017) from news articles: Total = around 15,000,000 articles as several giga bytes of csv files. Those are lovely size for the analysis!

Anyways, I received the Dropbox link containing those csv files and could easily transfer those to my account, thereby downloading them to my desktop to start playing around with R.

As expected, those are kept in several folders and I was about to load them into RStudio Server since it has sufficient memory to deal with.

As a typical habit, I googled and double checked if there is any better NEW way in R to load all of the files at once, and merge them into one huge data frame. Guess what? There is a WOW Package(readbulk) to do that!!! And as always, thank god I realized how stupid I am, everyday!

Importing and Merging Multiple csv files into One Data Frame - 3 Ways

As such, I would like to summarize three ways to merge a bunch of csv files into one huge data frame in R, including the readbulk: (1) fread(), (2) spark_read_csv(), and (3) read_bulk().

(1) data.table::fread()

# (1) Make sure where your files are located
csv_files <- list.files (path       = "data/2005-2017", 
                         pattern    = "*.csv", 
                         full.names = T)

# (2) Import All csv with 'fread()'
library (data.table)
DATA_ALL <- as_tibble (rbindlist (lapply (csv_files, fread)))

# (3) Save
save (DATA_ALL, file = 'object/DATA_ALL.Rda') 
write_excel_csv (DATA_ALL, 'data/DATA_ALL.csv')

All the csv files are saved in the folder of data/2005-2017 in my side, so I make them as path to tell R where my csv files are located. Since I need to load all the csv files, I set the pattern as any(*) csv files (with their path as full names). Then, I use rbindlist() from data.table package with fread() function to load all the files with lapply() at once. Finally, the merged object is converted as DATA_ALL with tibble object

(2) sparklyr::spark_read_csv()

# Set Up Spark Connection
library (sparklyr)
sc <- spark_connect (master = "local") # => local = Desktop

First thing first. To take the full benefit of spark with sparklyr in R, we first need to connect the already installed spark. To do that, spark_install() takes care of the easy installation. I assume we already did that and proceed to import the csv files directly from csv files in folder to spark. This way, we don’t worry about the memory issue we may encounter when dealing with tons of files.

# Load multiple csv files DIRECTLY to Spark: R memory is not afftected!
DATA_ALL <- spark_read_csv (sc,
                            "data",
                            "data/2005-17/*.csv")

# Check Spark "DATA_ALL"
sdf_dim (DATA_ALL); sdf_nrow (DATA_ALL); sdf_ncol (DATA_ALL)
glimpse (DATA_ALL)
print (DATA_ALL, n = 2, width = Inf)
head (DATA_ALL)

Using a handy spark_read_csv() with simple wild card(*) in path setting, all the csv files in that folders are easily loaded to spark. As always, we can quickly check the data imported with sdf_dim(), and even with glimpse() from dplyr.

# Save merged file in Spark => Directly to Desktop as One csv File
DATA_ALL <- sdf_coalesce (DATA_ALL,        
                          partitions = 1)

spark_write_csv (DATA_ALL, 
                 "data/DATA_ALL.csv", # => File location where this will be saved
                 header = TRUE, 
                 mode = "overwrite",
                 charset = "UTF-8")

Prior to saving the combined csv file to desktop, we might want to make sure that how many csv files we want to have. In this case, I need only one big csv file, and therefore I set partitions = 1 using sdf_coalesce() function. Finally, with spark_write_csv(), DATA_ALL object is now saved under “data” folder named as DATA_ALL.csv. This is simple but very effective approach of combining multiple files if you have at least more than two cores in your local machine: Distribute your cores to work!

(3) read_bulk() with SQLite

Now, I will use recently known read_bulk() together with a handy SQLite() from RSQLite package. This is a really handy database that requires no separate installation and configuration.

# (1) Connect to Database
library (RSQLite); library (readbulk)

DATA_DB <- dbConnect (RSQLite::SQLite(),  # => We use SQLite
                      "DATA_DB.sqlite")   # => Create 'sqlite' file at your machine

We first create and connect to database, named as “DATA_DB” with dbConnect(). Then, DATA_DB.sqlite file is created at your working directory and all the csv files that will be combined are about to be inserted to that. In the meantime, you will be amazed by the size of the object of DATA_DB in R: Only few kilobyte. What this means is that you are free of memory issue in R since the real csv files are directlty stored in this DATA_DB.sqlite instead. R here works as intermediary liaison in between Database and csv file-transfer.

# (2) Load csv files to Database using 'read_bulk()'
system.time ( 
  dbWriteTable (
    DATA_DB,             # => Put csv files here created in (1)
    name   = "DATA_DB",  # => Make name table
    append = TRUE,       # => Keep adding csv files into this single DB
    value  = read_bulk (
      fun = data.table::fread,            # => or dplyr::read_csv(): Any reading function
      directory = "data/2005-17",         # => File Location
      #subdirectories = F,                # => Set 'T' if there are nested folders
      #subdirectories = c("2010","2015"), # => Selective Nested Folder at your needs
      extension = ".csv", 
      verbose   = T)) )

One thing to note is that I always measure the time it takes (i.e. system.time()) when working with large volumes of data including but not limited to model fitting. Once I know how long it will take for a specific task, it becomes much easier to manage multiple tasks at the same time. Clock is ticking!

fun = argument in read_bulk() allows you to use any function to import the files. So, we can set it as dplyr::read_csv, rio::import(), or util::read.csv, to name a few. data.table::fread() is renowned for its speed to import, so this is my choice of the day.

# (3) Queries with "dplyr"" verb: No SQL commands are necessary!

# Show what tables are in 'DATA_DB'
dbListTables (DATA_DB)

# Use `tbl()` to make `DATA_DB` accesible with "dplyr" verb
library (dplyr); library (dbplyr)
DATA_ALL <- tbl (DATA_DB, "DATA_DB")

# Check with "dplyr" Verb with pipe operator
glimpse (DATA_ALL)

DATA_ALL %>% select (contents)

DATA_TO_R <- DATA_ALL %>% 
  arrange (-postTime) %>% 
  head (100) %>% 
  collect() # => Save it to "R" when you think you are done

# Save "DATA_TO_R"(R Object) to Disk
save (DATA_TO_R, file = "object/DATA_TO_R.Rda")

The beauty of dplyr also applies to SQL. Though not all SQL commands are possible with dplyr, it is sufficient enough to get the data we want (i.e. select, filter). After playing around the data with several times, use collect() to save it in R (memory) for subsequent analysis. I made sure I had it by saving DATA_TO_R as Rda to disk.

Wrap up

As always, tons of data (or files) need to be massaged (migrated or integrated, etc) in order for us to analyze them with the way we want. Of course, with R, there should be multiple alternative ways to do the same tasks. Here, this post looked at the very fundamental stage of importing and combining bunch of csv files into one dataset with three simple ways (that I know of): (1) fread(), (2) spark_read_csv(), and (3) read_bulk().

Again, any comment will be appreciated and adding your version to do this job will also be greatly apprecited! Cheers!