First of all, HAPPY NEW YEAR!
Another Exciting Project
Recently, I started the new project with NIA in order to find the topics and their moving trends over time (2005~2017) from news articles: Total = around
15,000,000 articles as several giga bytes of
csv files. Those are lovely size for the analysis!
Anyways, I received the
Dropbox link containing those
csv files and could easily transfer those to my account, thereby downloading them to my desktop to start playing around with
As expected, those are kept in several folders and I was about to load them into
RStudio Server since it has sufficient memory to deal with.
As a typical habit, I googled and double checked if there is any better NEW way in
R to load all of the files at once, and merge them into one huge data frame. Guess what? There is a WOW Package(readbulk) to do that!!! And as always, thank god I realized how stupid I am, everyday!
Importing and Merging Multiple csv files into One Data Frame - 3 Ways
As such, I would like to summarize three ways to merge a bunch of
csv files into one huge data frame in R, including the
spark_read_csv(), and (3)
# (1) Make sure where your files are located csv_files <- list.files (path = "data/2005-2017", pattern = "*.csv", full.names = T) # (2) Import All csv with 'fread()' library (data.table) DATA_ALL <- as_tibble (rbindlist (lapply (csv_files, fread))) # (3) Save save (DATA_ALL, file = 'object/DATA_ALL.Rda') write_excel_csv (DATA_ALL, 'data/DATA_ALL.csv')
All the csv files are saved in the folder of
data/2005-2017 in my side, so I make them as path to tell R where my csv files are located. Since I need to load all the
csv files, I set the
any(*) csv files (with their path as
full names). Then, I use
data.table package with
fread() function to load all the files with
lapply() at once. Finally, the merged object is converted as DATA_ALL with
# Set Up Spark Connection library (sparklyr) sc <- spark_connect (master = "local") # => local = Desktop
First thing first. To take the full benefit of spark with sparklyr in R, we first need to connect the already installed spark. To do that,
spark_install() takes care of the easy installation. I assume we already did that and proceed to import the csv files directly from csv files in folder to spark. This way, we don’t worry about the memory issue we may encounter when dealing with tons of files.
# Load multiple csv files DIRECTLY to Spark: R memory is not afftected! DATA_ALL <- spark_read_csv (sc, "data", "data/2005-17/*.csv") # Check Spark "DATA_ALL" sdf_dim (DATA_ALL); sdf_nrow (DATA_ALL); sdf_ncol (DATA_ALL) glimpse (DATA_ALL) print (DATA_ALL, n = 2, width = Inf) head (DATA_ALL)
Using a handy
spark_read_csv() with simple wild card(*) in path setting, all the csv files in that folders are easily loaded to spark. As always, we can quickly check the data imported with
sdf_dim(), and even with
# Save merged file in Spark => Directly to Desktop as One csv File DATA_ALL <- sdf_coalesce (DATA_ALL, partitions = 1) spark_write_csv (DATA_ALL, "data/DATA_ALL.csv", # => File location where this will be saved header = TRUE, mode = "overwrite", charset = "UTF-8")
Prior to saving the combined csv file to desktop, we might want to make sure that how many csv files we want to have. In this case, I need only one big csv file, and therefore I set
partitions = 1 using
sdf_coalesce() function. Finally, with
DATA_ALL object is now saved under “data” folder named as
DATA_ALL.csv. This is simple but very effective approach of combining multiple files if you have at least more than two cores in your local machine: Distribute your cores to work!
read_bulk() with SQLite
Now, I will use recently known
read_bulk() together with a handy
SQLite() from RSQLite package. This is a really handy database that requires no separate installation and configuration.
# (1) Connect to Database library (RSQLite); library (readbulk) DATA_DB <- dbConnect (RSQLite::SQLite(), # => We use SQLite "DATA_DB.sqlite") # => Create 'sqlite' file at your machine
We first create and connect to database, named as “DATA_DB” with
DATA_DB.sqlite file is created at your working directory and all the csv files that will be combined are about to be inserted to that. In the meantime, you will be amazed by the size of the object of
DATA_DB in R: Only few kilobyte. What this means is that you are free of memory issue in R since the real csv files are directlty stored in this
DATA_DB.sqlite instead. R here works as intermediary liaison in between Database and csv file-transfer.
# (2) Load csv files to Database using 'read_bulk()' system.time ( dbWriteTable ( DATA_DB, # => Put csv files here created in (1) name = "DATA_DB", # => Make name table append = TRUE, # => Keep adding csv files into this single DB value = read_bulk ( fun = data.table::fread, # => or dplyr::read_csv(): Any reading function directory = "data/2005-17", # => File Location #subdirectories = F, # => Set 'T' if there are nested folders #subdirectories = c("2010","2015"), # => Selective Nested Folder at your needs extension = ".csv", verbose = T)) )
One thing to note is that I always measure the time it takes (i.e.
system.time()) when working with large volumes of data including but not limited to model fitting. Once I know how long it will take for a specific task, it becomes much easier to manage multiple tasks at the same time. Clock is ticking!
fun = argument in
read_bulk() allows you to use any function to import the files. So, we can set it as
util::read.csv, to name a few.
data.table::fread() is renowned for its speed to import, so this is my choice of the day.
# (3) Queries with "dplyr"" verb: No SQL commands are necessary! # Show what tables are in 'DATA_DB' dbListTables (DATA_DB) # Use `tbl()` to make `DATA_DB` accesible with "dplyr" verb library (dplyr); library (dbplyr) DATA_ALL <- tbl (DATA_DB, "DATA_DB") # Check with "dplyr" Verb with pipe operator glimpse (DATA_ALL) DATA_ALL %>% select (contents) DATA_TO_R <- DATA_ALL %>% arrange (-postTime) %>% head (100) %>% collect() # => Save it to "R" when you think you are done # Save "DATA_TO_R"(R Object) to Disk save (DATA_TO_R, file = "object/DATA_TO_R.Rda")
The beauty of
dplyr also applies to SQL. Though not all SQL commands are possible with
dplyr, it is sufficient enough to get the data we want (i.e. select, filter). After playing around the data with several times, use
collect() to save it in
R (memory) for subsequent analysis. I made sure I had it by saving
DATA_TO_R as Rda to disk.
As always, tons of data (or files) need to be massaged (migrated or integrated, etc) in order for us to analyze them with the way we want. Of course, with R, there should be multiple alternative ways to do the same tasks. Here, this post looked at the very fundamental stage of importing and combining bunch of csv files into one dataset with three simple ways (that I know of):
(2) spark_read_csv(), and
Again, any comment will be appreciated and adding your version to do this job will also be greatly apprecited! Cheers!