Modifying CSV files with Node.js

I needed a way to quickly and efficiently modify a large CSV file, so I used Javascript.

Background

I work with Shopify every single day at my job, and parts of this job often involves the transportation of large amounts of e-commerce data. Let's say you have a store that's on Magento and you're about to migrate to Shopify, you'll need to move all of your existing products, customers, orders data over. The general idea is that you'll connect to your Magento database, export your Magento data into a usable format (like .csv files), and import them into Shopify. Seems simple enough?

The issue is that these are often extremely large chunks of data. Your historical orders may span years and hit the hundreds of thousands and your exported CSV files will become very large, needing to be formatted in in a way that Shopify can read.

There are tools out there to assist with this. Shopify provide their own (very basic) Shopify app and CLI tool to handle the input of Magento data, and there are powerful third party apps like Excelify which take advantage of the Shopify API to create a more advanced and robust data transportation experience.

Predictably, these apps don't handle any reformating of data - they'll take a value and either accept or reject it. So what happens if you need to modify data before uploading? A real world example of this can be seen from the Transporter documentation:

Email: The unique email address of the customer. This column is required if the Phone or the First Name and Last Name columns are not present. The domain (for example, @mydomain.com) needs to be valid.

The last sentence is a fun one. At first, I thought that this was going to be a simple regex string check (i.e. 'does this string follow the pattern of 'name@domain.com'). But no, as I found out the hard way of receiving thousands of upload rejections, what actually happens as you're importing your CSV is that Shopify will run a domain check for each customer name and validate if the domain is still registered and active, if the domain is not currently active, that customer is rejected.

So among other things, I needed to modify these records in bulk.

Creating a Node.js script

To preface all of this, I know nothing about CSV files outside of basic spreadsheet functions, I am a complete novice. Could there be a way to do this within Excel or Numbers? Probably. My sheets had around 750,000 rows altogether and I needed a way to quickly, efficiently, and automatically modify these data sets - so I turned to Node.js with the reasoning that I would be able to convert the CSV format into JSON and then use familiar ES6 array functionality like .map() and .filter().

It turned out to be quite a fun exploration, and the script works quite well! I've actually ended up using it a bunch of times for modifying multiple data sets (not just Shopify!) and upon realising that this thing seems to have repeated real-world use, I cleaned it up and open sourced it in case there is anybody else in my position:

https://github.com/jonashcroft/csv-json-boilerplate

From the GitHub readme:

This is a Node.js script that will allow you to parse a CSV File with Javascript, modify its contents, and output a new CSV file. This script will create a readable stream to your CSV file, parse the data into a JSON object and allow you to use modern Javascript functions on it before it outputting the everything into a new CSV file.

I wanted to make this as easy and readable as possible, so the script is essentially broken down into:

  1. Configure your CSV names and values
  2. Add your own ES6 functionality in the designated spaces
  3. Automatically export a new file.

The example CSV in the repo uses a table of the best selling records in the 2010s from Wikipedia. I've added a demo function to remove any albums that were released in 2015:

/**
 * Function that will remove items that don't match our desired years.
 */
function filterAlbumYears() {
  console.log('Removing items released in 2015');

  modifiedCsvJson = modifiedCsvJson.filter((item) => {
    return item['Year'] === '2015' ? null : item
  });

  console.log('...Done');
}

Earlier in this post, I mentioned needing to modify invalid emails with valid ones - I achieved this by using ES6 .map() functionality and laid out an example of this in the repo. The best-selling records sheets add brackets to the number of units sold (e.g. (2,000,000)), let's remove these:

/**
 * Removes the parenthesis from the 'Certified Units' field.
 */
function modifyCertifiedUnits() {
  console.log('Removing parenthesis from Units Sold...')

  modifiedCsvJson = modifiedCsvJson.map((item) => {
    const returnedItem = item
    const itemKey = 'Certified Units'

    returnedItem[itemKey] = item[itemKey].replace(/[{()}]/g, '');

    return returnedItem
  })

  console.log('...Done');
}

Conclusion

Overall, I am happy with how this came out, I think the fact that I have repeatedly used this for multiple data sets means that it's paid for itself, so I hope that there's people out there who can also find use of this!

The script wouldn't be possible without two excellent modules, csv-parser and csv-writer - both of these have really done the heavy lifting here, I just wanted to package up both of these modules into a useful boilerplate.

GitHub Repo: https://github.com/jonashcroft/csv-json-boilerplate