How to format incoming data?

I am working on a data viz project. The first step is data collection. After several tries, I formatted the data into a shape that was easier for me to manipulate in JS.

I had to go through several iterations to get the data to the point where it was easier to work with. But I was just guess-timating what to do until closer to the end when I started seeing the logic of what kind of data structure would be most beneficial.

So my question is: is this going into the realm of data engineering? (Albeit at a very low level?) And, since I am NOT a data engineer, what are the rules of thumb, best practices, etc that I need to know to be able to do this type of simple data formatting (BEFORE loading into JS file)? What resources should I read…?

Thanks!

What is the existing format of the incoming data that is being collected (e.g. web scraped data, flat files)?

I collected the data off of tables from PDF files posted online and converted them into CVS files. It is these CVS files that I am manipulating before loading into JS file.

First off, I would start to think about the data type (e.g. text, decimal, etc.) and the relationship for the particular data points you will be working with from the source data. This is a simple example, say the data in the PDF file contains company sales for a time period. In this case you’d work with these data points and formatting approaches:

  • Company name text
  • Sales decimal
  • Time period date or a text if the PDF contained sales for third quarter in a year. So if it is already presented as a date then format it as such. If it is a text, such as Q3, then that would translate as a text to 07/01/YY - 09/30/YY. Alternatively you can have two date fields for this as well where you have a start and end.

Hope this helps :slight_smile:

1 Like

That’s really helpful @xps321.

I think where I’m also getting stuck is in trying to figure out what data should be in rows and what data should be in columns. I’m using d3 for a lot of my code and the d3 library often requires data to be in a certain format. Since I can create the CVS file however I want, it’s in my interest to make sure I take advantage of that as much as possible so I have less data manipulation to do in JS file.

For now, I’m working backwards, i.e. looking at what data needs to look like right before inputting into d3 data and working backwards from that to see what is the best way to shape CVS file to make my JS life easier.

I get the feeling that is just about all I can do…? There isn’t necessarily any best practice around that question…?