Close

February 26, 2013

Data File Processing, Data Mining and Report Normalization Tips

At Silva Tech Solutions LLC, we process and handle a lot of custom reporting tasks for clients in a variety of industries.  Typically, we are asked to process an input file (spreadsheet, text file, even a database) , extract the required data, append or manipulate the extracted data, and delivery a report in a file with a specific format (ie: spreadsheet, text, database etc).

Our clients receive these raw input files from a variety of sources, ranging from a local municipality tax feed to an online free website.  We can usually get the data automatically by crawling the website or just have the client send the input file to a specific email address as an attachment and the system will email back the final report.

These are some data normalization tips that we’ve collected over the years for extracting data from a variety of sources.

1.) We are PERL and PHP fans.  Really, any language that allows you to use REGEX as a basis for pattern matching will work, but scripting languages allow you to adapt the reporting code quickly to any changing conditions.  (Geek note: we just love PERL for the on the fly/infinite Hash mappings. )

2.) The required output a the final report are very important, and should be determined right away.  If the file is going to be a CSV file, what delimiter are you going to need to separate the data?  If it is a TAB character, is there any possibility that tabs might exist in the input data that will mess up the output?  If so, you’ll need to chose another type of delimiter.

3.) Always plan for the changes in the data down the road.  If the input data has fields added, will your process be able to modify.  This is especially true with municipality or local city data feeds.  Federal data can change, but it is usually documented fairly well and/or older formats will be available.  Also, data such as National Weather Service data is used by a significant amount more companies than say a local suburb in PA, so the formats are usually kept fairly similar to avoid massive outcry.

4.) Always plan to have basic unit tests for your reports.  If some errant data is introduced to your program, you need to be able to detect it quickly and alert someone that the data has changed.  This is especially true, if data is being crawled from a website.  Websites change frequently and if the web crawler is not able to detect a webpage change, the data being retrieved can be utterly useless.

5.) Excel Spreadsheet versions issues.  Code that automatically generates Excel Spreadsheets, may work well in newer versions of Excel, but not necessarily in older versions.  Generally, data being provided in the final Excel report does not rely on any of the newer features of MS Excel, so make sure it works in the older version of Excel first.  Newer Excel, in our experience, does not have any problems in importing spreadsheets from made in older versions.

 

We have been doing data reporting for over 13 years now, so we hope that you will find some of these tips useful.

If you need any help at all, feel free to shoot us a message.

Also feel free to check out some of our previous data processing exploits, by clicking here.