Importing Data

From IHRIS Wiki
Revision as of 00:23, 7 January 2013 by Sovello (talk | contribs)

This article describes the import tools developed for Botswana. The same can be adopted for other projects and be edited as necessary depending on the structure of the data you have. The import tools for Botswana can be accessed in launchpad from here. Here there are many files, however we have one important file the import_base.php which forms the basis for all the other tools. We will refer to this file in this article as simply the Import Base. The Import Base defines the classes which are then implemented by all the other tools.

Import Base

This file can be organized into different sections. The first section being lines 39-74. This is the part where we set the system wide variables like database credentials, where we find the other tools like I2CE and Command Line Interfacing tools (line 74). If you check closely, this first part resembles the site index.php file.

DataFile Class

The DataFile class is the second part of import base. The DataFile class (lines 85-114) defines the abstract methods for the real classes to implement. It defines methods like getDataRow(), hasDataRow(), getHeaders() to fetch one data row at a time, to check if the file we are reading has data rows in it and to get the headers from the file respectively. We are assuming the file we are going to work with has its data organized into rows. The constructor initializes the file that we will be working on and we have a method to fetch the file name and to close the file after we have finished reading it.

CSVDataFile Class

This is the first implementation of the DataFile class defined in the second part of the import base.

It implements the DataFile class depending on the features of the CSV File. Its constructor calls the parent constructor to initialize the file we are going to import data from, reads the size of this file and makes sure the file is open for reading otherwise it fails. (lines 123-125).

The hasDataRow() methods checks to see if there is some data in the file we just opened in the constructor. ([php.net/manual/en/function.ftell.php ftell() method]). Should there be no data, we return false otherwise we return true that the file has data in it.

Then we go and read the headers from this file. We assume the headers to be in the first row of the file [php.net/manual/en/function.fseek.php fseek() method) and that the separators could be one of a tab, comma or semicolon. (lines 147 and 148)

Then we read the header fields into a variable $headers line 149. Lines 150-153 we check if the $headers array has no data in it, or there are only two headers and if the user (one running the script) replies no the the displayed headers for confirmation, then we remain at the first line of the file and skip.

We assign the separator found in the file to the class feature 'in_file_sep' and stop checking for more. Lines 157 - 159 we are making sure we have the separator used in the CSV file and then we remove any white spaces before and after each of the headers we have found. ([php.net/manual/en/function.trim.php trim() method]).

Then finally the function returns the headers we have found from the file.

The getDataRow() (lines 167-169) returns an array of data read from the file based on the separator we found above. ([php.net/manual/en/function.fgetcsv.php fgetcsv() function]). After we have done everything we close the file we opened.

ExcelDataFile Class

We also assume if the data file we are reading is not a CSV then it could be an Excel spreadsheet. This is accomplished using the PHPExcel library which must be installed before this is done lines 187-189.

What this does in the constructor is initialization of the file we are going to work on like creating the reader resource, setting the file to readonly as we process, loading the file and the active sheet from the worksheet, and setting the row iterator in this worksheet.

The other methods perform the same function like in the CSVDataFile class above, except we have added a helper method _readRow, which reads the data from a row into an array and later this is used by the getHeaders() and getDataRow() methods. [This is due to the limited versatility of the PHPExcel library to process stuff more easily). More about PHPExcel library can be read from [phpexcel.codeplex.com/ here].

Processor Class

This is the main class which does all the processing of the importing. Whenever you define a class to import some data, make sure you implement this class.

The convert() function (lines 263-267) is utility function to convert file file sizes into bytes, megabytes, gigabytes etc. depending on the size value we have.

The constructor of the class reads the file and determines whether it is a CSV or an Excel spreadsheet and initializes them accordingly lines 287-294 of the import base. Then it initializes the database user and the Form Factory instance, the bad file and test mode and creates the import logger. The import logger is a database table which keeps track of where the last import ended in case we didn't finish. The logger is somewhat for a specific file so that if we started processing a file and something happened in between before we finish we can proceed from where we ended.

The getCurrentRow() fetches one row at a time from the file and returns it for processing, the hasDataRow() makes sure we have rows in the file we are reading and returns true on success.

The run() method handles all the import processes giving the user choices whether to run the script in test or production mode giving options to skip some of the rows. It also checks to see if there are more than ten consecutive empty rows and asks the user whether to proceed or stop processing. It also checks to see, if the script is run in test mode (meaning no data is saved in the database) it echoes back Success on Test meaning if we were running it in production mode this data from this row would be imported successfully into the database.

The getStats() is one of the utility method to collect statistics for the import process. At the end of the import process, the script gives statistics for the success and failed imports.

The processRow() processes each row encountered in the file and returns true on success and false on failure. If a row is processed (lines 393-395) we return a success (true) and proceed. Lines 399-404 we check if processing of the row has been successful we mark that as being processed and set that in the import logger (markProcessed()) and return true, otherwise we return false: failure.

The setTestMode() is a helper function to help set the mode of the script whether to run in test or production mode.

The save() method saves the data into the database and echoes back the row number and the form where the data is being saved and the memory used so far. If the script is run in test mode then nothing gets saved into the database lines 429-431 otherwise it saves the data with the user who saved that data and asks whether to continue or stop.

The getHeaderMap() method does some mapping of the headers from the file we are importing data from and the internal headers that you will have defined for reference. 'Check its use here. The indices of the array are the internal references to the headers found in the file to be imported. And the mapHeaders() this now maps the internal references to the column headers in the file.

The initBadFile() method (lines 520-528) initializes a file where we keep record of the import process. Every row we fail to process gets saved in this file together with the reasons for failure by adding a column at the end of the import file we are working on.

The addBadRecord() method adds all the failures into the file which holds all the unsuccessful imports.

The createLogger() method creates the database table into which we keep track of all the rows that have been processed so that next time we run the same script on the same file we proceed from where we last ended.

The alreadProcessed() method confirms that the row we are now working on has been processed or not.

The markProcessed() method marks each row as processed into the logger table.

The getDate() function is a utility function to get the current data and format it, so that we can use it in the name of the bad records that we will create later for storing unsuccessful imports.