Technology Solutions for Everyday Folks

Reconciling Disparate Datasets

A few weeks ago I was asked to help update a centralized inventory dataset, which to us is definitely a "tertiary" system. Several hundred device records were associated with our part of the organization, and every time I opened the list I quickly "noped" out of doing anything with it. But the time came when I needed to actually do something with it lest I wind up on someone's naughty list.

Since we have two systems that act as passive systems, along with a primary inventory system, I actually have a lot of data at the ready for a project like this. Better yet, those aforementioned systems have APIs I already use for other purposes.

The Task: Update External Dataset with Matching Localized Data

The easiest way to deal with this is (at least for me to think about/through it) was to pull our little corner of the world out into a separate CSV for manipulation and testing.

I also have a "data baby" that's auto-generated weekly for a different purpose which mashes together the passive and primary inventory systems for active and recently-known devices. I grab an export of that to CSV as well.

One thing I didn't have at the ready was a list of the stuff we've "deleted" or retired/recycled from our inventory. This data, however, is easily obtained from our primary inventory system with a modified API query, and then stuffed aside as its own CSV.

I now have three distinct datasets to work with/from:

  1. "External" records we need to update;
  2. "Active" records from our fleet; and
  3. "Deleted" or retired records from our fleet.

The Process (Part One): Search and Extract

It's pretty easy to pull these CSVs into matching associative arrays in PHP, where I use the serial number (or whatever primary identifier) as the key for each row in the disparate datasets, like so:

$externalDataCSV = '/path/to/external.csv';
$activeDataCSV = '/path/to/active.csv';
$deletedDataCSV = '/path/to/deleted.csv';

$externalData = array();
$activeData = array();
$deletedData = array();

if (($handle = fopen($externalDataCSV, "r")) !== FALSE) {
  while (($inputData = fgetcsv($handle, 0, ",")) !== FALSE) {
    $externalData[$inputData[7]] = $inputData;
  }
  fclose($handle);
}

if (($handle = fopen($activeDataCSV, "r")) !== FALSE) {
  while (($inputData = fgetcsv($handle, 0, ",")) !== FALSE) {
    $activeData[$inputData[24]] = $inputData;
  }
  fclose($handle);
}

if (($handle = fopen($deletedDataCSV, "r")) !== FALSE) {
  while (($inputData = fgetcsv($handle, 0, ",")) !== FALSE) {
    $deletedData[$inputData[0]] = $inputData;
  }
  fclose($handle);
}

One has to ensure is that the key value for each dataset be set to the $inputData column containing the primary identifier. For me this was column 8 in the external data, column 25 in the internal/active data, and column 1 in the deleted data. PHP arrays have zero-based indexes, so adjust accordingly.

From there I'm able to easily 'walk' the external system data to look for any matches in the active or deleted data, along with identifying anything that isn't recorded anywhere (the mismatches). For matches, I am preserving both datasets by merging them together, where the internal data is first, followed by two blank columns for separation, and finally the external data:

$matchedRecords = array();
$deletedRecords = array();
$unknownRecords = array();

foreach ($externalData as $deviceSerial => $deviceData) {
  if (array_key_exists($deviceSerial, $activeData)) {
    $matchedRecords[$deviceSerial] = array_merge($activeData[$deviceSerial], array(' ',' '), $deviceData);
  } else if (array_key_exists($deviceSerial, $deletedData)) {
    $deletedRecords[$deviceSerial] = array_merge($deletedData[$deviceSerial], array(' ',' '), $deviceData);
  } else {
    $unknownRecords[$deviceSerial] = $deviceData;
  }
}

Then all that's left is to write all the data out to a CSV where it can be looked at (by a human) in Excel or some similar too:

$fp = fopen('/path/to/Inventory-Matched.csv', 'w');
foreach ($matchedRecords as $row) {
  fputcsv($fp, $row);
}
fclose($fp);

$fp = fopen('/path/to/Inventory-Deleted.csv', 'w');
foreach ($deletedRecords as $row) {
  fputcsv($fp, $row);
}
fclose($fp);

$fp = fopen('/path/to/Inventory-Mismatched.csv', 'w');
foreach ($unknownRecords as $row) {
  fputcsv($fp, $row);
}
fclose($fp);

Now I have three nice CSVs that can be analyzed by myself and others on my team, based on the nature of the match.

The Process (Part Two): Some Human Analysis

In this case, about 80% of the records in scope to update were retired and showed up in the Inventory-Deleted.csv data, where I could literally fill in the required field once and do a "fill down" to record the edits. About 12% of the records in scope were in the Inventory-Matched.csv, where in most cases the edits were a matter of just reassigning the proper user or department...or similar "only a human knows what to insert here" adjustment.

That leaves the ~8% in the Inventory-Mismatched.csv. These are records that didn't match anything in our internal systems and need some more deliberate examination. This number was well under 50, and I could tell by some of the device model numbers that these mismatches were devices that existed before our current inventory system was implemented (and for various reasons never added to it). However, there were under two dozen devices that are "current" enough to be still in use, but aren't cropping up elsewhere. It's speculated that most are likely sitting on a shelf or in storage (alternatively, in a little-used space due to the pandemic) and haven't been placed back into service. Folks on my team are looking around for them as time allows, and in relatively short turnaround identified about half of these before I submitted any updates.

The Process (Part Three): Merge the Updated Info

With all but about 4% of records now having updated information (across three CSVs), it was time to pull it all back together. Since the datasets may have been sorted or had other row manipulations since it was originally sourced, I decided to script a "re-match" or merge back together into one dataset. This would allow me to actually copy and paste the four "updated" columns from my CSV into the master dataset provided for our consideration.

The process to do this is almost identical to how I split it all apart, except it's sort of reversed (loading the adjusted CSVs, a refreshed copy of the original external data, and walking the external data to add the data/field updates for matching records):

$externalDataCSV = '/path/to/external.csv';
$activeDataCSV = '/path/to/Inventory-Matched.csv';
$deletedDataCSV = '/path/to/Inventory-Deleted.csv';
$mismatchedDataCSV = '/path/to/Inventory-Mismatched.csv';

$reconciledData = array();

if (($handle = fopen($activeDataCSV, "r")) !== FALSE) {
  while (($inputData = fgetcsv($handle, 0, ",")) !== FALSE) {
    $reconciledData[$inputData[0]] = $inputData;
  }
  fclose($handle);
}

if (($handle = fopen($deletedDataCSV, "r")) !== FALSE) {
  while (($inputData = fgetcsv($handle, 0, ",")) !== FALSE) {
    $reconciledData[$inputData[0]] = $inputData;
  }
  fclose($handle);
}

if (($handle = fopen($mismatchedDataCSV, "r")) !== FALSE) {
  while (($inputData = fgetcsv($handle, 0, ",")) !== FALSE) {
    $reconciledData[$inputData[0]] = $inputData;
  }
  fclose($handle);
}

$externalData = array();

if (($handle = fopen($tcDataCSV, "r")) !== FALSE) {
  while (($inputData = fgetcsv($handle, 0, ",")) !== FALSE) {
    $externalData[$inputData[7]] = $inputData;
  }
  fclose($handle);
}

$updatedRecords = array();

foreach ($externalDataas $deviceSerial => $deviceData) {
  if (array_key_exists($deviceSerial, $reconciledData)) {
    $updatedRecords[$deviceSerial] = array(
      $reconciledData[$deviceSerial][18],
      $reconciledData[$deviceSerial][19],
      $reconciledData[$deviceSerial][20],
      $reconciledData[$deviceSerial][21],
      $deviceData[4],
      $deviceData[5],
      $deviceData[6],
      $deviceData[7],
      $deviceData[8],
      $deviceData[9],
      $deviceData[10],
      $deviceData[11],
      $deviceData[12],
      $deviceData[13],
      $deviceData[14],
      $deviceData[15]
    );
  } else {
    $updatedRecords[$deviceSerial] = $deviceData;
  }
}

All that's left is to write the $updatedRecords to CSV and we're good to go.

And then, quite literally from Excel I copied/pasted the rows back to the origin dataset. The last bit of code above only updated/modified the first four columns of any given row (with the information that we had either collectively or en masse updated based on the analysis). Simple. Effective. Did not require anyone to copy/paste/manually manipulate individual records in the origin data.

But...That Seems a Lot of Effort?

I pulled this whole thing together in under an hour. It would have taken an order of magnitude longer had we assigned someone to go look up all these records manually in our primary inventory system.

It was an effort that I'd not necessarily planned to undertake, but at the same time I have a separate inventory reconciliation project on the horizon and much of what I wrote above will be rolled into a more proper "tool" that we can use to keep our own stuff in sync. In a way, this effort was really just a first go at that project, and I know what I want to behave differently when that time comes.

To that end, the goal wasn't necessarily a tool...but having the arbitrary deadline of this request has provided me with the stub of what will become a tool in our arsenal.

Hopefully you draw some inspiration from something relatively simple like this. It saved us a boatload of time!