Butt Biter: Fix State Abbreviations && Large Data Set Updates

Ran into a little butt biter today… I imported a few thousand addresses that had the state in long form, like ‘Arizona’. But I commonly used the us_states plugin for popup selection of states. Well, us_states will be looking for state abbreviations, like ‘AZ’ when it selects it’s selected value. So what’s a girl (or dude) to do? You gotta convert them states!

But there are a few concerns:

  • you need to iterate over a large set of records, maybe an entire table
  • there may be memory issues with loading such a large set

The following rake task demonstrates a technique that is more memory efficient than loading the entire model/attributes (table/fields) into memory. Only load the ‘id’ and ‘state’ attributes when you need to find your initial record set. The ‘id’ will be used to find the record for updating, and the state is need to find the new abbreviation.

[sourcecode language=’ruby’]
#only load id and state attributes to save memory
Organization.find( :all, :select => ‘id, state’ )

#load the entire attribute list, uses more memory
Organization.find(:all)
[/sourcecode]

Organization.find(:all) just sucked up a ton of memory on my machine and slowed things down a bit. But if you only have a few hundred or a few thousand records in your table, you may not see much of a difference

There are some cons though. For every record that needs updating, there will be a query in addition to the update. That’s because the entire record’s attributes need to be loaded for update (at least mine did), so a find using the id along with the update is necessary. This technique really shines when you only need to change a minority of records.

You may not need to load additional attributes if no other attributes are called or modified in your model. In my case, I geocode each address, so the Organization model will need other fields during it’s execution both to read and to update. Though I could turn off the geocoding temporarily for this rake task and thus not need to load the rest of the attributes.

Save this in your /lib/tasks directory:
[sourcecode language=’ruby’]
namespace :db do
namespace :orgs do

desc “Convert states to two-letter abbr.”
task :convert_states => :environment do
US_STATES = [[“Alaska”, “AK”],[“Alabama”, “AL”],[“Arkansas”, “AR”],[“Arizona”, “AZ”],[“California”, “CA”],[“Colorado”, “CO”],[“Connecticut”, “CT”],[“District of Columbia”, “DC”],[“Delaware”, “DE”],[“Florida”, “FL”],[“Georgia”, “GA”],[“Hawaii”, “HI”],[“Iowa”, “IA”],[“Idaho”, “ID”],[“Illinois”, “IL”],[“Indiana”, “IN”],[“Kansas”, “KS”],[“Kentucky”, “KY”],[“Louisiana”, “LA”],[“Massachusetts”, “MA”],[“Maryland”, “MD”],[“Maine”, “ME”],[“Michigan”, “MI”],[“Minnesota”, “MN”],[“Missouri”, “MO”],[“Mississippi”, “MS”],[“Montana”, “MT”],[“North Carolina”, “NC”],[“North Dakota”, “ND”],[“Nebraska”, “NE”],[“New Hampshire”, “NH”],[“New Jersey”, “NJ”],[“New Mexico”, “NM”],[“Nevada”, “NV”],[“New York”, “NY”],[“Ohio”, “OH”],[“Oklahoma”, “OK”],[“Oregon”, “OR”],[“Pennsylvania”, “PA”],[“Rhode Island”, “RI”],[“South Carolina”, “SC”],[“South Dakota”, “SD”],[“Tennessee”, “TN”],[“Texas”, “TX”],[“Utah”, “UT”],[“Virginia”, “VA”],[“Vermont”, “VT”],[“Washington”, “WA”],[“Wisconsin”, “WI”],[“West Virginia”, “WV”],[“Wyoming”, “WY”],[“APO-Military”, “APO”],[“FPO-Military”, “FPO”]]
org_count = Organization.count
cnt = 0
Organization.transaction do
Organization.find(:all, :select => ‘id, state’).each do |org|
puts “#{cnt += 1} of #{org_count}: #{org.state}”
e = US_STATES.assoc(org.state)
Organization.find(org).update_attribute_with_validation_skipping( :state, e.last ) unless e.nil?
end
end
end

end
end
[/sourcecode]

Briefly, it iterates through the entire Organization table using only the ‘id’ and ‘state’ attributes. Next, it looks through the US_STATES array for a match of the state’s long name. If a match is found, then it loads the entire record, and updates the attribute without validations (your call here, but since you are just changing the state we can skip validations pretty safely). Everything is wrapped in a transaction so if there are any problems along the way, all updates are ROLLBACKed.

I usually add some sort of record-by-record output to the screen (via ‘puts’) so there is a visual indication of the progress. I used a counter, ala record number, with the ‘state’ value to give me some feedback.

You could also silence the logger as this task may be generating a lot of log entries. My log soared to over 30Mb. But remember, if you silence the log there will be no error logging either, but you should see any errors on the rake console. Note the double pipes inside the block with nothing inside.
[sourcecode language=’ruby’]
# wrap this inside the transaction
Organization.silence { ||
# my beautiful code
}
[/sourcecode]

Posted in Rails. Tagged with , , .

4 Responses