Ruby DeRailed: Emergent Data Design

In his talk about Redis at Scottish Ruby Conf a few years ago, Ryan Briones covered the importance of deferring the design of data structures of using, in his case, Redis as a data store. This resonated with me strongly. In my code I always look to defer design by using the heuristic of building the simplest thing that could possibly work. I then let the design emerge from my refactoring as the real use cases present themselves.

This transformation from forcing big design decisions up front to allowing an emergent design also fits with data. Each iteration of adding software features, and the data that goes with these features, provides me with more information on what the real use cases are. This process of learning is more accurate than trying to guess the use cases. Guessing cannot account for all the different pressures on an evolving system. In other words, guessing wastes time and effort by developing use cases which are not valid. Guessing is a gamble I have already lost enough.

An example of deferring data structure decisions is starting an application with a key-value data store. As the application matures, it creates data use cases; it becomes obvious what data is used where. Once these use cases are stable, it is easier to let the data design emerge into a structure based on how the data is used.

There are 2 ways to take a project to the point of emergent data design: local mapping and abstract persistence.

Local Mapping

This technique maps the data into each package. Say we are writing the payroll package for an HR application. The first thing I want to do is create a high level use case object:

module Payroll
		class PayEmployees
				def pay(employees, tax_service, pension_service, bank_information)
						data = PayEmployeesDataMapper.new(employees, tax_service, pension_service, bank_information)
						# algorithm to pay
				end
		end
end

This class defines a boundary object which takes data from three different sources: taxservice, pensionservice and bank_information. Rather than passing the data from the persistence layer in the package, the first task is to structure the data for this package. We need to create abstractions for the employee and bank, and then map the data for each of these abstractions:

module Payroll
		class PayEmployeeDataMapper
				def map(employees, tax_service, pension_service, bank)
						data = {}
						data[:employees] = map_employees(employees, tax_service, pension_service)
						data[:bank] = map_bank(bank)
						data
				end

				private

				def map_employees(employees, tax_service, pension_service)
						employees.map do |employee|
								{
										ssn: employee.ssn,
										fica: tax_service.fica_for(employee.number),
										ira: pension_service.pension_for(employee.number)
								}
						end
				end

				def map_bank(bank)
						{
								account_number: bank.payroll_account_number,
								routing_number: bank.payroll_account_routing_number
						}
				end
		end
end

Locally mapping the data allows an emergent design between packages. Before the use cases of an application become solid, and to some degree afterwards, there will be different tensions causing the structure of the data to change. Those tensions can have far reaching effects between packages. There will certainly be data from the data source that the package doesn’t need and would only clutter the package.

However, by using a map, all that changes is the mapping code in a package. Changing it may expose how the structure coming from the data store is similar to another use case or is now splitting away from a shared use case. We can handle both of these tensions without having to understand the behavior of a specific package.

This method also creates a local schema for the package that is independent of the global data source schema. When another developer has to change the payroll application, they don’t need to understand the entire HR system. They can find in the map a schema and structure of the data as it applies to what needs to change. It is easier to create a localized mental model in a large system by looking at the mapping.

This method uses System 1 thinking for coding as described here.

Abstract Persistence

Rails’s Object Record Model (ORM), Active Record, is one of the key reasons Rails has been so popular and successful. The basis of ORM is the premise that the data model can be decoupled completely from the implementation of the database. With a Ruby DSL built on top of the database that generates all the SQL, one can theoretically swap out an SQL Server database for a MySQL database.

One of the problems with using an ORM like Active Record is that it couples the design closely to a psuedo-db implementation. Active Record is such a leaky abstraction its DSL is modeled in no small part from SQL.

With an abstract persistence it is easier to change the persistence. While I have never actually switched an application from one SQL database to another, I have had to switch between a SQL database to an elastic search. There are different technologies that are appropriate for storing or accessing data in different ways.

The pattern for data abstraction I prefer is the repository pattern, as described here. Each package has its specific view of the persistence layer it cares about. This way one can implement multiple data stores, e.g. caching, in memory, key/value, SQL db, or elasticsearch. Each type of data store has its own advantages for different use cases.

Continuing the example above, when the payroll package needs to scale a specific data set from a SQL query to elasticsearch, we change the repository that belongs to payroll, not the data store itself.

module Payroll
		class EmployeeRepository
				def initialize(data_store)
						@data_store = data_store
				end

				def find_by_number(number)
						build_model @data_store.employee.find_by_id(number)
				end

				def find_all_by_pay_period(date)
						build_models @data_store.employees.find_all_by_pay_period(date)
				end

				private

				attr_reader :data_store

				def build_model(row)
						Employee.new(row) if row
				end

				def build_models(rows)
						rows.map { |row| build_model(row) }
				end
		end
end

Here the data store needs to implement two methods specific to our package: findbyid and findallbypayperiod. This design also follows the interface segregation principle by isolating a read only package from the common CRUD options in an Active Record model.

Using ideas like emergent data design feel at first like fighting against Rails. Since Rails has very strong opinions it can seem like a losing battle. However, the results are significant when you can use what Rails does well while still making good architectural decisions. It does mean I have to know a lot about the internals of what is going on in any persistence layer. I need to know the query language well. I need to know data structures well. I need to be able to listen to tensions and dependencies to decouple packages correctly.

I’ve found the benefit of keeping small contexts through mapping, letting the use cases drive the structure, and keeping boundaries between the use case of your data and how you store it worth the cost. For more on the Ruby DeRailed series, check out Ruby DeRailed: Fast Tests.