Building a Graph in SQL Land

Anna Rankin - Oct 8 '18 - - Dev Community

In the spirit of sharing failures and learning from them, I'd like to tell you a story about a time a younger me helped create a "clever" system to organize a content hierarchy. This cautionary tale is mostly for fun, but also to document some of the neat things and painful lessons I learned along the way. Names and entities have been changed to protect the innocent 😜

But ...why?

Long, long ago in a product team far away, there existed a project to create a new, more flexible hierarchy for collections of books. Originally, our data was set up like so:

original data setup - collections, genres, and books

A user had access to a collection of books ("Fiction" in this case); a collection contained several genres, which in turn contained individual books. A user could be granted access to see everything within a collection, and could create "reading lists" so they could favorite/categorize books based on their own criteria.

This worked well - a user could see all the fiction books in the collection, and they could even add these books to a personalized reading list. Later on, however, our product team started testing out a new collection:

same diagram, but with a new "nonfiction" collection in addition to the first

We didn't initially want to allow all users to access the new "Nonfiction" collection, so we granted access to that separately.

We had a few issues - mainly that a user could only access and create reading lists from one collection at a time, and they had to switch between them manually in the UI. This worked well when we only had one collection, but lacked flexibility when we started adding more. Additionally, if a book appeared in more than one genre, its entry would need to be duplicated - our data structure didn't expect a book to have more than one genre 😬 Users were also prevented from adding books to their reading lists that didn't belong in any of our collections at all yet. If they wanted a one-stop shop to track what they were reading, they'd be out of luck.

The idea was that as we added more collections, we needed a better way to categorize our content and avoid duplication. Folks from across the organization worked together to create a new content structure:

proposed content hierarchy

Rather than segmenting our content into "collections," we'd house everything in a mega-collection - called the "library" - and allow books to appear in multiple genres. We hoped this would make our content more discoverable in searches, less duplicative, and would allow users to have reading lists with both fiction AND non-fiction titles on them.

Technologies

We were already working with some tried-and-true technologies in this application:

Now, our job as engineers was to implement the product vision and provide the tools our coworkers needed to interact with this new model. We had a choice: Do we create a series of discrete tables linked to each other by standard relations, or do we try something new?

(spoiler alert - we tried something new.)


Going Graph

It didn't take us long to decide that if we were going to create this new "world view," we wanted to play with some cool new technology along the way - and to be fair, it did seem like a graph model would serve the current use case well. We drew out some concepts of what we expected our entities and relations to look like.

image of graph containing library, reading list, genres and books a user can access

Users would be able to add their own books to their reading lists if they so chose - but that didn't mean they belonged in the library's collection.

That done, we discussed implementation - how would we bring this into our application? How would this fit into our current codebase? After some research and exploration, it was decided that we'd avoid bringing in a true graph database (like Neo4j, OrientDB, etc.). Cost was potentially an issue, and we wanted to avoid forcing our developers to learn new concepts and a new query language until we were sure this model was something that we really wanted to stick with.

That having been decided, we drew up our approach to storing our graph entities and relations ("nodes" and "edges") in our relational database:

table setup

The Good

Once we got our new hierarchy set up, rendering nested relationships became simple. We could easily convert a higher-level model (like a genre) into a tree-like JSON representation (in the real world, we had several additional levels to deal with - think category, subcategory, etc. - so this was pretty helpful).

library.to_json
# => 
{
  id: 1,
  name: 'Library',
  genres: [{
    id: 1,
    name: 'Fantasy',
    books: [{
      name: 'Book 1',
      id: 10
    },{
      name: 'Book 2',
      id: 20
    }]
  }]
}

Interacting with the data also became pretty simple - our nodes all had the same attributes, so we used pretty generalized views and services to display, create, and edit everything from the library and reading lists to genres and books.

<h1>Editing <%= @node.name %></h1>
<%= render 'node-edit-form', node: @node %>

Another bonus was getting to understand graph structures pretty well. We learned how to traverse a graph, how to filter a graph, how to check for cyclical references - and more.

The Bad

Everything worked well in the beginning - when everything looked the same. Before too long, though, we found that we needed additional information in the payloads we sent down to the front end.

# Our data model started to become more complex...
{
  id: 1,
  name: 'Library',
  genres: [{
    id: 1,
    name: 'Fantasy',
    book_count: 2,
    books: [{
      name: 'Book 1',
      available_for_checkout: true,
      author: 'Bob Bobberson',
      id: 10
    },{
      name: 'Book 2',
      available_for_checkout: false,
      author: 'Frances Farina',
      id: 20
    }]
  }]
}

We started mixing behavior and additional attributes in with structure of the library hierarchy rather than allowing the graph structure to do what it was good at - defining the relationships of entities to one another. We started shoehorning in details about individual nodes, complicating the logic required to render our over-generalized views. These views quickly became complicated and full of switches on node type.

<h1>Editing <%= @node.name %></h1>

<% if @node.type == 'Genre' %>
  <p>Book count: <%= @node.book_count %></p>
  <%= render 'node-edit-form', node: @node %>
<% elif @node.type == 'Book' %>
  <%= render 'book-node-edit-form', node: @node %>
<% else %>
  <%= render 'node-edit-form', node: @node %>
<% end %>

After a while, it became clear that our application was a graph database that we'd smashed into the shape of a bookstore, instead of a bookstore that utilized a graph database to store data. Our codebase was littered with overly generic, meaningless methods and controllers that were difficult at best to understand.

# Instead of this...
def save_book(book_data)
  # Clearly saving a book to the DB
  save_to_graph(book_data) 
end

# we ended up with this:
def save_node(node_data)
  # What are we saving here??
  Nodes.save(node_data)
end

The Awful

As the library grew in size, querying became downright awful. Queries ballooned in complexity; complicated preloading was required in order to avoid making hundreds of queries. Many of our services relied on recursion to generate serialized JSON or aggregate data, which added mental overhead when trying to figure out what your bug was, what was causing it, and where your extra queries were being made. When new devs joined our team, it was harder for them to ramp up on what we were doing, which caused frustration and wasted a lot of time. We used tools like recursive SQL views and terrifying "octo-UI" (see below) admin tools that caused more grief than joy.

image of a force-directed graph

This graph is similar to a stopgap admin interface we implemented to allow admins to interact with the graph - screenshot taken from the D3 example gallery

The last straw(s)

All told, we worked with this psuedo-graph structure for a little under a year before we gave up the ghost and started ripping it out. Some notable reasons:

  • As mentioned above, writing services and views for a general idea of a "node" did not work well once our data model evolved
  • This idea of graph-like structures stored in a relational database was certainly not supported by ActiveRecord - this resulted in inefficient and confusing queries
  • Super importantly, we (as developers) didn’t create tooling that would have made these concepts easy to work with, for other developers and for the end users of our product.

Lessons learned

While I wouldn't do this again, I certainly learned a few things along the way.

  1. Recursive views are cool πŸ€“
  2. Never put your personal curiosity ahead of someone else's livelihood.
  3. Use the right tool for the job!

The next time you find yourself looking at a shiny new technology and feel the desire to use it in a product that's critical to your company's day-to-day business, I encourage you to consider your choices carefully.

Do I even need INSERT SHINY THING HERE? ...do I really?

Most often, it's in your best interest to choose "boring technology" (Choose Boring Technology, Dan McKinley) over "the new hotness." If you consistently opt for exciting, innovative technologies over longer-lived and widely-understood systems, you're going to increase the cost/headache related to onboarding new engineers and maintaining your entire system.

I would argue that this also applies to the patterns we use to build our software. When we hid the implementation details of our graph structure poorly, we introduced code that felt unintuitive and confused new developers. If we had done a better job isolating that code, it would have been much easier to swap out our "homegrown" version for the real thing later on.

The article linked above makes a great case for limiting the number of technical solutions you use as a team. Instead of spending your energy troubleshooting and maintaining a homegrown solution, you can choose to focus on solving business problems - actual people's problems. Add too many unfamiliar technologies, and you run the risk of having one or two "experts" (masters of the arcane) on how your application works rather than a straightforward codebase that most folks can get up to speed on in a reasonable amount of time.

If the honest answer to "do I need this?" is "yes, this is by far the best solution for my problem," ensure you have the time, support, team bandwidth, and expertise to do so. If these conditions are met, then you need to do your research - and please, use the right tools for the job πŸ™‡β€β™€οΈ

. . . . . . . . . . . . . .
Terabox Video Player