I am presently working on a team project where we were tasked with the responsibilities of developing a biology domain-driven application. Of course, it is a research project and part of the requirements for my present academic programme, therefore I will only be able to talk about the generic part as regards technologies learnt and used. I must say I was very excited about this project especially concerning the data storage and processing part where we need to work with NEO4J. I will try to document my thoughts and experience here as much as possible as things unfold.
This article is just a quick overview on the said topic and much more details can be found in the book Learning Neo4J by Rik Van Bruggen which I reference a lot. The book, as well as other interesting ones, can be obtained for free from the official NEO4J website here
What is Neo4J?
An open-source graph database system, based on the graph theory Wikipedia. Generally, it belongs to the NoSQL classification of database management system even though it inherently belongs to a class of its own as you will discover later in this write-up. The NoSQL database management system family can be further categorised as follows:.
- Key-Value stores eg. Redis and DynamoDB.
- Column-Family stores e.g Google’s Big Table.
- Document stores e.g Couchbase and MongoDB.
- Graph databases e.g Neo4J, OrientDB and FlockDB.
They have also been tagged “task-oriented” database management systems as they focus more on tasks that need to be performed in order to meet certain goals or to achieve a certain performance standard instead of generics and this makes them more adaptable and suitable when developing domain-driven applications.
A graph structure means that we will be using vertices and edges (or nodes and relationships, as we prefer to call these elements) to store data in a persistent manner. As a consequence, the graph structure enables us to:
- Represent data in a much more natural way, without some of the distortions of the relational data model.
- Apply various types of graph algorithms on these structures
- Built for graphs, from the ground up.
- Transactional, ACID-compliant database.
- Made for Online Transaction Processing.
- Designed for scalability.
- A declarative query language – Cypher
Software Engineering “Best Practices” emphasizes the constant and continuous need to always use the best tool(s) for every task as there is no tool; be it programming language, development tool, technologies e.t.c that has the fit-for-all attribute at the “best” category. Same applies to Graph Database as they are not fit-for-all scenario when it comes to database management systems but developed just like other technologies to address specific needs.
Why NOT Relational Database Management Systems
For someone whose first encounter with database management system more than a decade ago was with a relational one (MySQL) and have used the same ever since for almost every application development involvement. I had another but a short encounter with NoSQL (CouchDB) about two years ago when an application I was working on at that time was scaling fast than planned, required that we perform analytics on the data at a faster rate and expanding use cases, which came with the need to have a flexible “schema”. Thanks to my then Technical Team Lead for suggesting and arguing in favour of NoSQL. It is also safe to conclude that one might never experience any of the relational database management systems limitations until there is a need to scale.
Summarily, the issues with relational database systems are:
- Relational Database Systems suffer at scale as growth emerge, query response times becomes poorer.
- Relational Databases are quite “anti-relational”: complexity in designing one grows as we move towards different domains and as this occurs the complexity in joins operations become worse which affect performance.
- Relational databases impose a schema even before we put any data into the database and this is not applicable in a lot of domains where the need for flexibility is primary.
PROPERTY GRAPH: The Data Model for Graph Database
A data model refers to the logical inter-relationships and data flow between different data elements involved in the information world. It also documents the way data is stored and retrieved. Data models facilitate communication business and technical development by accurately representing the requirements of the information system and by designing the responses needed for those requirements. Data models help represent what data is required and what format is to be used for different business processes. Technopedia
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to properties of the real world entities. Wikipedia
Property Graph defines how we will be storing and retrieving our data in the graph database.
Features of the Data Model:
- There is no fixed schema.
- Partly because of the schema-less nature, it seems to be a very nice fit for dealing with semi-structured data. If one node or relationship has more or fewer properties, we do not have to alter the design for this; we can just deal with that difference in structure automatically and work with it in exactly the same way.
- Nodes and node properties seem to be quite easy to understand. They are analogous to tables we have in relational systems.
- Relationships are a bit different and directional. They always have a start- and an endpoint and be self-referencing. They are explicit and can also have properties.
WHY and WHEN to use a Graph Database:
- When you are dealing with complex queries: These are the types of questions (queries) that you want to ask that is inherently composed of a number of complex join-style operations. This poses a problem which becomes exponentially bigger when dealing with multiple tables and large datasets.with every table join that you add. A simple hypothetical example can be: finding all the restaurants in a certain London neighbourhood that serve Indian food, are open on Sundays, and cater for kids. In relational terms, this would mean joining up data from the restaurant table, the food type table, the Opening hours table, the Caters for a table, and the zip-code table holding the London neighbourhoods and then provide an answer.
- When you need to perform In-the-clickstream queries on live data: This has to do with performance as regards query response time and Graph Databases’ Index Free Adjacency property is one of the reasons. You can read further on this here
- When performing path-finding queries: This type of query is extremely well suited for graph databases, they are queries where you would be looking to find out how different data elements are related to each other.
When not to use a Graph Database
- Large, set-oriented queries: If you are trying to put together large lists of things, effectively sets, that do not require a lot of joining or require a lot of aggregation(summing, counting, averaging, and so on) on these sets, then the performance of the graph database compared to other database management systems will be not as favourable. It is clear that a graph database will be able to perform these operations, but the performance advantage will be smaller, or perhaps even negative. Set-oriented databases such as relational database management systems will most likely give just as or even more performance.
- Graph global operations: finding clusters of nodes, discovering unknown patterns of relationships between nodes, and defining centrality and/or in-betweenness of specific graph components are extremely interesting and wonderful concepts, but they are very different concepts from the ones that graph databases excel at. These concepts are looking at the graph in its entirety, and we refer to them as graph global operations. While graph databases are extremely powerful at answering “graph local” questions, there is an entire category of graph tools (often referred to as graph processing engines or graph compute engines) that look at the graph global problems.
- Simple, aggregate-oriented queries: simple queries, where write patterns and read patterns align to the aggregates that we are trying to store, are typically served quite inefficiently in a graph, and would be more efficiently handled by an aggregate-oriented Key-Value or Document store. If complexity is low, the advantage of using a graph database system will be lower too.