This post is part of a series of learning topics presented by technology experts. Today’s post comes courtesy of Michael Sirinotis, Data Scientist, Senior Project Manager at Powerwrap Limited and architect of the Powerwrap Operational Data Store, who answers some questions about data architecture. This content comes from a recent Q & A session delivered by Michael in our workplace.
Thanks so much Michael, for this Q and A session. First off, what is “data architecture”?
Data architecture is the design of a system to manage data throughout its life cycle, how the components interact together and how it fulfils the business’ needs.
What is the difference between data architecture and data infrastructure?
Data architecture is the overall setup and design of your system – e.g., choosing which components are used for what function, and how they fit together. Infrastructure is the actual physical implementation of those components.
As an example, in AWS – the networking, virtual machines and the database would describe its infrastructure. However how those components communicate together, share data via APIs, etc is the architecture. In practice, the terms “architecture” and “infrastructure” often get co-mingled.
OK, that makes sense. Now for the next couple of questions I’m going to go specific.
What exactly are ‘containers’?
Containers package up a lightweight operating system, any applications and a program and run jobs as a packaged executable.
That means they are stand-alone, can be shared and run on any platform and solve the problem of applications not working across multiple isolated environments. They are often used in a micro-services architecture. Similarly, a step further towards separation of business logic and the underlying hardware is represented by services like Amazon’s Lambda, which allow you to run a snippet of code at a time using managed infrastructure as a ‘Function as a Service’.
What are ‘micro-services’?
Micro-services are used to split out processes, usually along business-domain lines, so that the operations can be independently managed.
For example, in an e-commerce site, you might have an Orders service and a Users service, so that if you get a lot of orders but not a big growth in users, you can scale up the Orders service without having to scale up the Users service as well. A major advantage of micro-services is for operational purposes: each team can update and iterate its own service as needed without everything having to be packaged up as a whole-system release.
The disadvantage of micro-services is more overhead –they are not be suitable for small operations. Things can get very complex very quickly, with more monitoring, management and deployment to manage.
What’s the difference between a data lake and a data warehouse?
A data lake is raw unmodified data stored somewhere. Taking the raw data and cleaning it into a more usable format and the storing it into a database forms the data warehouse. For example, as we get incoming data from third parties or from our own applications we store it initially in raw form and that storage repository is a data lake. Some of that data gets processed into a separate data warehouse for analytical reporting.
What are the differences between Amazon, Azure and Google Platform?
All have relative parity in features available such that you can build almost anything using them. AWS tends to be slightly more developer-centric and includes a wider range of services. Azure is perhaps slightly more user-friendly and GCP has a stronger focus on open-source and containerisation.
With everyone using cloud infrastructure now, how do companies guard against outages by the cloud giants?
These days Amazon and the like rarely have outages, so this is not as risky as it might have been once. To answer specifically about Amazon, AWS has regions where data centres are clustered, and within those regions are availability zones (AZs). Companies can choose to replicate their applications and processes across 2 or more AZs to increase fault tolerance. Whether or not a company does this, or to what extent, will depend on how critical the data and processes are. Banks for example, are running critical operations and have zero fault tolerance. A small company operating a non-financial service will have less need for that. Everything is a trade-off, so the more replication you have, the higher the cost both in terms of money and maintenance. The best approach is to put in (only) as much replication and redundancy as you need, with the ability to add or reduce in future as needed.
How do you decide what vendors and components to use?
It depends on the business requirements, and what fits in best in the desired data architecture. For operational reasons it’s often best to choose components that are vertically integrated and managed by your cloud provider. If that functionality is lacking, too costly or not performant enough, you can assess other third party vendors to fill the need.
What makes good architecture?
The key things to consider are these:
Availability: the system must continue to function after failures or outages.
Durability against data loss. What is the Disaster Recovery (DR) strategy? What is the RTO (recovery time objective – how long until things are up and running again?). What is the RPO (recovery point objective – how far do we roll back, how much data loss is acceptable?)
Scalability: how the system handles increase in scale of different components. There are two basic ways to scale: horizontally (e.g. add more machines to distribute processing), or vertically (increase resources on the existing machines). Scaling vertically is easier and less complex but will have its limits for large tasks. In general, you try to scale vertically as much as is possible, and then scale horizontally as needed.
Security: It is a very in-depth topic, but you want to have many layers and depth of defence. Minimise access to only those who absolutely need it and have thorough audit logs and traceability of all actions taken. Protecting data ‘at-rest’ and ‘in-transit’ with encryption is paramount as well as responsible storage of confidential information.
Cost: Good architecture is designed to only cost what it needs to. The new cloud services offer applications and services that charge by usage. It’s important to consider the cost of people to manage the system, not just the physical components (TCO).
Monitoring: For example, logs, alerts, failures, re-tries, self-healing. Systems generate a lot of information, make sure you have the right level of detail so you can take action when required.
Performance: Speed – usually percentile-based. When measuring performance, averages are not particularly helpful. We’re usually looking for data like “99% of users are getting query returned in 100 milliseconds”. Even knowing that 80% of queries is working well could still mean a ‘long tail’ of bad performance.
Simplicity: This is arguably the most important factor, as it affects operational speed and maintainability. The more complex your system, the more overhead both in terms of system performance and in cost and effort to maintain. Complexity makes it harder and slower to make changes and roll out new features and can mean higher people costs and more key person risk if you need specialised developers or additional time and training to run niche systems.
Key considerations in designing architecture: - Availability - Durability - Scalability - Security - Cost - Monitoring - Performance - Simplicity
What makes bad architecture?
Not assessing the above areas and considering the trade-offs. A common pitfall is bringing in too much complexity too early because of premature scaling which slows the speed of new features. Build the product and if you ever reach the need for immense distributed scale you will have plenty of customers (and thus resources) to deal with it.
To finish off, a couple of lighter questions:
What is hard or annoying for data scientists?
I’d say all the data manipulation. 80-90% of data science is transforming the data before you get to the fun stuff. Another thing is because it is a science, we spend a lot of time testing hypotheses that might never have a business outcome. We might attempt something, do two weeks of work on it and then throw it away because the outcome didn’t work (though still potentially valuable learnings). That can be hard when you’re working in a traditional business that has structured planning and timelines.
What is fun or satisfying for data scientists?
This is a response from the developers in our team: It’s great for data scientists now to be able to store and play with data in the cloud instead of on local machines. It saves a lot of time and the vendors provide excellent services for working with data. Things like AWS’s EMR (Elastic MapReduce) allow you to process and review huge amounts of data quickly and easily.
I’d agree – having all those resources available in the cloud makes scaling data processes up and down much easier. And also… data science is fun and satisfying when it works!
Thanks Michael, for a great Q and A. I finally got to grips with a couple of concepts I had been very hazy on!
You might also be wondering, what does a data architect do?
Here is a good definition, from Discover Data Science:
“Data architects design and manage vast electronic databases to store and organize data. They investigate a company’s current data infrastructure and develop a plan to integrate current systems with a desired future state. Data architects then write code to create new, secure framework for databases that may be used by hundreds or thousands of people.”
If you want to learn more about data architecture, try these sites:
And if you want to learn more about Michael, say hi on LinkedIn!