Top 5 Macro Data Trends for 2023

Guha Ayan
6 min readDec 23, 2022

Holiday greetings!! It is time of the year when we are buzzing with anticipation of upcoming holidays, and also with anticipation for what new year will bring to us. This is also a great time to look back to identify what are the big trends we saw this year, and what we expect to continue in next year. Here are highlights for me, synthesised.

Macro Data Trend: 2023

I. Data Reliability will be key

In distant past, life was simple. Data used to live in a single, often giant, appliance based data warehouses. Data processing, storage, analysis everything used to colocate. It had an excellent property: single control plane touchpoint.

Fast forward to today, our data storage and compute systems are decoupled, pipelines are often span across multiple toolsets, and there are requirement to enable multiple consumption pattern of data. All for good reasons. However, now our control planes are widely fragmented. So what is the impact? Our experience is this fragmentation directly impact non functional requirements (NFR) such as quality, performance, cost, security etc.

This year we are seeing an emerging trend of renewed focus on data reliability engineering. Key focus being consolidation of tools and technologies to provide a streamlined way to manage and measure NFRs, there are number of concepts and implementations were introduced around this theme, such as:

Distributed Data Governance

  • Platform specific governance tools such as Unity Catalog, AWS Lake Formation, Alation etc.
  • Tag/Attribute based policy enforcement: this is a big leap in terms governance and policy management
  • Data Security — including Perimeter security, AuthN and AuthZ, Hashing, Encryption at rest and at flight.

Data Ops

  • Integration of data assets with CI/CD
  • Automated Data Quality measurement
  • Use software engineering practices to ensure code quality, especially using Spark and dbt
  • Automated and augmented AI/ML driven proactive security and quality management such as GCP Cloud DLP

Data Observability

  • Single Pane of Glass for end to end data visibility
  • Centralised logging, monitoring and notification system

II. Unified Data Platforms

Unification of data platforms has been a theme this year and we expect it to continue in next year as well. Two biggest manifestation of this concept can be grouped as below:

Unification of Processing Semantics

The concept of treating batch data as bounded dataset and real time data as unbounded dataset, and treating both datasets around such abstraction is the key focus.

This is a trend is shaping up for last couple of years. Apache Spark has been leading the way. Apache Beam is built on the idea (the name itself is derived by fusing two words — Batch and Stream) and it is in core of GCP Dataflow offering. Recently AWS Kinesis Analytics introduced support for Apache Flink. This is definitely a space which will evolve and emerge pretty rapidly.

Feature Consolidation At Platform level

Each of the platform vendors are flexing their “We can do it all” muscles. This market trend is pushing each vendor to close major gaps in their offerings. Key example being Snowflake’s Snowpark, Databricks’ Unity catalog, AWS Lakeformation etc.

III. Novel Data Consumption Patterns

While traditional BI consumption pattern is still leading the board, various novel consumption patterns are emerging. To support such patterns, platform level modifications are emerging simultaneously.

Data Assets — Data As Product

One of the key trend is to identify information in the form of data assets and making them first class objects into a platform. An asset can be a BI report in traditional form, it can be an entire dataset, it can also be a synthesized notebook which runs a set of analytical workload or a completely trained ML model. The key here is how information is shared and consumed across the board.

Data Sharing, Clean Rooms, Marketplaces

The ability to share data securely and without creating copies is absolutely game changer. Cross-Industry data sharing and ability to enable multi-party governance and access control brings unprecedented agility. This concept is now extended to various data assets as well.

Snowflake and GCP BigQuery leading the way, but Databricks and other vendors are catching up quickly. Few of the key trend around Data Sharing is emerging such as

  • Marketplace: It is a one way sharing of data where one organisation is the publisher and others are consumers.
  • Exchange: Data exchange mechanism enables multiple publishers and subscribers to use the same platform to form a data ecosystem.
  • Clean Rooms: Clean room technology evolved with the need of multiple parties to come to a common platform to work on a specific problem, without revealing their inner workings. Needless to mention this trend helped quick data exchange during pandemic while the crisis forced organisations to fine tune the technology from performance and security stand point.

It is expected data asset sharing will continue and evolve as a trend. It is expected many early adopters who are on preview stage will soon become generally available, while a host of new features being introduced

IV. Interaction with AI/ML Assets

The industry has spend last 5 years to make model development commonplace. Strong focus and initiatives were in general successful. Writing and developing a custom ML model for a specific use case are the new normal. Now the conversation in moving from science to engineering aspects — efficient, reliability and security of the model as the industry is focussing on intelligent products.

There are couple of trends emerging around this area:

ML Operations (MLOps)

These two terms are so similar that it is often misinterpreted. MLOps is around how model lifecycle is managed — all the way from model development to model deployment and then measuring model performance while it is running in production. There are few interesting development in this area:

ML Engineering

There is a renewed focus on considering ML applications as engineering artefacts and treat them as such. This includes version controlling, quality management and CI/CD. Feature stores are now showing up in enterprise architecture, especially where big organisations started to rationalise feature engineering across many bespoke models.

Application engineering space is carving out a space for ML Engineers to integrate with low-effort AI/ML solutions such as AutoML and Cognitive services supported by prominent cloud platforms.

V. SQL Going Serverless — dbt will help

This is not a new trend, rather it is affirmation of continuing focus on SQL. In fact, all prominent data vendors are investing effort and features to make SQL as performant as possible. One prominent trend emerging in this space is around Serverless offerings. Redshift and Databricks both came out with serverless SQL endpoints, while Azure already had similar offerings around SQL pools under Synapse ecosystem.

This year, not many data platform discussion were concluded without mentioning dbt. It is an excellently conceived neat little tool which has great impact on how teams manage to write consistent SQL code. It has been introduced as tools for Analytics Engineers, but it is used in data processing space quite often. It is expected that both the trends will continue to evolve next year.

Conclusion

Everything putting together, here is the outlook:

data engineering technology stack is mature and the focus is shifting towards reliability. While data science space is evolving rapidly, both technology and mind share investments on building engineering foundations around it are on the rise.

** This articles focuses on macro trends which have potential impact on the overall analytical data platform. There are trends within each specific area which have finer and subtler impact. Of course, these are personal views and observations.

--

--