Skip to content

Commit 4563066

Browse files
committed
Editors pass over the blog
1 parent ea60c52 commit 4563066

File tree

1 file changed

+19
-12
lines changed

1 file changed

+19
-12
lines changed

pgml-docs/docs/blog/data-is-living-and-relational.md

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -37,28 +37,35 @@ A common problem with data science and machine learning tutorials is the publish
3737

3838
</center>
3939

40-
- It’s usually denormalized into a single tabular form, e.g. csv file
41-
- It’s often relatively tiny to medium amounts of data, not big data
42-
- It’s always static, new rows are never added
43-
- It’s sometimes been pre-treated to clean or simplify the data
40+
They are:
4441

45-
As Data Science transitions from academia into industry, those norms influence organizations and applications. Professional Data Scientists now need teams of Data Engineers to move the data from production databases into centralized data warehouses and denormalized schemas that are more familiar, and ideally easier to work with. Large offline batch jobs are a typical integration point between Data Scientists and their Engineering counterparts who deal with online systems. As the systems grow more complex, additional specialized Machine Learning Engineers are required to optimize performance and scalability bottlenecks between databases, warehouses, models and applications.
42+
- usually denormalized into a single tabular form, e.g. a CSV file,
43+
- often relatively tiny to medium amounts of data, not big data,
44+
- always static, with new rows never added,
45+
- and sometimes pre-treated to clean or simplify the data.
4646

47-
This eventually leads to expensive maintenance and then to terminal complexity where new improvements to the system become exponentially more difficult. Ultimately, previously working models start getting replaced by simpler solutions, so the business can continue to iterate. This is not a new phenomenon, see the fate of the Netflix Prize.
47+
As Data Science transitions from academia into industry, these norms influence organizations and applications. Professional Data Scientists need teams of Data Engineers to move data from production databases into data warehouses and denormalized schemas which are more familiar, and ideally easier to work with. Large offline batch jobs are a typical integration point between Data Scientists and their Engineering counterparts, who primarily deal with online systems. As the systems grow more complex, additional specialized Machine Learning Engineers are required to optimize performance and scalability bottlenecks between databases, warehouses, models and applications.
48+
49+
This eventually leads to expensive maintenance and to terminal complexity: new improvements to the system become exponentially more difficult. Ultimately, previously working models start getting replaced by simpler solutions, so the business can continue to iterate. This is not a new phenomenon, see the fate of the Netflix Prize.
4850

4951
Announcing the PostgresML Gym 🎉
5052
-------------------------------
5153

52-
Instead of starting from the academic perspective that data is dead, PostgresML embraces the living and dynamic nature of data inside modern organizations. It's relational and growing in multiple dimensions.
54+
Instead of starting from the academic perspective that data is dead, PostgresML embraces the living and dynamic nature of data produced by modern organizations. It's relational and growing in multiple dimensions.
5355

5456
![relational data](/images/illustrations/uml.png)
5557

56-
- Schemas are normalized for real time performance and correctness considerations
57-
- New rows are constantly added and updated, which form the incomplete features for a prediction
58-
- Denormalized datasets may grow to billions of rows, and terabytes of data
59-
- The data often spans multiple iterations of the schema, and software bugs can introduce outlier data
58+
Relationa data:
59+
60+
- is normalized for real time performance and correctness considerations,
61+
- and has new rows added and updated constantly, which form the incomplete features for a prediction.
62+
63+
Meanwhile, denormalized data sets:
64+
65+
- may grow to billions of rows, and terabytes of data,
66+
- and often span multiple iterations of the schema, with software bugs introducing outliers.
6067

61-
We think it’s worth attempting to move the machine learning process and modern data architectures beyond the status quo. To that end, we’re building the PostgresML Gym to provide a test bed for real world ML experimentation in a Postgres database. Your personal gym will include the PostgresML dashboard and several tutorial notebooks to get you started.
68+
We think it’s worth attempting to move the machine learning process and modern data architectures beyond the status quo. To that end, we’re building the PostgresML Gym, a free offering, to provide a test bed for real world ML experimentation in a Postgres database. Your personal Gym will include the PostgresML dashboard, several tutorial notebooks to get you started, and access to your own personal PostgreSQL database, supercharged with our machine learning extension.
6269

6370
<center>
6471
<video autoplay loop muted width="90%" style="box-shadow: 0 0 8px #000;">

0 commit comments

Comments
 (0)