Apr '20 tensorwerk Heartbeat

Hey Folks! Here we come with our fourth Heartbeat: we are glad to present a super nice use case for Hangar, and yet very contemporary considering the situation we are facing right now with the SARS-CoV-2 global pandemic. Moreover, we will be talking about RedisConf 2020, the international Redis conference which is happening this week.

Every month we are sharing news on projects we are working on, conferences and events we attend, what are our plans for the future and everything that might be related to data.

A collaborative annotation tool for covid19 datasets

Today we are proud to present a practical use case for Hangar we are working on. We are building a collaborative image annotation tool on top of the secure foundations of Hangar, with the hope it could serve the community in these times of emergency.

Using this system, you get annotations versioning and the choice of the best backend storage for your data for free. Moreover, this enables collaborative dataset curation amongst different collaborators, without the hassle of maintaining a collection of (maybe) CSV files with names, timestamps, or even worse … formatted Excel files (yes - there are plenty of people still doing that).

The annotation interface is built on top LOST and it is based on a coarse point counting grid. Stereology has been proved to be an effective approach in reducing the cost of annotation, yet enabling the training of a segmentation model and achieve a satisfying performance.

You could either decide to distribute the annotation workload across annotators, assigning to each of them a different Hangar branch. Each person sees only a subset of the data and can carry out the task individually. When the annotators are done, you can simply merge all the branches and get the annotations for the whole dataset.

Otherwise, you could still assign a different Hangar branch for each person involved with the annotation process, but instead, let them see and annotate the whole dataset. This way you can get a bunch of different annotations for the same image, so you can stop relying on the eyes of a single radiologist. This would also help to overcome the bias coming from having one image seen by only one radiologist and therefore limiting possible batch effects.

If you want to have a closer look at the code, please visit https://github.com/hhsecond/coviddatastore.

RedisConf 2020 Takeaway, May 12-13

Like many other technical conferences in the world, due to the outbreak of COVID-19, RedisConf 2020 became a virtual event. The good news is that it became a free event, so every one of you can actually participate from their own couch, enjoying a live keynote, 50+ breakout sessions, a hackathon, 1:1 office hours with Redis experts, group chats, games, and more. Just tune in on May 12-13!

👉 Registration and more info at the official website. 👈

Our CEO Luca Antiga has also been invited to be a speaker for the conference! Make sure to follow his Breakout Session on RedisAI. You will discover the latest new features coming with the new release of RedisAI 1.0, including auto-batching, DAG commands, MLFlow integration and revamped docs. You’ll also get a glimpse of what’s baking for the next releases. In case you want to find out more about it, please check out our last heartbeat, which was entirely dedicated to RedisAI!

Meet the people: Luca Antiga

Luca (lantiga on Twitter and GitHub) is a co-founder and CEO at Tensorwerk.

As a kid, he started coding on his Sinclair ZX Spectrum 48k in the mid-’80s, but he didn’t do much with it until much later in life (he still thinks that having BASIC instructions stamped on the keyboard was a great way to get a kid’s attention). A bioengineer in training, he went on as a researcher in medical image analysis and cardiovascular biomechanics in the 2000’s. He picked up C++ and Python, and after noodling with connections between vascular morphology, computational geometry and fluid dynamics, he released the Vascular Modeling Toolkit in 2004, an open-source project that is still used to date in bioengineering departments. He later contributed to the Insight Toolkit and 3DSlicer, and authored scientific papers on these subjects.

In 2009 he left research to co-found Orobix, a company based in Bergamo (Italy) initially focused on medical image analysis, and that around 2014 became an AI engineering company operating in different sectors, such as healthcare, manufacturing, gaming, astrophysics. In 2017 Luca started contributing to PyTorch and was a core contributor for a couple of years. In the meantime he started co-authoring Deep Learning with PyTorch for Manning. As Orobix was developing, ideas for new tools filling the gaps in the AI tooling landscape came up. Those ideas converged in an opportunity of a new initiative stemming from the experience at Orobix, but focused on developing core tools for Software 2.0. This is how Tensorwerk came to be. ✨

At Tensorwerk he’s busy with directions and design, and he’s directly involved in the development of RedisAI. Along with family and work, Luca has a passion for listening to jazz and its surroundings. He spends a few hours a week trying to make some sense out of the sounds coming out of his guitar.

Reach out

If you’d like to have a peek into our vision and our upcoming developments, please send us a note at info@tensorwerk.com. In any case, we will be posting our updates regularly here on Substack. Have fun and stay tuned.


If you want to stay up to date with ideas, projects and plans for the future at [tensor]werk, subscribe to our publication and receive the Heartbeat directly in your inbox.

Mar '20 [tensor]werk Heartbeat

Hey there! While this SARS-CoV-2 global pandemic is affecting most of the world population, fortunately we were able to continue with our activities from NY, Italy and Bangalore and here comes the March’s Heartbeat to talk about one of our core products, RedisAI, and, hopefully, to take your mind off things.

Every month we are sharing news on projects we are working on, conferences and events we attend, what are our plans for the future and everything that might be related to data.

What’s going on?

We released Hangar 0.5! 🎉 As we mentioned in our Feb heartbeat, with this release we are introducing our last API breaking change, and for the better. In particular, you will find the new columns API replacing the old arraysets and metadata terminology. With an arrayset you could represent only tensors as a data type, while with a column will be possible to represent also strings, replacing the functionality of metadata in a way that is simpler, more effective and extensible.

This month we also submitted a PR to the MLflow project to support our RedisAI plugin, in order to seamlessly deploy MLflow models directly to RedisAI. Stay tuned for the updates!

RedisAI

While we briefly introduced RedisAI in our very first Heartbeat a couple of months ago, we would like to explain better what kind of problems it solves and why it’s a convenient tool to adopt in your stack.

Before diving into the details, let’s take a step back: what is Redis? Redis, which stands for Remote Dictionary Server, is a fast, open-source, in-memory key-value data store for use as a database, cache, message broker, and queue.

At a glance, RedisAI is a Redis module for serving tensors and executing deep learning models, born from a collaboration between [tensor]werk and RedisLabs. With the RedisAI module, Redis can store another data type, the Tensor.

As our CEO Luca Antiga use to say “Don't say AI until you productionize“, taking your deep learning model prototype out of that Jupyter Notebook and putting it to production is a critical but required step for real-world AI applications.

The strongest point on RedisAI is its easiness:

  • It is easy to work with models defined in different deep learning frameworks. In fact, RedisAI understands PyTorch and TensorFlow models directly, plus models saved in the ONNX interchange format (from almost any machine learning framework, including scikit-learn). RedisAI can actually execute models from multiple frameworks as part of a single pipeline.

  • It is easy to switch between devices you want your model to be executed on and there is no separate workflow for GPU or CPU.

  • It is easy to become an MLOps engineer: if you already have Redis in your stack, setting up RedisAI is a no-brainer and your DevOps engineers don’t need to learn anything else. Also because setting up RedisAI is a matter of 5-6 shell commands (or one Docker command 😉).

  • And since you are using Redis in the back, it is easy to scale your production runtime to a multi-node cluster setup with failover.

  • It is easy to manage the deployment even without Python - the language of choice of deep learning practitioners - if you don’t want to include it in your production tech stack! We have clients available for Python, Go and Java; for other languages, users can use native Redis Client libraries. Have a look at this repo for a showcase of Python, Go, Node.js and bash clients.

  • It is easy to reduce response time: RedisAI is architected in a way that it enables users to keep the data local, keep the model hot and keep the stack short. Fewer moving parts (data) mean less cost and fewer headaches.

Oh, and not to mention that RedisAI is ~3 times faster than a REST API server a user would build using Django or Flask.

Although being written a year ago, read this blog post by our Sherin for a more comprehensive walkthrough of the features of RedisAI, including a comparison of existing deep learning runtimes, details on the installation and a practical example of an object detector with YOLO v3 step by step.

Meet the people: Sherin Thomas

Sherin, a.k.a hhsecond, is a senior developer at [tensor]werk started working with the team even before the company was founded. Sherin had spent his fair share of time on each tool [tensor]werk developed so far. He also created Stockroom, a high-level data + model + parameter versioning platform built on the foundation of Hangar and git. Sherin is expanding our super distributed team across the globe by working (and staying) from Bangalore. He is an author, speaker and also teaches students (and professionals) about programming in general and about different components in software 2.0.
Outside of work, Sherin is still a programmer 😁 He reads a lot (about a multitude of different topics spanning through psychology, tech, black holes, aliens, biology, management, etc), and is quite attached to the Bangalore startup ecosystem. He believes the current education system is still running with the decades-old methodologies and ideas and it should be rebuilt and hence helped to build fullstackengineering.ai. He is also fond of farming (although he never did farming, apparently) and probably use some AI stuff he copied from GitHub to build a robot to disrupt the Indian agriculture industry (that's what he says 😛).

Reach out

If you’d like to have a peek into our vision and our upcoming developments, please send us a note at info@tensorwerk.com. In any case, we will be posting our updates regularly here on Substack. Have fun and stay tuned.


If you want to stay up to date with ideas, projects and plans for the future at [tensor]werk, subscribe to our publication and receive the Heartbeat directly in your inbox.

Feb '20 [tensor]werk Heartbeat

Hey Folks! This is our second [tensor]werk Heartbeat and we are sharing what happened during February on our side. Did you miss the first heartbeat?! Get yourself updated with the latest news and read it here.

Every month we are sharing news on projects we are working on, conferences and events we attend, what are our plans for the future and everything that might be related to data.

What’s going on?

We are working on a brand new design of our logo and website, in collaboration with Evoque. For the moment no spoilers, stay tuned on our channels 😉

We are introducing a few changes on the Hangar APIs: we are introducing the new columns API, replacing the old arraysets and metadata terminology. With an arrayset you could represent only tensors as a data type, while with a column will be possible to represent also strings, replacing the functionality of metadata in a way that is simpler, more effective and extensible. We foresee this being the last major API change before 1.0.

Hangar

We are dedicating this Heartbeat to our fastest-growing project: Hangar!

We briefly mentioned what Hangar is in the last Heartbeat. In this Heartbeat, we would like to provide a few more insights on what kind of problems it solves and some of the motivations behind it.

Hangar was born with the aim of simplifying data lifecycle operations in Machine Learning and Deep Learning workflows, allowing efficient versioning and collaboration on numerical data.

Organizations & projects commonly rely on storing data on disk in some domain-specific binary format (i.e. .jpg images, .nii neuroimaging informatics studies, .cvs tabular data, etc.), and just deal with the hassle of maintaining all the infrastructure around reading, writing, transforming, and preprocessing these files into useable numerical data every time they want to interact with them. In fact, almost every data format requires a different program or a library to process them. This implies that most of the time people who work with data need knowledge about the data formats and the tools designed for them and not only about deep learning applied to some specific field (medical, agricultural, meteorological, etc.). On top of this, file-based representation of datasets is inherently rigid, when it comes to making such datasets evolve in time, collaborating over datasets or keeping multiple revisions of these datasets around for training or validation.

Hangar is designed to overcome these issues: we provide an API for storing and loading data in a standard and efficient way and also we aim at reducing the domain-specific knowledge gap required to work with different data types. When writing to a Hangar repository, you process the data into n-dimensional arrays once. Then when you retrieve it you are provided with the same array, in the same shape and datatype, already initialized in memory and ready to compute on instantly.

All this *wow* comes with an intelligent choice of the backend to be used to store your data: you won’t need to care about the best data storage system for your data. Based on heuristics, Hangar decides what is the best backend for the job, considering the arrayset specification (dimensionality, shape, sparseness, etc.) and system details. We are now supporting HDF5, Memmapped Arrays, TileDb and LMDB. As a user, this is completely transparent to you.

Hangar puts all these features behind a version control engine (based on a Merkle DAG, just like git). This enables data to be versioned, branched, merged and reconciled in case of conflicts. This means that we are able to time-travel along the history of data, know who did what on it, and access multiple branches in parallel, for instance for running multiple training jobs on different versions of a dataset at the same time.

Just like git, Hangar enables collaboration through remotes. A dataset can be cloned (that is, the repository information and commit history are copied from a remote, but not the actual data), fetched (the actual data is materialized), partially fetched (a subset of the data is materialized), pushed to and pulled from. In practice, this means that we can work on huge datasets stored on a remote, materialize a fraction of it (for instance to prototype a model on sample data) and push modifications to that fraction upstream.

Last but not least, Hangar seamlessly integrates with Machine Learning and Deep Learning workflows, through data loaders for PyTorch and TensorFlow. You can start training a model against data stored in a Hangar repo by literally adding one line to your existing code.

If you want to get started with Hangar, check out this brief tutorial from Jithin James 👉🏼 http://bit.ly/hangartutorial

Meet the people: Rick Izzo

Rick is Co-Founder and CTO of [tensor]werk, as well as the lead architect of Hangar. Prior to the founding of [tensor]werk in 2019, Rick was a Ph.D. candidate in Biomedical Engineering in the Endovascular Device Development Lab at SUNY Buffalo, Biomedical Engineer at the Jacobs Institute, fellow of the Prentice Family Foundation, and maintainer of the open-source Vascular Modeling Toolkit project (originally developed by fellow [tensor]werk Co-Founder, Luca Antiga).

He has expertise - and has published academic papers - in a wide a number of engineering, medical, and computer science domains including: Computational Fluid Dynamics, Medical Image Analysis & Segmentation, CAD for freeform anatomical model generation, Polymer-based multi-material 3D Printing, Interventional Cardiology (specific focus on transcatheter aortic/mitral valve replacement), Interventional Neurosurgery (catheter-based treatment of aneurysm, stroke, & pediatric hydrocephalus), storage, transformation, & computation on medical image datasets, and development & maintenance of open-source libraries written in Python & C++.

Outside of the lab, Rick is an amateur guitarist (primarily playing fingerstyle adaptations of bluegrass, classic rock, & anything by Mark Knopfler) and snow-skier. He is an avid traveler who loves to explore the world around him. He is currently well on his way towards a personal goal of traveling to 30 countries by the time he is 30 years old.

As CTO of [tensor[werk, Rick wrote an early prototype of the RedisAI runtime, and acts as lead architect and developer of Hangar. His current interests revolve around the efficient storage and usage of data (and how version control plays a role when curating and utilizing generalized datasets). His driving belief is that in the future (with tools like Hangar), the open-source software ethos will transfer into widespread open-source dataset curation and collaboration; allowing anyone who desires to generate insights from data collected about the world around us. We welcome anyone who connects with this belief to join our community and work!

Reach out

If you’d like to have a peek into our vision and our upcoming developments, please send us a note at info@tensorwerk.com. In any case, we will be posting our updates regularly here on Substack. Have fun and stay tuned.


If you want to stay up to date with ideas, projects and plans for the future at [tensor]werk, subscribe to our publication and receive the Heartbeat directly in your inbox.

Jan '20 [tensor]werk Heartbeat

Hey Folks! This is our very first [tensor]werk Heartbeat and we are very excited to start sharing what we do at [tensor]werk and what’s on our mind!

Every month we are sharing news on projects we are working on, conferences and events we attend, what are our plans for the future and everything that might be related to data.


Welcome

In this first Heartbeat, we thought it would be good to introduce ourselves and the projects we are working on. Although the company is new, we already have several interesting pieces to show.

Who we are

So, [tensor]werk: we are a small group of people coming from software development and data science. While [tensor]werk is incorporated in NY, we are a distributed team, currently spanning the US, India and Italy.

Our dream is to build tools for developers of systems whose behavior is learnt from data, rather than coded top-down. We call it data-defined software (this concept has been well captured by Andrej Karpathy in his Software 2.0 post from a while back).

As machine learning is entering the realm of software development, best software engineering practices developed over the course of decades need to be recast to this new paradigm. Many questions arise:

  • If data is the new source code, how can we keep track of it as it changes over time?

  • How do we work collaboratively on data-defined software?

  • How can we test data-defined software and what is TDD in this context?

  • How can we audit data-defined software and describe its behavior?

  • How do we factor data-defined software in building blocks and compose them at scale?

  • How do we build and maintain production systems at scale?

Of course, we’re not the only ones mumbling on this, at the same time there’s so much to do and a lot of space to get creative.

Now, how does all this translate into practice? At [tensor]werk we are currently working at a suite of tools, each with its own individual focus, each one filling a gap in the current tooling landscape.

What’s on our plate

Here is an outlook on the tools we are building at present:

  • RedisAI: is a Redis module for serving tensors and executing deep learning models, developed in collaboration with Redis Labs. It turns Redis into a multi-backend deep learning runtime (it currently supports PyTorch, TF, TFLite, ONNX, ONNX-ML on CPU and GPU) while retaining the operational simplicity of Redis. With RedisAI, you can literally productionize your model in minutes.
    We recently integrated RedisAI as a deploy target for MLFlow and more integrations with lifecycle management tools are on the way.

  • Hangar: is a Python module and CLI providing versioning for numerical data. Think git for tensors, or multidimensional arrays. It is designed to solve many of the problems tackled by version control systems for source code, just adapted to numerical data, so:

    • time traveling through the history of a dataset

    • zero cost branching

    • merging and conflict resolution

    • cloning, pushing, pulling to/from remotes

    Working Hangar is convenient. One can work with a huge dataset and just materialize a part of it locally. Or work collaboratively on a dataset making sure no changes get lost. Or train models on different branches from different processes, all against the same Hangar repo.
    On top of all this, Hangar provides fast access to data and can compress data very effectively. Using it from PyTorch or TensorFlow, instead of data files, is literally a one-liner.

  • Stockroom: is a very recent development. It is a tool built on top of Hangar and git to version models, data, (hyper)parameters and metrics alongside your model source code. It’s a natural complement to Hangar, which is focused on versioning numerical data. At the same time, it provides a simpler surface for users to track their experiments without necessarily becoming Hangar or git ninjas. We are working hard to ship an alpha release soon!

What’s going on

After months perfecting our vision and developing core projects, we are getting close to releasing RedisAI and Hangar in general availability. Meaning that users can start trusting these tools for production use.

The tooling landscape is very wide, and although we are filling very specific gaps, there’s work we need to do to make our tools known to our potential user-base. A website would help. Right, we are working on that too :-)

Reach out

If you’d like to have a peek into our vision and our upcoming developments, please send us a note at info@tensorwerk.com. In any case, we will be posting our updates regularly here on Substack. Have fun and stay tuned.


If you want to stay up to date with ideas, projects and plans for the future at [tensor]werk, subscribe to our publication and receive the Heartbeat directly in your inbox.

Loading more posts…