Process columnar data and build high-performance query engines on modern hardware like CPUs and GPUs using a standardized language-independent memory format for optimal performance
Key Features
- Learn about basic Arrow data types and understand how they are represented
- Explore Arrow's interoperability with data science tools like pandas and Parquet and the various IPC formats
- Work with Arrow Flight RPC protocols, Arrow Compute APIs, and Arrow Dataset APIs to produce and consume tabular data
Book Description
Apache Arrow is designed to accelerate analytics and allow the exchange of data across big data systems easily.
In-Memory Analytics with Apache Arrow begins with a quick overview of the Apache Arrow environment, before moving on to helping you understand Arrow's versatility and benefits as you walk through a variety of real-world use cases. You'll cover key tasks like enhancing data science workflows with Arrow, using Arrow and Apache Parquet with Apache Spark and Jupyter for better performance and hustle-free data translation, and as well as working with Perspective, which is an open source interactive graphical and tabular analysis tool for browsers. As you advance, you'll explore the different data interchange and storage formats and discuss the relationships between Arrow, Parquet, Feather, Protobuf, Flatbuffers, JSON, and CSV. In addition to understanding the basic structure of the Arrow Flight and Flight SQL protocols, you'll learn Dremio's usage of Apache Arrow to enhance SQL analytics and understand how Arrow can be used in web-based browser apps. Finally, you'll get to grips with the upcoming features of Arrow to help you stay ahead of the curve.
By the end of this book, you will have all the building blocks to create useful, efficient, and powerful analytical services and utilities with Apache Arrow.
What you will learn
- Use Arrow libraries to access data files both locally and in the cloud
- Understand the zero-copy elements of the Arrow format and binary data
- Improve performance with memory-mapping in Arrow
- Interact and communicate with an Arrow C data producer or consumer
- Use the Arrow Compute APIs to build a simple analytics query engine
- Compile expressions for higher performance analytics using the Gandiva library
- Create basic flight servers and clients for querying and sending Arrow data
- Get well-versed in Apache Arrow build scripts and automation jobs
Who This Book Is For
This book is for data analysts and data scientists looking to explore the capabilities of Apache Arrow from the ground up. This book will also be useful for any engineers who are working on building utilities for data analytics, query engines, or otherwise working with tabular data, regardless of the language they are programming in. Some familiarity with basic concepts of data analysis will help you to get the most out of this book.
Table of Contents
- Getting Started with Apache Arrow
- Working with Key Arrow Specifications
- Data Science with Apache Arrow
- Format and Memory Handling
- Deep Dive into the Arrow Libraries
- Exploring Apache Arrow Flight RPC
- Powered By Apache Arrow
- How to Leave Your Mark on Arrow
- Future Development and Plans