In-Memory Analytics with Apache Arrow

Perform fast and efficient data analytics on both flat and hierarchical structured data

By: Matthew Topol

Write A Review

eText | 24 June 2022 | Edition Number 1

At a Glance

Format
ePUB

eText

$79.19

or 4 interest-free payments of $19.80 with

Instant online reading in your Booktopia eTextbook Library *

Why choose an eTextbook?

Instant Access *

Purchase and read your book immediately

Read Aloud

Listen and follow along as Bookshelf reads to you

Study Tools

Built-in study tools like highlights and more

* eTextbooks are not downloadable to your eReader or an app and can be accessed via web browsers only. You must be connected to the internet and have no technical issues with your device or browser that could prevent the eTextbook from operating.

Process columnar data and build high-performance query engines on modern hardware like CPUs and GPUs using a standardized language-independent memory format for optimal performance

Key Features

Learn about basic Arrow data types and understand how they are represented
Explore Arrow's interoperability with data science tools like pandas and Parquet and the various IPC formats
Work with Arrow Flight RPC protocols, Arrow Compute APIs, and Arrow Dataset APIs to produce and consume tabular data

Book Description

Apache Arrow is designed to accelerate analytics and allow the exchange of data across big data systems easily.

In-Memory Analytics with Apache Arrow begins with a quick overview of the Apache Arrow environment, before moving on to helping you understand Arrow's versatility and benefits as you walk through a variety of real-world use cases. You'll cover key tasks like enhancing data science workflows with Arrow, using Arrow and Apache Parquet with Apache Spark and Jupyter for better performance and hustle-free data translation, and as well as working with Perspective, which is an open source interactive graphical and tabular analysis tool for browsers. As you advance, you'll explore the different data interchange and storage formats and discuss the relationships between Arrow, Parquet, Feather, Protobuf, Flatbuffers, JSON, and CSV. In addition to understanding the basic structure of the Arrow Flight and Flight SQL protocols, you'll learn Dremio's usage of Apache Arrow to enhance SQL analytics and understand how Arrow can be used in web-based browser apps. Finally, you'll get to grips with the upcoming features of Arrow to help you stay ahead of the curve.

By the end of this book, you will have all the building blocks to create useful, efficient, and powerful analytical services and utilities with Apache Arrow.

What you will learn

Use Arrow libraries to access data files both locally and in the cloud
Understand the zero-copy elements of the Arrow format and binary data
Improve performance with memory-mapping in Arrow
Interact and communicate with an Arrow C data producer or consumer
Use the Arrow Compute APIs to build a simple analytics query engine
Compile expressions for higher performance analytics using the Gandiva library
Create basic flight servers and clients for querying and sending Arrow data
Get well-versed in Apache Arrow build scripts and automation jobs

Who This Book Is For

This book is for data analysts and data scientists looking to explore the capabilities of Apache Arrow from the ground up. This book will also be useful for any engineers who are working on building utilities for data analytics, query engines, or otherwise working with tabular data, regardless of the language they are programming in. Some familiarity with basic concepts of data analysis will help you to get the most out of this book.