1 PyData Berlin 2018 Uwe L. Korn Extending Pandas using Apache Arrow and Numba
2 PyData Berlin 2018 Uwe L. Korn Extending Pandas using Apache Arrow and Numba
3 PyData Berlin 2018 Uwe L. Korn Strings, Strings, please give me Strings!
4 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy mail@uwekorn.com
5 1. Shortcomings of Pandas 2. ExtensionArrays 3. Arrow for storage 4. Numba for compute 5. All the stuff Agenda
6 Pandas Series • Payload stored in a numpy.ndarray • Index for data alignment • Rich analytical API • Accessors like .dt or .str
7 Shortcomings • Limited to NumPy data types, otherwise object • NumPy’s focus is numerical data and tensors • Pandas performs well when NumPy performs well • Most popular: • no native variable-length strings • integers are non-nullable
8 What’s the problem?
9 What’s the problem?
10 Why are objects bad? Python Data Science Handbook, Jake VanderPlas; O’Reilly Media, Nov 2016 https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html
11 Extending Pandas (0.23+) • Two new interfaces: • ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top
10x !!112
13 Extending Pandas (0.23+) • _from_sequence • _from_factorized • __getitem__ • __len__ • dtype • nbytes • isna • copy • _concat_same_type https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.extensions.ExtensionArray.html 13
14 Apache Arrow • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)
15 Nice properties • More native datatypes: string, date, nullable int, list of X, … • Everything is nullable • Memory can be chunked • Zero-copy to other ecosystems like Java / R • Highly efficient I/O
16 Not so nice properties • Still a young project • Not much analytic on top (yet!) • Core is in modern C++ • Extremely fast but hard to extend in Python
17 Writing Algorithms in Python is easy! but slow
18 Photo by Matthew Brodeur on Unsplash
19 Fast for-loops with Numba
20 Anatomy of an Arrow StringArray • 3 memory buffers • bitmap to indicate valid (non-null) entries • uint32 array of offsets:„where does the string start“ • uint8 array of characters (UTF-8 encoded) • int64 offset • allows zero-copy slicing
21 Numba @jitclass
22 Numba @jitclass
23 Photo by Niklas Tidbury on Unsplash
24 Fletcher https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as storage • Uses Numba to implement the necessary analytic on top
Demo25
26 Fletcher Demo
27 Fletcher Demo
28 Fletcher Demo
29 Fletcher Demo
30 ExtensionArray Implementations https://github.com/ContinuumIO/cyberpandas IPArray (PR) https://github.com/geopandas/geopandas GeometryArray (WIP) https://github.com/xhochy/fletcher Apache Arrow + Numba backed Arrays
31 Photo by Israel Sundseth on Unsplash pip install fletcher
32 By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons 24. - 26. October + 2 days of sprints (27/28.10.) ZKM Karlsruhe, DEKarlsruhe Call for Participation opens next week.
33 I’m Uwe Korn Twitter: @xhochy https://github.com/xhochy Thank you!

Extending Pandas using Apache Arrow and Numba