Fletcher Framework for Programming FPGA

1 The Fletcher Framework for Programming FPGAs OpenPOWER Summit Europe October 3, 2018 Johan Peltenburg Accelerated Big Data Systems Quantum & Computer Engineering Department Delft University of Technology

2 Additional Credits Development (TU Delft) Jeroen van Straten Matthijs Brobbel Laurens van Dam Lars Wijtemans Lars van Leeuwen Support Zaid Al-Ars (TU Delft) Peter Hofstee (IBM/TU Delft) Jinho Lee (IBM) CAPI SNAP team (IBM) Cathal McCabe (Xilinx)

3 Heterogeneous software processes Language Runs on top of... Methods C/C++ Fortran Rust Julia ... CPU Compiled to machine instructions (sometimes called native instructions) Java Scala Java Virtual Machine Compiled to Java Bytecode Could be Just-In-Time compiled to machine instructions Python R Interpreter Interpreted Strong integration with native libraries

4 Heterogeneous computing ● Big data systems are becoming increasingly heterogeneous. – Many diferent “types” of processes in both SW and HW. ● Example: TensorFlowOnSpark[1] – You can run a Python program ● That uses NumPy (Python bindings on top of a C core) – Interfacing with TensorFlow, programmed in CUDA ● Running on a GPU – On top of Spark, written in Scala/Java ● Running on a Java Virtual Machine ● That runs on your CPU ● What challenges does this bring? [1] https://github.com/yahoo/TensorFlowOnSpark

5 A string String size Pointer to char bufer Internal char array (optionally used) Optionally allocated char array JVM object header Hash cache UTF-16 Array reference UTF16 array JVM array object header Python variable length object header Hash State Variable length character array FPGAPython Length Stream Character stream Java 8C++

6 Serialized collection in shared memory Collection X in Memory of Process A Collection X in Memory of Process B Serialization ● Iterate over all objects in collection ● Traverse all object graphs (memory latency) ● Copy felds to some intermediate format both A and B understand (bandwidth lost) ● Reconstruct objects in B (allocations) ... ... ... Deserialize...Serialize...

7 Relative impact on accelerators CPU compute time (De)serialize / copy time Accelerator compute time Original process on CPU: Accelerated process on GPGPU/FPGA with serialization: Desired acceleration profle:

8 Overcoming serialization bottlenecks ● We (de)serialize a lot… Can we do this smarter? ● What if data is… – In a standardized format? ● Every language run-time can use it. – As contiguous as possible? ● We can move it quickly without traversing object graphs

9 Apache Arrow[2] ● Standardized representation in-memory: Common Data Layer ● Columnar format – Hardware friendly while iterating over data (SIMD, caches, etc…) ● Libraries and APIs for various languages to build and access data [2] https://arrow.apache.org/

10 Arrow in- memory example Index A B C 0 1.33f beer {1, 3.14} 1 7.01f is {5, 1.41} 2 ∅ nice {3, 1.61} Index Data 0 1.33f 1 7.01f 2 X Index Ofset 0 0 1 4 2 6 3 10 Ofset Data 0 b 1 e 2 e 3 r 4 i 5 s 6 n 7 i 8 c 9 e Index Data 0 1 1 5 2 3 Index Data 0 3.14 1 1.41 2 1.61 Index Valid 0 1 1 1 2 0 Schema X { A: Float (nullable) B: List<Char> C: Struct{ E: Int16 F: Double } } Arrow terminology: Schema: Description of data types in a RecordBatch RecordBatch: Tabular structure containing arrays Arrays: Combination of buffers, can be nested Buffers: Contiguous C-like arrays

11 Integrating FPGA and Arrow ● Arrow is hardware-friendly – Standardized format ● If you know the schema you know exactly where the data is. – Contiguous & columnar format ● Iterate over a column in streaming fashion ● Useful for: maps, reductions, flters, etc... – Parallel accessible format ● Uses ofsets, not lengths, for variable length data ● Useful for: maps, reductions, flters, etc… ● We can generate easy-to-use hardware interfaces automatically

12 Fletcher[3] architecture: [3] https://github.com/johanpel/fetcher

13 Generated Interface internals ● Based on streaming primitives – Slices, splitters, etc… – Arbiters, serializers, parallellizers, etc... – Normalizer, accumulators, etc… ● Each Arrow bufer gets its own BuferReader/Writer ● Combination of BuferReader/Writer forms a ColumnReader/Writer ● Generated through pure HDL; vendor agnostic – Simulation in GHDL, QuestaSim, XSIM, PSLSE ● Verifcation: random schema and testbench generation, over 1000+ schemas tested

Internals: Fixed length data (with validity bitmap) ● User streams in frst and last index in the table. ● Column Reader streams the requested rows in order. ● Internal command stream: – First element ofset in the data word. – No. valid elements in the data word. ● Response handler aligns and serializes or parallelizes the data. Index Data 0 1.33f 1 7.01f 2 X Index Valid 0 1 1 1 2 0

Internals: Variable length data (without validity bitmaps) Index Ofset 0 0 1 4 2 6 3 10 Ofset Data 0 b 1 e 2 e 3 r 4 i 5 s 6 n 7 i 8 c 9 e

Internals: Structs (without validity bitmaps) Index Data 0 1 1 5 2 3 Index Data 0 3.14 1 1.41 2 1.61

17 Motivating use case: Regular Expression Matching ● Given N strings ● Match M regular expressions ● Count matches for each regexp ● Example:

18 Results RegExp on 1GiB of tweet-like strings.

19 Hands on: sum example ● Suppose we want to add all integers in a column. …, weight[2], weight[1], weight[0] Accumulate result

20 Step 1: Generate Schema ● Currently C++ libraries are most advanced in Arrow ● Examples will use C++. – Python, Java, R, Rust, Go, etc… also possible ● Create list of Schema felds; name, type, nullable ● Add metadata for Fletcher – read/write, felds to ignore – bus width, elements per cycle – No. MMIO registers for user, etc... ● Save Schema as Flatbufer fle std::vector<std::shared_ptr<arrow::Field>> schema_fields = { arrow::field("weight", arrow::int64(), false) };

21 Step 2: Fletchgen Wrapper Generator ● Generates wrapper based on schema ● Specify desired top-level – Currently AXI is supported ● AXI4 master interface to (host/on-board) memory ● AXI4-lite slave interface for MMIO – Simulation top-level available ● Can provide Arrow RecordBatch to simulation ● Compatible with baseline project for CAPI SNAP / AWS EC2 F1 $ fletchgen --input sum.fbs --output fletcher_wrapper.vhd --axi axi_top.vhd

22 Step 3: Implement Accelerator Kernel ● Accelerator kernel template. ● Two streams appear (for this example): weight_cmd_firstIdx : out std_logic_vector(INDEX_WIDTH-1 downto 0); weight_cmd_lastIdx : out std_logic_vector(INDEX_WIDTH-1 downto 0); weight_cmd_ready : in std_logic; weight_cmd_valid : out std_logic; weight_valid : in std_logic; weight_ready : out std_logic; weight_data : in std_logic_vector(63 downto 0); weight_last : in std_logic;

23 Step 4: Finishing touches ● Simulate, Debug, Place, Route ● Easy to use run-time interfaces provided – C++ available – Python incoming – Other languages with Arrow support in future ● Set custom MMIO with desired confguration ● Put data in Arrow RecordBatch ● Throw at Fletcher

24 Future work ● Continued development – More applications for showcasing/verifcation – Support for more Arrow-supported languages ● HLS integration for map/reduce/flter lambdas ● SQL integration

25 Summary ● Accelerators can be heavily burdened by serialization overhead from heterogeneous systems ● Apache Arrow format prevents serialization overhead and allows hardware interface generation ● Paves the way for more efcient FPGA acceleration in any of the supported languages ● Fletcher is the framework! https://github.com/johanpel/fetcher

26 References [1] https://github.com/yahoo/TensorFlowOnSpark [2] https://arrow.apache.org/ [3] https://github.com/johanpel/fetcher ● Regular Expression matching example: https://github.com/johanpel/fetcher/tree/master/examples/regexp ● Writing strings to Arrow format using CAPI 2.0 and SNAP @ 11 GB/s: ● https://github.com/johanpel/fetcher/tree/master/examples/stringwrite ● Posit arithmetic on FPGA, accelerated through Fletcher/SNAP by Laurens van Dam: ● https://github.com/lvandam/posit_blas_hdl ● PairHMM accelerator with posit arithmetic by Laurens van Dam & Johan Peltenburg: ● https://github.com/lvandam/pairhmm_posit_hdl_arrow Example projects / existing applications:

Fletcher Framework for Programming FPGA

More Related Content

What's hot

Similar to Fletcher Framework for Programming FPGA

More from Ganesan Narayanasamy

Recently uploaded

Fletcher Framework for Programming FPGA