1 The Fletcher Framework for Programming FPGAs OpenPOWER Summit Europe October 3, 2018 Johan Peltenburg Accelerated Big Data Systems Quantum & Computer Engineering Department Delft University of Technology
2 Additional Credits Development (TU Delft) Jeroen van Straten Matthijs Brobbel Laurens van Dam Lars Wijtemans Lars van Leeuwen Support Zaid Al-Ars (TU Delft) Peter Hofstee (IBM/TU Delft) Jinho Lee (IBM) CAPI SNAP team (IBM) Cathal McCabe (Xilinx)
3 Heterogeneous software processes Language Runs on top of... Methods C/C++ Fortran Rust Julia ... CPU Compiled to machine instructions (sometimes called native instructions) Java Scala Java Virtual Machine Compiled to Java Bytecode Could be Just-In-Time compiled to machine instructions Python R Interpreter Interpreted Strong integration with native libraries
4 Heterogeneous computing ● Big data systems are becoming increasingly heterogeneous. – Many diferent “types” of processes in both SW and HW. ● Example: TensorFlowOnSpark[1] – You can run a Python program ● That uses NumPy (Python bindings on top of a C core) – Interfacing with TensorFlow, programmed in CUDA ● Running on a GPU – On top of Spark, written in Scala/Java ● Running on a Java Virtual Machine ● That runs on your CPU ● What challenges does this bring? [1] https://github.com/yahoo/TensorFlowOnSpark
5 A string String size Pointer to char bufer Internal char array (optionally used) Optionally allocated char array JVM object header Hash cache UTF-16 Array reference UTF16 array JVM array object header Python variable length object header Hash State Variable length character array FPGAPython Length Stream Character stream Java 8C++
6 Serialized collection in shared memory Collection X in Memory of Process A Collection X in Memory of Process B Serialization ● Iterate over all objects in collection ● Traverse all object graphs (memory latency) ● Copy felds to some intermediate format both A and B understand (bandwidth lost) ● Reconstruct objects in B (allocations) ... ... ... Deserialize...Serialize...
7 Relative impact on accelerators CPU compute time (De)serialize / copy time Accelerator compute time Original process on CPU: Accelerated process on GPGPU/FPGA with serialization: Desired acceleration profle:
8 Overcoming serialization bottlenecks ● We (de)serialize a lot… Can we do this smarter? ● What if data is… – In a standardized format? ● Every language run-time can use it. – As contiguous as possible? ● We can move it quickly without traversing object graphs
9 Apache Arrow[2] ● Standardized representation in-memory: Common Data Layer ● Columnar format – Hardware friendly while iterating over data (SIMD, caches, etc…) ● Libraries and APIs for various languages to build and access data [2] https://arrow.apache.org/
10 Arrow in- memory example Index A B C 0 1.33f beer {1, 3.14} 1 7.01f is {5, 1.41} 2 ∅ nice {3, 1.61} Index Data 0 1.33f 1 7.01f 2 X Index Ofset 0 0 1 4 2 6 3 10 Ofset Data 0 b 1 e 2 e 3 r 4 i 5 s 6 n 7 i 8 c 9 e Index Data 0 1 1 5 2 3 Index Data 0 3.14 1 1.41 2 1.61 Index Valid 0 1 1 1 2 0 Schema X { A: Float (nullable) B: List<Char> C: Struct{ E: Int16 F: Double } } Arrow terminology: Schema: Description of data types in a RecordBatch RecordBatch: Tabular structure containing arrays Arrays: Combination of buffers, can be nested Buffers: Contiguous C-like arrays
11 Integrating FPGA and Arrow ● Arrow is hardware-friendly – Standardized format ● If you know the schema you know exactly where the data is. – Contiguous & columnar format ● Iterate over a column in streaming fashion ● Useful for: maps, reductions, flters, etc... – Parallel accessible format ● Uses ofsets, not lengths, for variable length data ● Useful for: maps, reductions, flters, etc… ● We can generate easy-to-use hardware interfaces automatically
12 Fletcher[3] architecture: [3] https://github.com/johanpel/fetcher
13 Generated Interface internals ● Based on streaming primitives – Slices, splitters, etc… – Arbiters, serializers, parallellizers, etc... – Normalizer, accumulators, etc… ● Each Arrow bufer gets its own BuferReader/Writer ● Combination of BuferReader/Writer forms a ColumnReader/Writer ● Generated through pure HDL; vendor agnostic – Simulation in GHDL, QuestaSim, XSIM, PSLSE ● Verifcation: random schema and testbench generation, over 1000+ schemas tested
Internals: Fixed length data (with validity bitmap) ● User streams in frst and last index in the table. ● Column Reader streams the requested rows in order. ● Internal command stream: – First element ofset in the data word. – No. valid elements in the data word. ● Response handler aligns and serializes or parallelizes the data. Index Data 0 1.33f 1 7.01f 2 X Index Valid 0 1 1 1 2 0
Internals: Variable length data (without validity bitmaps) Index Ofset 0 0 1 4 2 6 3 10 Ofset Data 0 b 1 e 2 e 3 r 4 i 5 s 6 n 7 i 8 c 9 e
Internals: Structs (without validity bitmaps) Index Data 0 1 1 5 2 3 Index Data 0 3.14 1 1.41 2 1.61
17 Motivating use case: Regular Expression Matching ● Given N strings ● Match M regular expressions ● Count matches for each regexp ● Example:
18 Results RegExp on 1GiB of tweet-like strings.
19 Hands on: sum example ● Suppose we want to add all integers in a column. …, weight[2], weight[1], weight[0] Accumulate result
20 Step 1: Generate Schema ● Currently C++ libraries are most advanced in Arrow ● Examples will use C++. – Python, Java, R, Rust, Go, etc… also possible ● Create list of Schema felds; name, type, nullable ● Add metadata for Fletcher – read/write, felds to ignore – bus width, elements per cycle – No. MMIO registers for user, etc... ● Save Schema as Flatbufer fle std::vector<std::shared_ptr<arrow::Field>> schema_fields = { arrow::field("weight", arrow::int64(), false) };
21 Step 2: Fletchgen Wrapper Generator ● Generates wrapper based on schema ● Specify desired top-level – Currently AXI is supported ● AXI4 master interface to (host/on-board) memory ● AXI4-lite slave interface for MMIO – Simulation top-level available ● Can provide Arrow RecordBatch to simulation ● Compatible with baseline project for CAPI SNAP / AWS EC2 F1 $ fletchgen --input sum.fbs --output fletcher_wrapper.vhd --axi axi_top.vhd
22 Step 3: Implement Accelerator Kernel ● Accelerator kernel template. ● Two streams appear (for this example): weight_cmd_firstIdx : out std_logic_vector(INDEX_WIDTH-1 downto 0); weight_cmd_lastIdx : out std_logic_vector(INDEX_WIDTH-1 downto 0); weight_cmd_ready : in std_logic; weight_cmd_valid : out std_logic; weight_valid : in std_logic; weight_ready : out std_logic; weight_data : in std_logic_vector(63 downto 0); weight_last : in std_logic;
23 Step 4: Finishing touches ● Simulate, Debug, Place, Route ● Easy to use run-time interfaces provided – C++ available – Python incoming – Other languages with Arrow support in future ● Set custom MMIO with desired confguration ● Put data in Arrow RecordBatch ● Throw at Fletcher
24 Future work ● Continued development – More applications for showcasing/verifcation – Support for more Arrow-supported languages ● HLS integration for map/reduce/flter lambdas ● SQL integration
25 Summary ● Accelerators can be heavily burdened by serialization overhead from heterogeneous systems ● Apache Arrow format prevents serialization overhead and allows hardware interface generation ● Paves the way for more efcient FPGA acceleration in any of the supported languages ● Fletcher is the framework! https://github.com/johanpel/fetcher
26 References [1] https://github.com/yahoo/TensorFlowOnSpark [2] https://arrow.apache.org/ [3] https://github.com/johanpel/fetcher ● Regular Expression matching example: https://github.com/johanpel/fetcher/tree/master/examples/regexp ● Writing strings to Arrow format using CAPI 2.0 and SNAP @ 11 GB/s: ● https://github.com/johanpel/fetcher/tree/master/examples/stringwrite ● Posit arithmetic on FPGA, accelerated through Fletcher/SNAP by Laurens van Dam: ● https://github.com/lvandam/posit_blas_hdl ● PairHMM accelerator with posit arithmetic by Laurens van Dam & Johan Peltenburg: ● https://github.com/lvandam/pairhmm_posit_hdl_arrow Example projects / existing applications:

Fletcher Framework for Programming FPGA

  • 1.
    1 The Fletcher Frameworkfor Programming FPGAs OpenPOWER Summit Europe October 3, 2018 Johan Peltenburg Accelerated Big Data Systems Quantum & Computer Engineering Department Delft University of Technology
  • 2.
    2 Additional Credits Development (TUDelft) Jeroen van Straten Matthijs Brobbel Laurens van Dam Lars Wijtemans Lars van Leeuwen Support Zaid Al-Ars (TU Delft) Peter Hofstee (IBM/TU Delft) Jinho Lee (IBM) CAPI SNAP team (IBM) Cathal McCabe (Xilinx)
  • 3.
    3 Heterogeneous software processes LanguageRuns on top of... Methods C/C++ Fortran Rust Julia ... CPU Compiled to machine instructions (sometimes called native instructions) Java Scala Java Virtual Machine Compiled to Java Bytecode Could be Just-In-Time compiled to machine instructions Python R Interpreter Interpreted Strong integration with native libraries
  • 4.
    4 Heterogeneous computing ● Big datasystems are becoming increasingly heterogeneous. – Many diferent “types” of processes in both SW and HW. ● Example: TensorFlowOnSpark[1] – You can run a Python program ● That uses NumPy (Python bindings on top of a C core) – Interfacing with TensorFlow, programmed in CUDA ● Running on a GPU – On top of Spark, written in Scala/Java ● Running on a Java Virtual Machine ● That runs on your CPU ● What challenges does this bring? [1] https://github.com/yahoo/TensorFlowOnSpark
  • 5.
    5 A string String size Pointerto char bufer Internal char array (optionally used) Optionally allocated char array JVM object header Hash cache UTF-16 Array reference UTF16 array JVM array object header Python variable length object header Hash State Variable length character array FPGAPython Length Stream Character stream Java 8C++
  • 6.
    6 Serialized collection in shared memory CollectionX in Memory of Process A Collection X in Memory of Process B Serialization ● Iterate over all objects in collection ● Traverse all object graphs (memory latency) ● Copy felds to some intermediate format both A and B understand (bandwidth lost) ● Reconstruct objects in B (allocations) ... ... ... Deserialize...Serialize...
  • 7.
    7 Relative impact onaccelerators CPU compute time (De)serialize / copy time Accelerator compute time Original process on CPU: Accelerated process on GPGPU/FPGA with serialization: Desired acceleration profle:
  • 8.
    8 Overcoming serialization bottlenecks ● We(de)serialize a lot… Can we do this smarter? ● What if data is… – In a standardized format? ● Every language run-time can use it. – As contiguous as possible? ● We can move it quickly without traversing object graphs
  • 9.
    9 Apache Arrow[2] ● Standardized representationin-memory: Common Data Layer ● Columnar format – Hardware friendly while iterating over data (SIMD, caches, etc…) ● Libraries and APIs for various languages to build and access data [2] https://arrow.apache.org/
  • 10.
    10 Arrow in- memory example Index AB C 0 1.33f beer {1, 3.14} 1 7.01f is {5, 1.41} 2 ∅ nice {3, 1.61} Index Data 0 1.33f 1 7.01f 2 X Index Ofset 0 0 1 4 2 6 3 10 Ofset Data 0 b 1 e 2 e 3 r 4 i 5 s 6 n 7 i 8 c 9 e Index Data 0 1 1 5 2 3 Index Data 0 3.14 1 1.41 2 1.61 Index Valid 0 1 1 1 2 0 Schema X { A: Float (nullable) B: List<Char> C: Struct{ E: Int16 F: Double } } Arrow terminology: Schema: Description of data types in a RecordBatch RecordBatch: Tabular structure containing arrays Arrays: Combination of buffers, can be nested Buffers: Contiguous C-like arrays
  • 11.
    11 Integrating FPGA andArrow ● Arrow is hardware-friendly – Standardized format ● If you know the schema you know exactly where the data is. – Contiguous & columnar format ● Iterate over a column in streaming fashion ● Useful for: maps, reductions, flters, etc... – Parallel accessible format ● Uses ofsets, not lengths, for variable length data ● Useful for: maps, reductions, flters, etc… ● We can generate easy-to-use hardware interfaces automatically
  • 12.
  • 13.
    13 Generated Interface internals ● Basedon streaming primitives – Slices, splitters, etc… – Arbiters, serializers, parallellizers, etc... – Normalizer, accumulators, etc… ● Each Arrow bufer gets its own BuferReader/Writer ● Combination of BuferReader/Writer forms a ColumnReader/Writer ● Generated through pure HDL; vendor agnostic – Simulation in GHDL, QuestaSim, XSIM, PSLSE ● Verifcation: random schema and testbench generation, over 1000+ schemas tested
  • 14.
    Internals: Fixed lengthdata (with validity bitmap) ● User streams in frst and last index in the table. ● Column Reader streams the requested rows in order. ● Internal command stream: – First element ofset in the data word. – No. valid elements in the data word. ● Response handler aligns and serializes or parallelizes the data. Index Data 0 1.33f 1 7.01f 2 X Index Valid 0 1 1 1 2 0
  • 15.
    Internals: Variable lengthdata (without validity bitmaps) Index Ofset 0 0 1 4 2 6 3 10 Ofset Data 0 b 1 e 2 e 3 r 4 i 5 s 6 n 7 i 8 c 9 e
  • 16.
    Internals: Structs (without validitybitmaps) Index Data 0 1 1 5 2 3 Index Data 0 3.14 1 1.41 2 1.61
  • 17.
    17 Motivating use case: RegularExpression Matching ● Given N strings ● Match M regular expressions ● Count matches for each regexp ● Example:
  • 18.
    18 Results RegExp on 1GiBof tweet-like strings.
  • 19.
    19 Hands on: sumexample ● Suppose we want to add all integers in a column. …, weight[2], weight[1], weight[0] Accumulate result
  • 20.
    20 Step 1: GenerateSchema ● Currently C++ libraries are most advanced in Arrow ● Examples will use C++. – Python, Java, R, Rust, Go, etc… also possible ● Create list of Schema felds; name, type, nullable ● Add metadata for Fletcher – read/write, felds to ignore – bus width, elements per cycle – No. MMIO registers for user, etc... ● Save Schema as Flatbufer fle std::vector<std::shared_ptr<arrow::Field>> schema_fields = { arrow::field("weight", arrow::int64(), false) };
  • 21.
    21 Step 2: FletchgenWrapper Generator ● Generates wrapper based on schema ● Specify desired top-level – Currently AXI is supported ● AXI4 master interface to (host/on-board) memory ● AXI4-lite slave interface for MMIO – Simulation top-level available ● Can provide Arrow RecordBatch to simulation ● Compatible with baseline project for CAPI SNAP / AWS EC2 F1 $ fletchgen --input sum.fbs --output fletcher_wrapper.vhd --axi axi_top.vhd
  • 22.
    22 Step 3: ImplementAccelerator Kernel ● Accelerator kernel template. ● Two streams appear (for this example): weight_cmd_firstIdx : out std_logic_vector(INDEX_WIDTH-1 downto 0); weight_cmd_lastIdx : out std_logic_vector(INDEX_WIDTH-1 downto 0); weight_cmd_ready : in std_logic; weight_cmd_valid : out std_logic; weight_valid : in std_logic; weight_ready : out std_logic; weight_data : in std_logic_vector(63 downto 0); weight_last : in std_logic;
  • 23.
    23 Step 4: Finishingtouches ● Simulate, Debug, Place, Route ● Easy to use run-time interfaces provided – C++ available – Python incoming – Other languages with Arrow support in future ● Set custom MMIO with desired confguration ● Put data in Arrow RecordBatch ● Throw at Fletcher
  • 24.
    24 Future work ● Continued development –More applications for showcasing/verifcation – Support for more Arrow-supported languages ● HLS integration for map/reduce/flter lambdas ● SQL integration
  • 25.
    25 Summary ● Accelerators can beheavily burdened by serialization overhead from heterogeneous systems ● Apache Arrow format prevents serialization overhead and allows hardware interface generation ● Paves the way for more efcient FPGA acceleration in any of the supported languages ● Fletcher is the framework! https://github.com/johanpel/fetcher
  • 26.
    26 References [1] https://github.com/yahoo/TensorFlowOnSpark [2] https://arrow.apache.org/ [3]https://github.com/johanpel/fetcher ● Regular Expression matching example: https://github.com/johanpel/fetcher/tree/master/examples/regexp ● Writing strings to Arrow format using CAPI 2.0 and SNAP @ 11 GB/s: ● https://github.com/johanpel/fetcher/tree/master/examples/stringwrite ● Posit arithmetic on FPGA, accelerated through Fletcher/SNAP by Laurens van Dam: ● https://github.com/lvandam/posit_blas_hdl ● PairHMM accelerator with posit arithmetic by Laurens van Dam & Johan Peltenburg: ● https://github.com/lvandam/pairhmm_posit_hdl_arrow Example projects / existing applications: