FBTFTP Angelo Failla Production Engineer
 Cluster infrastructure team
 Facebook Ireland Facebook’s Python3 open-source framework to build dynamic tftp servers
• A Production Engineer • Similar to SRE / DevOps
 • Based in Facebook Ireland, Dublin • Since 2011
 • Cluster Infrastructure team member • Owns data center core services • Owns E2E automation for bare metal provisioning and cluster management. Who am I?
“There is no cloud, just other people’s computers…” - a (very wise) person on the interwebz “… and someone’s got to provision them.” - Angelo
POPs: Point of PresenceData center locations POPs locations are fictional
HANDS FREE PROVISIONING:
kernel OS initrd BIOS UEFI firmware bootloader v6/v4DHCP TFTP bootloader config kickstart anaconda RPM's location buildcontrol cyborg server type vendormodel OOB partitioning schemas 3rd party chef tier HTTP repos mysql inventory sys
kernel OS initrd BIOS UEFI firmware bootloader v6/v4DHCP TFTP bootloader config kickstart anaconda RPM's location buildcontrol cyborg server type vendormodel OOB partitioning schemas 3rd party chef tier HTTP repos mysql inventory sys
TFTP
It’s common in Data Center/ISP environments Simple protocol specifications Easy to implement UDP based -> produces small code footprint Fits in small boot ROMs Embedded devices and network equipment Traditionally used for netboot (with DHCPv[46])
DHCPv[46] - KEA TFTP NBP NETBOOT ANACONDA CHEF REBOOT PROVISIONEDPOWER ON • fetches config via tftp • fetches kernel/initrd
 (via http or tftp) • provides NBPs • provides config files for NBPs • provides kernel/initrd • provides network config • provides path for NBPs binaries Provisioning phases
30+ years old protocol me, ~1982 circa
Protocol in a nutshell (RRQ) CLIENT RRQ X 69 X YDAT 1 ACK 1X Y SERVER X YDAT N ACK NX Y
Latency: ~150ms File size Block Size Latency Time to download 80 MB 512 B 150ms 12.5 hours 80 MB 1400 B 150ms 4.5 hours 80 MB 512 B/ 1400 B 1ms <1 minute POP DC CLIENT RRX 69 X YDAT ACKX Y SERVER POPs locations are fictional
A look in the past ~2014 (and its problems) HW LB in.tftpd
 (active) in.tftpd
 (passive) Servers Cluster
 VIP Automation REPO rsync 7GBWrite config • Physical load balancers • Waste of resources • Automation needs to know which server is active • No stats • TFTP is a bad protocol in high latency environments • Too many moving parts
How did we solve those problems?
• Supports only RRQ (fetch operation) • Main TFTP spec[1], Option Extension[2], Block size option[3], Timeout Interval and Transfer Size Options[4]. • Extensible: • Define your own logic • Push your own statistics (per session or global) We built FBTFTP… …A python3 framework to build dynamic TFTP servers [1] RFC1350, [2] RFC2347, [3] RFC2348, [4] RFC2349
BaseServer BaseHandlerClient transfer session fork() server callback session callback get_handler() ResponseData get_response_data() RRQ Monitoring Infrastructure Framework overview child process
Example:
 
 a simple server serving files from disk
class FileResponseData(ResponseData): def __init__(self, path): self._size = os.stat(path).st_size self._reader = open(path, 'rb') def read(self, n): return self._reader.read(n) def size(self): return self._size def close(self): self._reader.close() A file-like class that represents a file served: BaseServer BaseHandlerClient transfer session fork() server callback session callback get_handler() ResponseData get_response_data() RRQ Monitoring Infrastructure child process
class StaticHandler(BaseHandler): def __init__(self, server_addr, peer, path, options, root, stats_callback): super().__init__(
 server_addr, peer, path, options, stats_callback) self._root = root self._path = path def get_response_data(self): return FileResponseData(
 os.path.join(self._root, self._path)) A class that deals with a transfer session: BaseServer BaseHandlerClient transfer session fork() server callback session callback get_handler() ResponseData get_response_data() RRQ Monitoring Infrastructure child process
class StaticServer(BaseServer): def __init__( self, address, port, retries, timeout, root, handler_stats_callback, server_stats_callback ): self._root = root self._handler_stats_callback = handler_stats_callback super().__init__( address, port, retries, timeout, server_stats_callback) def get_handler(self, server_addr, peer, path, options): return StaticHandler( server_addr, peer, path, options, self._root, self._handler_stats_callback) BaseServer class ties everything together: BaseServer BaseHandlerClient transfer session fork() server callback session callback get_handler() ResponseData get_response_data() RRQ Monitoring Infrastructure child process
def print_session_stats(stats): print(stats) def print_server_stats(stats): counters = stats.get_and_reset_all_counters() print('Server stats - every {} seconds’.format( stats.interval)) print(counters) server = StaticServer( ip='', port='69', retries=3, timeout=5, root='/var/tftproot/', print_session_stats, print_server_stats) try: server.run() except KeyboardInterrupt: server.close() The “main” BaseServer BaseHandlerClient transfer session fork() server callback session callback get_handler() ResponseData get_response_data() RRQ Monitoring Infrastructure child process
How do we use it? tftp Servers HTTP
 repo Improvements • No more physical LBs • No waste of resources • Stats! • TFTP servers are dynamic • Config files (e.g. grub/ipxe configs) are generated • Static files are streamed • You can hit any server • No need to rsync data • Container-friendly Provisioning
 backends tftpfbtftp local disk cache static files dynamic
 files requests can
 hit any server
Routing TFTP traffic NetNorad Latency Maps DHCP LBs are gone: which TFTP server will serve a given client? NetNorad publishes latency maps periodically, DHCP consumes it. Read about NetNorad on our blog: http://tinyurl.com/hacrw7c Location of server to provision Closest
 TFTP
 server TFTP Health checks Service
 discovery
POP1 DCFetches static files from closest origin only for cache misses or if files changed local
 fbtftp local
 fbtftp POP2 POPs locations are fictional
Thanks for listening! Feel free to email me at pallotron@fb.com Project home:
 https://github.com/facebook/fbtftp/ Install and play with it: 
 $ pip3 install fbtftp Poster session Tuesday at 14.45:
 Python in Production Engineering

FBTFTP: an opensource framework to build dynamic tftp servers

  • 1.
    FBTFTP Angelo Failla Production Engineer
 Clusterinfrastructure team
 Facebook Ireland Facebook’s Python3 open-source framework to build dynamic tftp servers
  • 2.
    • A ProductionEngineer • Similar to SRE / DevOps
 • Based in Facebook Ireland, Dublin • Since 2011
 • Cluster Infrastructure team member • Owns data center core services • Owns E2E automation for bare metal provisioning and cluster management. Who am I?
  • 3.
    “There is nocloud, just other people’s computers…” - a (very wise) person on the interwebz “… and someone’s got to provision them.” - Angelo
  • 7.
    POPs: Point ofPresenceData center locations POPs locations are fictional
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    It’s common inData Center/ISP environments Simple protocol specifications Easy to implement UDP based -> produces small code footprint Fits in small boot ROMs Embedded devices and network equipment Traditionally used for netboot (with DHCPv[46])
  • 13.
    DHCPv[46] - KEATFTP NBP NETBOOT ANACONDA CHEF REBOOT PROVISIONEDPOWER ON • fetches config via tftp • fetches kernel/initrd
 (via http or tftp) • provides NBPs • provides config files for NBPs • provides kernel/initrd • provides network config • provides path for NBPs binaries Provisioning phases
  • 14.
    30+ years oldprotocol me, ~1982 circa
  • 15.
    Protocol in anutshell (RRQ) CLIENT RRQ X 69 X YDAT 1 ACK 1X Y SERVER X YDAT N ACK NX Y
  • 16.
    Latency: ~150ms File size Block Size Latency Timeto download 80 MB 512 B 150ms 12.5 hours 80 MB 1400 B 150ms 4.5 hours 80 MB 512 B/ 1400 B 1ms <1 minute POP DC CLIENT RRX 69 X YDAT ACKX Y SERVER POPs locations are fictional
  • 17.
    A look inthe past ~2014 (and its problems) HW LB in.tftpd
 (active) in.tftpd
 (passive) Servers Cluster
 VIP Automation REPO rsync 7GBWrite config • Physical load balancers • Waste of resources • Automation needs to know which server is active • No stats • TFTP is a bad protocol in high latency environments • Too many moving parts
  • 18.
    How did wesolve those problems?
  • 19.
    • Supports onlyRRQ (fetch operation) • Main TFTP spec[1], Option Extension[2], Block size option[3], Timeout Interval and Transfer Size Options[4]. • Extensible: • Define your own logic • Push your own statistics (per session or global) We built FBTFTP… …A python3 framework to build dynamic TFTP servers [1] RFC1350, [2] RFC2347, [3] RFC2348, [4] RFC2349
  • 20.
  • 21.
    Example:
 
 a simple serverserving files from disk
  • 22.
    class FileResponseData(ResponseData): def __init__(self,path): self._size = os.stat(path).st_size self._reader = open(path, 'rb') def read(self, n): return self._reader.read(n) def size(self): return self._size def close(self): self._reader.close() A file-like class that represents a file served: BaseServer BaseHandlerClient transfer session fork() server callback session callback get_handler() ResponseData get_response_data() RRQ Monitoring Infrastructure child process
  • 23.
    class StaticHandler(BaseHandler): def __init__(self,server_addr, peer, path, options, root, stats_callback): super().__init__(
 server_addr, peer, path, options, stats_callback) self._root = root self._path = path def get_response_data(self): return FileResponseData(
 os.path.join(self._root, self._path)) A class that deals with a transfer session: BaseServer BaseHandlerClient transfer session fork() server callback session callback get_handler() ResponseData get_response_data() RRQ Monitoring Infrastructure child process
  • 24.
    class StaticServer(BaseServer): def __init__( self,address, port, retries, timeout, root, handler_stats_callback, server_stats_callback ): self._root = root self._handler_stats_callback = handler_stats_callback super().__init__( address, port, retries, timeout, server_stats_callback) def get_handler(self, server_addr, peer, path, options): return StaticHandler( server_addr, peer, path, options, self._root, self._handler_stats_callback) BaseServer class ties everything together: BaseServer BaseHandlerClient transfer session fork() server callback session callback get_handler() ResponseData get_response_data() RRQ Monitoring Infrastructure child process
  • 25.
    def print_session_stats(stats): print(stats) def print_server_stats(stats): counters= stats.get_and_reset_all_counters() print('Server stats - every {} seconds’.format( stats.interval)) print(counters) server = StaticServer( ip='', port='69', retries=3, timeout=5, root='/var/tftproot/', print_session_stats, print_server_stats) try: server.run() except KeyboardInterrupt: server.close() The “main” BaseServer BaseHandlerClient transfer session fork() server callback session callback get_handler() ResponseData get_response_data() RRQ Monitoring Infrastructure child process
  • 26.
    How do weuse it? tftp Servers HTTP
 repo Improvements • No more physical LBs • No waste of resources • Stats! • TFTP servers are dynamic • Config files (e.g. grub/ipxe configs) are generated • Static files are streamed • You can hit any server • No need to rsync data • Container-friendly Provisioning
 backends tftpfbtftp local disk cache static files dynamic
 files requests can
 hit any server
  • 27.
    Routing TFTP traffic NetNoradLatency Maps DHCP LBs are gone: which TFTP server will serve a given client? NetNorad publishes latency maps periodically, DHCP consumes it. Read about NetNorad on our blog: http://tinyurl.com/hacrw7c Location of server to provision Closest
 TFTP
 server TFTP Health checks Service
 discovery
  • 28.
    POP1 DCFetches static files fromclosest origin only for cache misses or if files changed local
 fbtftp local
 fbtftp POP2 POPs locations are fictional
  • 29.
    Thanks for listening! Feelfree to email me at pallotron@fb.com Project home:
 https://github.com/facebook/fbtftp/ Install and play with it: 
 $ pip3 install fbtftp Poster session Tuesday at 14.45:
 Python in Production Engineering