Apache IOTDB: a Time Series Database for Industrial IoT

Apache IoTDB: a Time Series Database for Industrial IoT Xiangdong Huang1 and Julian Feinauer2 (on behalf of the IoTDB community) 1 Tsinghua University, Beijing, China 2 Pragmatic Minds, Stuttgart, Germany Berlin, Germany, 2019.10.23

Outline • Who We Are • Why IoTDB Was Born • Overview of Apache IoTDB (incubating): Main Features • Working with Current Ecosystems • Performance Evaluation • Use Cases • Future Works

IoTDB • IoTDB = IoT + DB, a DataBase for managing (Industrial) IoT data • IoTDB is a IoT DB. (using IoTDB as a keyword on Google, not “IoT DB”)

IoTDB • IoTDB = IoT + DB, a DataBase for managing (Industrial) IoT data • IoTDB is a IoT DB. (using IoTDB as a keyword on Google, not “IoT DB”) • “You can find many substances about IoTDB in Germany”

IoTDB • IoTDB = IoT + DB, a DataBase for managing (Industrial) IoT data • “You can find many substances about IoTDB” • IIoT turbine excavator trunks modern cars

IoTDB • IoTDB = IoT + DB, the DataBase for managing (Industrial) IoT data • “You can find many substances about IoTDB” • IIoT • DB deutsche bahn (the real meaning)

Who We Are (The community) • We come from the Apache IoTDB (incubating) Community • A young community. 2018.11-18 entered the incubator. • Mentors: Christofer Dutz, Justin Mclean, (Champion) Kevin A. McGrail, Willem Jiang • Devoted to building the best time series database (in IoT area) in the world

Who We Are (Individual) • Xiangdong Huang (sainthxd@gmail.com) • PhD, PostDoc and Assistant Researcher (now) in Tsinghua University, Beijing, China • Use Apache Cassandra (for managing Timeseries Data) from 2012 • Develop IoTDB from 2017 • One of the initial committers of Apache IoTDB incubating

Who We Are (Individual) • Julian Feinauer (j.feinauer@pragmaticminds.de) • Founder of Startup pragmatic minds in Germany • The first committer who is not initial committer • The Release Manager of the first release version of IoTDB • The Committer of Apache PLC4x, Edgent etc..

The 4th Industrial Revolution Industry 4.0 Industry Internet Data analytics and utility is the key Advanced data analytics Industry Internet Data + Model Germany China USA Data is becoming the most important aspect of this era

Machine Data (Time Series Data) : the Largest Volume in Industrial Data Machine Data Other Domain Data EnvironmentMeteorology Geography Industrial Big Data Manufacturing Enterprise Data VideoModel Doc Drawings

How to Manage Time Series Data Network MQ PI System (Pi Server) queryinsertion save data locally RDBMS

How to Manage Time Series Data Network MQ Database queryinsertion save data locally Network analysis

The Problems Network MQ Database ● millions of data points per second? ● 10 millions of data points per second? ● billions of data points per second? insertion Big Data 50Hz，500points/machine， 20K wind-turbines macines， totally up to 500 million points/sec Produce Data 7*24 with High Frequency and Large Volume � More Features � Out-of-order sometimes � Sparse Table (different machine has different sensors)

The Problems Network MQ Database query analysis

The Problems Network MQ Database query analysis � Features of Data Query � Time Dimension is always accessed � Aggregation is the first-class citizen ■ Sometimes we do not need raw data, just know the count/min/max/avg value is ok. ■ (For visualization), the screen resolution is limited, e.g., 1024*768. Then no meaning for getting more than 1024 points (using aggregation to Downsampling) � Time-series-specific query and analysis

● get a mass of data QUICKLY (ETL) ● then convert it into a analysis-friendly file format ● time consuming The Problems Network MQ Database query analysis

What we want � Challenges � Large Volume � High Throughput � Low Cost (historical data) � Low Latency for Query � Fast Aggregation � Query-Analysis hybrid workloads

Different Solutions for Managing Time Series RDB KVDB LSM based •Efficient file structure •More query functions Not optimize for some application scenarios TSDB Limited number of columns 1600 Columns in a table Limited number of rows <=10M rows is better Manual Sharding • Support big data • Limited Queries • Lack time filtering • Lack value filtering • Lack multiple time series alignment Based on PG •Auto sharding •Query optimization Performance degrades sharply after writing data for a long time Hbase/Cassandra based •Partition by TS-UID and time range • Storage inefficiency • Limit of queries

Time Series DB for Industrial Internet now called: Apache IoTDB (incubating) Each node can manage: ★ Tens of millions of time series ★ Trillions of data points ★ Tens of TB data Support Hadoop, Spark, Matlab, Grafana etc.. “清华数为”工业互联网时间序列数据库

Apache IoTDB Features Persist data efficiently • Millions points ingestion per sec per node • Tens of millions of time series Query data with low latency • Efficiently filter data: millions of points per sec • Aggregation: tens of ms latency on billions of points Exclusive operations of time series • Segmentation • Representation • Subsequence matching • Time-frequency transform • Visualization Integration with existing ecosystem • Kafka • MatLab • Spark • MapReduce • Grafana • Connecting Edge to the Cloud • Powerful query engine • User Friendly analytics Collecti on Storage ProcessLearning Applicat ion Cover the life cycle of data

Architecture IoTDB Outlier detection Machine learning UDF Hadoop/ Spark Big data Framework cluster TsFile Time series optimized file format TsFile-CLI Interactive client command line IoTDB-JDBC Grafana-Adaptor Web dashboard to visualize time series data IoTDB-CLI Interactive client command line I/E Tool Batch load and export data Other Databases Application s Message Queue DevOp s devic e IoTDB IoTDBSync

Concepts in IoTDB (The Schema) Device (i.e., Data source) • A machine instance Measurement (e.g., sensor) • A device can have many measurements Time Series • Device + Measurement • is represented as a path that begins with root, like “root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain” Storage Group (SG) • A storage group can have many devices • Storage groups have independent resources (threads and files) to increase parallelism and reduce competitions for locks. Cadillac XT5

The schema mapping root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain root.Cadillac_XT5.USA.CA.7BTC409.speed root.Cadillac_XT5.USA.NV.6BAC321.speed country state device name timestamp fuelRemain speed USA CA 7BTC409 t1 5.0 120 USA CA 7BTC409 t2 4.9 109 USA CA 6BAC321 t1 NULL 50 USA CA 6BAC321 t3 NULL 65 Table Name: Cadillac_XT5 Tags and Fields in InfluxDB, KariosDB, OpenTSDB… called as Measurement in InfluxDB

Set time series group SET STORAGE GROUP TO root.laptop.d1.s1; Create Timeseries CREATE TIMESERIES root.laptop.d1.s1 WITH DATATYPE=INT32, ENCODING=RLE Insert Data INSERT INTO (d1.s1,d1.s2,time) VALUES (1000,2000,14735235234); Delete Data DALETE FROM d1.s1 WHERE time < 1000; Update Data UPDATE d1.s1 SET VALUE = 2000 WHERE time < 2000 and time > 1000; Query Data (Filter, Aggregation, Group by time interval) SELECT d1.s1,d2.* FROM BJ.WF1 WHERE d1.s1 < 2000 and d2.s2 > 1000 and freq(d2.s3) > 0.5; SELECT count(status), max_value(temperature) from root.ln.wf01.wt01; SELECT count(status) ) from root.ln.wf01.wt01 group by(1h, [2017-11-03T00:00:00, 2017-11- 03T23:00:00]); SQL in IoTDB

Supported data type • Boolean • Int • Long • Float • Double • String • GPS (TODO) -> for trajectory data management • Array (TODO) -> for unstructured data management

30 TsFile: Zip File Born for Time Series Data Columnar Store - Reduce Disk I/O - Improve Compression Compression & Encoding - Improve Compression Greatly - 15% Better than InfluxDB in Real Applications Time-domain Statistics Info Natively - Support Fast Query in - Time Domain - Value Domain - Freq Domain (TODO) detailed specification: http://iotdb.apache.org/#/Documents/0.8.0/chap7/sec3 https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format

TsFile: comparison with Parquet You say, “tomato”... Parquet Parquet TsFile Target in TsFile Row Group Chunk Group The data that belongs to a device instance Column Chunk The data that belongs to the device’s measurement Page Page a part of data that belongs to a Chunk The differences ❏ Each Page has two columns actually ❏ The time column and the value column ❏ No Repeat and Duplication Field Needed ❏ More summary info for a Page/Chunk ❏ min/max timestamp ❏ min/max value ❏ count ❏ FileMetadata Page Header Page Data Timestamps Values Difference in TsFile statistics FileMetadata Devices info Level 1 Devices info Level 2

TsFile: comparison with Parquet Apache Parquet Chunk Group Chunk File Metadata Time Series Time1 Value1 Time2 Value 2 TsFile Time series data General File Format

Adaptive Delta encoding – Int or Long (TODO) Gorilla encoding – Float or Double 128, 136, 144, 152, 160, … 8, 8, 8, 8 � 1st difference is constant. 0, 0, 0 � 2nd difference is 1-bit storage needed! 128, 135, 143, 154, 163, … 7, 8, 11, 9 � 1st difference is not constant though 1, 3, -2 � 2nd difference is 2-bit storage needed! • Unified support of fixed frequency times series or irregular frequency time series TS2Diff encoding – Int or Long (timestamps) • A adaptive enhance for TS2Diff. • See next page. RLE encoding – repeated Int or Long • For repeated sequence: store a value and its count Bit-Packing encoding – Int or Long • Store data in compact form • squeeze out wasteful bits • XOR consecutive data points • Store with variable length encoding scheme Snappy Gzip (TODO) LZO (TODO) Compression Algorithm TsFile: Encoding and Compression

Adaptive TS2Diff encoding – Int or Long (TODO) • For time series with outliers or missing points • Storing second-order delta values and a boolean flag array. TsFile: Encoding and Compression

Data Query Only records root nodes in memory and build virtual trees, for reducing memory cost and disk I/O 35 Fast Aggregation Method for Time Series IoTDB-SQL DM L R select raw aggregate filter device single across metric single across time certain range group by time interval series order by ASC DESC fill inter- polation latest limit slimit index C U D DDL ✔ 8 types of sub-clause ✔ ≥1052 kinds of query IoTDB-SQL ——Concise TS Operations Language JDBC ——Reduce the Cost of Learning Interfaces: JDBC, TsFile API, CLI, etc.

Time Series Specific Operations (TODO) Pattern Matching for Streaming Time Series Data ✔ Split the pattern and data stream into equal length fragments ✔ Extract features to reduce the dimension ✔ Accelerate the search by using features ✔ Scenario：fault alarm in real time 36 SELECT wind_3s FROM china.farm1.tb2 WHERE time > t1 AND time < t2 AND wind_3s LIKE PATTERN(7.2,..,20.3,..,6.0) Similarity Search of Sub-series ✔ Indexing data using Key-Value form ✔ Scenarios: ✔ Outlier detection ✔ Historical data analysis ✔ …

From Edge to Cloud: Run IoTDB Everywhere Time series data files: high-tech write, high compression ratio, support simple queries. Simply put, TsFile is a zip file for time series data. Suitable for embedded devices, general servers, data centers, etc. TsFile (a component of IoTDB) A zip file of time series Freely operate time series of multiple TsFiles, including: CRUD and advanced query like：max, min, avg and temporal alignment. Scene: Embedded equipment, on- site industrial computer, general server, etc. IoTDB A database of time series 3rd Systems Easy to use and integrate for complex analysis(data fusion, collaborative recommendation, machine learning) Scene: Cloud data center A data warehouse of time series

A Process to Manage Time Series Data data source or JDBC / Session API JDBC / Session API Grafana-Adaptor Spark-TsFile-AdaptorJDBC Analysis with Big Data Framework (big data set) Analysis with Matlab (small data set) Visualization (Manual data explore)

Using JDBC to write data set storage group create timeseries insert data https://iotdb.apache.org/#/Documents/0.8.0/chap6/sec1

Using Session API to write Data (more efficient) set storage group create timeseries insert data

Using JDBC to Query Data raw data query aggregation query down sampling query print result https://iotdb.apache.org/#/Documents/0.8.0/chap6/sec1

Using Grafana to Visualize Data https://iotdb.apache.org/#/Tools/Grafana • Install simple-json-datasource plugin • Config iotdb-grafana-connector • application.properties • Start iotdb-grafana-connector • java -jar iotdb-grafana-0.8.0.war • Add IoTDB data source(Simplejson) • choose connector IP • Config dashboard and Enjoy!

Using Matlab to Analyze Data read IoTDB by JDBC fast Fourier transform plot

Using Spark to Analyze Data create table sql query read TsFile write to TsFile https://iotdb.apache.org/#/Tools/Spark

Demo • Writing Data Locally • Show data with Grafana • Analyze data using SparkSQL • https://github.com/jixuan1989/iotdb-tutorial

Demo Video • Writing Data on HDFS directly • using Hive to analyze it • Video

Language • Written by Java • But the RPC is implemented by Thrift • Easy to get other language’s API.

Say Hi to the Apache Ecosystem IoTDB-repository: RocketMQ: https://github.com/apache/incubator- iotdb/tree/master/example/rocketmq Kafka: https://github.com/apache/incubator- iotdb/tree/master/example/kafka Third part: EMQx (MQTT server): https://github.com/jixuan1989/iotdb-tutorial Spark: https://github.com/jixuan1989/iotdb-tutorial Calcite: https://github.com/EJTTianYu/iotdb-calcite PLC4X: Mapreduce:

Application 1: The Next Generation of Big Data Platform for Meteorology 1073 kinds of meteor- ological data The platform is deployed across China Performance improved : two orders of magnitude ~150K stations collect more than 100 metrics/ 5 minutes upgrade

Application 2: Data Management for Equipment Monitoring The data records the operational status of the equipments, e.g., the vehicle’s speed, fuel consumption and malfunction. © 2015. All Rights Reserved. execute collect decision transfer Komatsu excavator TIANYUAN (with Komatsu) #devices (excavator etc.) #metrics collection times per minute • sharding every day • only store data in 3 months • more than 10 minutes for some queries • store the whole data • several seconds for complex queries

Application 3: Shanghai METRO Monitoring … 144 trains 9 KairosDB + Cassandra 3200 points/500 ms/train 14 Restful service just for avoiding modifying current programs KDB compatible Restful Service KDB compatible Restful Service KDB compatible Restful Service ONE IoTDB instance 300 trains 3200 points/200 ms/train 414 Billion data points per day just using ONE IoTDB instance upgrade

Future Works • Make it easy to use! • Relational Model: Integration with Calcite • step 1: supports relational SQL • step 2: standard JDBC • Big Data! • better integration with Hive, etc.. • Cluster! • now supports writing data on HDFS, but a share-nothing architecture is wanted. • Advanced functions! • integration with data streaming engine, etc..

Join Us • Mail list: • subscribe: dev- subscribe@iotdb.incubator.apache.org • discussion: dev@iotdb.apache.org • bug report: https://issues.apache.org/jira/projects/I OTDB/issues/IOTDB • Website: https://iotdb.apache.org • Ecosystem target: IoTDB v0.8.0 is released! (the first Apache release version)

Apache IOTDB: a Time Series Database for Industrial IoT

More Related Content

What's hot

Similar to Apache IOTDB: a Time Series Database for Industrial IoT

More from jixuan1989

Recently uploaded

In this document

Apache IOTDB: a Time Series Database for Industrial IoT