Open Source BI Deep Dive Ravi Samji 21/04/2011
Agenda Business Intelligence – Why, What & Who? Open Source BI – Introduction, Tech Stack OLAP Engine – Mondrian UI Layer – JPivot Performance & Scalability Constraints
About Yodlee
Business Intelligence – Why? Data is the biggest asset Structured and Unstructured format Most of our assets are buried Helps us understand customer behavior Helps us deliver better business value Measure performance
Business Intelligence – What? Reporting Analytics Data/Text Mining ETL Predictive Analytics
Business Intelligence – Who?
Open Source BI – Introduction Mondrian – OLAP Engine Initially Independent Open Source Initiative Now Part of Pentaho Open Source BI Suite 100% Pure Java Supports MDX and XML/A Bundled With Other Open Source Packages
Open Source BI – Tech Stack JFreeChart WCF log4j log4j JPivot Mondrian RDBMS
OLAP Engine – Mondrian Cube Definition – schema.xml MDX – Query language to access multi dimensional data Operates on normalized relational database
Mondrian – schema.xml Logical model of a multi dimensional database Cube, VirtualCube Dimensions, Hierarchies, Levels Measure, CalculatedMember
Logical Model – Multi Dimensional <Schema> Database <Cube name="Sales"> <Table name="sales_fact_1997"/> <Dimension name="Gender" foreignKey="customer_id"> <Hierarchy hasAll="true" allMemberName="All Genders" primaryKey="customer_id"> <Table name="customer"/> <Level name="Gender" column="gender" uniqueMembers="true"/> </Hierarchy> </Dimension> <Dimension name="Time" foreignKey="time_id"> <Hierarchy hasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column="quarter" uniqueMembers="false"/> <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> </Dimension> <Measure name="Unit Sales" column="unit_sales" aggregator="sum" formatString="#,###"/> <Measure name="Store Sales" column="store_sales" aggregator="sum" formatString="#,###.##"/> <Measure name="Store Cost" column="store_cost" aggregator="sum" formatString="#,###.00"/> <CalculatedMember name="Profit" dimension="Measures" formula="[Measures].[Store Sales] - [Measures].[Store Cost]"> <CalculatedMemberProperty name="FORMAT_STRING" value="$#,##0.00"/> </CalculatedMember> </Cube> </Schema>
Dimensions & Shared Dimensions <Schema> <Dimension name="Time"> <Hierarchy hasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column="quarter" uniqueMembers="false"/> <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> </Dimension> <Cube name="Sales"> <Table name="sales_fact_1997"/> <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/> <Measure …/> <CalculatedMember …/> </Cube> <Cube name=“Warehouse"> <Table name="sales_fact_1997"/> <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/> <Measure …/> <CalculatedMember …/> </Cube> </Schema>
Hierarchies <Schema> <Dimension name="Time"> <Hierarchy hasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column="quarter" uniqueMembers="false"/> <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> <Hierarchy name=“Fiscal Calendar” hasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column=“fiscal_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column=“fiscal_quarter" uniqueMembers="false"/> <Level name="Month" column=“fiscal_month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> </Dimension> <Cube name="Sales"> <Table name="sales_fact_1997"/> <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/> <Measure …/> <CalculatedMember …/> </Cube> </Schema>
Schema.xml – Extensions Plug-in classes In-line tables Views User defined functions
Extensions – Plug-in Classes Member Reader Member Formatter Cell Reader Cell Formatter Property Formatter
Extensions – In-line Tables <Dimension name="Severity"> <Hierarchy hasAll="true" primaryKey="severity_id"> <InlineTable alias="severity"> <ColumnDefs> <ColumnDef name="id" type="Numeric"/> <ColumnDef name="desc" type="String"/> </ColumnDefs> <Rows> <Row> <Value column="id">1</Value> <Value column="desc">High</Value> </Row> <Row> <Value column="id">2</Value> <Value column="desc">Medium</Value> </Row> <Row> <Value column="id">3</Value> <Value column="desc">Low</Value> </Row> </Rows> </InlineTable> <Level name="Severity" column="id" nameColumn="desc" uniqueMembers="true"/> </Hierarchy> </Dimension>
Extensions – Views <Cube name="Operations"> <View alias="StateCountyCity"> <SQL dialect="generic"> <![CDATA[ SELECT s.state_name, c.county_name, t.city_name, s.state_id, c.county_id, t.city_id FROM state s LEFT JOIN county c ON (c.state_id = s.state_id) LEFT JOIN city t ON (c.county_id = t.county_id) ]]> </SQL> </View> </Cube>
Extensions – User Defined Functions Must implement mondrian.spi.UserDefinedFunction Implementation must be available in classpath UDF Definition in schema.xml <Schema> ... <UserDefinedFunction name="PlusOne" className=“my.udf.PlusOne" /> </Schema> MDX Usage WITH MEMBER [Measures].[Unit Sales Plus One] AS 'PlusOne([Measures].[Unit Sales])' SELECT {[Measures].[Unit Sales]} ON COLUMNS, {[Gender].MEMBERS} ON ROWS FROM [Sales]
MDX / JDBC Parallels Mondrian JDBC Connection – mondrian.olap.Connection Connection – java.sql.Connection Query – mondrian.olap.Query Statement – java.sql.Statement Result – mondrian.olap.Result ResultSet – java.sql.ResultSet Access Axis & Cell from Result Access Rows & Columns from ResultSet
UI Layer – JPivot
Performance & Scalability Enable SQL statement logging to analyze mondrian generated SQL statements Index on foreign/join keys Use Aggregate Tables & Materialized Views Query results in session
Constraints Composite key joins are not supported Uniqueness within a level is not based on id Have had issues re-using same table with a different alias Make mondrian happy schema – must be normalized Requires dedicated Time dimension table
Summary 100% Pure Java BI tool Not too difficult to work with Extensible for different front-end layers Scalable Viable alternative to proprietary tools No vendor lock-in – Open Source Less TCO Quicker Time To Market

Learning Open Source Business Intelligence

  • 1.
    Open Source BI DeepDive Ravi Samji 21/04/2011
  • 2.
    Agenda Business Intelligence –Why, What & Who? Open Source BI – Introduction, Tech Stack OLAP Engine – Mondrian UI Layer – JPivot Performance & Scalability Constraints
  • 3.
  • 4.
    Business Intelligence –Why? Data is the biggest asset Structured and Unstructured format Most of our assets are buried Helps us understand customer behavior Helps us deliver better business value Measure performance
  • 5.
    Business Intelligence –What? Reporting Analytics Data/Text Mining ETL Predictive Analytics
  • 6.
  • 7.
    Open Source BI– Introduction Mondrian – OLAP Engine Initially Independent Open Source Initiative Now Part of Pentaho Open Source BI Suite 100% Pure Java Supports MDX and XML/A Bundled With Other Open Source Packages
  • 8.
    Open Source BI– Tech Stack JFreeChart WCF log4j log4j JPivot Mondrian RDBMS
  • 9.
    OLAP Engine –Mondrian Cube Definition – schema.xml MDX – Query language to access multi dimensional data Operates on normalized relational database
  • 10.
    Mondrian – schema.xml Logicalmodel of a multi dimensional database Cube, VirtualCube Dimensions, Hierarchies, Levels Measure, CalculatedMember
  • 11.
    Logical Model –Multi Dimensional <Schema> Database <Cube name="Sales"> <Table name="sales_fact_1997"/> <Dimension name="Gender" foreignKey="customer_id"> <Hierarchy hasAll="true" allMemberName="All Genders" primaryKey="customer_id"> <Table name="customer"/> <Level name="Gender" column="gender" uniqueMembers="true"/> </Hierarchy> </Dimension> <Dimension name="Time" foreignKey="time_id"> <Hierarchy hasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column="quarter" uniqueMembers="false"/> <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> </Dimension> <Measure name="Unit Sales" column="unit_sales" aggregator="sum" formatString="#,###"/> <Measure name="Store Sales" column="store_sales" aggregator="sum" formatString="#,###.##"/> <Measure name="Store Cost" column="store_cost" aggregator="sum" formatString="#,###.00"/> <CalculatedMember name="Profit" dimension="Measures" formula="[Measures].[Store Sales] - [Measures].[Store Cost]"> <CalculatedMemberProperty name="FORMAT_STRING" value="$#,##0.00"/> </CalculatedMember> </Cube> </Schema>
  • 12.
    Dimensions & SharedDimensions <Schema> <Dimension name="Time"> <Hierarchy hasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column="quarter" uniqueMembers="false"/> <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> </Dimension> <Cube name="Sales"> <Table name="sales_fact_1997"/> <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/> <Measure …/> <CalculatedMember …/> </Cube> <Cube name=“Warehouse"> <Table name="sales_fact_1997"/> <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/> <Measure …/> <CalculatedMember …/> </Cube> </Schema>
  • 13.
    Hierarchies <Schema> <Dimension name="Time"> <HierarchyhasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column="quarter" uniqueMembers="false"/> <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> <Hierarchy name=“Fiscal Calendar” hasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column=“fiscal_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column=“fiscal_quarter" uniqueMembers="false"/> <Level name="Month" column=“fiscal_month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> </Dimension> <Cube name="Sales"> <Table name="sales_fact_1997"/> <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/> <Measure …/> <CalculatedMember …/> </Cube> </Schema>
  • 14.
    Schema.xml – Extensions Plug-inclasses In-line tables Views User defined functions
  • 15.
    Extensions – Plug-inClasses Member Reader Member Formatter Cell Reader Cell Formatter Property Formatter
  • 16.
    Extensions – In-lineTables <Dimension name="Severity"> <Hierarchy hasAll="true" primaryKey="severity_id"> <InlineTable alias="severity"> <ColumnDefs> <ColumnDef name="id" type="Numeric"/> <ColumnDef name="desc" type="String"/> </ColumnDefs> <Rows> <Row> <Value column="id">1</Value> <Value column="desc">High</Value> </Row> <Row> <Value column="id">2</Value> <Value column="desc">Medium</Value> </Row> <Row> <Value column="id">3</Value> <Value column="desc">Low</Value> </Row> </Rows> </InlineTable> <Level name="Severity" column="id" nameColumn="desc" uniqueMembers="true"/> </Hierarchy> </Dimension>
  • 17.
    Extensions – Views <Cubename="Operations"> <View alias="StateCountyCity"> <SQL dialect="generic"> <![CDATA[ SELECT s.state_name, c.county_name, t.city_name, s.state_id, c.county_id, t.city_id FROM state s LEFT JOIN county c ON (c.state_id = s.state_id) LEFT JOIN city t ON (c.county_id = t.county_id) ]]> </SQL> </View> </Cube>
  • 18.
    Extensions – UserDefined Functions Must implement mondrian.spi.UserDefinedFunction Implementation must be available in classpath UDF Definition in schema.xml <Schema> ... <UserDefinedFunction name="PlusOne" className=“my.udf.PlusOne" /> </Schema> MDX Usage WITH MEMBER [Measures].[Unit Sales Plus One] AS 'PlusOne([Measures].[Unit Sales])' SELECT {[Measures].[Unit Sales]} ON COLUMNS, {[Gender].MEMBERS} ON ROWS FROM [Sales]
  • 19.
    MDX / JDBCParallels Mondrian JDBC Connection – mondrian.olap.Connection Connection – java.sql.Connection Query – mondrian.olap.Query Statement – java.sql.Statement Result – mondrian.olap.Result ResultSet – java.sql.ResultSet Access Axis & Cell from Result Access Rows & Columns from ResultSet
  • 20.
  • 21.
    Performance & Scalability EnableSQL statement logging to analyze mondrian generated SQL statements Index on foreign/join keys Use Aggregate Tables & Materialized Views Query results in session
  • 22.
    Constraints Composite key joinsare not supported Uniqueness within a level is not based on id Have had issues re-using same table with a different alias Make mondrian happy schema – must be normalized Requires dedicated Time dimension table
  • 23.
    Summary 100% Pure JavaBI tool Not too difficult to work with Extensible for different front-end layers Scalable Viable alternative to proprietary tools No vendor lock-in – Open Source Less TCO Quicker Time To Market