Skip to content

Cache 0.1 #81

@wangxiaoying

Description

@wangxiaoying

Summary

Simple cache support for query result in order to speed up loading the same query multiple times. Extensible to more sophisticated design and implementation.

Design-level Explanation Actions

  • Investigate existing cache / materialized view solutions for external storage
  • Decide the scope of the first version:
    • what to cache (admission and eviction)
    • how to use the cache
    • how to maintain the cache
  • Decide the storage engine to store local cache in the first version (sqlite)
  • Design the implementation of cache logic

Design-level Explanation

Scope of Cache 0.1

  • What to cache
    • Manually add query result to cache
    • Do not remove cached data (do not consider limited disk space)
  • How to use the cache
    • Load the entire cache if and only if exactly match on the entire query (without do any filtering or manipulation)
  • How to maintain the cache
    • Manually force refreshing the cache when issuing query
    • Refresh the entire cache (do not consider incremental refresh)

API Design

User Interface

def read_sql(..., enable_cache=True, force_query=False)
  • enable_cache(string or bool, optional(default True)) - Whether or not to cache the result data. If False is set, do not cache the result. If True is set, cache the result to /tmp with connection and sql as name. If a string is set, cache the result to the corresponding path.
  • force_query(bool, optional(default False)) - Whether or not to force download the data from database no matter there is a cache or not. If True, also update the local cache.

Logical workflow

workflow

Cache Module Implementation

pub trait Cache { fn init(conn: str) -> Result<()>; // init cache source, init metadata if not exists fn query_match(query: str) -> Result<(Vec<str>, Vec<str>)>; // lookup metadata, split query into probe query and remainder query, and partition each query fn post_execute(dests: Vec<Box<dyn Destination>>) -> Result<Destination>; // produce final result }

overview

Implementation-level Explanation

Left: current implementation, Right: implementation supporting cache

implementation

Either cache_queries or db_queries will be empty in the first version that only support exactly match.

Rational and Alternatives

  • Able to extend to more cache backends, data format and policies in the future.
  • Easy to incorporate in the current workflow of ConnectorX
  • Able to make use current loading and writing mechanism in ConnectorX

Prior Art

Some related works:

Future Possibilities

  • Support different external storages for cache (e.g. Redis)
  • Support using cache on partial match queries
    • partial attribute
    • partial predicate
    • partial table (in join)
  • Admission and eviction policy given limited space budget
  • Incremental refresh the cache

Implementation-level Actions

  • Add cache source on decided storage
  • Add cache destination on decided storage
  • Support multiple source to multiple destination logic (source partition -> destination partition combinator)
  • read_sql API support
  • Add tests
  • Add documentation

Additional Tasks

  • This task is put into a correct pipeline (Development Backlog or In Progress).
  • The label of this task is setting correctly.
  • The issue is assigned to the correct person.
  • The issue is linked to related Epic.
  • The documentation is changed accordingly.
  • Tests are added accordingly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions