Building reliable models that produce causal explanations for performance debugging and transfer better to new environments Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Cache Policy a ff ects Throughput via Cache Misses. Cache
FlexiBO: A multi-objective optimization that tradeoff information gain with the cost of design evaluations • FlexiBO is a cost-aware approach for multi- objective optimization that iteratively selects a design and an objective for evaluation.
• It allows us to trade o ff the additional information gained through an evaluation and the cost incurred due to the evaluation.
Finding root causes of configuration issues in highly-configurable robots We discovered that root causes of task failures in robots could be captured by causal e ff ect estimation of task inputs and robot con fi gurations.
Looking for winning tickets in over-parametrized networks • How robust are the discovered sub-networks (e.g., adversarial attack, distributional shift)?
• Is there any always-winning lottery ticket hidden in a randomly initialized network?
• Is it possible to train the sparse sub-network e ffi ciently?
• Contrastive learning (CL) without label information is less robust than other learning schemes.
• Semi-supervised learning (SL- CL or SCL-CL) is more robust than CL. Is there anything special about contrastive learning in terms of adversarial robustness?
• Adversarial training causes similar representations between consecutive layers.
• Fully adversarial fine-tuning can improve clean accuracy (red line) and robustness (blue line) by eliminating these similarities. • The lack of differentiated layer- wise representations after adversarial training may hinder neural networks from achieving high clean/adversarial accuracy. Is there anything special about contrastive learning in terms of adversarial robustness?
Partitioned Computation Graph Pipeline Schedule M0-C0 B-1 B-1 M0-C1 M1-C8 B-n B-1 B-2 B-2 B-n B-n SG1 SG2 SG2 Mapping M1 M2 M3 M4 HOST/CPU PCIE Switch PCIE Switch M5 M6 M7 M8 C0 C1 C2 PCIE C4 C5 C6 C7 D2D C0 C1 C2 C3 PCIE C4 C5 C6 C7 D2D Hetrogeneous System Interconnect Graph PE0 PE3 PE1 PE2 D2D-M D2D-N DDR-N DDR-S PCIE Vendor A Intra-Chiplet Interconnect Graph PE0 PE3 CONV RISCV D2D-M D2D-N DDR-S PCIE Vendor B Intra-Chiplet Interconnect Graph Workload Computation Graph C3 Time Module# - Chiplet# FRAMEWORK OUTPUTS FRAMEWORK INPUTS 1 2 4 5 6 4 2 Inter-Chiplet Interconnect Graph Inter-Chiplet Interconnect Graph C C C C C C C CA C C C C C C C CB 3 Set of Chiplets Hardware-aware partitioning and mapping for multi-chiplet and multi-card AI inference systems
Reconciling high accuracy, cost-efficiency, and low latency of inference serving systems • Model variants provide a different level of accuracy/ latency trade-offs.
• Models’ performance varies under different resource assignments.
A new paradigm that integrates probabilistic model checking with causal inference to enable planning and verification in autonomous systems • RQ1: How can structural causal models be integrated with probabilistic model checking to provide a framework for planning tasks in autonomous systems?
• RQ2: How can counterfactual reasoning be integrated with probabilistic model checking to analyze the effect of interventions that have not been observed in the system's behavior?
Independent modular networks for learning robust and disentangled representations • Modular networks can automatically decompose the shapes into different learnable representations.
• With the introduction of the ID classifier, the decomposition is improved significantly, where a large majority of the images for each shape are passed through one module.
Unit cycle architecture: an intrinsically faster approach • Problem:
• Single/multi-cycle microarchitectures waste time using the critical path.
• If the longest instruction takes 1100 ps, then every instruction takes 1100 ps.
• Solution:
• Use a timer to measure the time.
• Set the timer to the duration of the instruction.
• When the timer runs out, move to the next instruction. Single-Cycle Multi-Cycle Unit-Cycle Clock Period (ps) 1100 300 100 Cycles Executed 360 1,316 2,748 Execution Time (ps) 396,000 394,800 274,800 Benchmark program: Square root
Unit-Cycle is more than 40% faster than Single-Cycle or Multi-Cycle