Artificial Intelligence and Systems Laboratory (AISys): A Research Overview

Arti f icial Intelligence and Systems Laboratory (AISys) Research Overview
Pooyan Jamshidi University of South Carolina https://pooyanjamshidi.github.io/AISys/

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen
(PhD student) Shahriar Iqbal (PhD student) Sonam Kharde (Postdoc) Hamed Damirchi (PhD student) co-advised by Forest Agostinelli Mehdi Yaghouti (Postdoc) Samuel Whidden (Undergraduate) Rasool Shari fi (PhD student) Kimia Noorbakhsh (Undergraduate)

Building reliable models that produce causal explanations for performance debugging
and transfer better to new environments Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Cache Policy a ff ects Throughput via Cache Misses. Cache Policy Cache Misses Through put

FlexiBO: A multi-objective optimization that tradeoff information gain with the
cost of design evaluations • FlexiBO is a cost-aware approach for multi- objective optimization that iteratively selects a design and an objective for evaluation. • It allows us to trade o ff the additional information gained through an evaluation and the cost incurred due to the evaluation.

Finding root causes of configuration issues in highly-configurable robots We
discovered that root causes of task failures in robots could be captured by causal e ff ect estimation of task inputs and robot con fi gurations.

Sim-to-real by enabling causal transfer learning
Causal models learned in simulation can be transferred to real robots to fi nd the root causes of failures of physical robots.

Looking for winning tickets in over-parametrized networks • How robust
are the discovered sub-networks (e.g., adversarial attack, distributional shift)? • Is there any always-winning lottery ticket hidden in a randomly initialized network? • Is it possible to train the sparse sub-network e ffi ciently?

• Contrastive learning (CL) without label information is less robust
than other learning schemes. • Semi-supervised learning (SL- CL or SCL-CL) is more robust than CL. Is there anything special about contrastive learning in terms of adversarial robustness?

• Adversarial training causes similar representations between consecutive layers. •
Fully adversarial fine-tuning can improve clean accuracy (red line) and robustness (blue line) by eliminating these similarities. • The lack of differentiated layer- wise representations after adversarial training may hinder neural networks from achieving high clean/adversarial accuracy. Is there anything special about contrastive learning in terms of adversarial robustness?

Hardware-aware partitioning and mapping for multi-chiplet and multi-card AI inference
systems Partitioned Computation Graph Pipeline Schedule M0-C0 B-1 B-1 M0-C1 M1-C8 B-n B-1 B-2 B-2 B-n B-n SG1 SG2 SG2 Mapping M1 M2 M3 M4 HOST/CPU PCIE Switch PCIE Switch M5 M6 M7 M8 C0 C1 C2 PCIE C4 C5 C6 C7 D2D C0 C1 C2 C3 PCIE C4 C5 C6 C7 D2D Hetrogeneous System Interconnect Graph PE0 PE3 PE1 PE2 D2D-M D2D-N DDR-N DDR-S PCIE Vendor A Intra-Chiplet Interconnect Graph PE0 PE3 CONV RISCV D2D-M D2D-N DDR-S PCIE Vendor B Intra-Chiplet Interconnect Graph Workload Computation Graph C3 Time Module# - Chiplet# FRAMEWORK OUTPUTS FRAMEWORK INPUTS 1 2 3 4 5 3 2 Inter-Chiplet Interconnect Graph Inter-Chiplet Interconnect Graph

Partitioned Computation Graph Pipeline Schedule M0-C0 B-1 B-1 M0-C1 M1-C8
B-n B-1 B-2 B-2 B-n B-n SG1 SG2 SG2 Mapping M1 M2 M3 M4 HOST/CPU PCIE Switch PCIE Switch M5 M6 M7 M8 C0 C1 C2 PCIE C4 C5 C6 C7 D2D C0 C1 C2 C3 PCIE C4 C5 C6 C7 D2D Hetrogeneous System Interconnect Graph PE0 PE3 PE1 PE2 D2D-M D2D-N DDR-N DDR-S PCIE Vendor A Intra-Chiplet Interconnect Graph PE0 PE3 CONV RISCV D2D-M D2D-N DDR-S PCIE Vendor B Intra-Chiplet Interconnect Graph Workload Computation Graph C3 Time Module# - Chiplet# FRAMEWORK OUTPUTS FRAMEWORK INPUTS 1 2 3 4 5 3 2 Inter-Chiplet Interconnect Graph Inter-Chiplet Interconnect Graph Hardware-aware partitioning and mapping for multi-chiplet and multi-card AI inference systems

Partitioned Computation Graph Pipeline Schedule M0-C0 B-1 B-1 M0-C1 M1-C8
B-n B-1 B-2 B-2 B-n B-n SG1 SG2 SG2 Mapping M1 M2 M3 M4 HOST/CPU PCIE Switch PCIE Switch M5 M6 M7 M8 C0 C1 C2 PCIE C4 C5 C6 C7 D2D C0 C1 C2 C3 PCIE C4 C5 C6 C7 D2D Hetrogeneous System Interconnect Graph PE0 PE3 PE1 PE2 D2D-M D2D-N DDR-N DDR-S PCIE Vendor A Intra-Chiplet Interconnect Graph PE0 PE3 CONV RISCV D2D-M D2D-N DDR-S PCIE Vendor B Intra-Chiplet Interconnect Graph Workload Computation Graph C3 Time Module# - Chiplet# FRAMEWORK OUTPUTS FRAMEWORK INPUTS 1 2 4 5 6 4 2 Inter-Chiplet Interconnect Graph Inter-Chiplet Interconnect Graph C C C C C C C CA C C C C C C C CB 3 Set of Chiplets Hardware-aware partitioning and mapping for multi-chiplet and multi-card AI inference systems

Reconciling high accuracy, cost-efficiency, and low latency of inference serving
systems • Model variants provide a different level of accuracy/ latency trade-offs. • Models’ performance varies under different resource assignments.

Reconciling high accuracy, cost-efficiency, and low latency of inference serving
systems

A new paradigm that integrates probabilistic model checking with causal
inference to enable planning and verification in autonomous systems • RQ1: How can structural causal models be integrated with probabilistic model checking to provide a framework for planning tasks in autonomous systems? • RQ2: How can counterfactual reasoning be integrated with probabilistic model checking to analyze the effect of interventions that have not been observed in the system's behavior?

Independent modular networks for learning robust and disentangled representations •
Modular networks can automatically decompose the shapes into different learnable representations. • With the introduction of the ID classifier, the decomposition is improved significantly, where a large majority of the images for each shape are passed through one module.

Pretrained language models are symbolic mathematics solvers too! • Does
this pre-trained model help us to use fewer data for fine-tuning? • Does the result of this fine-tuning depend on the languages used for pretraining? • How robust is this fine-tuned model with respect to the distribution shift of test data compared to fine-tuning data?

Unit cycle architecture: an intrinsically faster approach • Problem: •
Single/multi-cycle microarchitectures waste time using the critical path. • If the longest instruction takes 1100 ps, then every instruction takes 1100 ps. • Solution: • Use a timer to measure the time. • Set the timer to the duration of the instruction. • When the timer runs out, move to the next instruction. Single-Cycle Multi-Cycle Unit-Cycle Clock Period (ps) 1100 300 100 Cycles Executed 360 1,316 2,748 Execution Time (ps) 396,000 394,800 274,800 Benchmark program: Square root Unit-Cycle is more than 40% faster than Single-Cycle or Multi-Cycle

Arti fi cial Intelligence and Systems Laboratory (AISys) https://pooyanjamshidi.github.io/AISys/ Research
Areas: - Causal AI - ML for Systems - Systems for ML - Adversarial ML - Robot Learning - Representation Learning Sponsors: Collaborators: Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen (PhD student) Shahriar Iqbal (PhD student) Sonam Kharde (Postdoc) Hamed Damirchi (PhD student) Mehdi Yaghouti (Postdoc) Samuel Whidden (Undergraduate) Rasool Shari fi (PhD student) Kimia Noorbakhsh (Undergraduate)

Artificial Intelligence and Systems Laboratory ...

Artificial Intelligence and Systems Laboratory (AISys): A Research Overview

Pooyan Jamshidi

More Decks by Pooyan Jamshidi

Other Decks in Research

Featured

Transcript

Arti f icial Intelligence and Systems Laboratory (AISys) Research Overview

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen

Building reliable models that produce causal explanations for performance debugging

FlexiBO: A multi-objective optimization that tradeoff information gain with the

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen

Finding root causes of configuration issues in highly-configurable robots We

Sim-to-real by enabling causal transfer learning

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen

Looking for winning tickets in over-parametrized networks • How robust

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen

• Contrastive learning (CL) without label information is less robust

• Adversarial training causes similar representations between consecutive layers. •

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen

Hardware-aware partitioning and mapping for multi-chiplet and multi-card AI inference

Partitioned Computation Graph Pipeline Schedule M0-C0 B-1 B-1 M0-C1 M1-C8

Partitioned Computation Graph Pipeline Schedule M0-C0 B-1 B-1 M0-C1 M1-C8

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen

Reconciling high accuracy, cost-efficiency, and low latency of inference serving

Reconciling high accuracy, cost-efficiency, and low latency of inference serving

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen

A new paradigm that integrates probabilistic model checking with causal

Independent modular networks for learning robust and disentangled representations •

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen

Pretrained language models are symbolic mathematics solvers too! • Does

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen

Unit cycle architecture: an intrinsically faster approach • Problem: •

Arti fi cial Intelligence and Systems Laboratory (AISys) https://pooyanjamshidi.github.io/AISys/ Research