Artificial Intelligence and Systems Laboratory (AISys): A Research Overview

Slide 1

Slide 1 text

Arti f icial Intelligence and Systems Laboratory (AISys) Research Overview Pooyan Jamshidi University of South Carolina https://pooyanjamshidi.github.io/AISys/

Slide 2

Slide 2 text

Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen (PhD student) Shahriar Iqbal (PhD student) Sonam Kharde (Postdoc) Hamed Damirchi (PhD student) co-advised by Forest Agostinelli Mehdi Yaghouti (Postdoc) Samuel Whidden (Undergraduate) Rasool Shari fi (PhD student) Kimia Noorbakhsh (Undergraduate)

Slide 3

Slide 3 text

Building reliable models that produce causal explanations for performance debugging and transfer better to new environments Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Cache Policy a ff ects Throughput via Cache Misses. Cache Policy Cache Misses Through put

Slide 4

Slide 4 text

FlexiBO: A multi-objective optimization that tradeoff information gain with the cost of design evaluations • FlexiBO is a cost-aware approach for multi- objective optimization that iteratively selects a design and an objective for evaluation. • It allows us to trade o ff the additional information gained through an evaluation and the cost incurred due to the evaluation.

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Finding root causes of configuration issues in highly-configurable robots We discovered that root causes of task failures in robots could be captured by causal e ff ect estimation of task inputs and robot con fi gurations.

Slide 7

Slide 7 text

Sim-to-real by enabling causal transfer learning Causal models learned in simulation can be transferred to real robots to fi nd the root causes of failures of physical robots.

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Looking for winning tickets in over-parametrized networks • How robust are the discovered sub-networks (e.g., adversarial attack, distributional shift)? • Is there any always-winning lottery ticket hidden in a randomly initialized network? • Is it possible to train the sparse sub-network e ffi ciently?

Slide 10

Slide 10 text

Slide 11

Slide 11 text

• Contrastive learning (CL) without label information is less robust than other learning schemes. • Semi-supervised learning (SL- CL or SCL-CL) is more robust than CL. Is there anything special about contrastive learning in terms of adversarial robustness?

Slide 12

Slide 12 text

• Adversarial training causes similar representations between consecutive layers. • Fully adversarial fine-tuning can improve clean accuracy (red line) and robustness (blue line) by eliminating these similarities. • The lack of differentiated layer- wise representations after adversarial training may hinder neural networks from achieving high clean/adversarial accuracy. Is there anything special about contrastive learning in terms of adversarial robustness?

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Hardware-aware partitioning and mapping for multi-chiplet and multi-card AI inference systems Partitioned Computation Graph Pipeline Schedule M0-C0 B-1 B-1 M0-C1 M1-C8 B-n B-1 B-2 B-2 B-n B-n SG1 SG2 SG2 Mapping M1 M2 M3 M4 HOST/CPU PCIE Switch PCIE Switch M5 M6 M7 M8 C0 C1 C2 PCIE C4 C5 C6 C7 D2D C0 C1 C2 C3 PCIE C4 C5 C6 C7 D2D Hetrogeneous System Interconnect Graph PE0 PE3 PE1 PE2 D2D-M D2D-N DDR-N DDR-S PCIE Vendor A Intra-Chiplet Interconnect Graph PE0 PE3 CONV RISCV D2D-M D2D-N DDR-S PCIE Vendor B Intra-Chiplet Interconnect Graph Workload Computation Graph C3 Time Module# - Chiplet# FRAMEWORK OUTPUTS FRAMEWORK INPUTS 1 2 3 4 5 3 2 Inter-Chiplet Interconnect Graph Inter-Chiplet Interconnect Graph

Slide 15

Slide 15 text

Partitioned Computation Graph Pipeline Schedule M0-C0 B-1 B-1 M0-C1 M1-C8 B-n B-1 B-2 B-2 B-n B-n SG1 SG2 SG2 Mapping M1 M2 M3 M4 HOST/CPU PCIE Switch PCIE Switch M5 M6 M7 M8 C0 C1 C2 PCIE C4 C5 C6 C7 D2D C0 C1 C2 C3 PCIE C4 C5 C6 C7 D2D Hetrogeneous System Interconnect Graph PE0 PE3 PE1 PE2 D2D-M D2D-N DDR-N DDR-S PCIE Vendor A Intra-Chiplet Interconnect Graph PE0 PE3 CONV RISCV D2D-M D2D-N DDR-S PCIE Vendor B Intra-Chiplet Interconnect Graph Workload Computation Graph C3 Time Module# - Chiplet# FRAMEWORK OUTPUTS FRAMEWORK INPUTS 1 2 3 4 5 3 2 Inter-Chiplet Interconnect Graph Inter-Chiplet Interconnect Graph Hardware-aware partitioning and mapping for multi-chiplet and multi-card AI inference systems

Slide 16

Slide 16 text

Partitioned Computation Graph Pipeline Schedule M0-C0 B-1 B-1 M0-C1 M1-C8 B-n B-1 B-2 B-2 B-n B-n SG1 SG2 SG2 Mapping M1 M2 M3 M4 HOST/CPU PCIE Switch PCIE Switch M5 M6 M7 M8 C0 C1 C2 PCIE C4 C5 C6 C7 D2D C0 C1 C2 C3 PCIE C4 C5 C6 C7 D2D Hetrogeneous System Interconnect Graph PE0 PE3 PE1 PE2 D2D-M D2D-N DDR-N DDR-S PCIE Vendor A Intra-Chiplet Interconnect Graph PE0 PE3 CONV RISCV D2D-M D2D-N DDR-S PCIE Vendor B Intra-Chiplet Interconnect Graph Workload Computation Graph C3 Time Module# - Chiplet# FRAMEWORK OUTPUTS FRAMEWORK INPUTS 1 2 4 5 6 4 2 Inter-Chiplet Interconnect Graph Inter-Chiplet Interconnect Graph C C C C C C C CA C C C C C C C CB 3 Set of Chiplets Hardware-aware partitioning and mapping for multi-chiplet and multi-card AI inference systems

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Reconciling high accuracy, cost-efficiency, and low latency of inference serving systems • Model variants provide a different level of accuracy/ latency trade-offs. • Models’ performance varies under different resource assignments.

Slide 19

Slide 19 text

Reconciling high accuracy, cost-efficiency, and low latency of inference serving systems

Slide 20

Slide 20 text

Slide 21

Slide 21 text

A new paradigm that integrates probabilistic model checking with causal inference to enable planning and verification in autonomous systems • RQ1: How can structural causal models be integrated with probabilistic model checking to provide a framework for planning tasks in autonomous systems? • RQ2: How can counterfactual reasoning be integrated with probabilistic model checking to analyze the effect of interventions that have not been observed in the system's behavior?

Slide 22

Slide 22 text

Independent modular networks for learning robust and disentangled representations • Modular networks can automatically decompose the shapes into different learnable representations. • With the introduction of the ID classifier, the decomposition is improved significantly, where a large majority of the images for each shape are passed through one module.

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Pretrained language models are symbolic mathematics solvers too! • Does this pre-trained model help us to use fewer data for fine-tuning? • Does the result of this fine-tuning depend on the languages used for pretraining? • How robust is this fine-tuned model with respect to the distribution shift of test data compared to fine-tuning data?

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Unit cycle architecture: an intrinsically faster approach • Problem: • Single/multi-cycle microarchitectures waste time using the critical path. • If the longest instruction takes 1100 ps, then every instruction takes 1100 ps. • Solution: • Use a timer to measure the time. • Set the timer to the duration of the instruction. • When the timer runs out, move to the next instruction. Single-Cycle Multi-Cycle Unit-Cycle Clock Period (ps) 1100 300 100 Cycles Executed 360 1,316 2,748 Execution Time (ps) 396,000 394,800 274,800 Benchmark program: Square root Unit-Cycle is more than 40% faster than Single-Cycle or Multi-Cycle

Slide 27

Slide 27 text

Arti fi cial Intelligence and Systems Laboratory (AISys) https://pooyanjamshidi.github.io/AISys/ Research Areas: - Causal AI - ML for Systems - Systems for ML - Adversarial ML - Robot Learning - Representation Learning Sponsors: Collaborators: Saeid Ghafouri (PhD student) Fatemeh Ghofrani (PhD student) Abir Hossen (PhD student) Shahriar Iqbal (PhD student) Sonam Kharde (Postdoc) Hamed Damirchi (PhD student) Mehdi Yaghouti (Postdoc) Samuel Whidden (Undergraduate) Rasool Shari fi (PhD student) Kimia Noorbakhsh (Undergraduate)