forked from lstmemery/arc_prez
-
Notifications
You must be signed in to change notification settings - Fork 0
/
prez.Rmd
327 lines (221 loc) · 9.31 KB
/
prez.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
---
title: "Abstraction and Reasoning Challenge: Kaggle's Toughest Competition"
author: "Matthew Emery (lstmemery)"
date: "7/9/2020"
bibliography: arc.bib
output:
revealjs::revealjs_presentation:
theme: night
css: style.css
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
---
![](img/easy1.png)
---
![](img/easy2.png)
## Why are machines so bad at this?
## Table of Contents
- What is the goal of this competition?
- **What is intelligence?**
- The Abstraction and Reasoning Corpus (ARC)
- Kaggle Solutions
<aside class="notes">
This competition is different from any other. We will spend most of the time discussing the philosophy of the dataset instead of the solutions.
</aside>
## What is the goal of Abstraction and Reasoning Corpus (ARC)?
<div id="left">
> The Abstraction and Reasoning Corpus (ARC) provides a benchmark to measure AI skill-acquisition on unknown tasks, with the constraint that only a handful of demonstrations are shown to learn a complex task. It provides a glimpse of a future where AI could quickly learn to solve new problems on its own.
</div>
<div id="right">
![](img/chollet.jpg)
</div>
## What percentage of the problems in ARC test set were solved by the 1st place AI?
## What is the goal of AI Research?
> AI is the science of making machines capable of performing tasks that would require intelligence if done by humans. - Marvin Minsky
<aside class="notes">
Notice is a pretty circular argument. How would we test this?
</aside>
## Weaknesses of Current AI Research Approach
<div id="left">
- Current algorithms are narrow and data hungry
- Famous AI competitions like the Turing Test rely on human judges
- Lack of agreement biases research to narrow, well defined skills
</div>
<div id="right">
![](img/loebner.png)
</div>
## Defining Intelligence Currently
> Intelligence measures an agent’s ability to *achieve goals* in a *wide range of environments*.
- Legg and Hutter
1. Achieve goals
2. Wide range of environments (Adaptability and Generalization)
Chollet's argument: We've spent too much time on the first part!
## Measuring the right thing
<div id="left">
- If the real world is a chaotic system, we need to optimize for adaptability
- DeepBlue didn't get us much closer to AGI
- Neither did the latest reinforcement learning algorithms
- Often, these challenges measure the ML engineer's intelligence, not the system itself
</div>
<div id="right">
![](img/alphago.jpg)
</div>
<aside class="notes">
- Note that a k-nearest neighbour algorithm can solve any task, given enough data
</aside>
## Current trends in AGI evaluation: Reinforcement Learning (RL)
<div id="left">
- Most RL tests don't measure robustness
- RL can sample arbitrary amounts of data
- Brittle to exploits
</div>
<div id="right">
![](img/openai5cheese.png)
</div>
<aside class="notes">
This tweet is saying that one of OpenAI DOTA bots can be tricked into teleporting onto a ledge it can't get down from
</aside>
## AI Effect
> "Every time somebody figured out how to make a computer do something...there was a chorus of critics to say 'that's not thinking" - Pamela McCorduck
## What would a good AI adaptability benchmark have?
- Reproducibility
- Fairness
- Scalability
- Flexibility/Generalization
## Two ways of thinking about generalization
<div id="left">
#### System-centric generalization
- Most machine learning is here
- The developer tries to compensate for data not in the dataset
- Known unknowns
</div>
<div id="right">
#### Developer-aware generalization
- Generalizations beyond what the developer expected
- Unknown unknowns
</div>
<aside class="notes">
The game of Go never changes. What would happen if we changed the size of the board?
</aside>
---
<img src="./img/impact.gif" height="200%" width="200%">
## The generalization spectrum
1. **Absence of generalization:** Classical algorithms. No uncertainty.
2. **Local generalization:** Most machine learning
3. **Broad generalization:** Wozniak's coffee cup challenge
4. **Extreme generalization:** Human-level intelligence
## What's Human Intelligence?
> AI is the science of making machines capable of performing tasks that would require intelligence if done by humans. - Marvin Minsky
## Psychometrics Perspective
![](img/psychometrics.png)
<aside class="notes">
- There's a whole field dedicated to measuring intelligence called psychometrics
</aside>
## G Factor
- Think of G Factor like general athleticism
- There are limits to athletic measurement, we wouldn't measure humans at the bottom of the ocean or on Mars
- Humans are incredible efficient at solving 2D and 3D problems but we are terrible at 4D+ problems
<aside class="notes">
- IQ tests are very human-centric
- Why not consider octopus camouflage intelligence?
</aside>
## Lessons from Psychometrics
- Measure abilities, not skills
- Use a battery of tests
- Set standards on reliability and validity
- Remove the need for non-universal knowledge
- Focus on skill acquisition efficiency
## Human Priors
- To compare intelligence we need to control for experience, priors and the difficulty of the task
- Human intelligence is not hard-coded, but we also aren't blank slates
- Some human priors:
- Reflexes (uninteresting to us for this challenge)
- Metalearning priors (These are the priors we are trying to reverse engineer)
- High level knowledge priors (e.g. object permanence)
## Machine Priors
- We should make the machine priors as close to human priors as possible
- The less priors you have, the more disadvantaged the machine is
- **All priors should be enumerated**
## ARC Priors
- **Objectness and elementary physics**
- Object recognition
- Cohesion
- Persistence
- Contact
- **Agency**
- Some objects seem to react to their environment (like avoiding walls)
- Some objects seem to react to other objects (a pip following another pip)
- **Natural numbers up to 10**
- Comparison/Sorting
- Addition and subtraction
- **Geometry/Topology**
- Symmetry
- Orientation
- In/Out relationships
## Contact/Goal-Directedness
![](img/contactness-before.png)
## Contact/Goal-Directedness
![](img/contactness-after.png)
## Distance/In-Out Relationtionship
![](img/distance.png)
## Distance/In-Out Relationtionship
![](img/distance2.png)
## Object Persistence and Size Comparison
![](img/hard-arc.png)
## Object Persistence and Size Comparison
![](img/hard-arc2.png)
## Francois Chollet's Definition of Intelligence
> The intelligence of a system is a measure of it's skill acquisition efficiency over a scope of tasks with respect to priors, experience, and generalization difficulty
$$I_{I S, s c o p e}^{\theta_{T}}=\underset{T \in s c o p e}{A v g}\left[\omega_{T} \cdot \theta_{T} \sum_{C \in C u r_{T}^{\theta_{T}}}\left[P_{C} \cdot \frac{G D_{I S, T, C}^{\theta_{T}}}{P_{I S, T}^{\theta_{T}}+E_{I S, T, C}^{\theta_{T}}}\right]\right]$$
## Implications of the Intelligence Definition
- Creating an intelligent system can be seen as an optimization problem
- Focus on skill acquisition efficiency
- Controls for priors and experience
- Can compare humans to AI
## Measuring Intelligence: A Diagram
![](img/arc1.png)
<aside class="notes">
- This is a different way of thinking about machine learning
- In training, the machine learning model is constant producing skill programs and the task is responding to the skill program itself
- For example, a neural network produces arrays of numbers (the skill program) and the task will give feedback (in the form of gradients) based on the skill programs response
</aside>
## ARC test description
- 400 training tasks/400 public evaluations/200 private
- Up to 10 unique symbols/colors
- Size from 1x1 to 30x30
- Usually 3 examples per task
- A typical human can solve most ARC problems with no previous training
- At least 1 and 3 high IQ humans were able to solve each task
- Intelligent system gets 3 guess per task
## Differences with Psychometric Tests
- No crystallized knowledge (like NLP or object recognition)
- ARC challenges are not generated programmatically
## What would a possible Intelligent System look like?
- Deep learning won't work here
- Focus on Program synthesis
- Make a Domain Specific Language (DSL) to describe all possible situations
- Have a good of finding candidate programs
## ARC Weaknesses & Future Refinement
- ARC solve could have human-like intelligent, or not!
- Generalization difficult not quantified
- Test validity not established
- Dataset diversity is limited
- Evaluation is overly close-ended
- Priors may not be well captured in ARC
## The Kaggle competition
- 20k in prize money
- Chollet would be surprised if 20% of tasks in the private set were solved
## 1st place by Johan Sokrates Wind
- 20.6% solved
- Implemented the first 100 training tasks by hand, chunking the solutions into useful functions
- Made a domain specific language (DSL) and >10k of C++ code multihreaded
- Used a directed acyclic graph to compose combinations together
## 2nd place by Alejandro de Miquel Bleier, Roderic Guigo Corominas and Yuji Ariyasu
- 19.7% solved
- Was less than 2% solved for the first 6 weeks
- Used a guided evolution strategy, applying random operations in the training loop and keeping the top 3 results
- Correctness was based on a proxy measure: pixel-wise distance
## When do you think an algorithm will outperform humans on ARC?
# References