Lab of ryul99

WSL2를 항상 켜져있는 서버로 사용하기

2024-06-02T00:00:00+09:00

배경 설명

WSL2는 윈도우에서 Linux를 쉽게 사용할 수 있는 방법중에 하나입니다. 그런데 WSL2를 서버처럼 사용하기에는 터미널 등에서 켜두지 않으면 자동으로 꺼지는 것이 걸림돌로 작용합니다. 이번 글에서는 이를 해결할 수 있는 방법을 소개합니다.

해결 방법

먼저 WSL2에서 systemd 기능을 켜야 합니다. 마소 공식 문서 참고

vim, nano등을 사용하여 /etc/wsl.conf를 수정하여 다음 줄을 추가합니다.
```
[boot]
systemd=true
```
PowerShell에서 wsl.exe --shutdown을 하여 wsl을 끈 후 다시 wsl에 접속합니다.

실제 작업을 위해 systemd 서비스를 만들어줍니다. 원리는 wsl의 백그라운드에서 아무 일도 하지 않는 프로세스를 돌리는 것입니다. firejox 유저의 깃헙 코멘트 참고

/etc/systemd/system/wsl-alive.service이라는 파일을 만들어 다음 내용을 추가합니다.

[Unit]
Description=Keep Distro Alive

[Service]
# cleanup the waiting signal previous set for
ExecStartPre=/windows/path/to/waitfor.exe /si MakeDistroAlive

ExecStart=/windows/path/to/waitfor.exe MakeDistroAlive

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload를 하여 서비스를 로드한 후 sudo systemctl enable --now wsl-alive.serivce를 하여 서비스를 켜줍니다.
sudo systemctl status wsl-alive.service를 하여 제대로 켜져 있는 지 확인합니다.

ddns 서버 자체 구축하기

2023-04-28T00:00:00+09:00

사건의 시작

일반적으로 가정집에서 Public IP까지는 받을 수 있지만 이 친구는 Static하지는 않습니다. (물론..거의 IP가 바뀌진 않지만 아주 가끔 바뀌는 경우가 있습니다) 이를 해결하기 위해서 DDNS를 많이들 셋팅하시는데, 보통 no-ip를 가입해서 공유기에 셋팅하거나 iptime 공유기에서 제공하는 iptime 도메인을 쓰는 방법을 사용합니다. 또 no-ip는 30일마다 이메일로 기간 연장을 해야합니다.

저는 이 두가지 모두 마음에 들지 않아서 제 공유기 밑에 있는 개인 서버에서 DDNS를 구축하는 방법을 찾아봤는데, 클라우드 플레어에서 제공하는 문서에서 언급된 ddclient를 살펴봤고 괜찮아보여서 구축해봤습니다.

과정

사실 크게 신경써야할 과정이 있지는 않습니다. https://github.com/ddclient/ddclient 에 가서 직접 소스코드를 받아서 빌드하거나 패키지매니저로 install하면 됩니다. 꽤 많은 패키지매니저에서 제공되고 있어서 쉽게 설치할 수 있었습니다.

저같은 경우에 Debian을 사용하고 있어서 sudo apt install ddclient 를 해줬습니다.
설치과정중에, config 셋팅하는 과정이 뜨는데, 이 과정에서 자신이 사용하고 있는 DNS provider를 골라서 (목록에서 보이지 않는다면 Others를 고르면 됩니다) 계정 identifier와 password 혹은 api키를 적어주면 됩니다.
그다음은 본인의 IP를 어떻게 확인할 것인지를 선택해야 하는데, 저처럼 서버가 직접 Public IP를 가지는 것이 아니라 서버는 공유기 아래에서 Private IP를 가지고 있기 때문에 웹기반을 선택해주면 됩니다.
여기까지 하면 설치는 끝나고 systemd service까지 켜져있는 상태일 겁니다. 여기서 sudo systemctl status ddclient.service 했을 때 warning 없이 잘 뜨면 문제가 없지만 저 같은 경우 config에서 zone= 부분이 빠져있어서 warning이 뜨고 있었습니다. (관련이슈) 그래서 /etc/ddclient.conf를 직접 열어서 zone을 추가해주었습니다.
마지막으로 재부팅되어도 자동으로 켜지도록 sudo systemctl enable ddclient.service 를 해주었습니다.

coc.clangd가 vim에서 기본적인 부분에서 에러를 내뱉는 경우

2023-02-14T00:00:00+09:00

사건의 시작

vim에서 c, c++ 를 사용하기 위해서 coc.clangd를 셋팅했었습니다. 그런데 이 coc.langd가 #include 같은 기본적인 부분에서 에러를 내뱉길래 트러블슈팅을 시작했습니다.

진행과정

처음에는 coc.clangd가 원래 c는 제대로 인식하는데 c++의 구문들을 이해하지 못한다고 생각해서 관련된 것들을 찾아봤습니다.
그런데 coc.clangd가 c만 지원하는게 아니라 c++도 기본적으로 지원한다는 내용을 발견했습니다.
이후 coc.clangd를 위한 패키지가 안 깔려있는 지, coc.clangd 재설치 등등 여러 과정을 거치다가 clangd가 c++를 이해하는 지 확인하기 위해서 이것저것 시도를 했습니다.
그 과정에서 clang이 깔려 있지 않다는 것을 확인했고 clang을 깔고 얘를 통해 빌드를 시도하니 빔에서 본 똑같은 에러가 발생했습니다.
이후 clang이 기본적인 헤더들을 못 찾는다는 스택오버플로우 글을 발견했고 현재 상황과 딱 맞는 것을 확인했습니다.
이 글에서 설명한대로 clang버전에 맞는 최신 g++-12를 설치하고 해결했습니다.

결론

clang 설치 (필요없을지도 모름)
clang -v 해서 사용하고 있는 gcc버전들 확인
그중에서 젤 높은 버전에 맞게 g++-12를 설치
profit!

g++이 이미 설치되어 있었기 때문에 문제가 없을 것이다 하고 넘어갈 뻔 했지만 g++-12를 설치하는 것을 시도하여 잘 해결했습니다.

Reference

https://stackoverflow.com/a/29821538

ssh-agent forwarding 팁들

2022-05-14T00:00:00+09:00

트러블슈팅

깃헙에 푸시를 하는 등의 작업을 할 때 ssh key인증을 해야 합니다. 하지만 서버에서 주로 작업하는 경우 이러한 인증에 어려움이 있을 수 있고 어쩔 수 없이 서버에 key를 두고 작업하는 경우도 있습니다. 하지만 key는 공용 서버라면 서버에 두지 않는 편이 좋고 ssh-agent forwarding이라는 것을 사용하면 로컬에 있는 ssh key를 서버에서도 사용할 수 있습니다. 깃헙에서 이를 설명한 자세한 글이 이 링크에 있습니다. 이 글에서도 트러블 슈팅 방법에 대해서 설명하고 있지만 간략하게 정리해보려고 합니다.

조심해야할 점은, 말 그대로 로컬의 ssh-agent를 forwarding하는 것이기에, forwarding 이후에 ssh-agent를 다시 실행하면 안됩니다. 다시 실행한다면 기존의 ssh-agent를 가르키고 있던 $SSH_AUTH_SOCK을 덮어 써버려서 의미가 없어집니다.

이러한 일은 특히 multiple hop ssh agent forwaridng에서 쉽게 일어날 수 있는데, shell rc 에 ssh-agent를 실행하게 한 경우 특히 자주 일어납니다.

이를 막기위해서는 $SSH_AUTH_SOCK이 Set되어 있는 지 확인하고, Set 되어 있지 않으면 ssh-agent를 실행하는 방식으로 해결할 수 있습니다.

과정

기본적으로 로컬과 서버 모두 ssh-agent가 켜져 있는 지 확인해야 합니다
- echo $SSH_AUTH_SOCK 에서 출력이 되는 지 아닌 지로 확인할 수 있습니다.
로컬에 ssh-add -L을 했을 때 공개키가 출력되어야 합니다.
- 만약 출력되지 않는다면 ssh-add를 입력하여 ~/.ssh아래에 있는 키들을 자동으로 ssh-agent에 등록하거나 ssh-add -K path/to/private_key를 하여 다른 위치에 있는 private key를 ssh-agent와 키체인에 등록할 수 있습니다.
마지막 과정으로 서버에서 ssh-add -L을 했을 때 마찬가지로 공개키가 출력되어야 합니다.
- 여기서 공개키가 출력되지 않는다면 어딘가에서 문제가 있었기 때문이므로 1,2번 과정을 체크해보거나 앞서 언급한 깃헙 링크에서 트러블슈팅 파트를 읽어 보시는 걸 추천합니다.

ssh agent를 user-level systemd service로 만들기

트러블슈팅 글에서도 알 수 있듯이 ssh-agent가 항상 켜져있어야 문제 없이 ssh-agent-forwarding이 잘 작동합니다. 때문에 systemd를 사용할 수 있는 환경이라면 user-level systemd service를 활용하여 재부팅/세션종료 후 재접속 하더라도 ssh-agent가 항상 켜져있을 수 있도록 도와주게 할 수 있습니다.

과정

~/.config/systemd/user/ssh-agent.service파일을 새로 만들면서 다음과 같이 설정합니다.

[Unit]
Description=SSH key agent

[Service]
Type=simple
Environment=SSH_AUTH_SOCK=%t/ssh-agent.socket
ExecStart=/usr/bin/ssh-agent -D -a $SSH_AUTH_SOCK

[Install]
WantedBy=default.target

systemctl --user daemon-reload 를 실행합니다.
systemctl --user enable --now ssh-agent를 실행합니다.

이렇게 하면 ${XDG_RUNTIME_DIR}/ssh-agent.socket위치에 file의 형태로 SSH_AUTH_SOCK이 저장됩니다.

앞서 말한대로 SSH Agent Overwrite 이슈를 피하기 위해서는, shell rc (.bashrc / .zshrc 등…) 에 다음을 추가해주는 것이 좋습니다.

이 코드는 SSH_AUTH_SOCK이 설정되어 있지 않을 때 위에서 작업한 SSH Agent의 SOCK을 바라보도록 하는 코드입니다.

if ! test "$SSH_AUTH_SOCK" ; then
    export SSH_AUTH_SOCK="${XDG_RUNTIME_DIR}/ssh-agent.socket"
fi

Reference

https://unix.stackexchange.com/a/390631
https://unix.stackexchange.com/questions/528360/ssh-agent-forwarding-troubleshooting#comment977659_528360

tmux에서 ssh agent forwarding 사용하기

tmux에서 ssh agent forwarding이 잘 안되는 이유는, tmux가 ssh 세션보다 오래 살아있어서 SSH_AUTH_SOCK 변수를 이미 죽은 ssh 세션의 것으로 들고 있기 때문입니다. 이를 해결하기 위해서는 SSH_AUTH_SOCK이 가르키고 있는 temp 파일을 홈 디렉토리에 symlink하고, tmux에서는 그 파일을 보게 만들면 됩니다.

과정

~/.ssh/rc에 다음 코드를 추가해줍니다.

# Fix SSH auth socket location so agent forwarding works with tmux
if test "$SSH_AUTH_SOCK" ; then
    ln -sf $SSH_AUTH_SOCK ~/.ssh/ssh_auth_sock
fi

그리고 위에서 shell rc에 추가한 코드를 수정하여 다음과 같이 바꿉니다.
- 이 코드는 ~/.ssh/ssh_auth_sock, SSH_AUTH_SOCK, XDG_RUNTIME_DIR 순서대로 값을 확인하고 SSH_AUTH_SOCK을 설정하는 코드입니다.

# (2024-03-19 수정)
if test -e "$(readlink -f $HOME/.ssh/ssh_auth_sock)" ; then
    export SSH_AUTH_SOCK="$HOME/.ssh/ssh_auth_sock"
elif ! test "$SSH_AUTH_SOCK" ; then
    export SSH_AUTH_SOCK="${XDG_RUNTIME_DIR}/ssh-agent.socket"
fi

Reference

https://blog.testdouble.com/posts/2016-11-18-reconciling-tmux-and-ssh-agent-forwarding/
https://blog.sanctum.geek.nz/reloading-tmux-config/

Paper Review: Do Adversarially Robust ImageNet Models Transfer Better?

2020-08-07T00:00:00+09:00

Paper Link: https://arxiv.org/abs/2007.08489

Contribution

Authors identified that adversarial robustness affects transfer learning performance.

Despite being less accurate on ImageNet, adversarially robust neural networks match or improve on the transfer performance of their standard counterparts.

They establish this trend in both “fixed-feature” setting in which one trains a linear classifier on top of feature extracted from a pre-trained network and “full-network” setting in which the pre-trained model is entirely fine-tuned on the relevant downstream task.

Motivation

How can we improve transfer learning?

Prior works suggest that accuracy on the source dataset is a strong indicator of performance on downstream tasks.

Still, it is unclear if improving ImageNet accuracy is the only way to improve performance. After all, the behavior of fixed-feature transfer is governed by models’ learned representations, which are not fully described by source-dataset accuracy.

These representations are, in turn, controlled by the priors that we put on them during training.

Adversarial robustness prior

Adversarial robustness refers to a model’s invariance to small (often imperceptible) perturbations of its inputs.

Robustness is typically induced at training time by replacing the standard empirical risk minimization objective with a robust optimization objective:

\[\min _{\theta} \mathbb{E}_{(x, y) \sim D}[\mathcal{L}(x, y ; \theta)] \Longrightarrow \min _{\theta} \mathbb{E}_{(x, y) \sim D}\left[\max _{\|\delta\|_{2} \leq \varepsilon} \mathcal{L}(x+\delta, y ; \theta)\right]\]

where $\theta$ is the model parameters, $\mathcal{L}$ is loss function, and $(x, y) \sim D$ are training image-label pairs.

This objective rather than minimizing the loss on the training points, minimizing the worst-case loss over a ball around each training point instead.

Should adversarial robustness help fixed-feature transfer?

On one hand, robustness to adversarial examples may seem somewhat tangential to transfer performance. In fact, adversarially robust models are known to be significantly less accurate than their standard counterparts, suggesting that using adversarially robust feature representations should hurt transfer performance.

On the other hand, recent work has found that the feature representations of robust models carry several advantages over those of standard models. For example, adversarially robust representations have better-behaved gradients and they are approximately invertible meaning that an image can be approximately reconstructed directly from its robust representation. Engstrom et al. hypothesize that the robust training objective leads to feature representations that are more similar to what humans use.

Adversarial Robustness and Full-Network Fine Tuning

Robust models match or improve on the transfer learning performance of standard ones.

Analysis and Discussion

In this section, we take a closer look at the similarities and differences in transfer learning between robust networks and standard networks.

ImageNet accuracy and transfer performance

Authors hypothesize that robustness and accuracy have effects which is counteracting but separate. In other words, higher accuracy with fixed robustness and higher robustness with fixed accuracy both improve transfer learning.

To test this hypothesis, they first study the relationship between ImageNet accuracy and transfer accuracy for each of the robust models that they trained. They find that the previously observed linear relationship between accuracy and transfer performance is often violated once the robustness aspect comes into play. (figure 5)

Also, they find that when robustness level is held fixed, the accuracy-transfer correlation observed by prior works for standard models holds for robust models too. Table 2 shows that for these models improving ImageNet accuracy improves transfer performance at around the same rate as standard models.

⇒ Transfer learning performance can be further improved by applying known techniques that increase the accuracy of robust models. Accuracy is not sufficient for measuring feature quality or versatility. But we don’t know why robust networks transfer well for now.

Robust models improve with width

Optimal robustness levels for downstream tasks

Authors observe that although the best robust models often outperform the best standard models, the optimal choice of robustness parameter $\epsilon$ varies widely between datasets. They explain that this variability of optimal choice might relate to dataset granularity. They hypothesize that on datasets where leveraging finer-grained features are necessary, the most effective values of $\epsilon$ will be much smaller than for a dataset where leveraging more coarse-grained features suffices.

Although we lack a quantitative notion of granularity (in reality, features are not simply singular pixels), authors consider image resolution as a crude proxy. They attempt to calibrate the granularities of the 12 image classification datasets used in this work, by first downscaling all the images to the size of CIFAR-10 (32 x 32), and then upscaling them to ImageNet size once more. They then repeat the fixed-feature regression experiments from prior sections, plotting the results in Figure 7. After controlling for original dataset dimension, the datasets’ epsilon vs. transfer accuracy curves all behave almost identically to CIFAR-10 and CIFAR-100 ones.

Comparing adversarial robustness to texture robustness

Figure 8b shows that transfer learning from adversarially robust models outperforms transfer learning from texture-invariant models on all considered datasets.

Figure 8a top shows that robust models outperform standard imagenet models when evaluated (top) or fine-tuned (bottom) on Stylized-ImageNet.

Awesome Python

2020-07-23T00:00:00+09:00

1. “and” and “or” returns the object

>>> [] and {}
[] # what???

Python’s “and” operation and “or” operation doesn’t returns “True” or “False”.

A or B # is equal to
A if A is True else B
# and also
A and B # is equal to
A if A is False else B

Note that neither and nor or restrict the value and type they return to False and True, but rather return the last evaluated argument.

ref: https://docs.python.org/3/reference/expressions.html#boolean-operations

2. Chaining comparison operators

>>> def f(x):
...     print(x)
...     return x
...
>>> 1 > f(2) > f(3)
2
False
>>> 1 < f(2) < f(3)
2
3
True
>>> 1 > f(2) and f(2) > f(3)
2
False
>>> 1 < f(2) and f(2) < f(3)
2
2
3
True
>>> a = 2
>>> a > 1 == a > 1
False
>>> a < 3 == True
False
>>> 1 < 3 is True
False

Comparisons can be chained arbitrarily, e.g., x < y <= z is equivalent to x < y and y <= z, except that y is evaluated only once (but in both cases z is not evaluated at all when x < y is found to be false). Formally, if a, b, c, …, y, z are expressions and op1, op2, …, opN are comparison operators, then a op1 b op2 c ... y opN z is equivalent to a op1 b and b op2 c and ... y opN z, except that each expression is evaluated at most once.

ref: https://docs.python.org/3/reference/expressions.html#comparisons

Entropy, Cross-Entropy, KL-Divergence

2020-07-22T00:00:00+09:00

Entropy (at Information theory)

The expectation of bits that used for notating (or classify each other) probabilistic events when using optimal bits coding scheme. ($log_2(\frac{1}{p})$ bits for notating events)
Entropy also can be interpreted as the average rate at which information is produced ($\text{I}(X) = log_2(\frac{1}{p})$) by stochastic source of data. (rare events have more information than an often occurring event.)
Entropy can be calculated by $\text{H}(X) = \text{E}[\text{I}(X)] = \text{E}[-\text{log}_2 (\text{P}(X)] = \sum\limits_{p \in P} p \text{log}_2(\frac{1}{p}) = -\sum\limits_{p \in P} p \text{log}_2({p})$ where $P$ is probability distribution. (Shannon’s source coding theorem)

Let’s think about the situation that you need to notate characters “A, B, C, D” in bits that stochastically written in a sentence. You can simply notate each character with 2 bits. For example, “00” for A, “01” for B, “10” for C, “11” for D. If every character have the same probability, ($\text{P}(A) = \text{P}(B) = \text{P}(C) = \text{P}(D) = 1/4$) this notating is optimal notating. You used 2 bits for each character on average. (2 * 1/4 * 4)

But how about $\text{P}(A) = 1/2, \text{P}(B) = 1/4, \text{P}(C) = \text{P}(D) = 1/8$ ? If you use same scheme, you will use 2 bits for each character on average. (2 * 1/2 + 2 * 1/4 + 2 * 1/8 * 2 = 2 * (1/2 + 1/4 + 1/8 + 1/8)) However, this is not optimal scheme for notating 4 character. As character A is frequently used than others, if you notate A with less bits, you can use less bits on average. So when notating “1” for A, “01” for B, “000” for C, “001” for D, 1.75 bits are used on average (1 * 1/2 + 2 * 1/4 + 3 * 1/8 * 2 = 1.75). In that case, you can decode bits by following rules:

If looking bit is 1 or length of group of bits is 3, finish one character decoding.
If looking bit is 0, add looking bit (0) to group of bits and looking next bit.

Cross-Entropy and KL-Divergence

The cross-entropy of the distribution $q$ relative to distribution $p$ over a given set is defined as follows:

\[\text{H}(p,q) = -\text{E}[l] = - \text{E}_p[\text{log}_2(q)] = - \sum_{x \in X} p(x) \text{log}_2(q(x)) = \text{H}(p) + D_{KL}(p \Vert q) \tag{1}\]

You can think cross-entropy as applying coding scheme which is optimal to probability distribution $q$ ($l_i = - \text{log}_2(q(x_i)) \Leftrightarrow q(x_i) = (\frac{1}{2})^{l_i}$) to probability distribution $p$ where $l_i$ is length of bits to coding i-th.

Kullback–Leibler divergence (KL-Divergence) can be thought of as something like a measurement of how far the distribution $q$ is from the distribution $p$.

\[D_{KL}(p \Vert q) = \sum_{x \in X} p(x)\text{log}(\frac{p(x)}{q(x)}) = - \sum_{x \in X}p(x)\text{log}q(x) - (- \sum_{x \in X}p(x)\text{log}p(x))\\ = \text{H}(p,q) - \text{H}(p)\]

In deep learning, $p$ is dataset and $q$ is neural network output. Making cross-entropy loss smaller is making KL-Divergence of $p$ and $q$ ( $D_{KL}(p \Vert q))$ ) smaller because $\text{H}(p)$ is fixed value.

Reference

What is Logit and Logistic?

2020-07-14T00:00:00+09:00

In Math

If $p$ is probability..

odds is $\frac{p}{1-p}$.
The logit (logistic unit) function or the log-odds is $logit(p) = \log \frac{p}{1-p}$ in statistics.
- Logit function makes a map of probability values from $(0, 1)$ to $(-\infty, +\infty)$.
The logistic function or the sigmoid function is the inverse-logit. ($logistic(x) = logit^{-1}(x) = \frac{1}{1+e^{-x}}=\frac{e^{x}}{e^{x}+1}=p$

In Machine Learning

The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. Normalization function could be the sigmoid function in binary-class classification or softmax function in multi-class classification.

Reference

CVPR 2020 Tutorial: Interpretable Machine Learning for Computer Vision

2020-06-17T00:00:00+09:00

website: https://interpretablevision.github.io/

Lecture 1 by Bolei Zhou: Exploring and Exploiting Interpretable Semantics in GANs. video, slide, bili

Lecture 2 by Zeynep Akata: Modeling Conceptual Understanding in Image Reference Games video, slide, bili

Lecture 3 by Ruth C. Fong: Understanding Deep Neural Networks video, slide, bili

Lecture 4 by Christopher Olah: Introduction to Circuits in CNNs. video, slide, bili

Lecture 1

turn on / off latent unit (GAN Dissection)
random walk in latent space ← using attribute classifier in latent space (InterFaceGAN & GAN Hierarchy)
layer-wise stochastic vector

→ control semantic / these using pre-trained classifier (supervised)

some unsupervised ways… (I can’t understand)

GAN inverse

want to generate image (encode latent vector) of unseen image domain

→ not works well with unseen image domain(asian, not face, etc.) (there is no constraint of encoded latent vector should in original latent domain)

Lecture 3

Interpretability tools are crucial for high-impact, high-risk applications of deep learning.

supervised deep learning: Inputs (What is model looking at) + Internal Representation (What & how does model encode) + Training Procedure (How can we improve model)

What is model looking at

we want model not to cheat. we want model to get intuition from dataset.

→ but datasets have bias. classifier could not have intuition and just cheating the dataset.

Attribution: identify input features responsible for model decision

Prior work:
1. combine network activation and gradients ← fast but difficult to interpret
2. Pertubation Approaches: change input and observe the effect on the output ← Clear meaning, but can only test small range of occlusions
Desired Approach: automated test and wide range of occlusions
1. Meaningful Perturbations: Learn a minimal mask m to perturb input x that maximally affects the networks output ← considers a wide range of occlusion sizes and shapes
2. Extremal Perturbations: Learn a fixed-sized mask m to perturb input x that maximally preserves the network’s output

→ Foreground evidence is usually sufficient / Large objects are recognized by their details / multiple objects contribute cumulatively / suppressing the background may overdrive the network

Adversarial Defense

right graph: networks that are trained to get input as heatmap and to predict whether heatmap is from clean image or adversarial image can discover properly and even can recover origin label.

See the video for the details

→ Adversarial Defense is possible using heatmap!

⇒ How to use?

Research Development: Critically design and evaluate attribution methods
General Usage: Assume a model has failures and use attribution methods to understand them

Internal Representation

Two main way to view intermediate activations

How groups of channels work together to encode? ← this is similar to how neuroscientist often don’t just study a single neuron in the brain for other coronated collections or populations of neurons

Attributing channels in intermediate activations

What groups of channels are responsible for model’s decision?

Understanding how semantic concepts are encoded

one filter might be packed with multiple concepts and one concept might be encoded using multiple fillter

Lecture 4

reverse engineering NN! ← only small fraction of interpretable NN targets for

What is understanding NN?

→ Chris think understanding of NN is kind of understanding the variable or registers in a computer program when reverse engineering it.

The weights are the actual “assembly code” of our model!

Reverse engineer a NN in two steps!

Correctly understand all the neurons.
Understand the weights connecting them.

Understanding Neurons

Feature Visualization (=activation maximization)

starting from random noise, make features that stimulate neuron by gradient descent.
https://distill.pub/2017/feature-visualization/

interrogate neurons

How do we go from understanding features to understanding weights?

Neurons are combine together and make new detector

How do we know we aren’t fooling ourselves?

Edit circuits to change model behavior

Clean room reimplementation of hundreds of neurons

over five layers, building up to curve detectors.

(Wrote a small python program that filled in the weights of a neural network)

Paper Review: IMAGENET-TRAINED CNNS ARE BIASED TOWARDS TEXTURE

2020-05-28T00:00:00+09:00

Paper Link: https://openreview.net/forum?id=Bygh9j09KX

Abstract

ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies.Authors then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on ‘Stylized-ImageNet’, a stylized version of ImageNet.

→ ImageNet-trained CNNs are concentrate on texture and this could be solved by Stylized-ImageNet.

Concentrating on shape rather than texture is much better fit for human behavioral performance and has benefits such as improved object detection performance and robustness toward a wide range of image distortions.

Introduction

In this work authors aim to shed light on this debate with a number of carefully designed yet relatively straightforward experiments.

shape hypothesis (Context)

One widely accepted intuition is that CNNs combine low-level features (e.g. edges) to increasingly complex shapes (such as wheels, car windows) until the object (e.g. car) can be readily classified.
explanation that CNNs see overall context which needs to see larger than concentrate on texture
CNNs are currently the most predictive models for human ventral stream object recognition (e.g. Cadieu et al., 2014; Yamins et al., 2014); and it is well-known that object shape is the single most important cue for human object recognition (Landau et al., 1988), much more than other cues like size or texture (which may explain the ease at which humans recognize line drawings or millennia-old cave paintings).

texture hypothesis (Locality)

Contrast to shape hypothesis, there are finding points to the important role of object textures for CNN object recognition.
CNNs can still classify texturised images perfectly well, even if the global shape structure is completely destroyed
two studies suggest that local information such as textures may actually be sufficient to “solve” ImageNet object recognition:
- Gatys et al. (2015) discovered that a linear classifier on top of a CNN’s texture representation (Gram matrix) achieves hardly any classification performance loss compared to original network performance.
- Brendel & Bethge (2019) demonstrated that CNNs with explicitly constrained receptive field sizes throughout all layers are able to reach surprisingly high accuracies on ImageNet, even though this effectively limits a model to recognizing small local patches rather than integrating object parts for shape recognition.

→ it seems that local textures indeed provide sufficient information about object classes—ImageNet object recognition could, in principle, be achieved through texture recognition alone. object textures are more important than global object shapes for CNN object recognition.

Experiments

Utilizing style transfer (Gatys et al., 2016), authors created images with a texture-shape cue conflict such as Figure 1c. Authors perform nine comprehensive and careful psychophysical experiments comparing humans against CNNs on exactly the same images, totaling 48,560 psychophysical trials across 97 observers.

→ These experiments provide behavioral evidence in favor of the texture hypothesis: A cat with an elephant texture is an elephant to CNNs, and still a cat to humans.

Contributions

Beyond quantifying existing biases, authors subsequently present results for our two other main contributions: changing biases, and discovering emergent benefits of changed biases. They show that the texture bias in standard CNNs can be overcome and changed towards a shape bias if trained on a suitable dataset. Networks with a higher shape bias are more robust and reach higher performance on classification and object recognition tasks.

Methods

Models are ImageNet-pretrained networks.

Variable definitions

silhouette: bounding contour of an object in 2D (i.e., the outline of object segmentation).
object shape: a definition that is broader than just the silhouette of an object: we refer to the set of contours that describe the 3D form of an object, i.e. including those contours that are not part of the silhouette.
texture: an image (region) with spatially stationary statistics. Note that on a very local level, textures(according to this definition) can have non-stationary elements (such as a local shape): e.g. a single bottle clearly has non-stationary statistics, but many bottles next to each other are perceived as a texture: “things” become “stuff”. For an example of a “bottle texture” see Figure 7.

5 simple recognition task

It is important to note that the Authors only selected object and texture images that authors are correctly classified by all four networks.

Original: 160 natural colour images of objects (10 per category) with white background.
Greyscale: Images from Original data set converted to greyscale
Silhouette: Images from Original data set converted to silhouette images
Edges: Images from Original data set converted to an edge-based representation
Texture: 48 natural color images of textures (3 per category). Typically the textures consist of full-width patches of an animal (e.g. skin or fur) or, in particular for man-made objects, of images with many repetitions of the same objects (e.g. many bottles next to each other)

Cue Conflict task

Images generated using iterative style transfer (Gatys et al., 2016) between an image of the Texture data set (as style) and an image from the original data set (as content). The authors generated a total of 1280 cue conflict images (80 per category), which allows for presentation to human observers within a single experimental session.

Stylized-ImageNet (SIN)

Starting from ImageNet authors constructed a new data set (termed Stylized-ImageNet or SIN) by stripping every single image of its original texture and replacing it with the style of a randomly selected painting through AdaIN style transfer (Huang & Belongie, 2017) (see examples in Figure 3) with a stylization coefficient of $\alpha$ = 1.0.

Styled-ImageNet is produced in different way with Cue Conflict task

Results

Texture vs Shape bias in humans and ImageNet-trained CNNs

One confound in 5 simple recognition tasks is that CNNs tend not to cope well with domain shifts, i.e. the large change in image statistics from natural images (on which the networks have been trained) to sketches (which the networks have never seen before).

ImageNet can be solved to high accuracy using only local information. In other words, it might simply suffice to integrate evidence from many local texture features rather than going through the process of integrating and classifying global shapes.

Overcoming the texture bias of CNNs

The SIN-trained ResNet-50 shows a much stronger shape bias in our cue conflict experiment (Figure 5), which increases from 22% for a IN-trained model to 81%. In many categories, the shape bias is almost as strong as for humans.

Does the increased shape bias, and thus the shifted representations, also affect the performance or robustness of CNNs? In addition to the IN- and SIN-trained ResNet-50 architecture authors here additionally analyse two joint training schemes:

Training jointly on SIN and IN.
Training jointly on SIN and IN with fine-tuning on IN. We refer to this model as Shape-ResNet.

Authors compared these models with a vanilla ResNet-50 on these three experiments:

Classification performance: Shape-ResNet surpasses the vanilla ResNet as reported in Table 2.This indicates that SIN may be a useful data augmentation on ImageNet that can improve model performance without any architectural changes.
Transfer learning: The authors tested the representations of each model as backbone features for Faster R-CNN (Ren et al., 2017) on Pascal VOC 2007 and MS COCO. Incorporating SIN in the training data substantially improves object detection performance as shown in Table 2.
Robustness against distortions: Authors tested how model accuracies degrade if images are distorted by uniform or phase noise, contrast changes, high- and low-pass filtering, or eidolon perturbations. The results of this comparison are visualized in Figure 6. While lacking a few percent accuracies on undistorted images, the SIN-trained network outperforms the IN-trained CNN on almost all image manipulations.

Opinion

The robustness just could be from data augmentation?