일반적으로 가정집에서 Public IP까지는 받을 수 있지만 이 친구는 Static하지는 않습니다. (물론..거의 IP가 바뀌진 않지만 아주 가끔 바뀌는 경우가 있습니다) 이를 해결하기 위해서 DDNS를 많이들 셋팅하시는데, 보통 no-ip를 가입해서 공유기에 셋팅하거나 iptime 공유기에서 제공하는 iptime 도메인을 쓰는 방법을 사용합니다. 또 no-ip는 30일마다 이메일로 기간 연장을 해야합니다.
저는 이 두가지 모두 마음에 들지 않아서 제 공유기 밑에 있는 개인 서버에서 DDNS를 구축하는 방법을 찾아봤는데, 클라우드 플레어에서 제공하는 문서에서 언급된 ddclient를 살펴봤고 괜찮아보여서 구축해봤습니다.
사실 크게 신경써야할 과정이 있지는 않습니다. https://github.com/ddclient/ddclient 에 가서 직접 소스코드를 받아서 빌드하거나 패키지매니저로 install하면 됩니다. 꽤 많은 패키지매니저에서 제공되고 있어서 쉽게 설치할 수 있었습니다.
sudo apt install ddclient
를 해줬습니다.sudo systemctl status ddclient.service
했을 때 warning 없이 잘 뜨면 문제가 없지만 저 같은 경우 config에서 zone=
부분이 빠져있어서 warning이 뜨고 있었습니다. (관련이슈) 그래서 /etc/ddclient.conf
를 직접 열어서 zone을 추가해주었습니다.sudo systemctl enable ddclient.service
를 해주었습니다.vim에서 c, c++ 를 사용하기 위해서 coc.clangd를 셋팅했었습니다. 그런데 이 coc.langd가 #include <cstring>
같은 기본적인 부분에서 에러를 내뱉길래 트러블슈팅을 시작했습니다.
g++이 이미 설치되어 있었기 때문에 문제가 없을 것이다 하고 넘어갈 뻔 했지만 g++-12를 설치하는 것을 시도하여 잘 해결했습니다.
https://stackoverflow.com/a/29821538
]]>깃헙에 푸시를 하는 등의 작업을 할 때 ssh key인증을 해야 합니다. 하지만 서버에서 주로 작업하는 경우 이러한 인증에 어려움이 있을 수 있고 어쩔 수 없이 서버에 key를 두고 작업하는 경우도 있습니다. 하지만 key는 공용 서버라면 서버에 두지 않는 편이 좋고 ssh-agent forwarding이라는 것을 사용하면 로컬에 있는 ssh key를 서버에서도 사용할 수 있습니다. 깃헙에서 이를 설명한 자세한 글이 이 링크에 있습니다. 이 글에서도 트러블 슈팅 방법에 대해서 설명하고 있지만 간략하게 정리해보려고 합니다.
조심해야할 점은, 말 그대로 로컬의 ssh-agent를 forwarding하는 것이기에, forwarding 이후에 ssh-agent를 다시 실행하면 안됩니다. 다시 실행한다면 기존의 ssh-agent를 가르키고 있던 $SSH_AUTH_SOCK
을 덮어 써버려서 의미가 없어집니다.
이러한 일은 특히 multiple hop ssh agent forwaridng에서 쉽게 일어날 수 있는데, shell rc 에 ssh-agent를 실행하게 한 경우 특히 자주 일어납니다.
이를 막기위해서는 $SSH_AUTH_SOCK
이 Set되어 있는 지 확인하고, Set 되어 있지 않으면 ssh-agent를 실행하는 방식으로 해결할 수 있습니다.
echo $SSH_AUTH_SOCK
에서 출력이 되는 지 아닌 지로 확인할 수 있습니다.ssh-add -L
을 했을 때 공개키가 출력되어야 합니다.
ssh-add
를 입력하여 ~/.ssh
아래에 있는 키들을 자동으로 ssh-agent에 등록하거나 ssh-add -K path/to/private_key
를 하여 다른 위치에 있는 private key를 ssh-agent와 키체인에 등록할 수 있습니다.ssh-add -L
을 했을 때 마찬가지로 공개키가 출력되어야 합니다.
트러블슈팅 글에서도 알 수 있듯이 ssh-agent가 항상 켜져있어야 문제 없이 ssh-agent-forwarding이 잘 작동합니다. 때문에 systemd를 사용할 수 있는 환경이라면 user-level systemd service를 활용하여 재부팅/세션종료 후 재접속 하더라도 ssh-agent가 항상 켜져있을 수 있도록 도와주게 할 수 있습니다.
~/.config/systemd/user/ssh-agent.service
파일을 새로 만들면서 다음과 같이 설정합니다.[Unit]
Description=SSH key agent
[Service]
Type=simple
Environment=SSH_AUTH_SOCK=%t/ssh-agent.socket
ExecStart=/usr/bin/ssh-agent -D -a $SSH_AUTH_SOCK
[Install]
WantedBy=default.target
systemctl --user daemon-reload
를 실행합니다.
systemctl --user enable --now ssh-agent
를 실행합니다.
이렇게 하면 ${XDG_RUNTIME_DIR}/ssh-agent.socket
위치에 file의 형태로 SSH_AUTH_SOCK
이 저장됩니다.
앞서 말한대로 SSH Agent Overwrite 이슈를 피하기 위해서는, shell rc (.bashrc / .zshrc 등…) 에 다음을 추가해주는 것이 좋습니다.
이 코드는 SSH_AUTH_SOCK
이 설정되어 있지 않을 때 위에서 작업한 SSH Agent의 SOCK을 바라보도록 하는 코드입니다.
if ! test "$SSH_AUTH_SOCK" ; then
export SSH_AUTH_SOCK="${XDG_RUNTIME_DIR}/ssh-agent.socket"
fi
tmux에서 ssh agent forwarding이 잘 안되는 이유는, tmux가 ssh 세션보다 오래 살아있어서 SSH_AUTH_SOCK
변수를 이미 죽은 ssh 세션의 것으로 들고 있기 때문입니다. 이를 해결하기 위해서는 SSH_AUTH_SOCK
이 가르키고 있는 temp 파일을 홈 디렉토리에 symlink하고, tmux에서는 그 파일을 보게 만들면 됩니다.
~/.ssh/rc
에 다음 코드를 추가해줍니다.# Fix SSH auth socket location so agent forwarding works with tmux
if test "$SSH_AUTH_SOCK" ; then
ln -sf $SSH_AUTH_SOCK ~/.ssh/ssh_auth_sock
fi
~/.ssh/ssh_auth_sock
, SSH_AUTH_SOCK
, XDG_RUNTIME_DIR
순서대로 값을 확인하고 SSH_AUTH_SOCK
을 설정하는 코드입니다.# (2024-03-19 수정)
if test -e "$(readlink -f $HOME/.ssh/ssh_auth_sock)" ; then
export SSH_AUTH_SOCK="$HOME/.ssh/ssh_auth_sock"
elif ! test "$SSH_AUTH_SOCK" ; then
export SSH_AUTH_SOCK="${XDG_RUNTIME_DIR}/ssh-agent.socket"
fi
Authors identified that adversarial robustness affects transfer learning performance.
Despite being less accurate on ImageNet, adversarially robust neural networks match or improve on the transfer performance of their standard counterparts.
They establish this trend in both “fixed-feature” setting in which one trains a linear classifier on top of feature extracted from a pre-trained network and “full-network” setting in which the pre-trained model is entirely fine-tuned on the relevant downstream task.
Prior works suggest that accuracy on the source dataset is a strong indicator of performance on downstream tasks.
Still, it is unclear if improving ImageNet accuracy is the only way to improve performance. After all, the behavior of fixed-feature transfer is governed by models’ learned representations, which are not fully described by source-dataset accuracy.
These representations are, in turn, controlled by the priors that we put on them during training.
Adversarial robustness refers to a model’s invariance to small (often imperceptible) perturbations of its inputs.
Robustness is typically induced at training time by replacing the standard empirical risk minimization objective with a robust optimization objective:
\[\min _{\theta} \mathbb{E}_{(x, y) \sim D}[\mathcal{L}(x, y ; \theta)] \Longrightarrow \min _{\theta} \mathbb{E}_{(x, y) \sim D}\left[\max _{\|\delta\|_{2} \leq \varepsilon} \mathcal{L}(x+\delta, y ; \theta)\right]\]where $\theta$ is the model parameters, $\mathcal{L}$ is loss function, and $(x, y) \sim D$ are training image-label pairs.
This objective rather than minimizing the loss on the training points, minimizing the worst-case loss over a ball around each training point instead.
On one hand, robustness to adversarial examples may seem somewhat tangential to transfer performance. In fact, adversarially robust models are known to be significantly less accurate than their standard counterparts, suggesting that using adversarially robust feature representations should hurt transfer performance.
On the other hand, recent work has found that the feature representations of robust models carry several advantages over those of standard models. For example, adversarially robust representations have better-behaved gradients and they are approximately invertible meaning that an image can be approximately reconstructed directly from its robust representation. Engstrom et al. hypothesize that the robust training objective leads to feature representations that are more similar to what humans use.
Robust models match or improve on the transfer learning performance of standard ones.
In this section, we take a closer look at the similarities and differences in transfer learning between robust networks and standard networks.
Authors hypothesize that robustness and accuracy have effects which is counteracting but separate. In other words, higher accuracy with fixed robustness and higher robustness with fixed accuracy both improve transfer learning.
To test this hypothesis, they first study the relationship between ImageNet accuracy and transfer accuracy for each of the robust models that they trained. They find that the previously observed linear relationship between accuracy and transfer performance is often violated once the robustness aspect comes into play. (figure 5)
Also, they find that when robustness level is held fixed, the accuracy-transfer correlation observed by prior works for standard models holds for robust models too. Table 2 shows that for these models improving ImageNet accuracy improves transfer performance at around the same rate as standard models.
⇒ Transfer learning performance can be further improved by applying known techniques that increase the accuracy of robust models. Accuracy is not sufficient for measuring feature quality or versatility. But we don’t know why robust networks transfer well for now.
Authors observe that although the best robust models often outperform the best standard models, the optimal choice of robustness parameter $\epsilon$ varies widely between datasets. They explain that this variability of optimal choice might relate to dataset granularity. They hypothesize that on datasets where leveraging finer-grained features are necessary, the most effective values of $\epsilon$ will be much smaller than for a dataset where leveraging more coarse-grained features suffices.
Although we lack a quantitative notion of granularity (in reality, features are not simply singular pixels), authors consider image resolution as a crude proxy. They attempt to calibrate the granularities of the 12 image classification datasets used in this work, by first downscaling all the images to the size of CIFAR-10 (32 x 32), and then upscaling them to ImageNet size once more. They then repeat the fixed-feature regression experiments from prior sections, plotting the results in Figure 7. After controlling for original dataset dimension, the datasets’ epsilon vs. transfer accuracy curves all behave almost identically to CIFAR-10 and CIFAR-100 ones.
Figure 8b shows that transfer learning from adversarially robust models outperforms transfer learning from texture-invariant models on all considered datasets.
Figure 8a top shows that robust models outperform standard imagenet models when evaluated (top) or fine-tuned (bottom) on Stylized-ImageNet.
]]>>>> [] and {}
[] # what???
Python’s “and” operation and “or” operation doesn’t returns “True” or “False”.
A or B # is equal to
A if A is True else B
# and also
A and B # is equal to
A if A is False else B
Note that neither and nor or restrict the value and type they return to
False
andTrue
, but rather return the last evaluated argument.
ref: https://docs.python.org/3/reference/expressions.html#boolean-operations
>>> def f(x):
... print(x)
... return x
...
>>> 1 > f(2) > f(3)
2
False
>>> 1 < f(2) < f(3)
2
3
True
>>> 1 > f(2) and f(2) > f(3)
2
False
>>> 1 < f(2) and f(2) < f(3)
2
2
3
True
>>> a = 2
>>> a > 1 == a > 1
False
>>> a < 3 == True
False
>>> 1 < 3 is True
False
Comparisons can be chained arbitrarily, e.g.,
x < y <= z
is equivalent tox < y and y <= z
, except that y is evaluated only once (but in both cases z is not evaluated at all whenx < y
is found to be false). Formally, if a, b, c, …, y, z are expressions and op1, op2, …, opN are comparison operators, thena op1 b op2 c ... y opN z
is equivalent toa op1 b and b op2 c and ... y opN z
, except that each expression is evaluated at most once.
ref: https://docs.python.org/3/reference/expressions.html#comparisons
]]>Let’s think about the situation that you need to notate characters “A, B, C, D” in bits that stochastically written in a sentence. You can simply notate each character with 2 bits. For example, “00” for A, “01” for B, “10” for C, “11” for D. If every character have the same probability, (\(\text{P}(A) = \text{P}(B) = \text{P}(C) = \text{P}(D) = 1/4\)) this notating is optimal notating. You used 2 bits for each character on average. (2 * 1/4 * 4)
But how about \(\text{P}(A) = 1/2, \text{P}(B) = 1/4, \text{P}(C) = \text{P}(D) = 1/8\) ? If you use same scheme, you will use 2 bits for each character on average. (2 * 1/2 + 2 * 1/4 + 2 * 1/8 * 2 = 2 * (1/2 + 1/4 + 1/8 + 1/8)) However, this is not optimal scheme for notating 4 character. As character A is frequently used than others, if you notate A with less bits, you can use less bits on average. So when notating “1” for A, “01” for B, “000” for C, “001” for D, 1.75 bits are used on average (1 * 1/2 + 2 * 1/4 + 3 * 1/8 * 2 = 1.75). In that case, you can decode bits by following rules:
The cross-entropy of the distribution \(q\) relative to distribution \(p\) over a given set is defined as follows:
\[\text{H}(p,q) = -\text{E}[l] = - \text{E}_p[\text{log}_2(q)] = - \sum_{x \in X} p(x) \text{log}_2(q(x)) = \text{H}(p) + D_{KL}(p \Vert q) \tag{1}\]You can think cross-entropy as applying coding scheme which is optimal to probability distribution \(q\) (\(l_i = - \text{log}_2(q(x_i)) \Leftrightarrow q(x_i) = (\frac{1}{2})^{l_i}\)) to probability distribution \(p\) where \(l_i\) is length of bits to coding i-th.
Kullback–Leibler divergence (KL-Divergence) can be thought of as something like a measurement of how far the distribution \(q\) is from the distribution \(p\).
\[D_{KL}(p \Vert q) = \sum_{x \in X} p(x)\text{log}(\frac{p(x)}{q(x)}) = - \sum_{x \in X}p(x)\text{log}q(x) - (- \sum_{x \in X}p(x)\text{log}p(x))\\ = \text{H}(p,q) - \text{H}(p)\]In deep learning, \(p\) is dataset and \(q\) is neural network output. Making cross-entropy loss smaller is making KL-Divergence of \(p\) and \(q\) ( \(D_{KL}(p \Vert q))\) ) smaller because \(\text{H}(p)\) is fixed value.
If $p$ is probability..
The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. Normalization function could be the sigmoid function in binary-class classification or softmax function in multi-class classification.
Lecture 1 by Bolei Zhou: Exploring and Exploiting Interpretable Semantics in GANs. video, slide, bili
Lecture 2 by Zeynep Akata: Modeling Conceptual Understanding in Image Reference Games video, slide, bili
Lecture 3 by Ruth C. Fong: Understanding Deep Neural Networks video, slide, bili
Lecture 4 by Christopher Olah: Introduction to Circuits in CNNs. video, slide, bili
→ control semantic / these using pre-trained classifier (supervised)
some unsupervised ways… (I can’t understand)
GAN inverse
want to generate image (encode latent vector) of unseen image domain
→ not works well with unseen image domain(asian, not face, etc.) (there is no constraint of encoded latent vector should in original latent domain)
Interpretability tools are crucial for high-impact, high-risk applications of deep learning.
supervised deep learning: Inputs (What is model looking at) + Internal Representation (What & how does model encode) + Training Procedure (How can we improve model)
we want model not to cheat. we want model to get intuition from dataset.
→ but datasets have bias. classifier could not have intuition and just cheating the dataset.
→ Foreground evidence is usually sufficient / Large objects are recognized by their details / multiple objects contribute cumulatively / suppressing the background may overdrive the network
Adversarial Defense
right graph: networks that are trained to get input as heatmap and to predict whether heatmap is from clean image or adversarial image can discover properly and even can recover origin label.
See the video for the details
→ Adversarial Defense is possible using heatmap!
⇒ How to use?
Two main way to view intermediate activations
How groups of channels work together to encode? ← this is similar to how neuroscientist often don’t just study a single neuron in the brain for other coronated collections or populations of neurons
What groups of channels are responsible for model’s decision?
one filter might be packed with multiple concepts and one concept might be encoded using multiple fillter
reverse engineering NN! ← only small fraction of interpretable NN targets for
What is understanding NN?
→ Chris think understanding of NN is kind of understanding the variable or registers in a computer program when reverse engineering it.
The weights are the actual “assembly code” of our model!
Reverse engineer a NN in two steps!
Neurons are combine together and make new detector
over five layers, building up to curve detectors.
(Wrote a small python program that filled in the weights of a neural network)
]]>ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies.Authors then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on ‘Stylized-ImageNet’, a stylized version of ImageNet.
→ ImageNet-trained CNNs are concentrate on texture and this could be solved by Stylized-ImageNet.
Concentrating on shape rather than texture is much better fit for human behavioral performance and has benefits such as improved object detection performance and robustness toward a wide range of image distortions.
In this work authors aim to shed light on this debate with a number of carefully designed yet relatively straightforward experiments.
shape hypothesis (Context)
texture hypothesis (Locality)
→ it seems that local textures indeed provide sufficient information about object classes—ImageNet object recognition could, in principle, be achieved through texture recognition alone. object textures are more important than global object shapes for CNN object recognition.
Experiments
Utilizing style transfer (Gatys et al., 2016), authors created images with a texture-shape cue conflict such as Figure 1c. Authors perform nine comprehensive and careful psychophysical experiments comparing humans against CNNs on exactly the same images, totaling 48,560 psychophysical trials across 97 observers.
→ These experiments provide behavioral evidence in favor of the texture hypothesis: A cat with an elephant texture is an elephant to CNNs, and still a cat to humans.
Contributions
Beyond quantifying existing biases, authors subsequently present results for our two other main contributions: changing biases, and discovering emergent benefits of changed biases. They show that the texture bias in standard CNNs can be overcome and changed towards a shape bias if trained on a suitable dataset. Networks with a higher shape bias are more robust and reach higher performance on classification and object recognition tasks.
Models are ImageNet-pretrained networks.
Variable definitions
It is important to note that the Authors only selected object and texture images that authors are correctly classified by all four networks.
Images generated using iterative style transfer (Gatys et al., 2016) between an image of the Texture data set (as style) and an image from the original data set (as content). The authors generated a total of 1280 cue conflict images (80 per category), which allows for presentation to human observers within a single experimental session.
Starting from ImageNet authors constructed a new data set (termed Stylized-ImageNet or SIN) by stripping every single image of its original texture and replacing it with the style of a randomly selected painting through AdaIN style transfer (Huang & Belongie, 2017) (see examples in Figure 3) with a stylization coefficient of \(\alpha\) = 1.0.
Styled-ImageNet is produced in different way with Cue Conflict task
One confound in 5 simple recognition tasks is that CNNs tend not to cope well with domain shifts, i.e. the large change in image statistics from natural images (on which the networks have been trained) to sketches (which the networks have never seen before).
ImageNet can be solved to high accuracy using only local information. In other words, it might simply suffice to integrate evidence from many local texture features rather than going through the process of integrating and classifying global shapes.
The SIN-trained ResNet-50 shows a much stronger shape bias in our cue conflict experiment (Figure 5), which increases from 22% for a IN-trained model to 81%. In many categories, the shape bias is almost as strong as for humans.
Does the increased shape bias, and thus the shifted representations, also affect the performance or robustness of CNNs? In addition to the IN- and SIN-trained ResNet-50 architecture authors here additionally analyse two joint training schemes:
Authors compared these models with a vanilla ResNet-50 on these three experiments:
The robustness just could be from data augmentation?
]]>Char-net is consist of word-level encoder, character-level encoder and lstm-based decoder.
Authors wanted word-level encoder to make the feature map that has not only semantic information but also spatial information of each character. As the feature map of a deeper layer of CNN is more semantically strong and spatially coarse, WLE stacks convolutional feature map from different levels by several hyper-connections to increase spatial information.
HAM consists of the recurrent RoIWarp layer and character-level attention layer. Given the feature map \(F\) of the text image produced by the word-level encoder, HAM aims at producing a context vector \(z^t_c\) for predicting the label \(y^t\) of the character being considered at time step \(t\).
RLN is responsible for recurrently locating each character RoI. RLN uses the score map \(s^t\) to predict spatial information of each RoI.
\[\left(q_{x}^{t}, q_{y}^{t}, q_{w}^{t}, q_{h}^{t}\right)=\operatorname{MLP}_{l}\left(\mathrm{s}^{t}\right)\]As there is no direct supervision for RLN, it is hard to optimize RLN from scratch. Hence, the Authors pre-train a variant of the traditional attention mechanism to ease the difficulty in training the recurrent localization network. The generated weight set \(\alpha^{t}\) can be interpreted as an attention distribution over all the feature points in the convolutional feature map.
Instead of generating an unconstrained distribution by normalizing the relevancy score map, we model the attention distribution as a 2-D Gaussian distribution and calculate its parameters by
\[\left(\mu_{x}^{t}, \mu_{y}^{t}, \sigma_{x}^{t}, \sigma_{y}^{t}\right)=\operatorname{MLP}_{g}\left(\mathrm{s}^{t}\right)\]where \((\mu_x^t, \mu_y^t)\) and \((\sigma_x^t, \sigma_y^t)\) are center and standard deviations of distribution perspectively.
Similar to the traditional attention mechanism, this Gaussian attention mechanism can be easily optimized in an end-to-end manner. The authors then directly use the parameters of the Gaussian attention mechanism to initialize our RLN.
This target at cropping out the character of interest at each time step and warping it into a fixed size \(W_c \times H_c \times C_f\).
CLA takes the responsibility of selecting the most relevant features from the rectified character feature map produced by the character-level encoder to generate the context vector \(z_c^t\).
CLA is essential as it is difficult for the recurrent RoIWarp layer to precisely crop out a small feature region that contains features only from the corresponding character. Even though RLN predict warping region perfectly, the distortion exhibited in the scene text would cause the warped feature region to include also features from neighboring characters.
From experiment, the Authors find that features from neighboring characters and cluttered background would mislead the update of the parameters during the training procedure if they do not employ CLA, and this would prevent them from training Char-Net in an end-to-end manner.
In the training process, we minimize the sum of the negative log-likelihood of Eq. (2) over the whole training dataset. During testing, we directly pick the label with the highest probability in Eq. (3) as the output in each decoding step.
]]>