格物治用

Versioning in Main-Memory Database Systems From MusaeusDB to TardisDB

沧海一声笑 — Tue, 07 Dec 2021 05:50:30 GMT

主要贡献点在于扩展了OrpheusDB到多表。另外基于mvv实现了内存数据库的多版本

https://dl.acm.org/doi/10.1145/3335783.3335792

OrpheusDB: Bolt-on Versioning for Relational Databases

沧海一声笑 — Tue, 07 Dec 2021 01:38:18 GMT

版本分支数据库。基于传统数据库，分析了多种数据模型。记录+版本号（或者时间戳）；记录+版本list；记录id+记录+ 版本表（分按版本list和记录list）。特点是基于现有数据库（非kv），事务/访问控制/sql查询不需要考虑。

https://arxiv.org/abs/1703.02475

Housekeeping for Paper: Erasure coding and Read/Write Separation in Flash Storage

Bao Jianguo — Sat, 09 Oct 2021 02:22:14 GMT

解决了在闪存阵列下性能不一致的问题。
原有的Flash on Rails的设计，利用副本进行了读写分离，以达到性能一致性问题。
本论文基于Flash on Rails，研究了EC+读写分离下的性能一致性。
主要贡献在于发现在中等大小阵列下，虽然EC计算开销随着阵列增大快速增长，但并没有明显的影响。
但对于大规模阵列确实有影响，作者根据crush建议了一种重叠的逻辑冗余组概览，以hypergraph表示，生成小的阵列保持EC计算开销低，同时保持读写分离。
作者未对读写分离的写进行研究，只是建议batch写，例如利用journal方式。

关于读写分离，对于2副本盘，read一般经过缓存，然后到ssd1， write周期到缓存，同时前面周期的缓存写到SSD2. 然后一定时间SSD切换读写角色。对于EC多个盘，类似。

https://www.usenix.org/system/files/conference/inflow14/inflow14-skourtis.pdf

Vim 搭建与问题

沧海一声笑 — Thu, 02 Jul 2020 08:38:11 GMT

介绍

主要一些常用的插件和遇到的问题

Vim(pythonx):ModuleNotFoundError: No module named 'neovim'

安装自动完成插件deoplete.nvim后，启动vim每次都是这个错误。
或者vim中执行:echo neovim_rpc#serveraddr()，会报更详细的错误。

failed executing: pythonx import neovim
Vim(pythonx):ModuleNotFoundError: No module named 'neovim'

搜索网上。

大多数都是简易安装。如果是用python3的话，pip3 install pynvim就行。本人macos系统，目前的python版本系统很混乱，一个是使用pyenv管理python版本，一个是基于系统默认版本python3下利用python -m venv appenv来创建,一个是brew安装的python版本。利用默认python安装后还是vim失败。

pyenv
虽然使用了pyenv，但目前并没有启用，可以通过以下命令看出。

pyenv versions
* system (set by ~/.pyenv/version)
  3.5.7
  3.6.8
  3.7.3
$pyenv which python2 // 看下节命令行默认python的来源，可以看到这个python来自brew的Cellar安装的
/usr/local/bin/python2
$ pyenv which python3
/Library/Frameworks/Python.framework/Versions/3.7/bin/python3
$ pyenv which python3.7
/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 
其他python版本没有。

~/.bash_profile中设置pyenv启动，

命令行默认python的来源

~/.bash_profile中设置系统默认python PATH为/Library/Frameworks/Python.framework/Versions/3.7/bin 目前为python 3.7.1
$ which python2 //默认使用来自brew的python2,python2.7的也是如此来源。
/usr/local/bin/python2
$ ls -lah /usr/local/bin/python2
lrwxr-xr-x  1 jg  admin    39B Feb 19  2019 /usr/local/bin/python2 -> ../Cellar/python@2/2.7.15_3/bin/python2

$ which python3  默认使用的pyenv设置的系统python3.7.1 
/Users/jg/.pyenv/shims/python3
$ ls -lah /Users/jg/.pyenv/shims/python3
-rwxr-xr-x  1 jg  staff   415B Sep  6  2019 /Users/jg/.pyenv/shims/python3
$ python3 --version
Python 3.7.1
$ pyenv which python3  这个是3.7.1的系统默认版本，Library目录下只有这个3的版本。
/Library/Frameworks/Python.framework/Versions/3.7/bin/python3
$ /Library/Frameworks/Python.framework/Versions/3.7/bin/python3 --version
Python 3.7.1

macos系统自身的版本，默认不能访问
macos 系统自身的python版本，python2 link住/System/目录,这个system下目录只有python2的包。,python3直接安装到此目录。

$ ls -lah /usr/bin/python2
lrwxr-xr-x  1 root  wheel    75B Nov 13 23:07 /usr/bin/python2 -> ../../System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7
$ /usr/bin/python2 --version
Python 2.7.16

$ /usr/bin/python3 --version //who install??？？难道是pyenv安装的？？
Python 3.7.3
$ ls -lah /usr/bin/python3
-rwxr-xr-x  1 root  wheel    31K Jan 23 22:13 /usr/bin/python3

brew安装的版本
不能直接默认访问。

$ ls /usr/local/Cellar/python*/
/usr/local/Cellar/python/:
3.7.4	3.7.5

/usr/local/Cellar/python3/:
3.7.4	3.7.5

/usr/local/Cellar/python@2/:
2.7.15_3

事实确认

本人安装deoplete.nvim插件，用的是从brew安装的macvim，不是neovim。

$ ls -lah `which vim`
lrwxr-xr-x  1 jg  admin    32B Nov 25 17:08 /usr/local/bin/vim -> ../Cellar/macvim/8.1-161/bin/vim
用的是哪的vim配置文件呢？通过以下命令确认
$ vim --version
VIM - Vi IMproved 8.1 (2018 May 18, compiled Oct 30 2019 23:05:13)
macOS version
Included patches: 1-2234
Compiled by Homebrew
   system vimrc file: "$VIM/vimrc"
     user vimrc file: "$HOME/.vimrc"
 2nd user vimrc file: "~/.vim/vimrc"
      user exrc file: "$HOME/.exrc"
  system gvimrc file: "$VIM/gvimrc"
    user gvimrc file: "$HOME/.gvimrc"
2nd user gvimrc file: "~/.vim/gvimrc"
       defaults file: "$VIMRUNTIME/defaults.vim"
    system menu file: "$VIMRUNTIME/menu.vim"
  fall-back for $VIM: "/Applications/MacVim.app/Contents/Resources/vim"

看网上介绍，确实也在vimrc中设置了let g:python3_host_prog="PATH"，但问题依旧。

反思

既然安装了，没有效果，可能是真的不是用的vimrc设置中的python。
这里vim --version起了很好的启示。其输出的最后有其编译时的link选项，可以看到其用的python库路径是/usr/local/opt/python/Frameworks/Python.framework/Versions/3.7。
因此，应该可以在这个python版本下来安装pynvim。在没有pip情况下，采用如下命令

/usr/local/opt/python/Frameworks/Python.framework/Versions/3.7/bin/python3  -m pip install pynvim

注：brew upgrade tmux 导致vim也升级了。vim打开文件出现同样的问题，用vim --version发现python版本变了。同方法解决。

成功

这个问题在一年前升级系统后就一直存在着。让自己对vim的使用有了挫折感，基本停留在基本功能的使用上，而且也出现了一个很不好的体验：在这个bug存在下，一旦我打开vim然后退出vim。在iterm里往回滚动历史输出，会发现刚刚vim命令执行前的一些命令的输出几乎都乱了，无法查询。

虽然自己不乏解决问题的能力，但可怕的是遇到问题，自己没有去认真的去解决，而是接受现状，这不可取。

遗留问题

vim的启动/debug方法
插件失败，导致iterm输出混乱的问题

rust插件

install vim-race for deoplete(vim)

安装racer

https://github.com/racer-rust/racer
$ rustup toolchain add nightly
info: syncing channel updates for 'nightly-x86_64-apple-darwin'
464.0 KiB / 464.0 KiB (100 %) 199.5 KiB/s in  3s ETA:  0s
info: latest update on 2020-03-19, rust version 1.44.0-nightly (f509b26a7 2020-03-18)
info: downloading component 'cargo'
  3.8 MiB /   3.8 MiB (100 %) 128.0 KiB/s in 28s ETA:  0s
info: downloading component 'clippy'
  1.4 MiB /   1.4 MiB (100 %) 115.2 KiB/s in 12s ETA:  0s
info: downloading component 'rust-docs'
 12.1 MiB /  12.1 MiB (100 %)  22.4 KiB/s in  7m  6s ETA:  0s
info: downloading component 'rust-std'
 16.4 MiB /  16.4 MiB (100 %)  25.6 KiB/s in  7m 42s ETA:  0s
info: downloading component 'rustc'
 56.0 MiB /  56.0 MiB (100 %)   1.3 MiB/s in  5m  0s ETA:  0s
info: downloading component 'rustfmt'
  2.2 MiB /   2.2 MiB (100 %)   1.3 MiB/s in  1s ETA:  0s
info: installing component 'cargo'
info: installing component 'clippy'
info: installing component 'rust-docs'
 12.1 MiB /  12.1 MiB (100 %)   2.9 MiB/s in  3s ETA:  0s
info: installing component 'rust-std'
 16.4 MiB /  16.4 MiB (100 %)  15.5 MiB/s in  1s ETA:  0s
info: installing component 'rustc'
 56.0 MiB /  56.0 MiB (100 %)  10.0 MiB/s in  5s ETA:  0s
info: installing component 'rustfmt'

  nightly-x86_64-apple-darwin installed - rustc 1.44.0-nightly (f509b26a7 2020-03-18)

info: checking for self-updates
$ cargo +nightly install racer
http_proxy='http://192.168.1.1:8123' https_proxy='http://192.168.1.1:8123' cargo +nightly install racer

    Updating crates.io index
  Downloaded racer v2.1.31
  Downloaded 1 crate (182.1 KB) in 22.27s
  Installing racer v2.1.31
  Downloaded derive_more v0.99.3
  Downloaded rls-span v0.5.2
  Downloaded lazycell v1.2.1
  Downloaded racer-cargo-metadata v0.1.1
  Downloaded env_logger v0.7.1
  Downloaded clap v2.33.0
  Downloaded rustc-ap-rustc_ast_pretty v642.0.0
  Downloaded rustc-ap-rustc_data_structures v642.0.0
  Downloaded rustc-ap-rustc_errors v642.0.0
  Downloaded rustc-ap-rustc_session v642.0.0
  Downloaded rustc-ap-rustc_span v642.0.0
  Downloaded rustc-ap-syntax v642.0.0
  Downloaded humantime v2.0.0
  Downloaded rustc-ap-rustc_parse v642.0.0
  Downloaded termcolor v1.1.0
  Downloaded racer-interner v0.1.0
  Downloaded ansi_term v0.11.0
  Downloaded termize v0.1.1
  Downloaded rustc-rayon-core v0.3.0
  Downloaded strsim v0.8.0
  Downloaded rustc-ap-graphviz v642.0.0
  Downloaded crossbeam-utils v0.6.6
  Downloaded rustc-hash v1.1.0
  Downloaded stable_deref_trait v1.1.1
  Downloaded rustc-ap-rustc_index v642.0.0
  Downloaded rustc-ap-serialize v642.0.0
  Downloaded rustc-ap-rustc_feature v642.0.0
  Downloaded libc v0.2.68
  Downloaded rustc-ap-rustc_fs_util v642.0.0
  Downloaded humantime v1.3.0
  Downloaded crossbeam-queue v0.1.2
  Downloaded serde v1.0.105
  Downloaded measureme v0.7.1
  Downloaded parking_lot v0.9.0
  Downloaded ena v0.13.1
  Downloaded rustc-ap-arena v642.0.0
  Downloaded jobserver v0.1.21
  Downloaded scoped-tls v1.0.0
  Downloaded rustc-ap-rustc_macros v642.0.0
  Downloaded rustc-ap-rustc_lexer v642.0.0
  Downloaded rustc-ap-rustc_attr v642.0.0
  Downloaded quick-error v1.2.3
  Downloaded crossbeam-deque v0.7.3
  Downloaded rustc-ap-rustc_target v642.0.0
  Downloaded synstructure v0.12.3
  Downloaded annotate-snippets v0.6.1
  Downloaded maybe-uninit v2.0.0
  Downloaded rustc-rayon v0.3.0
  Downloaded itertools v0.8.2
  Downloaded lock_api v0.3.3
  Downloaded serde_derive v1.0.105
  Downloaded memmap v0.7.0
  Downloaded parking_lot_core v0.6.2
  Downloaded rustc_version v0.2.3
  Downloaded crossbeam-epoch v0.8.2
  Downloaded crossbeam-utils v0.7.2
  Downloaded smallvec v0.6.13
  Downloaded either v1.5.3
  Downloaded scopeguard v1.1.0
  Downloaded semver v0.9.0
  Downloaded memoffset v0.5.4
  Downloaded semver-parser v0.7.0  
  .cargo/bin/racer
  
$ rustup component add rust-src
info: downloading component 'rust-src'
  2.2 MiB /   2.2 MiB (100 %) 108.8 KiB/s in 18s ETA:  0s
info: installing component 'rust-src'

add 
export RUST_SRC_PATH=~/.rustup/toolchains/stable-x86_64-apple-darwin/lib/rustlib/src/rust/src
to vimrc 
source ~/.bashrc
$ racer complete std::io::B
测试成功
To complete names in external crates, Racer needs Cargo.lock. So, when you add a dependency in your Cargo.toml, you have to run a build command such as cargo build or cargo test, to get completions.

安装vim-racer
https://github.com/racer-rust/vim-racer

add plugin which i use vundle
Plugin 'racer-rust/vim-racer' to vimrc
set hidden
let g:racer_cmd = "/Users/jg/.cargo/bin/racer"
let g:racer_experimental_completer = 1
let g:racer_insert_paren = 1
augroup Racer
    autocmd!
    autocmd FileType rust nmap  gd         (rust-def)
    autocmd FileType rust nmap  gs         (rust-def-split)
    autocmd FileType rust nmap  gx         (rust-def-vertical)
    autocmd FileType rust nmap  gt         (rust-def-tab)
    autocmd FileType rust nmap  gd (rust-doc)
augroup END

C-x-C-o to search for completions and provides several mappings for source code navigation.
与depolete的关系？

bugs

unexpected keyword argument 'encoding' when enter insert mode on opened rust file

https://github.com/Shougo/deoplete.nvim/issues/1071
https://github.com/Shougo/deoplete.nvim/pull/1070
vim vundle update all plugins
:PluginUpdate
and python install msgpack

how to use racer ?

macos darkmode conflict with the iterm color soloris mode. so first use the light mode for the macos.
Later setup the color scheme for the macos development.

Plugin 'rust-lang/rust.vim'

https://github.com/rust-lang/rust.vim
:h vundle
:help rust-syntastic
The installation of Tagbar along with Universal Ctags is recommended for a good Tagbar experience. For other kinds of setups, rust.vim tries to configure Tagbar to some degree

rls insall

https://github.com/rust-lang/rls#setup

rustup component add rls
error: toolchain 'stable-x86_64-apple-darwin' does not contain component 'rls' for target 'x86_64-apple-darwin'; did you mean 'rustc'?
rustup toolchain remove stable && rustup toolchain add stable
rustup uninstall stable
$ rustup install stable
rustup component add rls rust-analysis rust-src
https://rls.booyaa.wtf/(screen how)

Plug 'rust-lang/rust.vim'
Plug 'prabirshrestha/async.vim'
Plug 'prabirshrestha/vim-lsp'
Plug 'prabirshrestha/asyncomplete.vim'
Plug 'prabirshrestha/asyncomplete-lsp.vim'

https://hoverbear.org/blog/setting-up-a-rust-devenv/
good design toc and font style
https://github.com/Hoverbear/hoverbear.org/ jkikkwy
rust-clippy - A linter.
rustfmt - A code formatter.
racer - An autocompletion utility.
rls - A language server (for symbol search etc.)
cargo +nightly install clippy

cargo +nightly clippy
Excellent. You can see the different lints which clippy detects here. You can configure various lints in the clippy.toml according to their options listed in the wiki.

To disable (or warn instead of deny) various lints you can add the deny and allow flags to your crate attributes:

#![cfg_attr(feature = "cargo-clippy", deny(empty_enum))]
#![cfg_attr(feature = "cargo-clippy", warn(indexing_slicing))]
#![cfg_attr(feature = "cargo-clippy", allow(print_with_newline))]
cargo install rustfmt
t this point we can run cargo fmt to format a repository. It runs rustfmt in the 'replace' mode which creates backup files with a .bk extension. If our project is already in version control we may not want this. If that is the case we can edit rustfmt.toml to include the following:

write_mode = "overwrite"
The rustfmt.toml lets us configure the various options found in rustfmt --config-help. Now let's edit the main.rs we made later to have some obviously poor style:
good pratice
https://hoverbear.org/blog/setting-up-a-rust-devenv/

rename
hover look at proptye
go to define
find sysmbol
autocomplete
disgnaros
formating
codeaction

There are three choices of Vim plugins for language service protocol (LSP clients. Follow the respective link for information about how to register cquery server with one of them:

LanguageClient-neovim
vim-lsp
vim-lsc

vim tips

If :echo has("python3") returns 1, then you have python 3 support
pip3 install --user pynvim
:h vundle

vs code tips

cmd+p to pane
hover goto symbol goto define peek format docment come complete

基于OpenSSL和QUICHE的Nginx/OpenResty HTTP3 QUIC 构建

沧海一声笑 — Fri, 29 May 2020 17:16:25 GMT

Abstract

Build Openresty/Nginx against OpenSSl which have HTTP3 Boringssl API for QUIC using Cloudflare's Quiche impl.
使用 Cloudflare quiche（QUIC 实现）基于修改的 OpenSSL，构建 Nginx/OpenResty

OpenSSl1.1.1g 支持 QUIC draft 27
Cloudflare quiche master 基于 OpenSSL 构建
Nginx/OpenResty 1.17.10 基于 OpenSSL 构建

Build process

mkdir build
cd build
git clone -b nginx-1.17.10-quic-support --single-branch  https://github.com/inevity/openresty.git
git clone -b openresty-packaging-quic-support --single-branch https://github.com/inevity/openresty-packaging.git
cd openresty
make clean && make
cd ../openresty-packaging/deb
make zlib-build
sudo apt-get install ./a.deb ./a-dev.deb
make pcre-build
sudo apt-get install ./a.deb ./a-dev.deb
make opensll111-build
sudo apt-get install ./openresty-openssl111-dev_1.1.1g-1~focal1_amd64.deb ./openresty-openssl111_1.1.1g-1~focal1_amd64.deb
make opensll111-debug-build
sudo apt-get install ./openresty-openssl111-debug-dev_1.1.1g-1~focal1_amd64.deb ./openresty-openssl111-debug_1.1.1g-1~focal1_amd64.deb

make openresty-build
sudo apt-get install ./openresty_1.17.10.1rc1-1~focal1_amd64.de

make openresty-debug-build
sudo apt-get install ./openresty-debug_1.17.10.1rc1-1~focal1_amd64.deb


mkdir ~/build/http3
rsync -av /usr/local/openresty/nginx/conf http3/
rsync -av /usr/local/openresty/nginx/html http3/
rsync -av /usr/local/openresty/nginx/logs http3/
cd ~/build/http3
sudo openresty -p .
sudo openresty -p . -s reload
sudo openresty -p . -s stop
sudo openresty-debug -p .

/curl/bin/curl -vvv --http3  https://approachai.com
tail -f logs/error.log

Dev Notes

OpesnSSL 支持 HTTP3

本质是实现 BoringSSL 的 QUIC API. 具体参见 Akamai OpenSSL QUIC Branch
讨论见 WIP: master QUIC support #8797
如下最后两个 patch 是针对 BoringSSL 对 QUIC 传输层的方法的修改而来的，目前支持最新 Cloudflare QUICHE.

ubuntu@easybubuild:~/build/openresty-packaging$ cat deb/openresty-openssl111/debian/patches/series
openssl-1.1.1c-sess_set_get_cb_yield.patch
0001-Add-support-for-BoringSSL-QUIC-APIs.patch
0002-Fix-resumption-secret.patch
0003-QUIC-Handle-EndOfEarlyData-and-MaxEarlyData.patch
0004-QUIC-Increase-HKDF_MAXBUF-to-2048.patch
0005-Fall-through-for-0RTT.patch
0006-Some-cleanup-for-the-main-QUIC-changes.patch
0007-Prevent-KeyUpdate-for-QUIC.patch
0008-Test-KeyUpdate-rejection.patch
0009-Fix-out-of-bounds-read-when-TLS-msg-is-split-up-into.patch
0001-update-quice-method.patch
fupdatesetread.patch

Cloudflare QUICHE 针对 OpenSLL 编译的修改

原来 QUICHE 构建只针对 BoringSSL，这个 patch 使得基于 OpenSSL 构建成为可能。
讨论见 WIP tls: add feature to build against OpenSSL #126
我的修改主要是针对最新的 BoringSSL 引入的改变，做了相应的改变。
比如 add 了 early data/0 RTT，规避了 SSL_get_peer_signature_algorithm，SSL_get_curve_id 等

ubuntu@easybubuild:~/build/openresty$ ls patches/0001-tls-add-feature-to-build-against-OpenSSL.patch
patches/0001-tls-add-feature-to-build-against-OpenSSL.patch

OpenResty/Nginx 的 HTTP3 支持

本质需要基于 TLS1.3 和 QUICHE 提供的传输层 API 和 HTTP3 API，提供 HTTP3 实现。
核心 patch 是 Cloudflare 提供的基于 BorignSSL 的 nginx quic patch
nginx-1.17.10-quiche.patch 是核心 patch。
由于基于 pkgconfig 方式构建的 QUICHE，删除了这个核心 patch 对 nginx 里的 openssl 的构建的修改auto/lib/openssl/make，添加了单独构建 OpenSSL 包的情况，添加了调试选项。

ubuntu@easybubuild:~/build/openresty$ ls patches/nginx-1.17.10-quiche* -t
patches/nginx-1.17.10-quiche-remove_opennssl_make_fix.patch  patches/nginx-1.17.10-quiche_openssldebug.patch
patches/nginx-1.17.10-quiche.patch                           patches/nginx-1.17.10-quiche_sshheader.patch

Have Done

Debain package openresty-openssl111, openresty,and debug package .

To DO

The QUIC draft 28
RPM package
Alpine image
0-RTT test
test nginx with openssl dir option,which the quiche also need the openssl.
hard coded cargo path and openssl lib path in nginx

Revisting iscsi target hung issue with tcmu glusterfs backstore

沧海一声笑 — Tue, 26 May 2020 11:12:30 GMT

解决问题的思路反思和教训

2019 年 0710 生成此报告，20200525 再次审视。
不光需要理解代码，也应该有假设探索的能力，才能快速准确的定位解决 bug。反思自己以下的解决过程，在代码理解和 bug 的复现的基础上结合一定的假设，也许就可以更快的分析解决这个问题了。

问题复现上，由于缺乏当时生产环境的具体信息，确实只能自我摸索。但其实已经极限接近问题的本质了，但最后却浅尝辄止了
在代码理解上，精读分析了代码。却没有跳出具体细节。从直观或者宏观角度做一个假设，进而换个角度理解或者给 bug 复现探索提供新的思路

问题

三台机器，243，244，245 IO 的时候有些 target 的 iscsi 会话断开，target hung。

gluster-block 创建一个 glfs backstore，基于此 backstore 创建一个 lun，期间默认会创建 tpg 和 portal，这里 HA=1。然后通过 targetcli 在此 tgp 和 portal 下创建其他 lun。默认 gluster-block 在一个 tpg 下只会创建一个 lun，这里我们利用了 targetcli 的方式，创建其他 lun。
如此，在组成 gluster volume 的这三台机器，每个都这么执行如上操作。本质上生成了三个 target 即三个 iqn。
每个 target 对应一个机器，每个 target 一个 tpg，这个 tpg 下面有多个 lun。
i 端（另外机器）连上三个 target，轮流读写这些 lun。

现象

244/245 机器当前的 target 与某个 i 端的连接处于 close-wait 状态，而 243 机器所有的 i-t 连接都是正常 est 状态。
244 机器 np 线程接收队列满了。245 未观察。

[root@xdfs2 ~]# ss -n4lt 'sport = :iscsi-target'
State       Recv-Q Send-Q               Local Address:Port         Peer Address:Port
LISTEN      257    256                  192.168.227.244:3260

244/245 机器 np/trx/ttx 线程 fd count
ls /proc/*/fd|wc -l
数据获取错误，忽略
iscsiadm -m discovery
在 244 机器上执行 iscsiadm -m discovery -t st -p 192.168.227.245/244 -d8 连接超时，243 正常。
iscsiadm -m discovery -t st -p 192.168.227.243 输出三个 target iqn，688d0c0a，ada71018（这个是 block 创建的）, 8132c0d0
iscsiadm -m session
在 243 机器上 iscsiadm -m session，no active session，说明此机器没有发起过或者没有 logging session。
在 244 上执行 iscsiadm -m session -P 3
三个 session，conn state 和 session state 都是 logged in 状态。
一个是连到 245 target 7a734 的 session。sid =1
一个是连到 244 即本机的 target d6726 的 session。sid=2 （block 创建的）
一个是连到 244 即本机的 target 0d7c4b 的 session。sid =3
243 ： block 创建的是 ada71018-8d55-4106-9787-17f7df92c643
244: block 创建的是 d672686d-dda6-4c31-ace6-4db75e9e12d3
D 状态，只有 243 机器没有 D 状态，244/245 偶尔有 glusterfs/glusterfsd D 状态，但是正常的。
244/245 有以下 hang，
244 机器

  scsit [iscsi_ttx]                 18041 D

  iscsit [iscsi_np]                 21181 D

[root@xdfs2 ~]# cat /proc/21181/stack

[] iscsit_stop_session+0x1c8/0x1e0 [iscsi_target_mod]

[] iscsi_check_for_session_reinstatement+0x213/0x280 [iscsi_target_mod]

[] iscsi_target_check_for_existing_instances+0x35/0x40 [iscsi_target_mod]

[] iscsi_target_do_login+0xf9/0x640 [iscsi_target_mod]

[] iscsi_target_start_negotiation+0x56/0xc0 [iscsi_target_mod]

[] __iscsi_target_login_thread+0x89e/0x1000 [iscsi_target_mod]

[] iscsi_target_login_thread+0x28/0x60 [iscsi_target_mod]

[] kthread+0xd1/0xe0

[] ret_from_fork_nospec_end+0x0/0x39

[] 0xffffffffffffffff

[root@xdfs2 ~]# cat /proc/18041/stack

[] iscsit_check_session_usage_count+0x87/0x90 [iscsi_target_mod]

[] iscsit_close_session+0x165/0x230 [iscsi_target_mod]

[] lio_tpg_close_session+0x12/0x20 [iscsi_target_mod]

[] target_release_session+0x2c/0x30

[target_core_mod]

[] target_put_session+0x2c/0x30 [target_core_mod]

[] iscsit_logout_post_handler+0xce/0x260 [iscsi_target_mod]

[] iscsit_response_queue+0xd6/0x730 [iscsi_target_mod]

[] iscsi_target_tx_thread+0x1bf/0x240 [iscsi_target_mod]

[] kthread+0xd1/0xe0

[] ret_from_fork_nospec_end+0x0/0x39

[] 0xffffffffffffffff

245 机器

  iscsit [iscsi_np]                  25620 D

[root@xdfs3 ~]# cat /proc/25620/stack

[] iscsit_stop_session+0x1c8/0x1e0 [iscsi_target_mod]

[] iscsi_check_for_session_reinstatement+0x213/0x280 [iscsi_target_mod]

[] iscsi_target_check_for_existing_instances+0x35/0x40 [iscsi_target_mod]

[] iscsi_target_do_login+0xf9/0x640 [iscsi_target_mod]

[] iscsi_target_start_negotiation+0x56/0xc0 [iscsi_target_mod]

[] __iscsi_target_login_thread+0x89e/0x1000 [iscsi_target_mod]

[] iscsi_target_login_thread+0x28/0x60 [iscsi_target_mod]

[] kthread+0xd1/0xe0

[] ret_from_fork_nospec_end+0x0/0x39

[] 0xffffffffffffffff

进程状态

gluster-blockd--- active(running)

gluster-block-target--- active(exited)

target ---- inactive(dead) 应该 active(running) ？需要关注

tcmu-runner —active-running
kernel trigger blocked thread to print call trace

与 cat /proc/*/stack 结果一致。

分析

接收队列满

试图设置更高的值，发现系统默认 net.core.somaxconn = 128，而且内核 target 代码设置 SCSIT_TCP_BACKLOG = 256 应该是合理的。一个疑问：somaxconn 是最大值，SCSIT_TCP_BACKLOG 怎么会超过呢？倾向于认为接收队列满是因为 hung 导致的。
根据 call trace 搜寻一些 commit, 未果

模拟复现

根据 244 机器，一个正常 login 过程，一个属于 logout 过程。分析 calltrace，两个 calltrace 独立，互相不影响（？）。我们试图模拟出 login- perform reinstatement，本质就是 i 端登录失败后自己 relogin，内部机制，不是我们手动 relogin。

排除 tcmu，利用 target backstore type pscsi，使用 scsi_debug 模拟的盘进行测试：
在一定读写次数 (scsi_debug 参数 every_nth = 30) 后，/dev/sda 会忽略读写（scsi_debug 参数 opts = 0x4 表示忽略此 cmd，不返回），即永久超时。I 端大约在一定时间（由 device timeout 时间决定）后触发自己的超时机制，再次检查确认命令超时后会发送 abort （abort 前面那个超时的 cmd）给 target。由于 I 端设置 abort 的超时时间是 5s，但是后端即 target 的 /dev/sda 设备执行 abort 的时间设置为 120s-650s 之间（比如 scsi_debug 参数 abort_sleep = 120），导致 I 端 abort 失败。I 端进而发送 reset lun，然后 reset target ，并试图 reset session，等待 relogin，期间会进行 session recovery 。失败的话，会 offline device。

在 i 端设置
1. echo 5 > /sys/block/sda/device/timeout 默认 30s，i 端读写的时候一般要大约超时 180s 后才可能触发超时，然后重复确认，进而 abort/reset 等，时间比较长。为了复现尽可能快，设置为 5s。
2. node.session.timeo.replacement_timeout = 5 , 默认 120s
3. node.conn[0].timeo.login_timeout = 30，默认 15s
4. node.conn[0].timeo.noop_out_timeout and 未改变
  node.conn[0].timeo.noop_out_interval = 5 未改变
5. node.session.err_timeo.abort_timeout = 5 默认 15s
2-4 项是 i 端的 /etc/iscsi/iscsid.conf 配置，replacement_timeout 是 recovery session 的 timeout 时间。

abort_timeout 表示 abort 命令的超时时间。noop_out_timeout 表示发送 ping 到 target 端的超时时间，间隔 noop_out_interval 5s 再次发送。

复现情况：

复现出与原来烽火 bug 相同的 np 线程 hung。但是在大约一定时间后，np 又恢复正常。具体与后端实际 hung 的时间有关，对于我们这个测试，就是与 abort_sleep 有关。
也试图改变测试条件，但最终也没有复现出一直 hung 的情况。
在 np 卡住的时候，命令行 I 端登录 target，没有立即完成，这时持续 ctrl+c 。持续多次登录 +ctrlC，直至 target 的 3260 端口 listen sock 的 accept 队列变满，同时所有 I-T 连接处于 closewait 状态。
从这点来看，基本能复现出 accept 队列满。说明是 np 卡住或者某种原因来不及关闭已经建立的 sock，出现 closewait。但线上问题是只出现一个 closewait 状态，且 listen sock 的 accept 队列也满了。
单 brick glusterfs I -T 环境：
手动配置的 target，不是通过 gluster-block 配置。具体差别有
- /sys/kernel/config/target/core/user_1/dust2lun/attrib/cmd_time_out gluster-block 130s 配置，手动默认配置是 30s。
- /sys/kernel/config/target/core/user_1/dust2lun/attrib/max_data_area_mb gluster-block 配置 64MB，而手动默认配置是 1024MB，期间也进行了 1MB 的测试。
- a 编译内核模块 dm_dust 模拟一些坏扇区进行测试。参数手动默认。没有出现 hung 的情况，只是 i 端 io 时有 input/output ERROR. 放弃测试。（可以试试持续 loop io 时，即使出现错误情况下也持续 io，类似于下面的测试）。
- b 创建 lvm 或者 dm 设备作为 brick，I 端登录后在 brick 机器执行 dmsetup suspend dmdevice ，暂停 brick io
  - 手动默认参数情况下，即 cmd_time_out=30，max_data_area_mb=1024MB 下
    I 端持续 io 情况下，t 端没有发现任何 D 状态出现。
    初步认为是 tcmu 的 cmd_time_out 超时机制导致 target 给 i 端返回错误，规避了 target 的 backstore hung 的情况。对于非 tcmu 的 target，由于没有类似的超时机制，一切的超时处理都由 i 端实现。
  - 手动默认参数情况下，即 cmd_time_out=30，max_data_area_mb=1MB, 测试在 ringbuffer 满的情况下即极端或者异常情况
    发现在 libaio，direct=0 的情况下，很快 target 的 tcmu 的 ringbuffer 用满了，进而好多 cmd 排队。只是影响性能，并没有出现 io 错误等，也没有 hung 的情况（在当时的测试环境下）
    在 libaio，direct=1 下，估计后端存储性能不强，没有出现 ringbuffer 用满了的情况，也没有 hung 出现。
  - cmd_time_out=120，max_data_area_mb=1MB, node.session.timeo.replacement_timeout = 10
    为了让在 tcmu 情况下也出现 hung 的情况，分析了 target/i 端的 pscsi/tcmu 下的 debug log 和代码，试图设置 target 端的 set attribute login_timeout=10（目的是触发 target login timeout），以前设置的值是 15s，发现设置没有成功。然后设置 tcmu cmd_time_out=120/130（这个时间长了也会触发 target login timeout，与 pscsi 情况就有可能相同）。同时设置 node.session.timeo.replacement_timeout = 10，增加了原来 5s 的设置，延长 target 时间来恢复。
    在 libaio，direct=1 下，每隔 1s watch 进程状态，发现出现了 np 和 trx 的 D 状态。而且持续时间很长。但是可能是 cmd_time_out 设置的时间的问题，基本都是在一定时间比如小于 120s 后就又恢复为 S 状态，然后就又 D 状态一段时间。所以 dmesg 里并不会出现 np 等的 hung call trace，只是有当时 suspend brick 出现的 glusterfs io thread 线程的 call trace。

初步结论

后端 hung，一般也会导致 np hung，但是目前只是复现出几乎 hung 的情况，并没有完全复现出。因为 cmd_time_out 存在，不会一直不返回的。
假如后端 hung，还需要去修复 glusterfs 自身的问题。即然具有了 1 的 cmd_time_out 的功能，2 也不会是原因。
至于 target 自身，包括 tcmu 的问题导致的 hung，也没有复现出，但有可能需要以下 bug 修复。可以采纳或者以验证的方式采用。
Revist:
对于 3，应该发散思维或者多找不同观点的同事或者朋友来沟通，也许才能最终找到本质问题。虽然代码已经彻底分析了，但却不明了问题的解决方向在哪。事后看来，才感叹原来好像很明显啊。
其实 1 和 2 已经表明问题的本质在 3. 其实从 stacktrace 角度和代码角度是很有可能作出是多个线程 (logout 线程、login 线程）等待一个资源，最后 hang 的问题。也试图在这方面进行多方面的尝试，具体可见下文。最终复现的方式是在在 logout 的同时，disable tpg 即 iscsit_free_session，其实与那个 login 线程里做的具体事情一样，最终都调用 iscsi_check_for_session_reinstatement。但由于对问题本质的不深刻把握，导致复现一直限于 logout/login 线程，不能直接的复现出来。

可能的 bug 修复

lost wakeup
target 本身就 hung
- https://www.spinics.net/lists/target-devel/msg16362.html
- https://www.spinics.net/lists/target-devel/msg16463.html
  作者说

   Disclaimer: I do not understand why this problem did not show up before tcp prequeue removal.

我们内核并没有删除 tcp prequeue，所以应该不会触发这个 bug。
不过可以试试 sysctl tcp_low_latency=1 （不走 tcp prequeue 路径）验证是否 workaround hung 问题。
2. cmd 丢失超时问题
scsi: tcmu: avoid cmd/qfull timers updated whenever a new cmd comes
此 patch 解决了 tcmu 内核模块丢失某些 cmd 的超时 timer，有可能让某些 cmd 永远不返回，np hung。

一些概念和需要注意的

max_data_area_mb per lun
tcmu 本身默认设置 1024MB，但 gluster-block 设置为 64MB。
一个 scsi cmd 的控制部分基本是固定的，可变的是每个 scsi cmd 的数据 buffer。1G data area 基本上 8M cmd area。
target 总的大小默认是 2GB。
对于高 iops，低吞吐量的 IO，可以采用 64MB 的 gluter-block 的设置，但对于高吞吐量的 IO，应该设置大些。
否则应该会有好多问题出现。
tcmu cmd_time_out 设置
- a cmd_time_out=0, 以前 gluster block 默认设置
  Setting cmd_time_out to 0 means that the iscsi_trx will wait infinite until the cmds fired to tcmu-runner return back. Just in case if tcmu-runner is killed/terminated, the iscsi_trx will hung in D state infinitely。link
  即使设置 cmd_time_out >0, 如果某些 cmd 丢失超时设置，也会导致 D 状态。
- b cmd_time_out > 0
  内核 tcmu cmd 超时机制，正常情况下即使 glusterfs hung，也不会导致 np 完全 hung 住
  alua HA 下 cmd_time_out 必须满足，cmd_time_out > GLUSTER ping timeout 42/5s（我们的设置） && cmd_time_out > replacement_timeout。
  - alua ha> 1 时，cmd_time_out 必须大于 replacement_timeout，以便 alua 路径正常恢复。 gluster-block 配置 cmd_time_out= 130s，replacement_timeout=120s。也必须大于 glusterfs ping timeout 42/（5s 是我们的 glusterfs 的设置）。
    在 brick hung 或者 tcmurunner hung/killed 时，target 会阻塞 130s（这时会有 D 状态也会有 dmesg blocekd 120s call trace 打印）。然后试图恢复。
    如下是 cmd_time_out 与 replacement_timeout 的关系的解释
  - alua ha=1 时，分析 i 端 iscsi 内核代码，cmd_time_out 与 replacement_timeout 应该没有任何关系。
    烽火线上配置是 cmd_time_out= 130s，replacement_timeout = 120s （假设客户用的标准 I 端程序），配置与 HA >1 时的相同。
    在设置 cmd_time_out=30s，replacement_timeout=5s 下，np 不 hung，也能登录成功。具体流程如下 tcmu session reinstatement target 端的流程
    1. 45:38 recv abort_task
    2. 45:43 rx=0
    3. 45:45 take action for connect exit
      cleanup_wait status ,close nection 0 on sid 1816
      transportgeneric free cmd detect cmd_t_aborted for its
    4. 5:45 recv login， sid1816 acitive,perf session reinstatement.
    5. 45:58 cmu_cmd_timedout:1287: dust2lun cmd timeout has expired

            target_core_user:tcmu_check_expired_cmd:1267: Timing out cmd 38 on dev dust2lun that is inflight.
            _transport_wait_for_tasks:3026: wait_for_tasks: Stopped wait_for_completion(&cmd- >t_transport_stop_comp) for ITT: 0x80000009
            ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 2147483657 1's
            lio_release_cmd
            iscsit_close_connection
            iscsit_close_session
            完成 session resinstatement
            Sending Login Response （成功）

如上流程第 4 步之后大约 13s 后，cmd_time_out 超时定时器失效，释放 cmd，释放原来的资源引用，session reinstatement 成功，登录成功。
对于 pscsi 测试情况，第四步 15s 后（即触发 target login timeout handler）导致 target 关闭自己的 socket，失败 session resinstatement，登录失败。
如果设置 cmd_time_out=130s，replacement_timeout=5s/10s 时，np hung 。如模拟复现部分 ---- 单 brick glusterfs I -T 环境项的最后一个复现条件。（可以测试 replacement_timeout=120s 下，类似烽火环境）

iscsi 登录过程理解，特别是 target 端的处理逻辑，可以参考这个 maillist 和对于的 RFC（具体的实现可能比 RFC 里的介绍的少）

Todo:

设置 i 端 logintimout 再小点可否复现？而不是 target 端 login timeout, 试试？
HA=1 时，到底是什么影响？cmd_time_out 应该小于还是大于 replacement_timeout？
生产复现时打开 i 端，target 端调试

target:

echo -n 'module iscsi_target_mod =pflmt; module target_core_mod  =pflmt; module target_core_user  =pflmt; module uio  =pflmt' > /sys/kernel/debug/dynamic_debug/control

echo -n 'module libiscsi =pflmt; module libiscsi_tcp =pflmt; module scsi_transport_iscsi =pflmt; module iscsi_tcp =pflmt' > /sys/kernel/debug/dynamic_debug/control

echo 1 > /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_conn
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_session
echo 1 > /sys/module/scsi_transport_iscsi/parameters/debug_conn
echo 1 > /sys/module/scsi_transport_iscsi/parameters/debug_session
echo 1 > /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp

生产 / 复现时要 stop gluster-blockd 和 gluster-block-target，防止干扰。或者开启，以便确认是否干扰导致的 hung。
I 端用标准客户端时有个 bug，在 io 的时候 logout 然后 login，会导致发现的设备异常，要恢复正常的话，目前只能重启机器。正在定位。并提交 bug https://bugzilla.redhat.com/show_bug.cgi?id=1726051
生产 / 复现时需要关注 target service 状态，验证 glusterfs 是否真正正常（reboot gluster volume 等）。采用 cat lun image 文件好像不能完全确定是否有问题。还有上次 statedump 竟然没有文件输出。
严格按照烽火环境参数配置测试。
target kernel warning _iscsit_free_cmd+0x26e/0x290 需要定位

Revist: 名为 todo，其实是没有继续这个问题的定位，而是放弃了。

20200525 的 revisit

好奇现在是否社区已经解决了，从现在翻看回 201907 月份的 maillist。找到问题的解决方案了。
scsi: target: fix hang when multiple threads try to destroy the same iscsi session
LIO hanging in iscsit_free_session and iscsit_stop_session
其实也是与原来 stacktrace 和代码分析一致，只是没能从这个角度去有意的变通，去设计测试例。当然如果是一直是 target-devel/iscsi 的维护者，应该会有充分的经验能判断出问题所在。
这个问题看社区估计从 2018 年就有出现这个问题，但最终解决还是在 20200312.

透明代理

沧海一声笑 — Fri, 01 May 2020 07:14:22 GMT

环境

光猫------路由器merlin梅林 A--------路由器openwrt B
原来主要通过A拨号上网，B闲置。
为了探索透明代理，防止配置错误影响上网，计划在B上折腾，待条件成熟，考虑迁移到A上。
A的一个LAN口与B的一个LAN口相连，B从A获取IP，或者静态设置IP。暂时关闭B的WAN拨号功能，并开启WIFI功能。这样可以连接WIFIA或者WIFEB。

B上若实现了透明代理功能(DNS功能和代理功能），这样客户端机器通过自动DHCP获取（连接WIFIB）或者手动设置route/gateway IP 为BIP，同时设置DNS为 BIP，就可以透明上网了。原有连接WIFIA的客户端还可以继续使用WIFIA的方式，保证了平滑过渡。

计划

基本原型

参考透明代理TPROXY。

优化

考虑性能、DNS分流方案、Dot、IPv6、Nftable

实践

基本原型

B环境准备

B必须能上网才能做透明代理。

BLAN1 与ALAN1相连
在/etc/config/network配置lan interface 为多IP模式。gateway ip是AIP。

list ipaddr '192.168.1.16/24'
list ipaddr '192.168.2.10/24'
option gateway '192.168.1.1'

在/etc/config/network中 disable WAN,不要用DHCP，不然会从A中获取，污染B的路由设置。
配置dnsmasq，添加dns服务器

uci add_list dhcp.@dnsmasq[0].server="192.168.1.1"
uci commit dhcp
service dnsmasq restart

当然可以web或者直接修改配置文件/etc/config/dhcp的方式来完成. 以上两操作保证在B上可以ping通 www.baidu.com.

cat /proc/sys/net/ipv4/ip_forward = 1
手动配置某个客户端，将默认网关指向BIP即 192.168.1.16。此时应当能正常上网。

v客户端安装和配置

v版本选择下载，首先判断cpu型号和大端字节序或者小端字节序

root@LEDE:~# cat /proc/cpuinfo
system type		: Broadcom BCM4716
machine			: Asus RT-N16
processor		: 0
cpu model		: MIPS 74Kc V4.0
BogoMIPS		: 239.61
wait instruction	: yes
microsecond timers	: yes
tlb_entries		: 64
extra interrupt vector	: yes
hardware watchpoint	: yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
isa			: mips1 mips2 mips32r1 mips32r2
ASEs implemented	: mips16 dsp dsp2
Options implemented	: tlb 4kex 4k_cache prefetch mcheck ejtag llsc dc_aliases vint perf_cntr_intr_bit nan_legacy nan_2008 perf
shadow register sets	: 1
kscratch registers	: 0
package			: 0
core			: 0
VCED exceptions		: not available
VCEI exceptions		: not available
root@LEDE:~# hexdump -s 5 -n 1 -C /bin/busybox
00000005  01                                                |.|
00000006
root@LEDE:~# echo -n I | hexdump -o | awk '{ print substr($2,6,1); exit}'
1 //1 mean 小端字节序

v 安装
v解压后60MB，但看B的空间，通过删除v解压后的重复文件可以放到/tmp下,有采用自行编译并用upx来压缩的方式，为了快速达到目标，这次用了usb盘

root@LEDE:~# df -h
Filesystem                Size      Used Available Use% Mounted on
/dev/root                 2.5M      2.5M         0 100% /rom
tmpfs                    61.1M      1.1M     60.0M   2% /tmp
/dev/mtdblock5           27.8M    996.0K     26.8M   4% /overlay
overlayfs:/overlay       27.8M    996.0K     26.8M   4% /
tmpfs                   512.0K         0    512.0K   0% /dev

opkg update && opkg install block-mount e2fsprogs kmod-fs-ext4 kmod-usb-storage kmod-usb2 kmod-usb3
block detect | uci import fstab
uci set fstab.@mount[0].enabled='1' && uci set fstab.@global[0].anon_mount='1' && uci commit
/sbin/block mount && service fstab enable
#copy to B 
scp ~/Downloads/v-linux-mipsle.zip root@192.168.1.16:/mnt/sdav/v

run v

  root@LEDE:~# opkg install curl
  # config from system proxy or 参考透明代理tproxy的v客户端配置 
  root@LEDE:/mnt/sda/v# ./v -c config.jsonsystemproxy  
  root@LEDE:~# curl -so /dev/null -w "%{http_code}" xxxx.com -x socks5://127.0.0.1:1080
  # 确保301/200就行。

config v for tproxy

  root@LEDE:/mnt/sda/v# opkg install iptables-mod-tproxy ipset
  # maybe we will use ipset later
 
  root@OpenWrt:/mnt/sda/v# cat mytest3.sh
  #!/bin/sh
  
  ip rule add fwmark 1 table 100 # add ip rule ,once a packet match the fwmark = 1,then lookup the route table 100.
  ip route add local 0.0.0.0/0 dev lo src 127.0.0.1 table 100 # add local type's route entry to route table  100
  
  iptables -t mangle -N ONLYONE
  iptables -t mangle -A ONLYONE -j MARK --set-mark 1
  iptables -t mangle -A ONLYONE -j ACCEPT
  iptables -t mangle -A PREROUTING -m socket -j ONLYONE // prevent packet which have existing connectin to enter into tproxy twice
  
  iptables -t mangle -N V
  iptables -t mangle -A V -d 127.0.0.1/32 -j RETURN
  iptables -t mangle -A V -d 224.0.0.0/4 -j RETURN
  iptables -t mangle -A V -d 255.255.255.255/32 -j RETURN
  iptables -t mangle -A V -d 192.168.0.0/16 -p tcp -j RETURN
  iptables -t mangle -A V -d 192.168.0.0/16 -p udp ! --dport 53 -j RETURN # let port 53 udp packet enter into the v app
  iptables -t mangle -A V -p udp -j TPROXY --on-port 13345 --on-ip 127.0.0.1 --tproxy-mark 0x1/0x1 # use tproxy to forward the packet to local port 13345,meantime mark 1
  iptables -t mangle -A V -p tcp -j TPROXY --on-port 13345 --on-ip 127.0.0.1 --tproxy-mark 0x1/0x1
  iptables -t mangle -A PREROUTING -j V // let other new  packet  processing by the v chain
  
  iptables -t mangle -N V_MASK
  iptables -t mangle -A V_MASK -d 224.0.0.0/4 -j RETURN
  iptables -t mangle -A V_MASK -d 255.255.255.255/32 -j RETURN
  iptables -t mangle -A V_MASK -d 192.168.0.0/16 -p tcp -j RETURN
  iptables -t mangle -A V_MASK -d 192.168.0.0/16 -p udp ! --dport 53 -j RETURN
  iptables -t mangle -A V_MASK -j RETURN -m mark --mark 0xff # let packet which have mark 255 to direct connection
  iptables -t mangle -A V_MASK -p udp -j MARK --set-mark 1
  iptables -t mangle -A V_MASK -p tcp -j MARK --set-mark 1
  iptables -t mangle -A OUTPUT -j V_MASK

用执行脚本的方式来验证。待验证成功后，可以放到/etc/firewall.user持久化。v 客户端的配置暂时就是参考链接里的配置。

参考网上配置把v做成服务的方式，系统启动后自动启动。

注意事项

B openwrt failsafe mode
一开始最好调研好怎么进入这个failsafe模式。同时执行时尽量避免把路由器变砖了。
本人使用iptable，把规则放到了/etc/firewall.user中，然后再也没法访问了，最后只能进入华硕的恢复模式重刷openwrt固件。折腾了半天也没进入openwrt的failsafe mode。

优化

DNS over tls

DoH虽然因为浏览器的普及很容易实现和普及，但据说有泄露agent的可能。综合多种原因，最终选择了Dot。

在A路由器merlin上配置Dot

参考DNS privacy
在android和pc 上设置dns ip 为AIP 192.168.1.1.
打开网页测试https://www.cloudflare.com/ssl/encrypted-sni/
这里的设置分为路由（网关）和DNS单独的设置。
所以虽然B透明代理当前也劫持53的DNS流量，但是我们仍然可以用A作为DNS服务器，只是需要手动设置而已。

Dot+ESNI

ESNI属于web服务器的客户端范畴的概念，是TLSv1.3的一个扩展功能,加密我们要访问的服务器名字。

支持情况：
- firefox：已经73版本后支持。
- chrome：不支持，参见Support for Encrypted SNI (ESNI)
测试验证：
- firefox
  整体结果不理想。
  Mac采用Dot DNS,即设置DNS IP为路由器AIP 192.168.1.1（路由器DNS禁用了doh）。在about：config里设置network.trr.mode 0/2/3/5多种情况下，偶尔发现启用了ESNI。但大多数情况下都失败着。
  采用wireshark抓包、cloudflare check、 cloudflare trace api多种验证方法。
  结果不乐观的原因：1，启用ESNI必须启用Doh，即Doh是依赖。参见bug. 2, 其他干扰因素比如权威DNS服务器与TLS服务器不同步的话会出现这个问题。
  总之并不能愉快的启用ESNI，特别是在开启Dot的情况下。暂时放弃。

Dot over V

性能

DNS分流方案

IPv6

IPv6 Nftable

一些概念和疑惑

router
Any machine which will accept and forward packets between two networks is a router.Every router is at least dual-homed; one interface connects to one network, and a second interface connects to another network.
route -n == netstat -rn
ip rule

root@OpenWrt:/mnt/sda/v# ip rule
0:	from all lookup local
32765:	from all fwmark 0x1 lookup 100
32766:	from all lookup main
32767:	from all lookup default

ip rule可以查看有那些路由规则,32765表示优先级，系统一般默认有路由表main，local，和default。这里lookup 前的是规则选择的方式，lookup是动作，表示一旦一个packet满足规则，就会lookup某个路由表。路由表可以有名字或者叫table 100这样的。

The route selection algorithm under linux
分为基于目的地址的路由和基于路由策略的路由。
高级路由也叫策略路由。
首先在路由策略数据库rpdb中根据优先级进行迭代遍历，利用longest prefix match selection algorithm 试图找到对应的路由，若找到，就会转发那个包，否则进入下一个规则。
main路由表里一般包含默认路由，即所有其他IP走这个路由。
When do reroute check?
grep "Reroute for ANY change" net/ipv4/netfilter/iptable_mangle.c.
Basically "any" means saddr,daddr mark and tos, the four routing-influencing parameters.
在上面配置output链规则的时候，一旦有这几项改变，会导致包的重路由，不再是原来的正常的路径。
当然reroute check的执行不一定只在output链这里。
为什么在路由策略里那里添加src 127.0.0.1
ip route add local 0.0.0.0/0 dev lo src 127.0.0.1 table 100 #
Make it explicit that the source IP used for this network when connecting locally should be in 127.0.0.0/8 range. This is needed since otherwise the TPROXY rule would match both forward and backward traffic. We want it to catch forward traffic only.
ref cloudflare blog
还没明白，但是正常使用来着现在。
为什么要添加match那条规则
tproxy 中有说明。
主要是防止已经有连接的包不管TCP还是UDP，不必要再次进入tproxy，只是标记为1，让策略路由把此包根据mark=1的规则来路由。
估计有性能提升，有兴趣的可以测试下。
How does dnsmasq interaction with dot?
openwrt fw3 and nftable cheatsheet
why fail use this?
期间遇到的问题，还好解决了，但现在不清楚为啥？
为了引入match那个规则，采用如下方式
步骤二执行后，dns不可用，但外网ip可以ping 通。执行步骤三但最后一个规则忽略。然后执行那个规则，会导致服务器无法连接。
是不是iptable rule执行顺序的问题

# 操作要点： 
主要添加了ONLYONE链，作用是防止同属于一个连接的其后的包进入tproxy两次。
还有在步骤三中,添加了iptables -t mangle -A BYPAS_MASK -d 127.0.0.1/32 -j RETURN
原来参考中，是没有这一句的，原来步骤二三中的直连，不同的就是这一句。
v client 配置用就是那个官方tproxy教程的配置，inbounds没有开放53端口，操作系统有自己的dns应用。 

# 步骤
## 一 策略路由
ip rule add fwmark 1 table 100
ip route add local 0.0.0.0/0 dev lo table 100

## 二 外面发过来的的包到透明网关

iptables -t mangle -N BYPASS
iptables -t mangle -A BYPASS -d 127.0.0.1/32 -j RETURN
iptables -t mangle -A BYPASS -d 224.0.0.0/4 -j RETURN
iptables -t mangle -A BYPASS -d 255.255.255.255/32 -j RETURN

iptables -t mangle -A BYPASS -d 192.168.0.0/16 -p tcp -j RETURN
iptables -t mangle -A BYPASS -d 192.168.0.0/16 -p udp ! --dport 53 -j RETURN


iptables -t mangle -N ONLYONE
iptables -t mangle -A ONLYONE -j MARK --set-mark 1
iptables -t mangle -A ONLYONE -j ACCEPT

iptables -t mangle -N V
iptables -t mangle -A V -p udp -j TPROXY --on-port 12345 --tproxy-mark 1
//用tproxy转发到127.0.0.1 13345，即让v来处理
iptables -t mangle -A V -p tcp -j TPROXY --on-port 13345 --tproxy-mark 1


iptables -t mangle -A PREROUTING -j BYPASS  //BYPASS掉一些私有/内网地址

**iptables -t mangle -A PREROUTING -m socket -j ONLYONE**
// 添加ONLYONE链到mangle表的prerouting链里，即应用ONLY链的规则。 防止同属于一个连接的其后的包进入tproxy两次，
// 这里让它直接进入ONLYONE链里处理，即直接路由到loop，不再经过tproxy，即不会执行下面这个rule了

iptables -t mangle -A PREROUTING -j V 
// 添加V链到mangle表的prerouting链里，即应用规则，即给 UDP/TCP 打标记 1，用tproxy转发至v app 的 13345 端口并开始路由决策


## 三 透明网关自身生成的包： v生成的包要发出去，或者还有自身生成的其他包

iptables -t mangle -N BYPASS_MASK
**iptables -t mangle -A BYPAS_MASK -d 127.0.0.1/32 -j RETURN**
iptables -t mangle -A BYPASS_MASK -d 224.0.0.0/4 -j RETURN
iptables -t mangle -A BYPASS_MASK -d 255.255.255.255/32 -j RETURN

iptables -t mangle -A BYPASS_MASK -d 192.168.0.0/16 -p tcp -j RETURN
iptables -t mangle -A BYPASS_MASK -d 192.168.0.0/16 -p udp ! --dport 53 -j RETURN

iptables -t mangle -N V_MASK
iptables -t mangle -A V_MASK -j RETURN -m mark --mark 0xff //v生成的带mark的包让其直接发出
iptables -t mangle -A V_MASK -p udp -j MARK --set-mark 1 // 这可能是哪个app的流量？ 要重新路由到loop
iptables -t mangle -A V_MASK -p tcp -j MARK --set-mark 1 // 这是系统dns的s的流量？要重新路由到loop

iptables -t mangle -A OUTPUT -j BYPASS_MASK //应用bypass规则
**iptables -t mangle -A OUTPUT -j V_MASK**

Introduction to Linux interfaces for virtual networking - Red Hat Developer

沧海一声笑 — Sat, 25 Apr 2020 08:15:08 GMT

A brief introduction to all commonly used virtual network interface types

https://developers.redhat.com/blog/2018/10/22/introduction-to-linux-interfaces-for-virtual-networking/

curl build with quic

Bao Jianguo — Wed, 22 Apr 2020 05:20:49 GMT

url

https://raw.githubusercontent.com/cloudflare/homebrew-cloudflare/master/curl.rb
https://github.com/curl/curl/blob/master/docs/HTTP3.md

issue

'openssl/opensslconf.h' file not found
```
./configure --disable-dependency-tracking --disable-silent-rules --prefix=/code/quiche/curl --with-secure-transport --without-ca-bundle  --without-ca-path --with-ssl=/code/quiche/curl-7.69.1/quiche/deps/boringssl/.openssl --with-quiche=quiche/target/release --enable-alt-svc
make
```
出现 md4.c:31:10: fatal error: 'openssl/opensslconf.h' file not found。

即使我们已经指定用--with-ssl=boringssl，还是报这样的错误。
参考chromium bug
本想在md4.c文件中加入include 还是不成功。
最后不得不用系统brew的openssl文件暂时workaround。
在configure之前export CPPFLAGS=-I/usr/local/opt/openssl/include。
ln问题
macos下好像ln -s 目录有问题，linked的目录下的文件在symlink下并不能看到。暂时 workaround了。
有时copy都会导致运行configure失败。注意文本格式

Two Sum in Rust

沧海一声笑 — Tue, 14 Apr 2020 06:10:07 GMT

problem

two sum

解析

先考虑最直接的暴力方法，然后分析时间空间复杂度，进而根据算法动作，分析可能的优化或者转换。
这里我们是找某个数并返回其index，因而可以想到hashmap
rust语言注意for与一般的用法不同
for 默认usize
给出A/B/C/D/E方法，多种变化。B考虑把Option的match转换成map_or_else()，有点复杂，放弃。

答案

impl Solution {
    pub fn two_sum(nums: Vec, target: i32) -> Vec {
        let n = nums.len();
        for i in 0..n {
            for j in i+1..n {
                if nums[i] + nums[j] == target {
                    return vec![i as i32, j as i32];
                }
            } 
        }
        // return vec![];
        // vec![]
        unreachable!()
    }
}
*/
// complex analysis: 
// time: action by action, O(n2), space: O(1)
// improve the last action , time vs space or essience to find index hashmap(v,index)
use std::collections::HashMap;
use std::convert::TryFrom;
impl Solution {
    pub fn two_sum(nums: Vec, target: i32) -> Vec {
       // A
        let mut comps: HashMap = HashMap::new();
        for i in 0..nums.len() {
            // match to map
            match comps.get(&nums[i]) {
                Some(&x) => return vec![x, i as i32],
                None => comps.insert(target - nums[i], i as i32)
            };
        }
      // B
      //  let mut comps: HashMap = HashMap::new();
      //  for i in 0..nums.len() {

      //      let mut result = comps.get(&nums[i]).map_or_else(|| { comps.insert(target -nums[i], i as i32); vec![]}, 
      //                                      |x| vec![x, i32::try_from(i).as_ref().unwrap()]);
      //      if result.is_empty() && (i < nums.len() - 1) {
      //          continue;
      //      } else {
      //          result
      //      };                             

      //  }
      // C
       // one improve: let mut index_hashmap = HashMap::with_capacity(nums.len());
      // D
       // let mut seen = HashMap::new();// no need specify type later will
       // for (i, num) in nums.iter().enumerate() { //use tuple,enumerate return tuple iter
       //     if seen.contains_key(num) {
       //         return vec![seen[num] as i32, i as i32];
       //     } else {
       //         seen.insert(target - num, i); // can use two push 
       //     }
       // }
       // here we can also use if let Some(&k) = seen.get(&(target -num)) 
       //                                 return vec![k as i32, i as i32];
       //                        } else {
       //                            seen.insert(target - num, i);
       //                        }
       // E

//        let mut dict: HashMap<&i32, usize>  = HashMap::new();
//        let mut res: Vec = vec![0; 2];
//        for (i, item) in nums.iter().enumerate(){
//            if dict.contains_key(&(target-item)) {
//                res[0] = dict[&(target-item)] as i32;
//                res[1] = i as i32;
//                return res;
//            }
//            else {
//                dict.insert(item, i);
//            }
//        } 


        vec![]
    }
}

Signs You Aren't Really Building a Minimum Viable Product - Anthony Panozzo's Blog

沧海一声笑 — Mon, 13 Apr 2020 15:07:16 GMT

MVP, despite the name, is not about creating minimal products.

http://www.panozzaj.com/blog/2012/01/11/signs-you-arent-really-building-a-minimum-viable-product/

Rewriting the heart of our sync engine - Dropbox

沧海一声笑 — Mon, 06 Apr 2020 15:03:41 GMT

New system emphasized testing, correctness, debuggability,encode all of these principles in data model 2.redesigned the client-server protocol to have strong consistency. 3. use Rust

https://dropbox.tech/infrastructure/rewriting-the-heart-of-our-sync-engine

Timeouts, retries and backoff with jitter

沧海一声笑 — Thu, 02 Apr 2020 04:36:22 GMT

Timeouts keep systems from hanging unreasonably long, retries can mask those failures, and backoff and jitter can improve utilization and reduce congestion on systems.

https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/

Millions of Tiny Databases

沧海一声笑 — Wed, 01 Apr 2020 10:15:17 GMT

what production to build

the configuration store for a high-performance cloud block storage system (Amazon EBS).

requirement

This database needs to be not only highly available, durable, and scalable but also strongly consistent. We quickly realized that the constraints on availability imposed by the CAP theorem, and the realities of operating distributed systems, meant that we didn’t want one database. We wanted millions

what is physalia

Physalia is a transactional key-value store, optimized for use in large-scale cloud control planes, which takes advantage of knowledge of transaction patterns and infrastructure design to offer both high availability and strong consistency to millions of clients. Physalia uses its knowledge of datacenter topology to place data where it is most likely to be available. Instead of being highly available for all keys to all clients, Physalia focuses on being extremely available for only the keys it knows each client needs, from the perspective of that client.

physalia use cases

distributed systems problems like control planes, configuration management, and service discovery.

large-scale cloud control planes

transaction patterns

infrastructure design

defect of Traditional architectures for highly-available systems,Most modern system designs,

assumption: infrastructure failures are statistically independent, and that it is extremely unlikely for a large number of servers to fail at the same time.
Most modern system designs are aware of broad failure domains (data centers or availability zones), but still assume two modes of failure: a complete failure of a datacenter, or a random uncorrelated failure of a server, disk or other infrastructure. These assumptions are reasonable for most kinds of systems.

in traditional datacenter environments, probability of second disk failure

while the probability of a second disk failure in a week was up to 9x higher when
a first failure had already occurred, this correlation drops off to less than 1.5x as systems age. While a 9x higher failure rate within the following week indicates some correlation, it is still very rare for two disks to fail at the same time. This is just as well, because systems like RAID [43] and primarybackup failover perform well when failures are independent, but poorly when failures occur in bursts

how measured the availability of system

a simple percentage of the time that the system is available (such as 99.95%), and set Service Level Agreements (SLAs) and internal goals around this percentage.

AWS EC2 Availability Zones

named units of capacity with clear expectations and SLAs around correlated failure, corresponding to the datacenters that customers were already familiar with.

aws thinking history on failure and availability

2006 measure available
2008 availability zone(corrleated failure)
now blast radius and correlation of failure
Not only do we work to make outages rare and short, we work to reduce the number of resources and customers that they affect [55], an approach we call blast radius reduction. This philosophy is reflected in everything from the size of our datacenters [30], to the design of our services, to operational practices.

How aws minimizes the blast radius of failures

https://www.youtube.com/watch?v=swQbA4zub20

sizing of database

https://www.youtube.com/watch?v=AyOAjFNPAbA

ebs

Amazon Elastic Block Storage (EBS) is a block storage service for use with AWS EC2, allowing customers to create block devices on demand and attach them to their AWS EC2 instances.

volume failure define

volumes are designed for an annual failure rate (AFR) of between 0.1% and 0.2%, where failure refers to a complete or partial loss of the volume. This is significantly lower than the AFR of typical disk drives [44].

how ebs acheive higher volume avalibility

EBS achieves this higher durability through replication, implementing a
chain replication scheme (similar to the one described by van
Renesse, et al [54]).

https://dl.acm.org/doi/10.5555/1251254.1251261
https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/renesse/renesse.pdf
Chain Replication for SupportingHigh Throughput and Availability
replication group??

ebs arch

In normal operation (of this simplified model), replicated data flows through the chain from client, to primary, to replica, with no need for coordination. When failures occur, such as the failure of the primary server, this scheme requires the services of a configuration master, which ensures that updates to the order and membership of the replication group occur atomically, are well ordered, and follow the rules needed to ensure durability. To fail gracefully and partially, and strongly avoid large-scale failures.

ebs search

Simworld aws

The deatailed requirements on configuration master Physalia

In normal operation it handles little traffic, as replication continues to operate with no need to contact the configuration master. However, when large-scale failures (such as power failures or network partitions) happen, a large number of servers can go offline at once, requiring the master to do a burst of work. This work is latency critical, because volume IO is blocked until it is complete. It requires strong consistency, because any eventual consistency would make the replication protocol incorrect. It is also most critical at the most challenging time: during large-scale failures.
Volume replication protocol is in the ebs storage server not in the physalia.
Physalia is a specialized database designed to play this role in EBS, and other similar systems at Amazon Web Services. Physalia offers both consistency and high availability, even in the presence of network partitions, as well as minimized blast radius of failures. It aims to fail gracefully and partially, and
strongly avoid large-scale failures.

the EBS control plane and On 21 April 2011 outage and blast radius

Replication configuration was stored in the EBS control plane these days
When data for a volume needs to be re-mirrored, a negotiation must take place between the AWS EC2 instance, the EBS nodes with the volume data, and the EBS control plane (which acts as an authority in this process) so that only one copy of the data is designated as the primary replica and recognized by the AWS EC2 instance as the place where all accesses should be sent.
This provides strong consistency of EBS volumes. As more EBS nodes
continued to fail because of the race condition described above, the volume of such negotiations with the EBS control plane increased. Because data was not being successfully re-mirrored, the number of these calls increased as the system retried and new requests came in. The load caused a brown out of the EBS control plane and again affected EBS APIs across the Region.
This failure vector was the inspiration behind Physalia’s design goal of limiting the blast radius of failures, including overload, software bugs, and infrastructure failures.

CAP consistency availability partition tolerance for physalia

As proven by Gilbert and Lynch [22], it is not possible for a distributed system to offer both strong consistency (in the sense of linearizability [31]), and be available to all clients in the presence of network partitions. Unfortunately, all realworld distributed systems must operate in the presence of
network partitions [6], so systems must choose between strong consistency and availability.
Strong consistency is nonnegotiable in Physalia, because it’s required to ensure the correctness of the EBS replication protocol. However, because chain replication requires a configuration change during network partitions, it is especially important for Physalia to be available during partitions.
Physalia then has the goal of optimizing for availability during network partitions, while remaining strongly consistent. Our core observation is that we do not require all keys to be available to all clients. In fact, each key needs to be available at only three points in the network: the AWS EC2 instance that is the client of the volume, the primary copy, and the replica copy. Through careful placement, based on our system’s knowledge of network and power topology, we can significantly increase the probability that Physalia is available to the clients that matter for the keys that matter to those clients.

Brewer’s conjecture and the feasibility of consistent, available, partitiontolerant web services

https://dl.acm.org/doi/10.1145/564585.564601

Linearizability: A correctness condition for concurrent objects

https://dl.acm.org/doi/10.1145/78969.78972

The network is reliable :review

https://dl.acm.org/doi/10.1145/2643130

motivation for physalia

C vs A
This is Physalia’s key contribution, and our motivation for building a new system from the ground up: infrastructure aware placement and careful system design can significantly reduce the effect of network partitions, infrastructure failures, and even software bugs.

compare paxos made live

https://dl.acm.org/doi/10.1145/1281100.1281103
the details, choices and tradeoffs that are required to put a consensus system into production. Our concerns, notably blast radius reduction and infrastructure awareness, are significantly different from that paper

Physalia design goal

Physalia’s goals of blast radius reduction and partition tolerance required careful attention in the design of the data model, replication mechanism, cluster management and even operational and deployment procedures.
to be easy and cheap to operate, contributing negligibly to the cost of our dataplane. We wanted its data model to be flexible enough to meet future uses in similar problem spaces, and to be easy to use correctly
highly scalable,able to support an entire EBS availability zone in a single installation.

Gcm-siv: Full nonce misuse-resistant authenticated encryption at under one cycle per byte

design: nodes, cells and the colony

Physalia’s highlevel organization.
each Physalia installation is a colony, made up of many cells. The cells live in the same environment: a mesh of nodes, with each node running on a single server. Each cell manages the data of a single partition key, and is implemented using a distributed state machine, distributed across seven nodes. Cells do not coordinate with other cells, but each node can participate in many cells. The colony, in turn, can consist of any number of cells (provided there are sufficient nodes to distribute those cells over).

nodes, cells and the colony relation graph

The cells live in the same environment: a mesh of nodes, with each node running on a single server. Each cell manages the data of a single partition
key, and is implemented using a Paxos-based distributed state machine, distributed across seven nodes. Cells do not coordinate with other cells, but each node can participate in many cells
The division of a colony into a large number of cells is our main tool for reducing radius in Physalia. Each node is only used by a small subset of cells, and each cell is only used by a small subset of clients.

cell graph

a mesh of nodes holding a single Paxos-based distributed state machine, with one of the nodes playing the role of distinguished proposer.

physalia colony

Each Physalia colony includes a number of control plane components. The control plane plays a critical role in maintaining system properties. When a new cell is created, the control plane uses its knowledge of the power and network
topology of the datacenter (discovered from AWS’s datacenter automation systems) to choose a set of nodes for the cell. The choice of nodes balances two competing priorities. Nodes should be placed close to the clients (where close is measured in logical distance through the network and power topology) to ensure that failures far away from their clients do not cause the cell to fail. They must also be placed with sufficient diversity to ensure that small-scale failures do not cause the cell to fail. Section 3 explores the details of placement’s role in
availability.

how the cell creation and repair?

The cell creation and repair workflows respond to requests to create new cells (by placing them on under-full nodes), handling cells that contain failed nodes (by replacing these nodes), and moving cells closer to their clients as clients move
(by incrementally replacing nodes with closer ones)

the node is the datanode of ebs??

the tradeoff for the control plane as a seperation system

We could have avoided implementing a seperate controlplane and repair workflow for Physalia, by following the example of elastic replication [2] or Scatter [23]. We evaluated these approaches, but decided that the additional complexity, and additional communication and dependencies between shards, were at odds with our focus on blast radius. We chose to keep our cells completely independent, and implement the control plane as a seperate system.
https://doi.org/10.1145/2523616.2523623
https://doi.org/10.1145/2043556.2043559
What is the shards in the Physalia?

Meang impl controlplane along with cell state machines is not proper, so impl as a seperate system now ?
Impl in the chain replication in ebs?

design: cell is falvor of paxos

The design of each cell is a straightforward consensus-based distributed state machine. Cells use Paxos [35] to create an ordered log of updates, with batching and pipelining [48] to improve throughput. Batch sizes and pipeline depths are
kept small, to keep per-item work well bounded and ensure short time-to-recovery in the event of node or network failure.
Physalia uses a custom implementation of Paxos,which keeps all required state both in memory and persisted to disk.
The control plane tries to ensure that each node contains a different mix of cells, which reduces the probability of correlated failure due to load or poison pill transitions. In other words, if a poisonous transition crashes the node software on each node in the cell, only that cell should be lost. In the
EBS deployment of Physalia, we deploy it to large numbers of nodes well-distributed across the datacenter. This gives the Physalia control plane more placement options, allowing it to optimize for widely-spread placement

cell paxos impl use what proposals: optimistic approach

proposals are accepted optimistically. All transactions given to the proposer are proposed, and at the time they are to be applied (i.e. all transactions with
lower log positions have already been applied), they are committed or ignored depending on whether the write conditions pass.

the optimistic approahc advantage and disadvantage

this optimistic approach is that the system always makes progress if clients follow the typical optimistic concurrency control (OCC) pattern.
The disadvantage is that the system may do significant additional work during
contention, passing many proposals that are never committed

optimistic concurrency control (OCC)

why select seven node as a cell paxos

Durability improves exponentially with larger cell size [29]. Seven replicas means that each piece of data is durable to at least four disks, offering durability around 5000x higher than the 2-replication used for the volume
data
Cell size has little impact on mean latency, but larger cells tend to have lower high percentiles because they better reject the effects of slow nodes, such as those experiencing GC pauses [17].
The effect of cell size on availability depends on the type of failures expected. smaller cells offer lower availability in the face of small numbers
of uncorrelated node failures, but better availability when the proportion of node failure exceeds 50%. While such high failure rates are rare, they do happen in practice, and a key design concern for Physalia.
Larger cells consume more resources, both because Paxos requires O(cellsize) communication, but also because a larger cell needs to keep more copies of the data. The relatively small transaction rate, and very small data, stored by the EBS use of Physalia made this a minor concern.

design: data model and api

The core of the Physalia data model is a partition key. Each EBS volume is assigned a unique partition key at creation time, and all operations for that volume occur within that partition key. Within each partition key, Physalia offers a transactional store with a typed key-value schema, supporting
strict serializable reads, writes and conditional writes over any combination of keys. It also supports simple in-place operations like atomic increments of integer variables

one layer of partition keys, any number (within operational limitations) of string keys within a partition, and one value per key. The API can address only one
partition key at a time, and offers strict serializable batch and conditional operations within the partition.

the design goal balance two competing concerns of the physalia api

balance two competing concerns

The API needed to be expressive enough for clients to take advantage of the (per-cell) transactional nature of the underlying store, including expressing conditional updates, and atomic batch reads and writes.
Increasing API expressiveness, on the other hand, increases the probability that the system will be able to accept a transition that cannot be applied (a poison pill).
The Physalia API is inspired by the Amazon DynamoDB API, which supports
atomic batched and single reads and writes, conditional updates, paged scans, and some simple in-place operations like atomic increments. We extended the API by adding a compound read-and-conditional-write operation.

physalia api data feild why not support floating-point data type

Floating-point data types and limited-precision integers are not supported due to difficulties in ensuring that nodes will produce identical results when using different software versions and hardware (see [24] and chapter 11 of [1]).

how support string in the api data feild

Phsyalia’s data fields are strong but dynamically typed. Supported field types include byte arrays (typically used to store UTF-8 string data), arbitrary precision integers, and booleans. Strings are not supported directly, but may be offered as a convenience in the client
As in any distributed state machine, it’s important that each node in a cell gets identical results when applying a transition

why not support sql in the api?

same as floatpoint type
ensure that complex updates are applied the same way by all nodes, across all software versions.

the api support two consistency mode to clients

In the consistent mode, read and write transactions are both linearizable and serializable, due to being serialized through the state machine log. Most Physalia clients use this consistent mode.
The eventually consistent mode supports only reads (all writes are consistent), and offers a consistent prefix [7] to all readers and monotonic reads [53] within a single client session. Eventually consistent reads are provided to be used for monitoring and reporting (where the extra cost of linearizing reads worth
it), and the discovery cache (which is eventually consistent anyway).

API Eventually consistent for what usecase

Eventually consistent reads are provided to be used for monitoring and reporting (where the extra cost of linearizing reads worth
it), and the discovery cache (which is eventually consistent anyway).

read and write transactions are both linearizable and serializable meaning

consistent prefix

Rethinking eventual consistency
https://dl.acm.org/doi/10.1145/2463676.2465339

monotonic reads [53] within a single client session

Session guarantees for weakly consistent replicated data
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.2269&rep=rep1&type=pdf

read after write

api offer first-class lease

lightweight timebounded locks
The lease implementation is designed to tolerate arbitrary clock skew and short pauses, but will give incorrect results if long-term clock rates are too different. In our implementation, this means that the fastest node clock is advancing at more than three times the rate of the slowest clock. Despite lease safety being highly likely, leases are only used where they are not critical for data safety or integrity.

Leases: An efficient faulttolerant mechanism for distributed file cache consistency

https://dl.acm.org/doi/10.1145/74851.74870

What every computer scientist should know about floating-point arithmetic

https://doi.org/10.1145/103162.103163,doi:10.1145/103162.103163
IEEE standard for floating-point arithmetic. IEEE
Std 754-2008, pages 1–70, Aug 2008

In the API, how proposer batched read /write in a single transaction

In the Physalia API, all keys used to read and write data, as well as conditions for conditional writes, are provided in the input transaction. This allows the proposer to efficiently detect which changes can be safely batched in a single transaction without changing their semantics. When a batch transaction is rejected, for example due to a conditional put failure, the proposer can remove the offending change from the batch and re-submit, or submit those changes without batching.

design: reconfiguration, teaching and learning

As with our core consensus implementation, Physalia does not innovate on reconfiguration.
The approach taken of storing per-cell configuration in the distributed state machine and passing a transition with the existing jury to update it follows
the pattern established by Lampson [37].

How to build a highly available system using consensus Lampson -- reconfigure

https://courses.cs.washington.edu/courses/csep590/04wi/assignments/paxos_lampson.pdf

A significant factor in the complexity of reconfiguration

the interaction with pipelining: configuration changes accepted at log position i
must not take effect logically until position i+α, where α is the maximum allowed pipeline length (illustrated in Figure 6). Physalia keeps α small (typically 3), and so simply waits for natural traffic to cause reconfiguration to take effect (rather than stuffing no-ops into the log). This is a very sharp edge in
Paxos, which doesn’t exist in either Raft [42] or Viewstamped
Replication [41].

reconfigure in the raft /paxos/ viewstamped replication

raft

In search of an understandable consensus algorithm
https://dl.acm.org/doi/10.5555/2643634.2643666
https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf

viewstamped replication

Viewstamped replication: A new primary copy method to support highly-available distributed systems
https://dl.acm.org/doi/10.1145/62546.62549

physalia reconfiguration is unusual

reconfiguration happens frequently. The colony-level control plane actively moves Physalia cells to be close to their clients. It does this by replacing far-away nodes with close nodes using reconfiguration. The small data sizes in Physalia make cell reconfiguration an insignificant portion of overall datacenter traffic

how physalia move node to client in cell

movement by iterative reconfiguration. The system prefers safety over speed, moving a single node at a time (and waiting for that node to catch up) to minimize the impact on durability. The small size of the data in each cell allows reconfiguration to complete quickly, typically allowing movement to complete within a minute.

node join or rejoin a cell by teaching

When nodes join or re-join a cell they are brought up to speed by teaching, a process we implement outside the core consensus protocol.
We support three modes of teaching

In the bulk mode, most suitable for new nodes, the teacher (any existing node in the cell) transfers a bulk snapshot of its state machine to the learner.
In the log-based mode, most suitable for nodes re-joining after a partition or pause, the teacher ships a segment of its log to the learner. We have found that this mode is triggered rather frequently in production, due to nodes temporarily falling behind during Java garbage collection pauses. Log-based learning is chosen when the size of the missing log segment is significantly smaller than the
size of the entire dataset.
packet loss and node failures may leave persistent holes in a node’s view of the log. If nodes are not able to find another to teach them the decided value in that log position (or no value has been decided), they use a whack-a-mole learning mode. In whack-a-mole mode, a learner actively tries to propose a no-op transition into the vacant log position. This can have two outcomes: either the acceptors report no other proposals for that log position and the no-op transition is accepted, or another proposal is found and the learner proposes that value. This process is always safe in Paxos, but can affect liveness, so learners apply substantial jitter to whack-a-mole learning.

paxos propser

design: discovery cache

The Discovery Cache where??
Clients find cells using a distributed discovery cache.
The discovery cache is a distributed eventually-consistent cache which allow clients to discover which nodes contain a given cell (and hence a given partition key). Each cell periodically pushes updates to the cache identifying which partition key they hold and their node members. Incorrect information in the cache affects the liveness, but never the correctness, of the system.

reduce the impact of the discovery cache on availability

client-side caching, forwarding pointers, and replication.
First, it is always safe for a client to cache past discovery cache results, allowing them to refresh lazily and continue to use old values for an unbounded period on failure.
Second, Physalia nodes keep long-term (but not indefinite) forwarding pointers when cells move from node to node. Forwarding pointers include pointers to all the nodes in a cell, making it highly likely that a client will succeed in pointer chasing to the current owner provided that it can get to at least one of the past owners.
Finally, because the discovery cache is small, we can economically keep many copies of it, increasing the probability that at least one will be available

system liveness and correctness

design: system model and Byzantine Faults

In designing Physalia, we assumed a system model where messages can be arbitrarily lost, replayed, re-ordered, and modified after transmission. Message authentication is implemented using a cryptographic HMAC on each message,
guarding against corruption occurring in lower layers. Messages which fail authentication are simply discarded. Key distribution, used both for authentication and prevention of unintentional Sybil-style attacks [20] is handled by our environment (and therefore out of the scope of Physalia), optimising for frequent and low-risk key rotation.
This model extends the “benign faults” assumptions of Paxos [11] slightly, but stops short of Byzantine fault tolerance 1 . While Byztantine consensus protocols are well understood, they add significant complexity to both software and
system interactions, as well as testing surface area. Our approach was to keep the software and protocols simpler, and mitigate issues such as network and storage corruption with cryptographic integrity and authentication checks at these
layers.
1 This approach is typical of production consensus-based systems, including popular open-source projects like Zookeeper and etcd.

Sybil-style attacks

Practical byzantine fault tolerance

http://pmg.csail.mit.edu/papers/osdi99.pdf

how build system tolerate faults in single machines, and uncorrelated failures of a small number of machines

use State-machine replication using consensus
In theory, systems built using this pattern can achieve extremely high availability. In practice, however, achieving high availability is challenging
Studies across three decades (including Gray in 1990 [26], Schroeder and Gibson
in 2005 [50] and Yuan et al in 2014 [57]) have found that software, operations, and scale drive downtime in systems designed to tolerate hardware faults. Few studies consider a factor that is especially important to cloud customers: largescale correlated failures which affect many cloud resources at
the same time.

Availability in consensus system: Physalia vs the Monolith

It is well known that it is not possible to offer both all-clients availability and consistency in distributed databases due to the presence of network partitions. It is, however, possible to offer both consistency and availability to clients on the majority side of a network partition. To be as available as possible to its clients, Physalia needs to be on the same side of any network partition as them. client, data master and data replica are nearby each other on the network, and Physalia needs to be nearby too. Reducing the number of network
devices between the Physalia database and its clients reduces the possibility of a network partition forming between them for the simple reason that fewer devices means that there’s less to go wrong.
Physalia also optimizes for blast radius. We are not only concerned with the availability of the whole system, but want to avoid failures of the whole system entirely. When failures happen, due to any cause, they should affect as small a subset of clients as possible. Limiting the number of cells depending on a single node, and clients on a single cell, significantly reduce the effect that one failure can have on the overall system.
A monolithic system has the advantage of less complexity. No need for the discovery cache, most of the control plane, cell creation, placement, etc.. Our experience has shown that simplicity improves availability, so this simplification would be a boon. On the other hand, the monolithic approach loses out on partition tolerance. It needs to make a trade-off between being localized to a small part of the network (and so risking being partitioned away from clients), or being spread over the network (and so risking suffering an internal partition making some part of it unavailable). The monolith also increases blast radius: a single bad software deployment could cause a complete failure (this is similar to the node count trade-off of Figure 4, with one node).

EBS control plane

of which the Physalia control plane is a part

Availability in cs: Placement For Availability

The EBS control plane (of which the Physalia control plane is a part) continuously optimizes the availability of the EBS volume P(Av) to the client AWS EC2 instance, and the EBS storage servers that store the volume.
In terms of the availability of the volume (Av), and the instance (Ai), the control plane optimizes the conditional probability P(Av|Ai).
The ideal solution to this problem is to entirely co-locate the volume and instance, but EBS offers the ability to detach a volume from a failed instance, and re-attach it to another instance. To make this useful, volumes must continue to be durable even if the instance suffers a failure. Placement must therefore balance the concerns of having the volume close enough for correlated availability, but far enough away for sufficiently independent durability to meet EBS’s durability promise.
As an example, consider an idealized datacenter with three levels of network (servers, racks and rows) and three power domains (A, B and C). The client instance is on one rack, the primary copy on another, and replica copy on a third, all within the same row. Physalia’s placement will then ensure that all nodes for the cell are within the row (there’s no point being available if the row is down), but spread across at least three racks to ensure that the loss of one rack doesn’t impact availability. It will also ensure that the nodes are in three different power domains, with no majority in any single domain.

the placement scheme for availability face two challenges

One is that realworld datacenter topology is significantly more complex
EBS volumes move by replication, and their clients move by customers detaching their volumes from one instance and attaching them to another. The Physalia control plane continuously responds to these changes in state, moving nodes to ensure that placement constraints continue to be met.

Availability in cs: Non-Infrastructure Availability Concerns

Another significant challenge with building high-availability distributed state machines is correlated work.
In a typical distributed state machine design, each node is processing the
same updates and the same messages in the same order. This leads the software on the machines to be in the same state. In our experience, this is a common cause of outages in realworld systems: redundancy does not add availability if failures
are highly correlated
Software deployments and configuration changes also contribute to downtime. The fault-tolerant nature of distributed state machines makes incremental deplyment approach less effective: because the system is designed to tolerate failure in less than half of hosts, failure may not be evident until new code is deployed to half of all hosts. Positive validation, where the deployment system checks that new nodes are taking traffic, reduce but do not eliminate this risk
Poison pills are a particularly interesting case of software failure.
All of these factors limit the availability of any single distributed state machine, as observed by its clients. To achieve maximum availability, we need many such systems spread throughout the datacenter. This was the guiding principle of
Physalia: instead of one database, build millions.

Poison pills

a particularly interesting case of software failure. A poison pill is a transaction which passes validation and is accepted into the log, but cannot be applied without causing an error. poison pills are typically caused by under-specification in the transaction logic ("what does dividing by zero do?", "what does it mean to decrement an unsigned zero?"), and are fixed by fully specifying these behaviors (a change which comes with it’s own backward-compatibility challenges).

Availability in cs: operational practices

operations, including code and configuration deployments, routine system operations such as security patching, and scaling for increased load, are dominant contributors to system downtime, despite ongoing investments in reducing operational defect rates.
Operational practices at AWS already separate operational tasks by region and availability zone, ensuring that operations are not performed across many
of these units at the same time.
Physalia goes a step further than this practice, by introducing the notion of colors. Each cell is assigned a color, and each cell is constructed only of nodes of the same color. The control plane ensures that colors are evenly spread around the datacenter, and color choice minimally constrains how close a cell can be to its clients.
When software deployments and other operations are performed, they proceed color-by-color. Monitoring and metrics are set up to look for anomalies in single colors. Colors also provide a layer of isolation against load-related and poison pill failures. Nodes of different colors don’t communicate with each other, making it significantly less likely that a poison pill or overload could spread across
colors.

Availability in cs: Load in Sometimes-Coordinating Systems

Load is another leading cause of correlated failures.
a consensus-based system needs to include more than half of all nodes in each consensus decision, which means that overload can take out more than half of all nodes. Colors play a role in reducing the blast radius from load spikes from a few clients, but the load on Physalia is inherently spiky.
During normal operation, load consists of a low rate of calls caused by the background rate of EBS storage server failures, and creation of new cells for new volumes. During large-scale failures, however, load can increase considerably. This is an inherent risk of sometimes-coordinating systems like EBS: recovery load is not constant, and highest during bad network or power conditions.
Per-cell Physalia throughput, as is typical of Paxos-style systems, scales well up to a point, with significant wins coming from increased batch efficiency. Beyond this point, however, contention and the costs of co-ordination cause goodput to drop with increased load (as predicted by Gunther’s model [28]). To avoid getting into this reduced-goodput mode, cells reject load once their pipelines are full.
Clients are expected to exponentially backoff, apply jitter, and eventually retry their rejected transactions. As the number of clients in the Physalia system is bounded,this places an absolute upper limit on load, at the cost of latency during overload.

jitter client

design ,build, test,evaluation

testing

Testing needs to cover not only the happy case, but also a wide variety of
error cases
jepsen test framework
error handling is where many bugs hide out
network partitions are rare events that easily hide bugs
many consensus implementations also have bugs in the happy path.
build a test harness which abstracts networking, performance, and other systems concepts (we call it a simworld).
The goal of this approach is to allow developers to write distributed systems tests, including tests that simulate packet loss, server failures, corruption, and other failure cases, as unit tests in the same language as the system itself. In this case, these unit tests run inside the developer’s IDE (or with junit at build time), with no need for test clusters or other infrastructure
The key to building a simworld is to build code against abstract physical layers (such as networks, clocks, and disks).

In addition to unit testing simworld, we adopted a number of other testing approaches. One of those approaches was a suite of automatically-generated tests which run the Paxos implementation through every combination of packet loss and reordering that a node can experience. This testing approach was inspired by the TLC model checker [56], and helped us build confidence that our implementation matched the formal specification
used the open source Jepsen tool [33] to test the system, and make sure that the API responses are linearizable under network failure cases. This testing, which happens at the infrastructure level, was a good complement to our lowerlevel tests as it could exercise some under-load cases that are hard to run in the simworld.
a number of game days against deployments of Physalia. A game day is a failure simulation that happens in a real production or production-like deployment of a system, an approach that has been popular at Amazon for 20 years. similar to the chaos engineering approach pioneered by Netflix [32], but typically focuses on larger-scale failures rather than component failures

The Role of Formal Methods

aws Simworld java

netflix chaos enginner

Netflix Simian Army, 2011. URL:
https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116

jepson io

Model checking tla+ specifications TLC model checker

https://dl.acm.org/doi/10.5555/646704.702012

test formal method

TLA+ [36] is a specification language that’s well suited to building formal models of concurrent and distributed systems. We use TLA+ extensively at Amazon [39], and it proved exceptionally useful in the development of Physalia
https://dl.acm.org/doi/10.1145/2736348?dl=ACM&coll=portal
How Amazon web services uses formal methods
https://dl.acm.org/doi/10.1145/2699417?dl=ACM&coll=portal
Our team used TLA+ in three ways: writing specifications of our protocols to check that we understand them deeply, model checking specifications against correctness and liveness properties using the TLC model checker, and writing extensively
commented TLA+ code to serve as the documentation of our distributed protocols
While all three of these uses added value, TLA+’s role as a sort of automatically tested (via TLC), and extremely precise, format for protocol documentation was
perhaps the most useful. Our code reviews, simworld tests, and design meetings frequently referred back to the TLA+ models of our protocols to resolve ambiguities in Java code or written communication. We highly recommend TLA+ (and
its Pluscal dialect) for this use.

split brain

quorum jury

ABA problem

evaluation need to dive

Evaluating the performance of a system like Physalia is challenging
Performance, including throughput and latency, are important, but the most important performance metrics are how the system performs during extremely rare large-scale outages.We evaluate the performance of Physalia in production, and evaluate the design through simulations. We also use simulations to explore some particularly challenging wholesystem aspects of Physalia.

diff with distributed co-ordination system

Distributed co-ordination systems, like Zookeeper [19], Chubby [9], Boxwood [38] and etcd [14], have the goal of providing a highly-available and strongly-consistent set of basic operations that make implementing larger distributed systems easier. Physalia’s design approach is similar to some of these systems, being based on the state machine replication pattern popularized by the work
of Schneider [49], Oki [40] and Lampson [37]. Physalia’s key differences from these systems are its fine-grained consensus (millions of distributed state machines, rather than a single one), and infrastructure awareness. This makes Physalia more scalable and more resistant to network partitions, but also
significantly more complex.
Physalia draws ideas from both distributed co-ordination systems and distributed databases

diff with highly-available distributed storage in fallible datacenter networks

The problem of providing highly-available distributed storage in fallible datacenter networks faces similar challenges to global and large-scale systems like OceanStore [34] and Farsite [3], with emphasis on moving data close to its expected to improve availability and latency. While the design of Physalia predates the publication of Spanner [15] and CosmosDB, Physalia takes some similar design approaches with similar motivation.

diff with horizontal partitioning of database

Systems like Dynamo [18] and its derivatives dynamically move partitions, and rely
on client behavior or stateless proxies for data discovery. Dynamic discovery of high-cardinality data, as addressed by Physalia’s discovery cache and forwarding pointers, has been well explored by systems like Pastry [47] and Chord [52].
Optimizing data placement for throughput and latency is also a well-established technique (such as in Tao [8], and Dabek et al [16]), but these systems are not primarily concerned with availability during partitions, and do not consider blast radius

infrastructure-aware placement

reflects some techniques from software-defined networking (SDN) [21]. Another similarity with SDN (and earlier systems, like RCP [10]) is the emphasis on separating control and data planes, and allowing the data plane to consist of simple packet-forwarding elements. This reflects similar decisions to separate Physalia from the data plane of EBS, and the dataand control planes of Physalia itself.
Some systems (like SAUCR [4], and the model proposed by Chen et al [13]) are designed to change operating modes when infrastructure failures occur or request
patterns change, but we are not aware of other database explicitly designed to include data placement based on network topology (beyond simple locality concerns)

conclusion

Physalia is a classic consensus-based database which takes a novel approach to availability: it is aware of the topology and datacenter power and networking, as well as the location of the clients that are most likely to need each row, and uses data placement to reduce the probability of network partitions. This
approach was validated using simulation, and the gains have been borne out by our experience running it in production at high scale across over 60 datacenter-scale deployments. Its design is also optimized to reduce blast radius, reducing the
impact of any single node, software, or infrastructure failure.
While few applications have the same constraints that we faced, many emerging cloud patterns require strongly consistent access to local data. Having a highly-available stronglyconsistent database as a basic primitive allows these systems
to be simpler, more efficient, and offer better availability

ref

https://www.usenix.org/system/files/nsdi20-paper-brooker.pdf
https://www.usenix.org/sites/default/files/conference/protected-files/nsdi20_slides_brooker.pdf
https://www.usenix.org/conference/nsdi20/presentation/brooker
https://www.amazon.science/blog/amazon-ebs-addresses-the-challenge-of-the-cap-theorem-at-scale

avalibiry

raid /failover /replication

block diagram

what are the components, what does each of them own, and how do they communicate to other components.
Getting the block diagram right helps with the design of database schemas and APIs, helps reason through the availability and cost of running the system, and even helps form the right org chart to build the design.

control plane

when doing these design exercises is to separate components into a control plane and a data plane, recognizing the differences in requirements between these two roles.
Along with the monolithic application itself, storage and load balancing are data plane concerns: they are required to be up for any request to succeed, and scale O(N) with the number of requests the system handles. On the other hand, failure tolerance, scaling and deployments are control plane concerns: they scale differently (either with a small multiple of N, with the rate of change of N, or with the rate of change of the software) and can break for some period of time before customers notice.
Every distributed system has components that fall roughly into these two roles: data plane components that sit on the request path, and control plane components which help that data plane do its work. Sometimes, the control plane components aren't components at all, and rather people and processes, but the pattern is the same.
https://brooker.co.za/blog/2019/03/17/control.html
https://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf

what physalia store

The master stores a small amount of configuration data indicating which servers hold the data for a given volume and the order in which they replicate it, which is important for identifying up-to-date data. The replication protocol uses the configuration data to decide where application data should be stored, and it updates the configuration to point to the application data’s new location. Physalia is designed to play the role of the configuration master.

cell node vs ebs volume machines

maybe colocated, maybe seperatedly

formal method in aws

How Amazon web services uses formal methods
https://dl.acm.org/doi/10.1145/2699417?dl=ACM&coll=portal
https://assets.amazon.science/fa/fb/fe142b0c4b3eab3ec3d15794e025/one-click-formal-methods.pdf
https://www.amazon.science/publications/reachability-analysis-for-aws-based-networks
http://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf
http://lamport.azurewebsites.net/tla/tla.html

can we repartition into another cell?

one EBS volume -> a single cell (with 7 nodes, replicating all the key set). what if shard is still big? can we repartition into another cell? now single EBS volume key points to two cells? Or am i something missing here?

since the emphasis is on 'tiny databases' I suspect that repartitioning / splits is out of scope.
its configuration db, volume size not matter. Single EBS Volume has Single Cell. Must be enough.

When they say Single Key Transactional store, does mean its not “MVCC” right?

meaning we just don’t care about multiple keys consistency in a single transaction correct?

multiple keys in a single partition can be updated in a single txn (section 2.3 Within each partition key, Physalia offers a transactional store with a typed key-value schema, supporting strict serializable reads, writes and conditional writes over any combination of keys.).Here the key is not the partition key.
Transactions are totally ordered, and are applied atomically, but I’m not sure there’s no need for concurrency control. Concurrency control is a property of a transaction manager in combination with storage engine. If I understand it correctly, you need it even in a single node database, let alone a distributed one.Can you elaborate why you think there’s no need for it in this case? Another example of a db that has total ordering but still needs concurrency control (pessimistic, in this case) is calvin.
total order doesn’t really preclude concurrent execution. It just means that order is predetermined.
I'm pretty sure you do need isolation, within a single partition, since there are multi-key general transactions. That's handled by serializing through Paxos, IIUC.
You need CC if you execute your totally ordered transactions concurrently.

total order of transaction vs ioslation

You need CC if you execute your totally ordered transactions concurrently.

however, it is only in case if the transactions that they have are not CAS
if they are CAS I can'd imagine they can get out of Paxos uncommitted

They're not just CAS. The only thing they do is gather their read sets at the beginning of the txn (like Calvin I think?).

If I recall correctly, both RAMP and Calvin do that. Not exactly same thing though: RAMP can collect writes for overlapping transactions, and Calvin collects data that satisfies the read set.

how cc impl in the physalia

still think that if they have CC, it is something very simple, and considering its a configuration DB, the write throughput is such small that maybe its even single-threaded (no cc, just sequential application of mutations) (however, its blind guesses)
right; in any event, since most storage engines are built for transactionality, you sort of have to implement CC one way or the other, even if you don’t use it for imposing order, whether it’s LSM, B-Tree or anything else I’m aware of..
2:10
I’m not arguing about the specific use-case/implementation though. Maybe they have partitioned store, maybe they’re just single-threaded. I was speaking more of a general case

get the picture of kind of configs are exactly stored in there as key value store? Just SD info, replicas info?

Best of VIM Tips, gVIM's Key Features zzapper

沧海一声笑 — Thu, 26 Mar 2020 14:44:53 GMT

Best practice for Vim

http://zzapper.co.uk/vimtips.html

格物治用

Versioning in Main-Memory Database Systems From MusaeusDB to TardisDB

OrpheusDB: Bolt-on Versioning for Relational Databases

版本分支数据库。基于传统数据库，分析了多种数据模型。记录+版本号（或者时间戳）；记录+版本list； 记录id+记录+ 版本表（分按版本list和记录list）。特点是基于现有数据库（非kv），事务/访问控制/sql查询不需要考虑。

Housekeeping for Paper: Erasure coding and Read/Write Separation in Flash Storage

Vim 搭建与问题

介绍

Vim(pythonx):ModuleNotFoundError: No module named 'neovim'

搜索网上。

事实确认

反思

成功

rust插件

install vim-race for deoplete(vim)

bugs

unexpected keyword argument 'encoding' when enter insert mode on opened rust file

how to use racer ?

Plugin 'rust-lang/rust.vim'

rls insall

https://github.com/rust-lang/rls#setup

vim tips

vs code tips

基于OpenSSL和QUICHE的Nginx/OpenResty HTTP3 QUIC 构建

Abstract

Build process

Dev Notes

OpesnSSL 支持 HTTP3

Cloudflare QUICHE 针对 OpenSLL 编译的修改

OpenResty/Nginx 的 HTTP3 支持

Have Done

To DO

Revisting iscsi target hung issue with tcmu glusterfs backstore

解决问题的思路反思和教训

问题

现象

分析

模拟复现

初步结论

可能的 bug 修复

一些概念和需要注意的

Todo:

20200525 的 revisit

透明代理

环境

计划

基本原型

优化

实践

基本原型

B环境准备

v客户端安装和配置

注意事项

优化

DNS over tls

在A路由器merlin上配置Dot

Dot+ESNI

Dot over V

性能

DNS分流方案

IPv6

IPv6 Nftable

一些概念和疑惑

Introduction to Linux interfaces for virtual networking - Red Hat Developer

curl build with quic

url

issue

Two Sum in Rust

problem

解析

答案

Signs You Aren't Really Building a Minimum Viable Product - Anthony Panozzo's Blog

Rewriting the heart of our sync engine - Dropbox

Timeouts, retries and backoff with jitter

Millions of Tiny Databases

what production to build

requirement

what is physalia

physalia use cases

large-scale cloud control planes

transaction patterns

版本分支数据库。基于传统数据库，分析了多种数据模型。记录+版本号（或者时间戳）；记录+版本list；记录id+记录+ 版本表（分按版本list和记录list）。特点是基于现有数据库（非kv），事务/访问控制/sql查询不需要考虑。