-
Joel Sing authored
This provides an assembly implementation of addVW for riscv64, processing up to four words per loop, resulting in a significant performance gain. On a StarFive VisionFive 2: │ addvw.1 │ addvw.2 │ │ sec/op │ sec/op vs base │ AddVW/1-4 57.43n ± 0% 41.45n ± 0% -27.83% (p=0.000 n=10) AddVW/2-4 69.31n ± 0% 48.15n ± 0% -30.53% (p=0.000 n=10) AddVW/3-4 76.12n ± 0% 54.97n ± 0% -27.79% (p=0.000 n=10) AddVW/4-4 85.47n ± 0% 56.14n ± 0% -34.32% (p=0.000 n=10) AddVW/5-4 96.16n ± 0% 62.82n ± 0% -34.67% (p=0.000 n=10) AddVW/10-4 149.60n ± 0% 89.55n ± 0% -40.14% (p=0.000 n=10) AddVW/100-4 1115.0n ± 0% 549.3n ± 0% -50.74% (p=0.000 n=10) AddVW/1000-4 10.732µ ± 0% 5.060µ ± 0% -52.85% (p=0.000 n=10) AddVW/10000-4 151.7µ ± 0% 103.7µ ± 0% -31.63% (p=0.000 n=10) AddVW/100000-4 1.523m ± 0% 1.050m ± 0% -31.03% (p=0.000 n=10) AddVWext/1-4 57.42n ± 0% 41.45n ± 0% -27.81% (p=0.000 n=10) AddVWext/2-4 69.32n ± 0% 48.15n ± 0% -30.54% (p=0.000 n=10) AddVWext/3-4 76.12n ± 0% 54.87n ± 0% -27.92% (p=0.000 n=10) AddVWext/4-4 85.47n ± 0% 56.14n ± 0% -34.32% (p=0.000 n=10) AddVWext/5-4 96.15n ± 0% 62.82n ± 0% -34.66% (p=0.000 n=10) AddVWext/10-4 149.60n ± 0% 89.55n ± 0% -40.14% (p=0.000 n=10) AddVWext/100-4 1115.0n ± 0% 549.3n ± 0% -50.74% (p=0.000 n=10) AddVWext/1000-4 10.732µ ± 0% 5.060µ ± 0% -52.85% (p=0.000 n=10) AddVWext/10000-4 150.5µ ± 0% 103.7µ ± 0% -31.10% (p=0.000 n=10) AddVWext/100000-4 1.530m ± 0% 1.049m ± 0% -31.41% (p=0.000 n=10) geomean 1.003µ 633.9n -36.79% │ addvw.1 │ addvw.2 │ │ B/s │ B/s vs base │ AddVW/1-4 132.8Mi ± 0% 184.1Mi ± 0% +38.55% (p=0.000 n=10) AddVW/2-4 220.1Mi ± 0% 316.9Mi ± 0% +43.96% (p=0.000 n=10) AddVW/3-4 300.7Mi ± 0% 416.4Mi ± 0% +38.48% (p=0.000 n=10) AddVW/4-4 357.1Mi ± 0% 543.6Mi ± 0% +52.25% (p=0.000 n=10) AddVW/5-4 396.7Mi ± 0% 607.2Mi ± 0% +53.06% (p=0.000 n=10) AddVW/10-4 510.1Mi ± 0% 852.0Mi ± 0% +67.02% (p=0.000 n=10) AddVW/100-4 684.1Mi ± 0% 1389.0Mi ± 0% +103.03% (p=0.000 n=10) AddVW/1000-4 710.9Mi ± 0% 1507.8Mi ± 0% +112.08% (p=0.000 n=10) AddVW/10000-4 503.1Mi ± 0% 735.8Mi ± 0% +46.26% (p=0.000 n=10) AddVW/100000-4 501.0Mi ± 0% 726.5Mi ± 0% +45.00% (p=0.000 n=10) AddVWext/1-4 132.9Mi ± 0% 184.1Mi ± 0% +38.55% (p=0.000 n=10) AddVWext/2-4 220.1Mi ± 0% 316.9Mi ± 0% +43.98% (p=0.000 n=10) AddVWext/3-4 300.7Mi ± 0% 417.1Mi ± 0% +38.73% (p=0.000 n=10) AddVWext/4-4 357.1Mi ± 0% 543.6Mi ± 0% +52.25% (p=0.000 n=10) AddVWext/5-4 396.7Mi ± 0% 607.2Mi ± 0% +53.05% (p=0.000 n=10) AddVWext/10-4 510.1Mi ± 0% 852.0Mi ± 0% +67.02% (p=0.000 n=10) AddVWext/100-4 684.2Mi ± 0% 1389.0Mi ± 0% +103.02% (p=0.000 n=10) AddVWext/1000-4 710.9Mi ± 0% 1507.7Mi ± 0% +112.08% (p=0.000 n=10) AddVWext/10000-4 506.9Mi ± 0% 735.8Mi ± 0% +45.15% (p=0.000 n=10) AddVWext/100000-4 498.6Mi ± 0% 727.0Mi ± 0% +45.79% (p=0.000 n=10) geomean 388.3Mi 614.3Mi +58.19% Change-Id: Ib14a4b8c1d81e710753bbf6dd5546bbca44fe3f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/595397 Reviewed-by:
Meng Zhuo <mengzhuo1203@gmail.com> Reviewed-by:
Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by:
Mark Ryan <markdryan@rivosinc.com> Reviewed-by:
Dmitri Shuralyov <dmitshur@google.com>
Joel Sing authoredThis provides an assembly implementation of addVW for riscv64, processing up to four words per loop, resulting in a significant performance gain. On a StarFive VisionFive 2: │ addvw.1 │ addvw.2 │ │ sec/op │ sec/op vs base │ AddVW/1-4 57.43n ± 0% 41.45n ± 0% -27.83% (p=0.000 n=10) AddVW/2-4 69.31n ± 0% 48.15n ± 0% -30.53% (p=0.000 n=10) AddVW/3-4 76.12n ± 0% 54.97n ± 0% -27.79% (p=0.000 n=10) AddVW/4-4 85.47n ± 0% 56.14n ± 0% -34.32% (p=0.000 n=10) AddVW/5-4 96.16n ± 0% 62.82n ± 0% -34.67% (p=0.000 n=10) AddVW/10-4 149.60n ± 0% 89.55n ± 0% -40.14% (p=0.000 n=10) AddVW/100-4 1115.0n ± 0% 549.3n ± 0% -50.74% (p=0.000 n=10) AddVW/1000-4 10.732µ ± 0% 5.060µ ± 0% -52.85% (p=0.000 n=10) AddVW/10000-4 151.7µ ± 0% 103.7µ ± 0% -31.63% (p=0.000 n=10) AddVW/100000-4 1.523m ± 0% 1.050m ± 0% -31.03% (p=0.000 n=10) AddVWext/1-4 57.42n ± 0% 41.45n ± 0% -27.81% (p=0.000 n=10) AddVWext/2-4 69.32n ± 0% 48.15n ± 0% -30.54% (p=0.000 n=10) AddVWext/3-4 76.12n ± 0% 54.87n ± 0% -27.92% (p=0.000 n=10) AddVWext/4-4 85.47n ± 0% 56.14n ± 0% -34.32% (p=0.000 n=10) AddVWext/5-4 96.15n ± 0% 62.82n ± 0% -34.66% (p=0.000 n=10) AddVWext/10-4 149.60n ± 0% 89.55n ± 0% -40.14% (p=0.000 n=10) AddVWext/100-4 1115.0n ± 0% 549.3n ± 0% -50.74% (p=0.000 n=10) AddVWext/1000-4 10.732µ ± 0% 5.060µ ± 0% -52.85% (p=0.000 n=10) AddVWext/10000-4 150.5µ ± 0% 103.7µ ± 0% -31.10% (p=0.000 n=10) AddVWext/100000-4 1.530m ± 0% 1.049m ± 0% -31.41% (p=0.000 n=10) geomean 1.003µ 633.9n -36.79% │ addvw.1 │ addvw.2 │ │ B/s │ B/s vs base │ AddVW/1-4 132.8Mi ± 0% 184.1Mi ± 0% +38.55% (p=0.000 n=10) AddVW/2-4 220.1Mi ± 0% 316.9Mi ± 0% +43.96% (p=0.000 n=10) AddVW/3-4 300.7Mi ± 0% 416.4Mi ± 0% +38.48% (p=0.000 n=10) AddVW/4-4 357.1Mi ± 0% 543.6Mi ± 0% +52.25% (p=0.000 n=10) AddVW/5-4 396.7Mi ± 0% 607.2Mi ± 0% +53.06% (p=0.000 n=10) AddVW/10-4 510.1Mi ± 0% 852.0Mi ± 0% +67.02% (p=0.000 n=10) AddVW/100-4 684.1Mi ± 0% 1389.0Mi ± 0% +103.03% (p=0.000 n=10) AddVW/1000-4 710.9Mi ± 0% 1507.8Mi ± 0% +112.08% (p=0.000 n=10) AddVW/10000-4 503.1Mi ± 0% 735.8Mi ± 0% +46.26% (p=0.000 n=10) AddVW/100000-4 501.0Mi ± 0% 726.5Mi ± 0% +45.00% (p=0.000 n=10) AddVWext/1-4 132.9Mi ± 0% 184.1Mi ± 0% +38.55% (p=0.000 n=10) AddVWext/2-4 220.1Mi ± 0% 316.9Mi ± 0% +43.98% (p=0.000 n=10) AddVWext/3-4 300.7Mi ± 0% 417.1Mi ± 0% +38.73% (p=0.000 n=10) AddVWext/4-4 357.1Mi ± 0% 543.6Mi ± 0% +52.25% (p=0.000 n=10) AddVWext/5-4 396.7Mi ± 0% 607.2Mi ± 0% +53.05% (p=0.000 n=10) AddVWext/10-4 510.1Mi ± 0% 852.0Mi ± 0% +67.02% (p=0.000 n=10) AddVWext/100-4 684.2Mi ± 0% 1389.0Mi ± 0% +103.02% (p=0.000 n=10) AddVWext/1000-4 710.9Mi ± 0% 1507.7Mi ± 0% +112.08% (p=0.000 n=10) AddVWext/10000-4 506.9Mi ± 0% 735.8Mi ± 0% +45.15% (p=0.000 n=10) AddVWext/100000-4 498.6Mi ± 0% 727.0Mi ± 0% +45.79% (p=0.000 n=10) geomean 388.3Mi 614.3Mi +58.19% Change-Id: Ib14a4b8c1d81e710753bbf6dd5546bbca44fe3f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/595397 Reviewed-by:
Meng Zhuo <mengzhuo1203@gmail.com> Reviewed-by:
Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by:
Mark Ryan <markdryan@rivosinc.com> Reviewed-by:
Dmitri Shuralyov <dmitshur@google.com>
Code owners
Assign users and groups as approvers for specific file changes. Learn more.